## NLP Text Categorization Competition

In this project I will be using a BERT model for classifying text data, as part of the AI Crowd NLP competition.

I will be using the ktrain library, which works with Tensorflow 2.0. I found this high-level library to be extremely useful for quickly building up a great BERT model, with state of the art prediction capabilities. This project was done using a Google Colab notebook with GPU capabilities.

### The Contest: Transfer Learning for International Crisis Response

The purpose of this contest is to improve the text classification algorithms used within the DEEP humanitarian information platform. To quote their website:

"The aim of the platform is to provide insights from years of historical and in-crisis humanitarian text data. The platform allows users to upload documents and classify text snippets according to predefined humanitarian target labels, grouped into and referred to as analytical frameworks. DEEP is now successfully functional in several international humanitarian organizations and the United Nations across the globe."

Currently, multiple organizations use the DEEP platform and each has a separate classification algorithm with varying degrees of success and which are not trained across other organizations. The aim of this project is to build a unified algorithm which both performs better and is generalizable enough to work across all organizations.

### The Categories: 12 Different Labels

(1) Agriculture

(2) Cross: short form of Cross-sectoral; areas of humanitarian response that require action in more than one sector. For example malnutrition requires humanitarian interventions in health, access to food, access to basic hygiene items and clean water, and access to non-food items such as bottles to feed infants.

(3) Education

(4) Food

(5) Health

(6) Livelihood: Access to employment and income

(7) Logistics: Any logistical support needed to carry out humanitarian activities e.g. air transport, satellite phone 
connection etc.

(8) NFI: Non-food items needed in daily life that are not food such as bedding, mattrassess, jerrycans, coal or oil for heating

(9) Nutrition

(10) Protection

(11) Shelter

(12) WASH (Water, Sanitation and Hygiene)

Although the contest has already finished while I am writing this, I never got the chance to try a BERT model. I wanted to build this notebook as a challenge to myself, to see if I can beat my previous best performance of 78% training accuracy and 69% validation accuracy. I am confident that BERT can help me achieve this!

Most of the data cleaning for this project was done in a different notebook then loaded into this one. To begin, I will import Tensorflow 2.0 into the notebook and verify that GPU hardware is attached with the code below.

In [6]:
%tensorflow_version 2.x
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


In [7]:
# installing the ktrain library we will be using to train our bert model

! pip install ktrain

# installing dependencies for ktrain

! pip install git+https://github.com/amaiya/eli5@tfkeras_0_10_1
! pip install git+https://github.com/amaiya/stellargraph@no_tf_dep_082

Collecting git+https://github.com/amaiya/eli5@tfkeras_0_10_1
  Cloning https://github.com/amaiya/eli5 (to revision tfkeras_0_10_1) to /tmp/pip-req-build-lzdhs9j1
  Running command git clone -q https://github.com/amaiya/eli5 /tmp/pip-req-build-lzdhs9j1
  Running command git checkout -b tfkeras_0_10_1 --track origin/tfkeras_0_10_1
  Switched to a new branch 'tfkeras_0_10_1'
  Branch 'tfkeras_0_10_1' set up to track remote branch 'tfkeras_0_10_1' from 'origin'.
Building wheels for collected packages: eli5
  Building wheel for eli5 (setup.py) ... [?25l[?25hdone
  Created wheel for eli5: filename=eli5-0.10.1-py2.py3-none-any.whl size=106682 sha256=b5e5f9a3027f99d77871e3a7c7ac2e983a4bc27b45cc787bc7340d23af23728f
  Stored in directory: /tmp/pip-ephem-wheel-cache-gfwbms3w/wheels/51/59/0a/0f48442b8d209583a4453580938d7ba2270aca40edacee6d45
Successfully built eli5
Collecting git+https://github.com/amaiya/stellargraph@no_tf_dep_082
  Cloning https://github.com/amaiya/stellargraph (to revision no

In [0]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0";

In [0]:
# importing the ktrain library and pandas for preprocessing

import ktrain
from ktrain import text
import pandas as pd
from sklearn.model_selection import train_test_split

In [12]:
# loading data from csv file, this file was cleaned in another notebook
# datasets will be separated into a train and test, with just the text column and the category it belongs to

# load data
data = pd.read_csv("train_no_dummies.csv")

# splitting and shuffling train and test, test size 10%
train, test = train_test_split(data, test_size=0.1, shuffle=True, random_state=2)

print('size of training set: %s' % (len(train['text'])))
print('size of validation set: %s' % (len(test['text'])))

size of training set: 8619
size of validation set: 958


In [13]:
# viewing the first 5 rows of training data

train.head()

Unnamed: 0,text,labels
2521,The management of Mitiga Airport has confirmed...,2
4646,And in Venezuelaestán reappearing controladasc...,5
825,"Reportedly, transportation costs, accommodatio...",10
5196,"As for distance, 60% of households are less th...",2
8094,"[Seasonal calendar] Climate conditions, Hazard...",6


In [14]:
# splitting data into correct format for preprocessing
x_train = train.text
y_train = train.labels
x_test = test.text
y_test = test.labels

# viewing some training examples before preprocessing
x_train[:5]

2521    The management of Mitiga Airport has confirmed...
4646    And in Venezuelaestán reappearing controladasc...
825     Reportedly, transportation costs, accommodatio...
5196    As for distance, 60% of households are less th...
8094    [Seasonal calendar] Climate conditions, Hazard...
Name: text, dtype: object

In [15]:
text.print_text_classifiers()

fasttext: a fastText-like model [http://arxiv.org/pdf/1607.01759.pdf]
logreg: logistic regression using a trainable Embedding layer
nbsvm: NBSVM model [http://www.aclweb.org/anthology/P12-2018]
bigru: Bidirectional GRU with pretrained word vectors [https://arxiv.org/abs/1712.09405]
standard_gru: simple 2-layer GRU with randomly initialized embeddings
bert: Bidirectional Encoder Representations from Transformers (BERT) [https://arxiv.org/abs/1810.04805]
distilbert: distilled, smaller, and faster BERT from Hugging Face [https://arxiv.org/abs/1910.01108]


In [16]:
# we are using a distilbert model, a quicker training bert model that still works great

# labels fall into one of 12 categories
categories = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

# using ktrain's preprocessing function, I had to specify lang='en' (English) for the function to work
trn, val, preproc = text.texts_from_array(x_train=x_train, y_train=y_train,
                                          x_test=x_test, y_test=y_test,
                                          class_names=categories,
                                          preprocess_mode='distilbert',
                                          lang='en',
                                          maxlen=350)

task: text classification
preprocessing train...
language: en
train sequence lengths:
	mean : 63
	95percentile : 130
	99percentile : 185


preprocessing test...
language: en
test sequence lengths:
	mean : 63
	95percentile : 130
	99percentile : 181


In [17]:
# defining model and wrapping in learner

model = text.text_classifier('distilbert', train_data=trn, preproc=preproc)

learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=6)

Is Multi-Label? False
maxlen is 350
done.


In [18]:
# viewing the layers of the model

model.summary()

Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
distilbert (TFDistilBertMain multiple                  66362880  
_________________________________________________________________
pre_classifier (Dense)       multiple                  590592    
_________________________________________________________________
classifier (Dense)           multiple                  9997      
_________________________________________________________________
dropout_19 (Dropout)         multiple                  0         
Total params: 66,963,469
Trainable params: 66,963,469
Non-trainable params: 0
_________________________________________________________________


In [19]:
# fitting model to data

learner.fit_onecycle(3e-5, 1)



begin training using onecycle policy with max lr of 3e-05...
Train for 1437 steps, validate for 160 steps


<tensorflow.python.keras.callbacks.History at 0x7feef605feb8>

That was a great first epoch! It is already much more accurate than my previous best model, with the validation accuracy at 78.7%. My previous best model only had a validation accuracy of 69%. Surprisingly, the training dataset's accuracy is currently lower than the val data after one epoch (69% on training and 79% on val). There is still much more room for improvement on training set accuracy with continued training. We will train for two more epochs.

In [20]:
learner.fit_onecycle(3e-5, 2)



begin training using onecycle policy with max lr of 3e-05...
Train for 1437 steps, validate for 160 steps
Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x7feea8220710>

## Final Thoughts

After two more epochs, accuracy on the training dataset improved dramatically, up to 89%, while the validation accuracy improved slightly, up to around 80.4%. There may be additional slight improvements on the validation data with more training, but at this point any further training may also run the risk of overfitting. I am sure there is much more fine-tuning that can be done with this project as well, including hyperparameter tuning and trying other BERT model variations, but for me this is a good stopping point. I am extremely happy with the improvements that the mighty BERT model helped me achieve.

## Predictions On New Data

I will see how the model performs on sample text, a real example from the DEEP platform.

In [22]:
predictor = ktrain.get_predictor(model, preproc)

predictor.predict("Data revealed that the majority (35 out of 46) of the restaurants and tea shops are cooking by using wood fuel in mud stove, while 16 of them are using LPG and LPG Stove")

9

The model predicted the highest probability category of 9: Nutrition. This is a very reasonable label for a post about cooking methods, however if we were applying multiple labels, then 9: Food or also even 5: Health would also be good labels as well. In reality, the DEEP system does apply multiple labels, but they do not for purposes of this competition.

Thank you BERT!