## NLP Text Categorization Competition

In this project I will be using a BERT model for classifying text data, as part of the AI Crowd NLP competition.

I will be using the ktrain library, which works with Tensorflow 2.0. I found this high-level library to be extremely useful for quickly building up a great BERT model, with state of the art prediction capabilities. This project was done using Amazon's Sagemaker and an Amazon Web Service's GPU server. 

### The Contest: Transfer Learning for International Crisis Response

The purpose of this contest is to improve the text classification algorithms used within the DEEP humanitarian information platform. To quote their website:

"The aim of the platform is to provide insights from years of historical and in-crisis humanitarian text data. The platform allows users to upload documents and classify text snippets according to predefined humanitarian target labels, grouped into and referred to as analytical frameworks. DEEP is now successfully functional in several international humanitarian organizations and the United Nations across the globe."

Currently, multiple organizations use the DEEP platform and each has a separate classification algorithm with varying degrees of success and which are not trained across other organizations. The aim of this project is to build a unified algorithm which both performs better and is generalizable enough to work across all organizations.

### The Categories: 12 Different Labels

(1) Agriculture

(2) Cross: short form of Cross-sectoral; areas of humanitarian response that require action in more than one sector. For example malnutrition requires humanitarian interventions in health, access to food, access to basic hygiene items and clean water, and access to non-food items such as bottles to feed infants.

(3) Education

(4) Food

(5) Health

(6) Livelihood: Access to employment and income

(7) Logistics: Any logistical support needed to carry out humanitarian activities e.g. air transport, satellite phone 
connection etc.

(8) NFI: Non-food items needed in daily life that are not food such as bedding, mattrassess, jerrycans, coal or oil for heating

(9) Nutrition

(10) Protection

(11) Shelter

(12) WASH (Water, Sanitation and Hygiene)

Although the contest has already finished while I am writing this, I never got the chance to try a BERT model. I wanted to build this notebook as a challenge to myself, to see if I can beat my previous best performance of 78% training accuracy and 69% validation accuracy. I am confident that BERT can help me achieve this!

To begin, I need to load a Tensorflow 1.14 or standard Python 3 Sagemaker notebook, then use pip to upgrade to Tensorflow 2.0. In other notebooks, I also tried loading Tensorflow 2.0 via a docker image, but this method below worked the best for me. A major reason why I am using Amazon Web Services for this project is having access to their powerful graphics processor units, which I do not have on my personal computer.

The data cleaning was done in a different notebook then loaded into this one.

In [2]:
# upgrading pip and setuptools to ensure everything installs smoothly

! pip install --upgrade pip
! pip install --upgrade setuptools

# install the ktrain library we will be using to train our bert model

! pip3 install ktrain

# installing dependencies for ktrain

! pip3 install git+https://github.com/amaiya/eli5@tfkeras_0_10_1
! pip3 install git+https://github.com/amaiya/stellargraph@no_tf_dep_082

Requirement already up-to-date: pip in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (20.0.1)
Requirement already up-to-date: setuptools in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (45.1.0)


Collecting git+https://github.com/amaiya/eli5@tfkeras_0_10_1
  Cloning https://github.com/amaiya/eli5 (to revision tfkeras_0_10_1) to /tmp/pip-req-build-7lnx5cdq
  Running command git clone -q https://github.com/amaiya/eli5 /tmp/pip-req-build-7lnx5cdq
  Running command git checkout -b tfkeras_0_10_1 --track origin/tfkeras_0_10_1
  Switched to a new branch 'tfkeras_0_10_1'
  Branch tfkeras_0_10_1 set up to track remote branch tfkeras_0_10_1 from origin.
Building wheels for collected packages: eli5
  Building wheel for eli5 (setup.py) ... [?25ldone
[?25h  Created wheel for eli5: filename=eli5-0.10.1-py2.py3-none-any.whl size=105885 sha256=ab56c6adef0bfec8c935a95aa34fe99ffcbd493a5ffa050d28279c8d9401ad2e
  Stored in directory: /tmp/pip-ephem-wheel-cache-qq1znpad/wheels/93/23/c2/479f99e6e981887ac70af72d4ff763471acf7184d1b80a9268
Successfully built eli5
Collecting git+https://github.com/amaiya/stellargraph@no_tf_dep_082
  Cloning https://github.com/amaiya/stellargraph (to revision no_tf_de

Building wheels for collected packages: stellargraph
  Building wheel for stellargraph (setup.py) ... [?25ldone
[?25h  Created wheel for stellargraph: filename=stellargraph-0.8.2-py3-none-any.whl size=141652 sha256=383f5a1ac37ce433a30bcff0a9b285f3cebaa91eedea4164068211319326b162
  Stored in directory: /tmp/pip-ephem-wheel-cache-2ktksw5p/wheels/9d/18/94/4155a499aea4ffd22b73a1983d594d4f4118e107068a30142f
Successfully built stellargraph


In [3]:
# upgrade to Tensorflow 2.0 for gpu

! pip3 install --user --upgrade tensorflow-gpu

Requirement already up-to-date: tensorflow-gpu in /home/ec2-user/.local/lib/python3.6/site-packages (2.1.0)


In [4]:
import tensorflow as tf

print(tf.__version__)

2.1.0


In [5]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0";

In [6]:
# importing the ktrain library and pandas for preprocessing

import ktrain
from ktrain import text
import pandas as pd
from sklearn.model_selection import train_test_split

using Keras version: 2.2.4-tf


In [7]:
# loading data into array format
# datasets will be separated into a train and test, with just the text column and the category it belongs to.

# labels fall into one of 12 categories
categories = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

# load data
data = pd.read_csv("train_no_dummies.csv")

# splitting and shuffling train and test, test size 20%
train, test = train_test_split(data, test_size=0.2, shuffle=True, random_state=1)

print('size of training set: %s' % (len(train['text'])))
print('size of validation set: %s' % (len(test['text'])))

size of training set: 7661
size of validation set: 1916


In [8]:
# first 5 rows of training data

train.head()

Unnamed: 0,text,labels
7072,"The areas in Badakhshan province, where the la...",2
9561,Nutrition: Children under 5 years with SAM adm...,9
5552,"In Cúcuta, Jorgenys started selling candy in t...",10
3259,Medical professionals in Derna reported four p...,5
6202,"[Seasonal Calendar]On health issues, the Baran...",5


In [9]:
# splitting data into correct format for preprocessing
x_train = train.text
y_train = train.labels
x_test = test.text
y_test = test.labels

# verify correct format
x_train.dtype

dtype('O')

In [10]:
# viewing some training examples before preprocessing

x_train[:5]

7072    The areas in Badakhshan province, where the la...
9561    Nutrition: Children under 5 years with SAM adm...
5552    In Cúcuta, Jorgenys started selling candy in t...
3259    Medical professionals in Derna reported four p...
6202    [Seasonal Calendar]On health issues, the Baran...
Name: text, dtype: object

In [11]:
# building a model and wrap in learner

text.print_text_classifiers()

fasttext: a fastText-like model [http://arxiv.org/pdf/1607.01759.pdf]
logreg: logistic regression using a trainable Embedding layer
nbsvm: NBSVM model [http://www.aclweb.org/anthology/P12-2018]
bigru: Bidirectional GRU with pretrained word vectors [https://arxiv.org/abs/1712.09405]
standard_gru: simple 2-layer GRU with randomly initialized embeddings
bert: Bidirectional Encoder Representations from Transformers (BERT) [https://arxiv.org/abs/1810.04805]
distilbert: distilled, smaller, and faster BERT from Hugging Face [https://arxiv.org/abs/1910.01108]


In [12]:
# using ktrain's preprocessing function, I had to specify lang='en' (English) for the function to work
# we are using a distilbert model, a quicker training bert model that still works great

trn, val, preproc = text.texts_from_array(x_train=x_train, y_train=y_train,
                                          x_test=x_test, y_test=y_test,
                                          class_names=categories,
                                          preprocess_mode='distilbert',
                                          lang='en',
                                          maxlen=350)

preprocessing train...
language: en


preprocessing test...
language: en


In [13]:
model = text.text_classifier('distilbert', train_data=trn, preproc=preproc)

Is Multi-Label? False
maxlen is 350
done.


In [14]:
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=6)

In [15]:
# fitting model to data

learner.fit_onecycle(3e-5, 1)



begin training using onecycle policy with max lr of 3e-05...
Train for 1277 steps, validate for 320 steps


<tensorflow.python.keras.callbacks.History at 0x7fea601d0048>

That was a great first epoch! It is already much more accurate than my previous best model, with the validation at 80%. My previous best model only had a validation accuracy of 69%. Surprisingly, the training dataset's accuracy is lower than the val data after one epoch. There is still much more room for improvement with continued training. We will train for two more epochs.

In [16]:
learner.fit_onecycle(3e-5, 2)



begin training using onecycle policy with max lr of 3e-05...
Train for 1277 steps, validate for 320 steps
Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x7fe9a45105f8>

## Final Thoughts

It is interesting to see that after two more epochs, accuracy on the training dataset improved, but the validation dataset stayed at 80%. This seems to be the best we can achieve with this model and this data. Any further training and we run the risk of overfitting. I am sure there is much more fine-tuning that can be done with this project, including hyperparameter tuning and trying other BERT model variations, but for me this is a good stopping point. I am extremely happy with the improvements that the mighty BERT model helped me achieve.

## Predictions On New Data

I will see how the model performs on sample text, a real example from the DEEP platform.

In [25]:
predictor.predict("Data revealed that the majority (35 out of 46) of the restaurants and tea shops are cooking by using wood fuel in mud stove, while 16 of them are using LPG and LPG Stove")

9

The model predicted the highest probability category of 9: Nutrition. This is a very reasonable label for a post about cooking methods, however if we were applying multiple labels, then 9: Food or also even 5: Health would also be good labels as well. In reality, the DEEP system does apply multiple labels, but they do not for purposes of this competition.

Thank you BERT!