## NLP Text Categorization Competition

In this project I will be using a BERT model for classifying text data, as part of the AI Crowd NLP competition.

I will be using the ktrain library, which works with Tensorflow 2.0. I found this high-level library to be extremely useful for quickly building up a great BERT model, with state of the art prediction capabilities. This project was done using Amazon's Sagemaker and an Amazon Web Service's GPU server. 

### The Contest: Transfer Learning for International Crisis Response

The purpose of this contest is to improve the text classification algorithms used within the DEEP humanitarian information platform. To quote their website:

"The aim of the platform is to provide insights from years of historical and in-crisis humanitarian text data. The platform allows users to upload documents and classify text snippets according to predefined humanitarian target labels, grouped into and referred to as analytical frameworks. DEEP is now successfully functional in several international humanitarian organizations and the United Nations across the globe."

Currently, multiple organizations use the DEEP platform and each has a separate classification algorithm with varying degrees of success and which are not trained across other organizations. The aim of this project is to build a unified algorithm which both performs better and is generalizable enough to work across all organizations.

### The Categories: 12 Different Labels

(1) Agriculture

(2) Cross: short form of Cross-sectoral; areas of humanitarian response that require action in more than one sector. For example malnutrition requires humanitarian interventions in health, access to food, access to basic hygiene items and clean water, and access to non-food items such as bottles to feed infants.

(3) Education

(4) Food

(5) Health

(6) Livelihood: Access to employment and income

(7) Logistics: Any logistical support needed to carry out humanitarian activities e.g. air transport, satellite phone 
connection etc.

(8) NFI: Non-food items needed in daily life that are not food such as bedding, mattrassess, jerrycans, coal or oil for heating

(9) Nutrition

(10) Protection

(11) Shelter

(12) WASH (Water, Sanitation and Hygiene)

Although the contest has already finished while I am writing this, I never got the chance to try a BERT model. I wanted to build this notebook as a challenge to myself, to see if I can beat my previous best performance of 78% training accuracy and 69% validation accuracy. I am confident that BERT can help me achieve this!

To begin, I need to load a Tensorflow 1.14 or standard Python 3 Sagemaker notebook, then use pip to upgrade to Tensorflow 2.0. In other notebooks, I also tried loading Tensorflow 2.0 via a docker image, but this method below worked the best for me. A major reason why I am using Amazon Web Services for this project is having access to their powerful graphics processor units, which I do not have on my personal computer.

The data cleaning was done in a different notebook then loaded into this one.

In [1]:
# upgrading pip and setuptools to ensure everything installs smoothly

! pip install --upgrade pip
! pip install --upgrade setuptools

# installing the ktrain library we will be using to train our bert model

! pip install ktrain

# installing dependencies for ktrain

! pip install git+https://github.com/amaiya/eli5@tfkeras_0_10_1
! pip install git+https://github.com/amaiya/stellargraph@no_tf_dep_082

Collecting pip
  Using cached https://files.pythonhosted.org/packages/54/0c/d01aa759fdc501a58f431eb594a17495f15b88da142ce14b5845662c13f3/pip-20.0.2-py2.py3-none-any.whl
Installing collected packages: pip
  Found existing installation: pip 19.3.1
    Uninstalling pip-19.3.1:
      Successfully uninstalled pip-19.3.1
Successfully installed pip-20.0.2
Collecting setuptools
  Using cached setuptools-45.1.0-py3-none-any.whl (583 kB)
[31mERROR: tensorflow-serving-api 1.15.0 requires tensorflow~=1.15.0, which is not installed.[0m
Installing collected packages: setuptools
  Attempting uninstall: setuptools
    Found existing installation: setuptools 41.6.0
    Uninstalling setuptools-41.6.0:
      Successfully uninstalled setuptools-41.6.0
Successfully installed setuptools-45.1.0
Processing /home/ec2-user/.cache/pip/wheels/3d/e7/4b/6f839c9443003c6fabfc286d29228050d8e2e9b0d1d50399d0/ktrain-0.8.3-cp36-none-any.whl
Processing /home/ec2-user/.cache/pip/wheels/63/dc/87/3260cb91f3aa32c0f85c5375429

Processing /home/ec2-user/.cache/pip/wheels/59/b1/91/f02e76c732915c4015ab4010f3015469866c1eb9b14058d8e7/dill-0.3.1.1-cp36-none-any.whl
Collecting tensorflow-metadata
  Using cached tensorflow_metadata-0.21.0-py2.py3-none-any.whl (30 kB)
Processing /home/ec2-user/.cache/pip/wheels/6e/9c/ed/4499c9865ac1002697793e0ae05ba6be33553d098f3347fb94/future-0.18.2-py3-none-any.whl
Processing /home/ec2-user/.cache/pip/wheels/bb/df/3f/81b36f41b66e6a9cd69224c70a737de2bb6b2f7feb3272c25e/keras_multi_head-0.22.0-cp36-none-any.whl
Processing /home/ec2-user/.cache/pip/wheels/d1/bc/b1/b0c45cee4ca2e6c86586b0218ffafe7f0703c6d07fdf049866/keras_embed_sim-0.7.0-cp36-none-any.whl
Processing /home/ec2-user/.cache/pip/wheels/5b/a1/a0/ce6b1d49ba1a9a76f592e70cf297b05c96bc9f418146761032/keras_pos_embd-0.11.0-cp36-none-any.whl
Processing /home/ec2-user/.cache/pip/wheels/54/80/22/a638a7d406fd155e507aa33d703e3fa2612b9eb7bb4f4fe667/keras_layer_normalization-0.14.0-cp36-none-any.whl
Processing /home/ec2-user/.cache/pip/wh

Building wheels for collected packages: stellargraph
  Building wheel for stellargraph (setup.py) ... [?25ldone
[?25h  Created wheel for stellargraph: filename=stellargraph-0.8.2-py3-none-any.whl size=141652 sha256=6f91834a123d945b821c4abda513aa91afa101b59a91605f163be867567ca661
  Stored in directory: /tmp/pip-ephem-wheel-cache-mwnjipr6/wheels/9d/18/94/4155a499aea4ffd22b73a1983d594d4f4118e107068a30142f
Successfully built stellargraph
Installing collected packages: stellargraph
Successfully installed stellargraph-0.8.2


In [2]:
# upgrade to Tensorflow 2.0 for gpu

! pip install --user --upgrade tensorflow-gpu

Requirement already up-to-date: tensorflow-gpu in /home/ec2-user/.local/lib/python3.6/site-packages (2.1.0)


In [3]:
import tensorflow as tf

print(tf.__version__)

2.1.0


In [4]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0";

In [5]:
# importing the ktrain library and pandas for preprocessing

import ktrain
from ktrain import text
import pandas as pd
from sklearn.model_selection import train_test_split

using Keras version: 2.2.4-tf


In [6]:
# loading data from csv file, this file was cleaned in another notebook
# datasets will be separated into a train and test, with just the text column and the category it belongs to

# load data
data = pd.read_csv("train_no_dummies.csv")

# splitting and shuffling train and test, test size 10%
train, test = train_test_split(data, test_size=0.1, shuffle=True, random_state=2)

print('size of training set: %s' % (len(train['text'])))
print('size of validation set: %s' % (len(test['text'])))

size of training set: 8619
size of validation set: 958


In [7]:
# viewing the first 5 rows of training data

train.head()

Unnamed: 0,text,labels
2521,The management of Mitiga Airport has confirmed...,2
4646,And in Venezuelaestán reappearing controladasc...,5
825,"Reportedly, transportation costs, accommodatio...",10
5196,"As for distance, 60% of households are less th...",2
8094,"[Seasonal calendar] Climate conditions, Hazard...",6


In [8]:
# splitting data into correct format for preprocessing
x_train = train.text
y_train = train.labels
x_test = test.text
y_test = test.labels

# viewing some training examples before preprocessing
x_train[:5]

2521    The management of Mitiga Airport has confirmed...
4646    And in Venezuelaestán reappearing controladasc...
825     Reportedly, transportation costs, accommodatio...
5196    As for distance, 60% of households are less th...
8094    [Seasonal calendar] Climate conditions, Hazard...
Name: text, dtype: object

In [9]:
text.print_text_classifiers()

fasttext: a fastText-like model [http://arxiv.org/pdf/1607.01759.pdf]
logreg: logistic regression using a trainable Embedding layer
nbsvm: NBSVM model [http://www.aclweb.org/anthology/P12-2018]
bigru: Bidirectional GRU with pretrained word vectors [https://arxiv.org/abs/1712.09405]
standard_gru: simple 2-layer GRU with randomly initialized embeddings
bert: Bidirectional Encoder Representations from Transformers (BERT) [https://arxiv.org/abs/1810.04805]
distilbert: distilled, smaller, and faster BERT from Hugging Face [https://arxiv.org/abs/1910.01108]


In [10]:
# we are using a distilbert model, a quicker training bert model that still works great

# labels fall into one of 12 categories
categories = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

# using ktrain's preprocessing function, I had to specify lang='en' (English) for the function to work
trn, val, preproc = text.texts_from_array(x_train=x_train, y_train=y_train,
                                          x_test=x_test, y_test=y_test,
                                          class_names=categories,
                                          preprocess_mode='distilbert',
                                          lang='en',
                                          maxlen=350)

preprocessing train...
language: en


preprocessing test...
language: en


In [11]:
# defining model and wrapping in learner

model = text.text_classifier('distilbert', train_data=trn, preproc=preproc)

learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=6)

Is Multi-Label? False
maxlen is 350
done.


In [12]:
# viewing the layers of the model

model.summary()

Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
distilbert (TFDistilBertMain multiple                  66362880  
_________________________________________________________________
pre_classifier (Dense)       multiple                  590592    
_________________________________________________________________
classifier (Dense)           multiple                  9997      
_________________________________________________________________
dropout_19 (Dropout)         multiple                  0         
Total params: 66,963,469
Trainable params: 66,963,469
Non-trainable params: 0
_________________________________________________________________


In [13]:
# fitting model to data

learner.fit_onecycle(3e-5, 1)



begin training using onecycle policy with max lr of 3e-05...
Train for 1437 steps, validate for 160 steps


<tensorflow.python.keras.callbacks.History at 0x7fb18c0e5048>

That was a great first epoch! It is already much more accurate than my previous best model, with the validation at 78.5%. My previous best model only had a validation accuracy of 69%. Surprisingly, the training dataset's accuracy is currently lower than the val data after one epoch (69% on training and 79% on val). There is still much more room for improvement on training set accuracy with continued training. We will train for two more epochs.

In [14]:
learner.fit_onecycle(3e-5, 2)



begin training using onecycle policy with max lr of 3e-05...
Train for 1437 steps, validate for 160 steps
Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x7fb0c252a4e0>

## Final Thoughts

It is interesting to see that after two more epochs, accuracy on the training dataset dramatically improved, but the validation dataset decreased after the second epoch then slightly improved in the third, up to 80.6%. This may be close to the best we can achieve with this model and this data set. Any further training and we run the risk of overfitting. I am sure there is much more fine-tuning that can be done with this project, including hyperparameter tuning and trying other BERT model variations, but for me this is a good stopping point. I am extremely happy with the improvements that the mighty BERT model helped me achieve.

## Predictions On New Data

I will see how the model performs on sample text, a real example from the DEEP platform.

In [16]:
predictor = ktrain.get_predictor(model, preproc)

predictor.predict("Data revealed that the majority (35 out of 46) of the restaurants and tea shops are cooking by using wood fuel in mud stove, while 16 of them are using LPG and LPG Stove")

9

The model predicted the highest probability category of 9: Nutrition. This is a very reasonable label for a post about cooking methods, however if we were applying multiple labels, then 9: Food or also even 5: Health would also be good labels as well. In reality, the DEEP system does apply multiple labels, but they do not for purposes of this competition.

Thank you BERT!