## NLP Disaster Tweets Kaggle Competition
Notebook by Kea Kohv, 2023

### Introduction

The Kaggle competition "Natural Language Processing with Disaster Tweets" has a dataset on Twitter tweets that are categorized as being about a disaster or not being about a disaster. The aim of the competition is to build a machine learning model that predicts which Tweets are about real disasters and which ones aren’t. In short, it is a binary text classification task.

The contents of this notebook are the following:
1. Importing libraries
2. Reading in the data
3. Training set overview and quality assessment
4. Test set overview and quality assessment
5. Data clean-up
6. Modelling: A simple SGDClassifier
7. Modelling: Simple Neural Network with BERT encodings
7. Modelling: fine-tuning DistilBERT
8. Concusions and thoughts of what else could be done

### Importing libraries

In [1]:
# The following code was used with Python version 3.10.4

# Import the necessary libraries

import pandas as pd # for analyzing, cleaning, exploring, and manipulating data
import numpy as np # for working with arrays
from datasets import Dataset # for data pre-processing
import pickle # to store objects into file

# sklearn for building ML models, cross-validation, data splitting, data pre-processing
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split

# tensorflow for building ML models
import tensorflow as tf
import tensorflow_hub as hub
from tensorflow.keras import optimizers, losses, activations

# transformers for DistilBERT model
from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification

# data clean-up methods for this specific task
from clean_data import clean_data

### Read in the data

In [15]:
# Read in the data
df_train = pd.read_csv('data/train.csv', dtype={'id': np.int16, 'target': np.int8})
df_test = pd.read_csv('data/test.csv', dtype={'id': np.int16})

In [4]:
# Display data shapes
df_train.shape, df_test.shape

((7613, 5), (3263, 4))

### Training set overview and quality assessment

In [5]:
# A look at the first rows in the training set
df_train.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [6]:
# Display dataframe info, incl non-null value count and data types
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int16 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int8  
dtypes: int16(1), int8(1), object(3)
memory usage: 200.9+ KB


In [7]:
# See the % of null values
print("Overview of null values in the training set as a percentage")
df_train.isnull().sum() * 100 / len(df_train)

Overview of null values in the training set as a percentage


id           0.000000
keyword      0.801261
location    33.272035
text         0.000000
target       0.000000
dtype: float64

In [8]:
# See the unique values count
print("Overview of unique values in the training set")
df_train.nunique()

Overview of unique values in the training set


id          7613
keyword      221
location    3341
text        7503
target         2
dtype: int64

In [9]:
# See if there are any duplicates
print("Duplicated rows in the training set:")
df_train.duplicated().sum()

Duplicated rows in the training set:


0

In [10]:
# Take a look at target class balance
print('Target class balance assessment with 0 (no disaster) and 1 (disaster) value counts:')
print(df_train['target'].value_counts())
print()
print('Target class distribution in percentages:')
print(df_train['target'].value_counts() *100 / len(df_train))

Target class balance assessment with 0 (no disaster) and 1 (disaster) value counts:
0    4342
1    3271
Name: target, dtype: int64

Target class distribution in percentages:
0    57.034021
1    42.965979
Name: target, dtype: float64


As can be seen, there are more samples with class 0 (no disaster) than with class 1 (disaster). However, the dataset is not very imbalanced.

### Test set overview and quality assessment

In [11]:
# Display first rows of the testset
df_test.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [12]:
# Display testset info, incl non-null value count and data types
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3263 entries, 0 to 3262
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        3263 non-null   int16 
 1   keyword   3237 non-null   object
 2   location  2158 non-null   object
 3   text      3263 non-null   object
dtypes: int16(1), object(3)
memory usage: 83.0+ KB


In [13]:
# Look at the percentage of null values
# As can be seen, the text field has no null values, so let's stick with that
print("Overview of null values in the test set as a percentage")
df_test.isnull().sum() * 100 / len(df_test)

Overview of null values in the test set as a percentage


id           0.000000
keyword      0.796813
location    33.864542
text         0.000000
dtype: float64

In [14]:
# Display unique values count
print("Overview of unique values in the test set")
df_test.nunique()

Overview of unique values in the test set


id          3263
keyword      221
location    1602
text        3243
dtype: int64

In [None]:
# See if there are any duplicates
print("Duplicated rows in the training set:")
df_test.duplicated().sum()

Duplicated rows in the training set:


0

### Data clean-up

In [7]:
# Data clean-up can take around 10 minutes. # Alternative is to load the already cleaned data from file, see below.
df_train['text_cleaned'] = df_train['text'].apply(lambda s : clean_data(s))
df_test['text_cleaned'] = df_test['text'].apply(lambda s : clean_data(s))

In [9]:
# Save cleaned dataframes to file
df_train.to_csv('data/train_cleaned.csv',index=False)
df_test.to_csv('data/test_cleaned.csv',index=False)

In [17]:
# Load cleaned dataframes from file
df_train = pd.read_csv('data/train_cleaned.csv')
df_test = pd.read_csv('data/test_cleaned.csv')

In [12]:
# An example of data cleaning
print('An example of cleaned text')
print('Original: ', df_train['text'][100])
print('Cleaned: ', df_train['text_cleaned'][100])

An example of cleaned text
Original:  .@NorwayMFA #Bahrain police had previously died in a road accident they were not killed by explosion https://t.co/gFJfgTodad
Cleaned:  NorwayMFA Bahrain police had previously died in a road accident they were not killed by explosion 


In [13]:
# Drop keyword, location and text columns
df_train.drop(columns=['keyword','location','text'], inplace=True)
df_test.drop(columns=['keyword','location','text'], inplace=True)

### SGDClassifier

First let's try a simple and fast SGDClassifier. It is a regularized linear model with stochastic gradient descent.

In [20]:
# Create a pipeline
sgd = Pipeline([('vect', CountVectorizer()), # tokenization
                ('tfidf', TfidfTransformer()), # to tf-idf
                ('clf', SGDClassifier(random_state=123)), # classifier
               ])

# Grid that will be used to find the best hyperparameters during cross-validation
params = {
    "clf__loss" : ["hinge", "log_loss", "squared_hinge", "modified_huber"],
    "clf__alpha" : [0.0001, 0.001, 0.01, 0.1],
    "clf__penalty" : ["l2", "l1", "none"],
}

# Cross-validation hyperparameter optimization
search = GridSearchCV(sgd, params, n_jobs=2)
search.fit(df_train['text_cleaned'], df_train['target'])

# Print out the validation accuracy and hyperparameters of the best model
# Note that on Kaggle, the assessment metric is fscore, not accuracy
print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)

Best parameter (CV score=0.730):
{'clf__alpha': 0.001, 'clf__loss': 'log_loss', 'clf__penalty': 'none'}


In [21]:
# Predict on test data and save the predictions to a submission file
y_pred = search.best_estimator_.predict(df_test['text_cleaned'])

submission = pd.DataFrame()
submission['id'] = df_test['id']
submission['target'] = y_pred

submission.to_csv('submissions/sgd_optimized.csv', index=False)

The submission got fscore 0.786 on Kaggle which is not bad considering how fast and easy the training of this model was.

### A simple neural network with BERT encodings

Next, let's make a simple neural network with BERT encodings as inputs.

BERT (Bidirectional Encoder Representations from Transformers) provides dense vector representations for natural language by using a deep, pre-trained neural network with the Transformer architecture. It was originally published by Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova: "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", 2018.

BERT encoder uses L=12 hidden layers (i.e., Transformer blocks), a hidden size of H=768, and A=12 attention heads.

In [15]:
# Split the training set into training and validation sets.
# Ideally, crossvalidation is preferrable but because training NNs takes a lot of time, I will not be doing that.
# Use stratification to ensure class labels are 
X_train, X_valid, y_train, y_valid = train_test_split(df_train['text_cleaned'], df_train['target'], stratify=df_train['target'])

In [18]:
bert_preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4")



In [19]:
# Bert layers
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
preprocessed_text = bert_preprocess(text_input)
outputs = bert_encoder(preprocessed_text)

# Neural network layers
l = tf.keras.layers.Dropout(0.1, name="dropout")(outputs['pooled_output'])
l = tf.keras.layers.Dense(1, activation='sigmoid', name="output")(l)

# Use inputs and outputs to construct a final model
model = tf.keras.Model(inputs=[text_input], outputs = [l])

In [20]:
# See a summary of the model
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 text (InputLayer)              [(None,)]            0           []                               
                                                                                                  
 keras_layer (KerasLayer)       {'input_type_ids':   0           ['text[0][0]']                   
                                (None, 128),                                                      
                                 'input_word_ids':                                                
                                (None, 128),                                                      
                                 'input_mask': (Non                                               
                                e, 128)}                                                      

In [21]:
# Compile the model
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [28]:
# Train the model on the training data
# Ideally the number of epochs should be larger, currently there could be underfitting
model.fit(X_train, y_train, epochs=2, batch_size = 16)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x23d2b8d7d90>

In [29]:
# Get the accuracy on validation set
# Not that this is ordinary accuracy, not f1score as on Kaggle
model.evaluate(X_valid, y_valid)



[0.5437530875205994, 0.7463235259056091]

In [30]:
# Get the testset label probabilities
y_predicted = model.predict(df_test['text_cleaned'])
y_predicted = y_predicted.flatten()



In [32]:
# Probabilities into predictions
y_predicted = np.where(y_predicted > 0.5, 1, 0)

In [33]:
# Write predictions into a submission file
submission = pd.DataFrame()
submission['id'] = df_test['id']
submission['target'] = y_predicted

submission.to_csv('submissions/bert_simple_nn.csv', index=False)

This simple NN got fscore 0.729 on Kaggle which is less than SGDClassifier but the training of the NN took a lot longer to train than the SGDClassifier. However, the NN is very simple, the hyperparameters are not optimized and the model is probably underfitting due to having run only 2 epochs.

### Fine-tuned DistilBERT

Now, instead of creating a NN from scratch, let's use a pre-trained model and fine-tune it.

DistilBERT is a small, fast, cheap and light HuggingFace Transformer model trained by distilling BERT base.
The following code is based on tutorials provided by HuggingFace https://huggingface.co/docs/transformers/model_doc/distilbert

In [3]:
# Download the pre-trained model checkpoints
distilbert_model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'activation_13', 'vocab_projector', 'vocab_transform']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['dropout_19', 'pre_classifier', 'classifier']
You should probably TRAIN this model on a down-stream task to be able to use i

In [4]:
# Download the tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

In [5]:
# Function to help tokenize the text, used with the map function
def tokenize(df):
    return tokenizer(df["text_cleaned"], padding="max_length", truncation=True, max_length=140)

In [9]:
# Split the training set into training and validation set
train_set, valid_set = train_test_split(df_train, test_size=0.2, stratify=df_train['target']) # Also shuffles because default is shuffle=True

In [10]:
# 16 samples used to estimate the error gradient
# Potentially could also try higher batch size values. However, a too large value can lead to lower accuracy
batch_size = 16

# Tokenize the training set and and turn it into a tensorflow dataset
dataset_train = Dataset.from_pandas(train_set[['text_cleaned','target']])
train_tokenized = dataset_train.map(tokenize)
tf_train = train_tokenized.to_tf_dataset(batch_size=batch_size, columns=['input_ids', 'attention_mask'], label_cols=['target'])

# Tokenize the validation set and and turn it into a tensorflow dataset
dataset_valid = Dataset.from_pandas(valid_set[['text_cleaned','target']])
valid_tokenized = dataset_valid.map(tokenize)
tf_valid = valid_tokenized.to_tf_dataset(batch_size=batch_size, columns=['input_ids', 'attention_mask'], label_cols=['target'])


Map:   0%|          | 0/6090 [00:00<?, ? examples/s]

Map:   0%|          | 0/1523 [00:00<?, ? examples/s]

In [11]:
# Compile the model with standard hyperparameters, not optimized.
optimizer = optimizers.Adam(learning_rate=3e-5)
loss = losses.SparseCategoricalCrossentropy(from_logits=True)
distilbert_model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])

In [63]:
distilbert_model.summary()

Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMai  multiple                 66362880  
 nLayer)                                                         
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
 dropout_19 (Dropout)        multiple                  0         
                                                                 
Total params: 66,955,010
Trainable params: 66,955,010
Non-trainable params: 0
_________________________________________________________________


In [12]:
# Train the model on training data
distilbert_model.fit(tf_train, batch_size=batch_size, epochs=2) # Training with 2 epochs took around 2 hours

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x2f583c9f880>

In [13]:
# Get the validation accuracy
benchmarks = distilbert_model.evaluate(tf_valid, return_dict=True, batch_size=batch_size)
print(benchmarks)

{'loss': 0.5095885992050171, 'accuracy': 0.8161523342132568}


In [18]:
# Tokenize the test set and turn it into tensorflow dataset
dataset_test = Dataset.from_pandas(df_test['text_cleaned'].to_frame())
test_encoded = dataset_test.map(tokenize)
tf_test = test_encoded.to_tf_dataset(batch_size=batch_size, columns=['input_ids', 'attention_mask'])

Map:   0%|          | 0/3263 [00:00<?, ? examples/s]

In [20]:
# Get testset label probabilities
y_pred = distilbert_model.predict(tf_test).logits
y_pred = activations.softmax(tf.convert_to_tensor(y_pred)).numpy()



In [60]:
# Turn the probabilities into labels
y_pred_lbls = np.where(y_pred[:,:1] > 0.5, 0, 1)
y_pred_lbls = y_pred_lbls.flatten()

In [62]:
# Write predictions into a submission file
submission = pd.DataFrame()
submission['id'] = df_test['id']
submission['target'] = y_pred_lbls

submission.to_csv('submissions/fine-tuned-distilbert.csv', index=False)

This simple fine-tuned DistilBERT model got fscore 0.81 on Kaggle testset. For comparison, the best real submissions on Kaggle get around 0.84 so this model is close to that. However, if the hyperparameters were optimized and with more experimentaton, the result could be made even better.

### What else could be done?

A few ideas:
- Further explorative data analysis, adding visualizations, displaying most often occurring words, phrases
- Add metafeatures (e.g. word count, mean word length) to the dataset and use these for model training
- Further data clean-up, e.g. correcting mislabeled samples
- Increasing the simple NN model epochs to avoid underfitting, grid-search to optimize hyperparameters
- Explore validation set predicted targets vs real targets with Confusion Matrix, assess precision and recall separately, report fscore as a metric
- Try other models and approaches