# Disaster Tweet Identification with HuggingFace DistilBert

Transformers are very nice tools for NLP. I always found them a bit complex myself, but the HuggingFace libraries make it quite simple to use them. In this notebook, I use the DistilBert TF transformer model from HF for tweet classification. Works very well with a short notebook.

A couple of articles I used as a basis: [article 1](https://medium.com/geekculture/hugging-face-distilbert-tensorflow-for-custom-text-classification-1ad4a49e26a7), [article 2](https://www.analyticsvidhya.com/blog/2022/04/building-state-of-the-art-text-classifier-using-huggingface-and-tensorflow/)

In [None]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

import tensorflow as tf

from transformers import DistilBertTokenizer
from transformers import TFDistilBertForSequenceClassification

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df_train = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
df_test = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")

I combined the location + keywords + tweet text into a single field that I use as input for the transformer (text_combo field). I got just as good results with just the tweet text, so this is quite pointless really. But a reminder that it is simple to experiment with these things.

In [None]:
df_train['text_combo'] = df_train['location'].astype(str) + " : " + df_train['keyword'].astype(str) + " : " + df_train['text'].astype(str)
df_test['text_combo'] = df_test['location'].astype(str) + " : " + df_test['keyword'].astype(str) + " : " + df_test['text'].astype(str)


In [None]:
df_train_subset = df_train[["text_combo", "target"]].copy()
df_train_subset.rename(columns = {'text_combo':'text'}, inplace = True)
X_train, X_test = train_test_split(df_train_subset, test_size=0.05, random_state=0, stratify=df_train["target"])

In [None]:
MODEL_NAME = 'distilbert-base-uncased-finetuned-sst-2-english'
BATCH_SIZE = 16
#generally I got the best results already after epoch 1 
#but it is good to try and see with a few more
N_EPOCHS = 3

# Tokenization

In [None]:
#HuggingFace models come with their own tokenizes, suitable for what input it expects
tokenizer = DistilBertTokenizer.from_pretrained(MODEL_NAME)

In [None]:
#tokenize the text
train_encodings = tokenizer(list(X_train["text"]),
                            truncation=True, 
                            padding=True)

test_encodings = tokenizer(list(X_test["text"]),
                           truncation=True, 
                           padding=True)


## Tokenization Example

Maybe we can learn something about looking at how the tokenizer handles some input?

In [None]:
X_train["text"].iloc[188]

In [None]:
input_ids = train_encodings["input_ids"][188]
tokens = tokenizer.convert_ids_to_tokens(input_ids)
#print(f"Tokenized output: {output}")
print(f"Tokenized tokens: {tokens}")
print(f"Tokenized text: {tokenizer.convert_tokens_to_string(tokens)}")

It seems the max length could be shorter, since tweets rarely will be that many words / tokens. Also it seems to lowercase all text, and separate special chars such as #.

In [None]:

train_dataset = tf.data.Dataset.from_tensor_slices((dict(train_encodings),
                                    list(X_train["target"].values)))

test_dataset = tf.data.Dataset.from_tensor_slices((dict(test_encodings),
                                    list(X_test["target"].values)))

# Model Training

In [None]:
model = TFDistilBertForSequenceClassification.from_pretrained(MODEL_NAME)#chose the optimizer
#optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)#define the loss function 
optimizer = tf.keras.optimizers.Adam(learning_rate=18e-6)#define the loss function 
losss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)#build the model

model.compile(optimizer=optimizer,
              loss=losss,
              metrics=['accuracy'])

checkpoint_filepath = 'mycheckpoint'
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=True,
    monitor='val_loss',
    mode='min',
    save_best_only=True)

model.fit(train_dataset.shuffle(len(X_train)).batch(BATCH_SIZE),
          epochs=N_EPOCHS,
          batch_size=BATCH_SIZE,
          callbacks=[model_checkpoint_callback],
          validation_data=test_dataset.shuffle(len(X_train)).batch(1))

In [None]:
#now to load the saved best model weights
model.load_weights(checkpoint_filepath)

# Test Set Prediction

In [None]:
def predict_proba(text_list, model, tokenizer):
    encodings = tokenizer(text_list, 
                          #max_length=MAX_LEN, 
                          truncation=True, 
                          padding=True)

    #somehow these API's never read very intuitively :/ 
    dataset = tf.data.Dataset.from_tensor_slices((dict(encodings)))
    #the batch(1) seems to be required for the call..
    preds = model.predict(dataset.batch(1)).logits  
    
    #transform to array with probabilities
    res = tf.nn.softmax(preds, axis=1).numpy()      
    
    return res

In [None]:
#test_texts = list(df_test["text"])
test_texts = list(df_test["text_combo"])

In [None]:
preds = predict_proba(test_texts, model, tokenizer)

# Predictions Distribution

First the predictions for 0 (not disaster), followed by 1 (disaster)

In [None]:
n, bins, patches = plt.hist(preds[:,0])
plt.show()

In [None]:
n, bins, patches = plt.hist(preds[:,1])
plt.show()

# Submission

In [None]:
df_submission = pd.DataFrame()
df_submission["id"] = df_test["id"]
df_submission["target"] = preds[:, 1] >= 0.5
df_submission["target"] = df_submission["target"].astype(int)
df_submission

In [None]:
df_submission.to_csv("kaggle_submission.csv", index=False)

In [None]:
!head kaggle_submission.csv