<a href="https://colab.research.google.com/github/Dansah2/Classifying-Disaster-Tweets/blob/main/Hugging_Face_Classifying_Disaster_Tweets_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classifying Disaster Tweets

Kaggle Dataset Download API Command:

kaggle competitions download -c nlp-getting-started

I will classify a tweet as either a 'Disaster Tweet' or 'Non-Disaster Tweet'.

##Project Outline:

1) Download the dataset

2) Explore/Analyze the data

3) Preprocess and organize the data

4) Classify using Vader

5) Classify using Bag of Words

6) Classify using Hugging Face

## Download the Dataset

1) Install required libraries

2) Import required libraries

3) Upload Data from Google Drive


#### Install Required Libraries

In [None]:
!pip install -q -U scikit-learn
!pip install -q -U numpy
!pip install -q -U transformers datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10.8 MB[0m [31m39.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.2/18.2 MB[0m [31m22.4 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
numba 0.56.4 requires numpy<1.24,>=1.18, but you have numpy 1.25.2 which is incompatible.
tensorflow 2.12.0 requires numpy<1.24,>=1.22, but you have numpy 1.25.2 which is incompatible.[0m[31m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m26.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m35.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━

#### Import Required Libraries

In [None]:
# handeling data
import numpy as np
import pandas as pd

# graphing data
pd.options.plotting.backend = "plotly"
import plotly.graph_objects as go

# downloading data
from google.colab import drive

# splitting data
from sklearn.model_selection import train_test_split

# training the data
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

####Upload Data from Google Drive

In [None]:
# Mount google drive to store Kaggle API for future use
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# read in the data
HF_train = pd.read_csv('/content/drive/My Drive/Disaster_Tweets/train_df.csv').copy()
HF_test = pd.read_csv('/content/drive/My Drive/Disaster_Tweets/test_df.csv').copy()

##**Find Sentiment with Hugging Face**

https://blog.tensorflow.org/2019/11/hugging-face-state-of-art-natural.html

### Split / Preprocess Training and Testing Data

In [None]:
# create a method that will split the data into training and testing sets
def split_data_frame(data_frame, target, test_size, shuffle=True):
  X = data_frame.drop(columns=target)
  y = data_frame[target]
  X_train, X_valid, y_train, y_valid = train_test_split(X,y, test_size=test_size, shuffle=shuffle)
  print(f'X_train: {X_train.shape}, X_valid: {X_valid.shape}, y_train: {y_train.shape}, y_valid: {y_valid.shape}')
  return X_train, X_valid, y_train, y_valid

X_train, X_valid, y_train, y_valid = split_data_frame(data_frame=HF_train, target='target', test_size=0.10)

X_train: (6851, 1), X_valid: (762, 1), y_train: (6851,), y_valid: (762,)


In [None]:
# tokenizer from a pretrained bert model
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

# create a tokenizer method
def tokenize_reviews(X_train, X_valid, y_train, y_valid, X_test):
  # Tokenize the reviews
  token_train_data = tokenizer(X_train['text'].to_list(), return_tensors='np', padding=True)
  token_valid_data = tokenizer(X_valid['text'].to_list(), return_tensors='np', padding=True)
  token_test_data = tokenizer(X_test['text'].to_list(), return_tensors='np', padding=True)

  # convert labels to a numpy array
  train_labels = np.array(y_train)
  valid_labels = np.array(y_valid)

  return token_train_data, token_valid_data, train_labels, valid_labels, token_test_data

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [None]:
token_train_data, token_valid_data, train_labels, valid_labels, token_test_data = tokenize_reviews(X_train, X_valid, y_train, y_valid, HF_test)

In [None]:
token_train_data['input_ids'][0]

array([  101,  1332, 13152,  1733, 17107, 15619,  1124,  3982,  6523,
       17328, 15907,  1252,  1220,  6467,  1302,   146, 18747,  1124,
        3982,  1398, 26949,  1706,   157,  3048,  6258,  8413,  1204,
        2528,  2559, 10294,  4206,  1182,  2591,  2591,  1324,  1475,
         102,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0])

In [None]:
token_test_data['input_ids'][0]

array([ 101, 2066, 2171,  170, 6434, 1610, 5683,  102,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0])

## Create and Train baseline model
1) Create the model

2) Compile the model

3) Train the model

4) Save the Model

5) Plot Training Accuracy / Loss

### Create the model

In [None]:
# create a method to load the pre-trained model
def load_pre_trained_model(model_name, NUM_LABELS):
  # load the pretrained model 'bert-base-cased'
  model = TFAutoModelForSequenceClassification.from_pretrained(model_name, num_labels=NUM_LABELS)

  return model

In [None]:
model = load_pre_trained_model('bert-base-cased', 2)

Downloading model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Compile the model

In [None]:
# create a method to create the loss and compile the model
def loss_and_compile(model):
  # create the loss
  loss = tf.keras.losses.SparseCategoricalCrossentropy(filter)

  # compile the model
  model.compile(optimizer=tf.keras.optimizers.Adam(5e-6), loss=loss, metrics=['accuracy'])

  return model

In [None]:
model_1 = loss_and_compile(model)

### Train the model

In [None]:
# create a method to train the model
def fit_model(model, BATCH_SIZE, EPOCHS, train_data, train_labels, val_data, val_labels):
  # fit the model
  history = model.fit(dict(train_data),
            train_labels,
            validation_data=(dict(val_data), val_labels),
            batch_size=BATCH_SIZE,
            epochs=EPOCHS)

  return model, history

In [None]:
model, history = fit_model(model_1, 64, 100, token_train_data, train_labels, token_valid_data, valid_labels)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

####Save the Model



In [None]:
model.save('/content/drive/My Drive/Disaster_Tweets/disaster_tweets_model', save_format="keras")



#### Plot Training Accuracy / Loss

In [None]:
loss = history.history['loss']
val_loss = history.history['val_loss']
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

In [None]:
def plot_train_loss_and_accuracy(train_loss, train_accuracy):
    # Create a figure with Plotly
    fig = go.Figure()

    # Add traces for training loss and accuracy
    fig.add_trace(go.Scatter(x=list(range(1, len(train_loss) + 1)), y=train_loss, mode='lines', name='Training Loss', yaxis='y1'))
    fig.add_trace(go.Scatter(x=list(range(1, len(train_accuracy) + 1)), y=train_accuracy, mode='lines', name='Training Accuracy', yaxis='y2'))

    # Update layout and labels
    fig.update_layout(title='Training Loss and Accuracy vs. Epochs',
                      xaxis=dict(title='Epochs'),
                      yaxis=dict(title='Training Loss', side='left', showgrid=False),
                      yaxis2=dict(title='Training Accuracy', side='right', overlaying='y', showgrid=False))

    fig.show()

plot_train_loss_and_accuracy(loss, acc)

Notice that the validation loss increases over time, and the validation accuracy decreases over time. Both the accuracy and loss seem to be unstable. There appears to be overfitting in this model. Changing the learning rate and or employing early stopping techniques may improve this model.

In [None]:
def plot_valid_loss_and_accuracy(val_loss, val_accuracy):
    # Create a figure with Plotly
    fig = go.Figure()

    # Add traces for validation loss and accuracy
    fig.add_trace(go.Scatter(x=list(range(1, len(val_loss) + 1)), y=val_loss, mode='lines', name='Validation Loss', yaxis='y1'))
    fig.add_trace(go.Scatter(x=list(range(1, len(val_accuracy) + 1)), y=val_accuracy, mode='lines', name='Validation Accuracy', yaxis='y2'))

    # Update layout and labels
    fig.update_layout(title='Validation Loss and Accuracy vs. Epochs',
                      xaxis=dict(title='Epochs'),
                      yaxis=dict(title='Validation Loss', side='left', showgrid=False),
                      yaxis2=dict(title='Validation Accuracy', side='right', overlaying='y', showgrid=False))

    fig.show()

plot_valid_loss_and_accuracy(val_loss, val_acc)