# Classifying Diaster Related Tweets as Real or Fake
Continuation of the Project. Building the Classification Model using a Transformer

In [1]:
# Libaries

import pandas as pd
import numpy as np
import tensorflow as tf

import warnings
warnings.filterwarnings('ignore')

In [2]:
train_df = pd.read_csv('/kaggle/input/text-classification/train.csv')
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [3]:
train_df.shape

(7613, 5)

In [4]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


# Data Cleaning

In [5]:
# dropping the variables that are not useful for our modeling

train_df.drop(['id', 'location', 'keyword'], axis=1, inplace=True)

train_df.head()

Unnamed: 0,text,target
0,Our Deeds are the Reason of this #earthquake M...,1
1,Forest fire near La Ronge Sask. Canada,1
2,All residents asked to 'shelter in place' are ...,1
3,"13,000 people receive #wildfires evacuation or...",1
4,Just got sent this photo from Ruby #Alaska as ...,1


# Build a Transformer Model

In [6]:
from sklearn.model_selection import train_test_split

X = train_df["text"]
y = train_df["target"]

# split the data into training (80%) and validation sets (20%)

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state = 100)
print(F"X_train: {X_train.shape}, X_val: {X_val.shape}, y_train: {y_train.shape}, y_val: {y_val.shape}")

X_train: (6090,), X_val: (1523,), y_train: (6090,), y_val: (1523,)


In [7]:
import transformers
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
from transformers import DataCollatorWithPadding
from transformers import AutoTokenizer


In [8]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

print(f"Tokenizer Model Maximum Length: {tokenizer.model_max_length}")
print(f"Tokenizer Model Vocabulary Size: {tokenizer.vocab_size}")

Tokenizer Model Maximum Length: 512
Tokenizer Model Vocabulary Size: 30522


In [9]:
train_encoding = tokenizer(list(X_train), truncation=True, padding=True)
val_encoding = tokenizer(list(X_val), truncation=True, padding=True)

In [10]:
# Transforming to tensorflow datasets
# Preparing the data for transformer

train_dataset = tf.data.Dataset.from_tensor_slices((dict(train_encoding), tf.constant(y_train.values, dtype=tf.int32)))

val_dataset = tf.data.Dataset.from_tensor_slices((dict(val_encoding), tf.constant(y_val.values, dtype=tf.int32)))


# configuring the datasets

train_dataset = train_dataset.shuffle(len(X_train)).batch(16)

val_dataset = val_dataset.batch(16)

In [11]:
# Build the model

model = TFAutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels = 2)

# Define optimizer, loss, and metrics 
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metrics = [tf.keras.metrics.SparseCategoricalAccuracy('accuracy')]

# Compile the model 
model.compile(optimizer=optimizer, loss=loss, metrics=metrics)

# Fit the model
model.fit(train_dataset, epochs=10, verbose = False)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

Cause: for/else statement not yet supported


I0000 00:00:1730985990.347155    2422 service.cc:145] XLA service 0x7a8ed4e9f6d0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1730985990.347206    2422 service.cc:153]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
I0000 00:00:1730985990.347211    2422 service.cc:153]   StreamExecutor device (1): Tesla T4, Compute Capability 7.5
I0000 00:00:1730985990.440735    2422 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


<tf_keras.src.callbacks.History at 0x7a92c82100a0>

In [12]:
# Model Evaluation

train_loss, train_acc = model.evaluate(train_dataset)
print(F"Train set accuracy: {train_acc}")

val_loss, val_acc = model.evaluate(val_dataset)
print(F"Validation set accuracy: {val_acc}")

Train set accuracy: 0.9878489375114441
Validation set accuracy: 0.8240315318107605


# Conclusion
In this project,built a deep learning classification model using Tensorflow. I used a real world tweets dataset to predict whether a tweet indicated disaster or not.

From the previous notebook of the same project. I started with a shallow neural network and went all the way to build Transformer based models. The performance of these various models are summarized below;

+ **`Shallow Neural Network`**: Training set and Validation set accuracy are 58%
+ **`Multilayer Deep Text Classification Model`**: Training set accuracy 56% and Validation set accuracy of 57%
+ **`Multilayer Bidirectional LSTM Model`**: Training set 95% and Validation set accuracy of 78% 
+ **`Transformer Model`**: Training set accuracy 98% and Validation set accuracy of 82%

The best performance comes from Transformer Model.
