<a href="https://colab.research.google.com/github/anthonywu2000/CSFIntershipAssessment2023/blob/main/Disaster_Tweet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Given a jupyter notebook that has a functioning implementation of a machine learning model that identifies unique individuals out of a crowd through gait analysis, how would you translate that notebook to a piece of software that can be used to apply the model to any arbitrary images or videos provided.



Extract Model Code: Identify the relevant code sections from the Jupyter notebook that contain the machine learning model implementation. Extract these sections into separate modules or functions.

Convert Pre-processing Steps: If the Jupyter notebook includes any pre-processing steps specific to the dataset used for training, convert them into reusable functions that can be called within the software application.

Load Trained Model: Save the trained model weights from the Jupyter notebook and load them into the software application.

Implement Inference Code: Write code to perform inference using the loaded model. This code should accept input images or videos, apply any necessary pre-processing, and feed them into the model for predictions.

Post-processing and Visualization: If there are any post-processing steps required to refine the model's predictions or visualize the results, implement those in the software application. 

Package and Distribute: Package the software application into a deployable form, such as a standalone executable, a Docker container, or a Python package.

# Import Packages and Installations

In [None]:
# uncomment for installing libraries
# !pip install spacy
# !python -m spacy download en_core_web_sm
!pip install sentence-transformers
!pip install torch torchvision

In [2]:
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_validate, train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV

# potential models to be used
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# using spacy models
import spacy
import en_core_web_sm

# using sentence transformer models
from sentence_transformers import SentenceTransformer


import warnings
warnings.filterwarnings("ignore")

# Data Splitting and Cleaning

In [3]:
data = pd.read_csv('./train.csv', encoding = "ISO-8859-1")
data.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


For this project, we can focus on the text and target columns, text is the main focus of pre-processing and successive predictions

In [4]:
data.drop(["id", "keyword", "location"], axis = 1, inplace = True)

Perform train test split with 80 percent training data and 20 percent test data. 

In [5]:
train_df, test_df = train_test_split(data, test_size = 0.2, random_state = 1234)

In [6]:
train_df.head()

Unnamed: 0,text,target
5850,@savannahross_4 see tryna ruin my life,0
1045,Womens Buckle Casual Stylish Shoulder Handbags...,0
2287,Think Akwa Ibom!: DonÂÃÂªt come to Uruan and...,0
2395,Dozens Die As two Trains Derail Into A River I...,1
3011,New Mad Max Screenshots Show Off a Lovely Dust...,0


In [7]:
# check for class imbalance in the dataset
none_disaster = sum(train_df["target"] == 0)
is_disaster = sum(train_df["target"] == 1)

none_disaster_probs = none_disaster / train_df.shape[0]
is_disaster_probs = is_disaster / train_df.shape[0]

print(none_disaster_probs)
print(is_disaster_probs)

0.5748768472906404
0.42512315270935963


Class imbalance is fairly not a problem in this training data, we see that around 57% has text that does not have a disaster sentiment and 43% of training set has text relating to a disaster sentiment. 

By looking at the head of the data, we see that some text may contain hashtags, mentions, internet links, and symbols. I have decided to use regex to remove those as they can be unnecessary to the contribution to the prediction of the problem.

In [8]:
def clean_text(dat):
  pattern_url = r'https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+\/\w+' # pattern to remove url
  pattern_unicode = r'[^\x00-\x7F]+' # pattern to remove the unicodes
  pattern_mention = r'@[A-Za-z0-9_]+' # pattern to remove hashtags 
  pattern_symb = r'[^a-zA-Z0-9\s]' # pattern to remove symbols
  dat["text"] = dat["text"].apply(lambda x: re.sub(
    pattern_url + '|' + pattern_unicode + '|' + pattern_mention + '|' + pattern_symb,'', x)
  )
  return dat

In [9]:
train_df_cleaned = clean_text(train_df)
test_df_cleaned = clean_text(test_df)

In [10]:
train_df_cleaned

Unnamed: 0,text,target
5850,see tryna ruin my life,0
1045,Womens Buckle Casual Stylish Shoulder Handbags...,0
2287,Think Akwa Ibom Dont come to Uruan and demolis...,0
2395,Dozens Die As two Trains Derail Into A River I...,1
3011,New Mad Max Screenshots Show Off a Lovely Dust...,0
...,...,...
3276,I think of that every time I go to the epicen...,0
7221,Incredulous at continued outcry of welfare bei...,1
1318,rip the world its burning,0
723,Did that look broken or bleeding,0


In [11]:
X_train = train_df_cleaned['text']
y_train = train_df_cleaned['target']
X_test = test_df_cleaned['text']
y_test = test_df_cleaned['target']

# Models

## Bag-of-words (CountVectorizer) representation for text analysis + Logistic Regression + Random Forest Classifier

In [12]:
# bag-of-words pipeline models
logpipeline_bow = make_pipeline(
    CountVectorizer(stop_words = 'english'),
    LogisticRegression(max_iter = 3000)
)

rfpipeline_bow = make_pipeline(
    CountVectorizer(stop_words = 'english'),
    RandomForestClassifier()
)

In [13]:
# cross-validate logpipeline_bow
pd.DataFrame(
    cross_validate(logpipeline_bow, X_train, y_train, cv = 5, scoring='accuracy', return_train_score=True)
)

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.540046,0.094106,0.79064,0.960796
1,0.637492,0.070852,0.788177,0.957718
2,0.598788,0.088377,0.797209,0.961207
3,0.581554,0.093104,0.769294,0.961823
4,0.603138,0.059053,0.778325,0.961823


In [14]:
# cross-validate rfpipeline_bow
pd.DataFrame(
    cross_validate(rfpipeline_bow, X_train, y_train, cv = 5, scoring='accuracy', return_train_score=True)
)

Unnamed: 0,fit_time,score_time,test_score,train_score
0,9.664599,0.240005,0.767652,0.989122
1,15.971554,0.369926,0.771757,0.987069
2,10.973398,0.18001,0.763547,0.98789
3,8.60058,0.188967,0.745484,0.990353
4,8.443532,0.211845,0.786535,0.989532


It seems that RandomForestClassifier tends to overfit a little bit and does not perform as well as LogisticRegression in terms of accuracy. We can now confidently select LogisticRegression as the main classifier model (other than using spaCy embedding and SentenceTransformer utilized later in the project). 

In this project, accuracy will be the main metric that is used. This is just to see how many targets are correctly predicted out of all predictions.

## GridSearch over Hyperparameters for CountVectorizer and Logistic Regression

In [15]:
parameters = {
    'logisticregression__C': [1, 10, 50, 100],
    'countvectorizer__ngram_range': [(1, 1), (1, 2), (2, 2)],
}

grid = GridSearchCV(logpipeline_bow,
                    parameters,
                    cv=5,
                    scoring='accuracy') 
grid.fit(X_train, y_train)

print("Best C: ", grid.best_params_['logisticregression__C'])
print("Best ngram_range: ", grid.best_params_['countvectorizer__ngram_range'])
print("Best accuracy score: ", grid.best_score_)

Best C:  1
Best ngram_range:  (1, 2)
Best accuracy score:  0.7912972085385879


In [16]:
grid.score(X_test, y_test)

0.8036769533814839

After obtaining hyperparameters, we see that the best validation score is 0.791 and the best test score is 0.804. Both scores are relatively close to each other, and the model is performing relatively well on unseen data. 

There is no sign of potential overfitting nor optimization bias as the validation score is represents the test score relatively well. Also, the validation score isn't higher.

## spaCy's average embedding representation for text analysis 

In [17]:
nlp = spacy.load("en_core_web_sm")

In [18]:
X_train_embeddings = pd.DataFrame([text.vector for text in nlp.pipe(X_train)])
X_test_embeddings = pd.DataFrame([text.vector for text in nlp.pipe(X_test)])

In [19]:
classifier = LogisticRegression(
    max_iter = 3000
)

params = {
    'C': [1, 50, 100, 150, 200, 1000]
}

grid_ae = GridSearchCV(
    classifier, 
    params,
    cv = 5,
    scoring = 'accuracy',
)

grid_ae.fit(X_train_embeddings, y_train)
print("Best C: ", grid_ae.best_params_['C'])
print("Best accuracy score: ", grid_ae.best_score_)

Best C:  100
Best accuracy score:  0.7001642036124796


In [20]:
grid_ae.score(X_test_embeddings, y_test)

0.7143795141168746

It seems like using transfer learning from pre-trained model has a lower accuracy score than with just bag-of-words. This could be due to the reason that using pre-trained model may not generalize very well onto unseen datasets. They may be trained very well on other data, but not in this case.

## Using advanced sentence representation for sentiment analysis

In [None]:
embedder = SentenceTransformer("paraphrase-distilroberta-base-v1")

In [22]:
def encode(dat):
  emb_dat = embedder.encode(dat.tolist())
  return pd.DataFrame(emb_dat)

In [23]:
emd_X_train = encode(X_train)
emd_X_test = encode(X_test)

In [24]:
# using the same logistic regression and same set of hyperparameters
grid_emd = GridSearchCV(
    classifier, 
    params,
    cv = 5,
    scoring = 'accuracy',
)

grid_emd.fit(emd_X_train, y_train)
print("Best C: ", grid_emd.best_params_['C'])
print("Best accuracy score: ", grid_emd.best_score_)

Best C:  1
Best accuracy score:  0.7926108374384236


In [25]:
grid_emd.score(emd_X_test, y_test)

0.7977675640183848

Using sentence transformer performs better than using spaCy's average embedding representation. However, it has a relatively same performance compared to bag-of-words method, with accuracy at around 0.80.

# BERT Experiment (with PyTorch)

In [26]:
# install packages
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

In [27]:
# install packages
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [28]:
!pip install tqdm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [29]:
data_cleaned = clean_text(data)
data_cleaned

Unnamed: 0,text,target
0,Our Deeds are the Reason of this earthquake Ma...,1
1,Forest fire near La Ronge Sask Canada,1
2,All residents asked to shelter in place are be...,1
3,13000 people receive wildfires evacuation orde...,1
4,Just got sent this photo from Ruby Alaska as s...,1
...,...,...
7608,Two giant cranes holding a bridge collapse int...,1
7609,The out of control wild fires in California ...,1
7610,M194 0104 UTC5km S of Volcano Hawaii,1
7611,Police investigating after an ebike collided w...,1


In [30]:
# create a class for creating input dataset for BERT Model
# use the whole dataset for tokenizer to work
import torch
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
labels = {
    0:0,
    1:1
}

class Dataset(torch.utils.data.Dataset):

    def __init__(self, df):

        self.labels = [labels[label] for label in data_cleaned['target']]
        self.texts = [tokenizer(text, 
                               padding='max_length', max_length = 512, truncation=True,
                                return_tensors="pt") for text in data_cleaned['text']]

    def classes(self):
        return self.labels

    def __len__(self):
        return len(self.labels)

    def get_batch_labels(self, idx):
        # Fetch a batch of labels
        return np.array(self.labels[idx])

    def get_batch_texts(self, idx):
        # Fetch a batch of inputs
        return self.texts[idx]

    def __getitem__(self, idx):
        batch_texts = self.get_batch_texts(idx)
        batch_y = self.get_batch_labels(idx)
        return batch_texts, batch_y

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [31]:
# split the data into train, validation, and test set
# method referenced from Rubert Winastwan
np.random.seed(12345)
df_train, df_test = np.split(data_cleaned.sample(frac=1, random_state=123), 
                             [int(.8 * len(data_cleaned))])
print(len(df_train), len(df_test))

6090 1523


In [32]:
# BERT model - with 12 layers of Transformer encoder (bidirectional)
# BERT-base consists of 12 transformer blocks,
# each containing 12 self-attention heads, and a hidden layer size of 768.
from torch import nn
from transformers import BertModel

class BertClassifier(nn.Module):
  def __init__(self, dropout = 0.5):
    super(BertClassifier, self).__init__()
    self.model = BertModel.from_pretrained('bert-base-uncased')
    self.dropout = nn.Dropout(dropout) # neural network structure
    self.linear = nn.Linear(768, 2) # BERT Architecture
    self.sigmoid = nn.Sigmoid() # Activation function
  
  def forward(self, input_id, mask):
    # embedding vectors for all the tokens in the sequence
    # pooled_output contains the embedding vector of [CLS] token
    _, pooled_output = self.model(input_ids=input_id, attention_mask=mask,return_dict=False)
    dropout_output = self.dropout(pooled_output)
    linear_output = self.linear(dropout_output)
    final_layer = self.sigmoid(linear_output)
    return final_layer

In [33]:
# training session
from torch.optim import Adam
from tqdm import tqdm

def train(model, train_data, learning_rate, epochs):

    train = Dataset(train_data)

    train_dataloader = torch.utils.data.DataLoader(train, batch_size=2, shuffle=True)

    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")

    criterion = nn.CrossEntropyLoss()
    optimizer = Adam(model.parameters(), lr= learning_rate)

    if use_cuda:
      model = model.cuda()
      criterion = criterion.cuda()

    for epoch_num in range(epochs):
            total_acc_train = 0
            total_loss_train = 0

            for train_input, train_label in tqdm(train_dataloader):
                train_label = train_label.to(device)
                mask = train_input['attention_mask'].to(device)
                input_id = train_input['input_ids'].squeeze(1).to(device)

                output = model(input_id, mask)
                
                batch_loss = criterion(output, train_label.long())
                total_loss_train += batch_loss.item()
                
                acc = (output.argmax(dim=1) == train_label).sum().item()
                total_acc_train += acc

                model.zero_grad()
                batch_loss.backward()
                optimizer.step()
            
            print(
                f'Epochs: {epoch_num + 1} | Train Loss: {total_loss_train / len(train_data): .3f} \
                | Train Accuracy: {total_acc_train / len(train_data): .3f}'
            )

In [34]:
ep = 2
model = BertClassifier()
learning_rt = 0.001

train(model, df_train, learning_rt, ep)

Downloading pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|██████████| 3807/3807 [14:11<00:00,  4.47it/s]


Epochs: 1 | Train Loss:  0.433                 | Train Accuracy:  0.711


100%|██████████| 3807/3807 [14:08<00:00,  4.49it/s]


Epochs: 2 | Train Loss:  0.438                 | Train Accuracy:  0.712


For simplicity and the limitation of computational resources, we just train on 2 epochs with learning rate of 0.001. For sure, more hyperparameter tuning should be done on training the BERT model.

In [37]:
# def testmod(model, test_data):

#     test = Dataset(test_data)

#     test_dataloader = torch.utils.data.DataLoader(test, batch_size=2)

#     use_cuda = torch.cuda.is_available()
#     device = torch.device("cuda" if use_cuda else "cpu")

#     if use_cuda:
#         model = model.cuda()

#     total_acc_test = 0
#     with torch.no_grad():
#         for test_input, test_label in test_dataloader:
#               test_label = test_label.to(device)
#               mask = test_input['attention_mask'].to(device)
#               input_id = test_input['input_ids'].squeeze(1).to(device)

#               output = model(input_id, mask)

#               acc = (output.argmax(dim=1) == test_label).sum().item()
#               total_acc_test += acc
#     print(len(test_data))
#     print(total_acc_test)
#     print(f'Test Accuracy: {total_acc_test / len(test_data): .3f}')

In [39]:
def testmod(model, test_data):

    test = Dataset(test_data)

    test_dataloader = torch.utils.data.DataLoader(test, batch_size=2)

    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")

    if use_cuda:
        model = model.cuda()

    correct_predictions = 0
    total_predictions = 0
    with torch.no_grad():
        for test_input, test_label in test_dataloader:
              test_label = test_label.to(device)
              mask = test_input['attention_mask'].to(device)
              input_id = test_input['input_ids'].squeeze(1).to(device)

              output = model(input_id, mask)

              _, predicted_labels = torch.max(output, dim=1)
              correct_predictions += (predicted_labels == test_label).sum().item()
              total_predictions += len(test_label)

    accuracy = correct_predictions / total_predictions
    print(f'Test Accuracy: {accuracy:.3f}')

In [40]:
testmod(model, df_test)

Test Accuracy: 0.559


Well, we have a lower test accuracy of 56% and train accuracy of 71%. Due to time constraints and limited computational resources, we could not do more tedious computations on hyperparameter tuning. But overall, BERT pre-trained model should definitely give us a more robust score than the previously described methods. However, the BERT model is just an experiment (extenstion) on this project. 

Given more time, I would explore more on different pre-trained model from spaCy and SentenceTransformer. I would also spend more time tuning the hyperparameters for BERT model. Additionally, I would also try other pre-trained models, like GPT. I would also like to perform simpler methods and use it as a baseline for this classification task.

We can see that using pretrained model from spaCy and SentenceTransformer does not give us improved scores than the bag-of-words method. This is likely due to the fact that pre-trained models are not suited for this tweet data task. Overall, the best accuracy we achieved is 80% on the test set, and this score only depicts on how the model performs on unseen data. 

Overfitting and Optimization bias are not a big issue in this task; however, by experience, BERT model tends to give very high accuracy scores for both train and test set. 