Deep Learning for Disaster Tweet Classification - Machine learning Final 
Charisma Ricarte
Trieu Do
Jonathan Garcia	

Dataset source: https://www.kaggle.com/datasets/vstepanenko/disaster-tweets?select=tweets.csv 


In [3]:
import pandas as pd
dataset = pd.read_csv("tweets.csv", # the location to the data file
                       sep=",", nrows = 10000
                       )
dataset

Unnamed: 0,id,keyword,location,text,target
0,0,ablaze,,"Communal violence in Bhainsa, Telangana. ""Ston...",1
1,1,ablaze,,Telangana: Section 144 has been imposed in Bha...,1
2,2,ablaze,New York City,Arsonist sets cars ablaze at dealership https:...,1
3,3,ablaze,"Morgantown, WV",Arsonist sets cars ablaze at dealership https:...,1
4,4,ablaze,,"""Lord Jesus, your love brings freedom and pard...",0
...,...,...,...,...,...
9995,9995,terrorism,,3yrs after IPOB formed Biafra security Service...,0
9996,9996,terrorism,www,France agrees to send more troops to West Afri...,1
9997,9997,terrorism,USA,"While the press feasts off a tiny ""he-said, sh...",0
9998,9998,terrorism,North Pole,‚óè NEWS ‚óè #meduza #russia ‚òû Man who made Russia...,0


In [11]:
# clean text of symbols and non-letters, etc. 
# should help standardize words, for example: "FIRE!!!", "fire.", and "fireüî•" now all map to "fire"
# helps prevent un needed token usage 

import re        # for regular expressions (text cleaning)

def clean_text(t):
    t = t.lower()
    t = re.sub(r"http\S+", "", t)  # remove URLs
    t = re.sub(r"@\w+", "", t)     # remove mentions
    t = re.sub(r"#", "", t)        # remove hashtag symbols (keep the word)
    t = re.sub(r"[^a-z\s]", "", t) # remove non-letters
    return t.strip()
dataset["clean_text"] = dataset["text"].apply(clean_text)

In [17]:
# import libraries to clean and prepare our dataset for our models - Bag of words binary feature matrix 
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
vectorizer = CountVectorizer(binary=True, stop_words="english") # 1 or 0 indicating if word appears in tweet and removes english words like "the", "and" and "is"
X = vectorizer.fit_transform(dataset["clean_text"]) # converts each tweet into a vector of 0s and 1s
df_tf = pd.DataFrame(X.todense(), columns=vectorizer.get_feature_names_out()) # takes every word and places it as a column name 
df_tf # predictors (independent variables)

Unnamed: 0,aa,aaaaaaaaacccccckkkkkkkk,aab,aadharcard,aalaathun,aampe,aampes,aap,aaron,aayega,...,zonal,zone,zoo,zoom,zoomedin,zorro,zuckerberg,zulaykhas,zuma,zw
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9998,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Decided to try a different vectorizer - TF-IDF - weighted feature matrix 
This vectorizer will assign weights to words. Words with more importance get a higher weight while words with less importance get a smaller weight. 
It performs much better than the binary vectorizer above. There are more meaningful words found, and this will help in training models. 

In [33]:
vectorizer = TfidfVectorizer(
    norm="l2",
    stop_words="english",
    ngram_range=(1,2),      # include unigrams + bigrams
    max_features=20000,     # cap vocab size - helps remove characters or words that have no meaning or very rare occurence
    lowercase=True
)
X = vectorizer.fit_transform(dataset["clean_text"])  # use your cleaned text column

# take a look at the first 100 rows of new dataframe 
pd.DataFrame(X[:100].toarray(), columns=vectorizer.get_feature_names_out())

Unnamed: 0,aap,aap chronology,ab,abandon,abandoned,abbott,abby,abc,abc news,abiding,...,zip bts,zip photos,zombie,zombie apocalypse,zombies,zone,zoo,zoom,zuma,zuma did
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
96,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
97,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
98,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [47]:
# Partition the data set
# create pipeline to prevent data leakage 
# Setup baseline - logistic regression 
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

X_text = dataset["clean_text"]  
y = dataset["target"]

X_train, X_test, y_train, y_test = train_test_split(
    X_text, y, test_size=0.20, random_state=123, stratify=y
)

# Logistic Regression model- baseline

pipe = make_pipeline(
    TfidfVectorizer(stop_words="english", ngram_range=(1,2), max_features=20000),
    LogisticRegression(max_iter=1000, class_weight="balanced")
)

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print(classification_report(y_test, y_pred, digits=3))

              precision    recall  f1-score   support

           0      0.940     0.909     0.924      1647
           1      0.631     0.728     0.676       353

    accuracy                          0.877      2000
   macro avg      0.786     0.818     0.800      2000
weighted avg      0.885     0.877     0.880      2000



0 = Non-disaster tweets 
1 = Disaster Tweets 
Precision = out of all tweets predicted as ‚Äúdisaster,‚Äù how many actually were?
Recall = out of all real disaster tweets, how many did the model correctly identify?
F1-score = harmonic mean of precision & recall (balances both)
Support = number of true samples in that class

1st run 
The LR model is performing well overall for a 70/20 split 
It does really well at finding non-disaster tweets with an F1-score of 92% and ok at finding disaster tweets at 67% 
The support reveals that there were over 2400 samples of 

              precision    recall  f1-score   support

           0      0.936     0.913     0.924      2470
           1      0.635     0.708     0.669       530

    accuracy                          0.876      3000
   macro avg      0.785     0.810     0.797      3000
weighted avg      0.882     0.876     0.879      3000


2nd run - will use this one for baseline
LR model baseline improved with an 80/20 split. 
The F1 score improved for disaster tweet detection by almost 1 point. Accuracy, macro avg, and weighted average also improved slightly. 

 precision    recall  f1-score   support

           0      0.940     0.909     0.924      1647
           1      0.631     0.728     0.676       353

    accuracy                          0.877      2000
   macro avg      0.786     0.818     0.800      2000
weighted avg      0.885     0.877     0.880      2000