Link to Github Repo: https://github.com/DStull99/MSCS-Boulder-Machine-Learning-Final-Projects

# Brief Description of Problem, Relevant Features, and Strategy

During real-world emergencies, people often tweet about ongoing disasters. The challenge is to classify tweets as disaster-related or not. This is difficult because many tweets use dramatic wording or sarcasm without referring to actual disasters (for example, "this party is on fire"). Our goal is to build a model that can distinguish real disaster tweets from unrelated ones. Dataset structure: Each tweet in the dataset has several fields​.

id: unique identifier for the tweet
text: the tweet content (actual tweet text)
keyword: a keyword from the tweet (might be empty if no keyword provided)
location: the location the tweet was sent from (could be blank)
target: label indicating if the tweet is about a real disaster (1) or not (0)​

The classification task is binary: predict target = 1 for disaster tweets and 0 for non-disaster tweets. This is important for emergency responders to filter signal from noise on Twitter​.

. Correctly identifying disaster tweets (true positives) is crucial, and missing a real disaster tweet (false negative) could mean missing critical information. Therefore, we will prioritize recall (catching as many real disaster tweets as possible) while still aiming for good precision to avoid too many false alarms. 

Tweets are sequences of words, and understanding the context (word order and history) is key. RNNs are designed for sequential data and maintain a memory of previous words, which helps in understanding the tweet in context​.

Unlike bag-of-words models, RNNs consider word order, meaning phrases like "on fire" can be interpreted with surrounding words. Specifically, we will experiment with two RNN variants: LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit). These networks include gating mechanisms that help them learn long-term dependencies in text. LSTMs have three gates (input, forget, output) and a cell state, whereas GRUs have a simpler design with two gates (reset and update) and no separate cell state​. 

In [1]:
# importing relevant libraries:
import pandas as pd
import numpy as np
from collections import Counter
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, GRU, Dense
from sklearn.metrics import classification_report, confusion_matrix

In [2]:
# Loading & Inspecting the Data:

df = pd.read_csv('train.csv')
# Inspect first few rows
print(df.head())
print("\nMissing values per column:")
print(df.isnull().sum())

   id keyword location                                               text  \
0   1     NaN      NaN  Our Deeds are the Reason of this #earthquake M...   
1   4     NaN      NaN             Forest fire near La Ronge Sask. Canada   
2   5     NaN      NaN  All residents asked to 'shelter in place' are ...   
3   6     NaN      NaN  13,000 people receive #wildfires evacuation or...   
4   7     NaN      NaN  Just got sent this photo from Ruby #Alaska as ...   

   target  
0       1  
1       1  
2       1  
3       1  
4       1  

Missing values per column:
id             0
keyword       61
location    2533
text           0
target         0
dtype: int64


In the above output, we can see there are multiple missing values in both the "keyword" column and "location" column. In order to handle these missing values, we will simply impute the word "missing" into the empty cells.

In [3]:
# Imputing missing values:

df[['keyword', 'location']] = df[['keyword', 'location']].fillna('missing')
print(df[['keyword','location']].head(5))


   keyword location
0  missing  missing
1  missing  missing
2  missing  missing
3  missing  missing
4  missing  missing


In [4]:
# Examining the distributions of word counts for each tweet:

df['word_count'] = df['text'].apply(lambda x: len(x.split()))
print(df['word_count'].describe().round(2))
print("Tweet lengths (words) for each entry:", df['word_count'].tolist())


count    7613.00
mean       14.90
std         5.73
min         1.00
25%        11.00
50%        15.00
75%        19.00
max        31.00
Name: word_count, dtype: float64
Tweet lengths (words) for each entry: [13, 7, 22, 8, 16, 18, 14, 15, 12, 10, 9, 27, 12, 7, 11, 3, 3, 3, 5, 3, 3, 4, 2, 4, 1, 6, 5, 3, 2, 4, 2, 5, 10, 9, 7, 13, 21, 8, 19, 5, 8, 11, 24, 6, 14, 16, 13, 10, 8, 27, 9, 11, 18, 10, 15, 19, 13, 22, 24, 16, 17, 18, 15, 23, 10, 8, 19, 27, 12, 16, 18, 14, 18, 4, 15, 12, 9, 9, 16, 18, 27, 19, 17, 14, 12, 17, 8, 24, 25, 6, 6, 8, 12, 9, 4, 26, 22, 21, 10, 22, 17, 20, 8, 20, 16, 11, 16, 8, 24, 11, 11, 11, 18, 2, 16, 16, 16, 9, 16, 16, 21, 16, 23, 8, 16, 19, 26, 16, 27, 11, 5, 2, 21, 19, 14, 18, 18, 20, 14, 18, 17, 20, 9, 10, 22, 10, 18, 20, 13, 13, 22, 18, 13, 10, 10, 11, 18, 11, 18, 18, 17, 25, 10, 14, 20, 5, 13, 21, 12, 7, 21, 22, 11, 20, 13, 9, 19, 10, 13, 13, 18, 13, 13, 11, 5, 11, 13, 8, 21, 14, 11, 10, 13, 13, 22, 12, 11, 11, 9, 9, 12, 11, 11, 11, 11, 11, 18, 11, 13, 17, 19, 6,

In [5]:
# Analyzing the most common words in these tweets:

words = re.findall(r'\w+', " ".join(df['text']).lower())
word_counts = Counter(words)
print("Top 5 most common words (including stopwords):", word_counts.most_common(5))


Top 5 most common words (including stopwords): [('t', 5199), ('co', 4740), ('http', 4309), ('the', 3277), ('a', 2200)]


As we can see above, we're not getting much useful information from the most common words when common stopwords are included. We will have to clean the text data to remove these stopwords and other unnecessary text.

In [6]:
# Cleaning the text data to remove stopwords and other unnecessary text:

stop_words = set(stopwords.words('english'))
ps = PorterStemmer()

def preprocess_text(text):
    text = re.sub(r"[^a-z0-9\s]", " ", text.lower())
    words = text.split()
    # Remove stopwords and single-character tokens
    words = [w for w in words if w not in stop_words and len(w) > 1]
    # Apply stemming
    words = [ps.stem(w) for w in words]
    return " ".join(words)

# Apply preprocessing to all tweets
df['text_clean'] = df['text'].apply(preprocess_text)

# Show original vs cleaned text for a few samples
for i in range(5):
    print("Original:", df.loc[i, 'text'])
    print("Cleaned: ", df.loc[i, 'text_clean'])
    print("------")


Original: Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all
Cleaned:  deed reason earthquak may allah forgiv us
------
Original: Forest fire near La Ronge Sask. Canada
Cleaned:  forest fire near la rong sask canada
------
Original: All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected
Cleaned:  resid ask shelter place notifi offic evacu shelter place order expect
------
Original: 13,000 people receive #wildfires evacuation orders in California 
Cleaned:  13 000 peopl receiv wildfir evacu order california
------
Original: Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school 
Cleaned:  got sent photo rubi alaska smoke wildfir pour school
------


As we can see in the output above, the cleaned data does not contain stopwords or punctuation. The Porter Stemming algorithm does not provide a perfect result (we can see words like "wildfir", "receiv", "earthquak", etc.), but this result will suffice for this analysis. Next, we will transform the data using the tf-idf vectorizer.

In [7]:
# Transforming the data using the tf-idf vectorizer:

tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(df['text_clean'])
print("TF-IDF matrix shape:", X_tfidf.shape)
# Show sample TF-IDF values for the first tweet
feature_names = tfidf.get_feature_names_out()
nz_indices = X_tfidf[0].nonzero()[1]
print("Sample TF-IDF features for first tweet:")
print({feature_names[i]: round(X_tfidf[0, i], 2) for i in nz_indices})


TF-IDF matrix shape: (7613, 18404)
Sample TF-IDF features for first tweet:
{'us': 0.28, 'forgiv': 0.46, 'allah': 0.41, 'may': 0.29, 'earthquak': 0.33, 'reason': 0.35, 'deed': 0.48}


Now that our data has been cleaned and transformed, we will compare 2 different RNN implementations: LSTM and GRU. Firstly, however, we will tokenize the cleaned text and split the data into training and validation sets.

In [8]:
# Tokenizing the cleaned text and splitting the data into training and validation sets:

tokenizer = Tokenizer()
tokenizer.fit_on_texts(df['text_clean'])
X_seq = tokenizer.texts_to_sequences(df['text_clean'])
max_len = max(len(seq) for seq in X_seq)
X_seq = pad_sequences(X_seq, maxlen=max_len, padding='post')
y = df['target'].values

X_train, X_val, y_train, y_val = train_test_split(X_seq, y, test_size=0.25, random_state=42)
print("Train size:", X_train.shape[0], "tweets; Validation size:", X_val.shape[0], "tweets")
print("Max sequence length:", max_len)

Train size: 5709 tweets; Validation size: 1904 tweets
Max sequence length: 25


# Model Architecture: 

For both LSTM and GRU models, we'll use a simple architecture: (1) An Embedding layer as the input layer to learn embeddings for each word index (we'll initialize it randomly and let the model learn), (2) One LSTM or GRU layer with a certain number of units (we'll start with 32 units for each, which is the dimensionality of the RNN's hidden state), (3) A Dropout within the RNN (we can use dropout and recurrent_dropout arguments) to help prevent overfitting, and (4) A Dense output layer with sigmoid activation to produce a probability for the positive class.

In [9]:
# Training both the LSTM and GRU Models:

vocab_size = len(tokenizer.word_index) + 1
embed_dim = 50  

lstm_model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=embed_dim, input_length=max_len),
    LSTM(32, dropout=0.2, recurrent_dropout=0.2),
    Dense(1, activation='sigmoid')
])
lstm_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

gru_model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=embed_dim, input_length=max_len),
    GRU(32, dropout=0.2, recurrent_dropout=0.2),
    Dense(1, activation='sigmoid')
])
gru_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train both models (using a small number of epochs for demonstration)
lstm_model.fit(X_train, y_train, epochs=10, batch_size=32, verbose=1)
gru_model.fit(X_train, y_train, epochs=10, batch_size=32, verbose=1)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x2c49bc60bb0>

In [10]:
# Making predictions on the validation set and comparing the classification results for the 2 models:

y_pred_lstm = (lstm_model.predict(X_val) > 0.5).astype(int)
y_pred_gru  = (gru_model.predict(X_val) > 0.5).astype(int)

print("LSTM Classification Report:")
print(classification_report(y_val, y_pred_lstm))
print("LSTM Confusion Matrix:")
print(confusion_matrix(y_val, y_pred_lstm))
print("\nGRU Classification Report:")
print(classification_report(y_val, y_pred_gru))
print("GRU Confusion Matrix:")
print(confusion_matrix(y_val, y_pred_gru))

LSTM Classification Report:
              precision    recall  f1-score   support

           0       0.77      0.78      0.78      1091
           1       0.70      0.69      0.70       813

    accuracy                           0.74      1904
   macro avg       0.74      0.74      0.74      1904
weighted avg       0.74      0.74      0.74      1904

LSTM Confusion Matrix:
[[848 243]
 [249 564]]

GRU Classification Report:
              precision    recall  f1-score   support

           0       0.81      0.72      0.76      1091
           1       0.67      0.77      0.72       813

    accuracy                           0.74      1904
   macro avg       0.74      0.75      0.74      1904
weighted avg       0.75      0.74      0.74      1904

GRU Confusion Matrix:
[[787 304]
 [187 626]]


In conclusion, it appears that the GRU model performs slightly better than the LSTM, especially when it comes to recall (i.e., minimizing false negatives). As such, we will use the GRU model will to generate predictions to be used on the test set and submitted.

In [11]:
# importing and preprocessing the test dataset:

df2 = pd.read_csv('test.csv')
df2[['keyword', 'location']] = df2[['keyword', 'location']].fillna('missing')
df2['text_clean'] = df2['text'].apply(preprocess_text)
X_tfidf = tfidf.transform(df2['text_clean'])

X_seq_test = tokenizer.texts_to_sequences(df2['text_clean'])
X_seq_test = pad_sequences(X_seq_test, maxlen=max_len, padding='post')

y_pred_gru  = (gru_model.predict(X_seq_test) > 0.5).astype(int)



In [None]:
# Appending the predictions to the dataframe
df2['target'] = y_pred_gru
df2.drop(columns=['keyword', 'location', 'text', 'text_clean'], inplace=True)
df2.head()

Unnamed: 0,id,target
0,0,0
1,2,1
2,3,1
3,9,1
4,11,1


In [15]:
# Saving as a csv to submit:
df2.to_csv('Disaster Tweets Submission.csv', index=False)