# MIT 15.776: Hands-On Deep Learning
## Final Project: Disaster Tweets Analysis 

**Giuseppe Iannone, Luca Sfragara, Trisha Sutivong, Hanna Zhang**

This notebook contains our exploration of NLP models to classify tweets into real (1) and fake (0) announcements of disaster.



In [1]:
import os
os.environ['KERAS_BACKEND'] = 'tensorflow'
import keras
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
keras.utils.set_random_seed(42) # setting seed 
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score


### Step 0: Loading Data
The data set comprises of 7,613 labeled tweets for training and 3,263 unlabeled tweets for evaluation. Each instance contains: 
- Text: The tweet content (max 280 characters)
- Keyword: Optional disaster-related keyword (e.g., “wildfire”, “earthquake”)
- Location: Optional user-provided location information
- Target: Binary label (1 = real disaster, 0 = not a disaster)

Because the test set given by competition is unlabeled, we will separate our training set into train, validation and test sets. We will use training set to train, validation to perform cross-validation and test set to evaluate our final performance in this assignment.

In [2]:
train_df = pd.read_csv("nlp-getting-started/train.csv")
test_df = pd.read_csv("nlp-getting-started/test.csv")

# separate out test set 
train_val_texts, test_texts, train_val_labels, test_labels = train_test_split(
    train_df['text'].values, 
    train_df['target'].values,
    test_size=0.15,  # 15% for test set
    random_state=42,
    stratify=train_df['target'].values
)

# split remaining train into train (68%) and validation (17%)
train_texts, val_texts, train_labels, val_labels = train_test_split(
    train_val_texts,
    train_val_labels,
    test_size=0.2,  # 20% of remaining
    random_state=42,
    stratify=train_val_labels
)

# print sizes of all sets
print("train set size: ", len(train_texts))
print("validation set size: ", len(val_texts))
print("test set size: ", len(test_texts))
print("submission set size (not used in this assignment): ", len(test_df))
print("example data: ")
print(train_df.head()) # looking at data 

train set size:  5176
validation set size:  1295
test set size:  1142
submission set size (not used in this assignment):  3263
example data: 
   id keyword location                                               text  \
0   1     NaN      NaN  Our Deeds are the Reason of this #earthquake M...   
1   4     NaN      NaN             Forest fire near La Ronge Sask. Canada   
2   5     NaN      NaN  All residents asked to 'shelter in place' are ...   
3   6     NaN      NaN  13,000 people receive #wildfires evacuation or...   
4   7     NaN      NaN  Just got sent this photo from Ruby #Alaska as ...   

   target  
0       1  
1       1  
2       1  
3       1  
4       1  


### Step 1: Defining Evaluation Metric and Naive Baseline
Following Kaggle's competition guidelines, we define four standard classifcation evaluation metrics we will use to assess our model performance: f1 score, accuracy, precision and recall. We define two functions for easy printing in the future and establish the baseline, i.e., model that predicts the most common class.

In [None]:
def evaluate_model(y_true, y_pred, model_name):
    f1 = f1_score(y_true, y_pred)
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    return {
        'model': model_name,
        'f1': f1,
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall
    }

def print_results(list_dicts):
    models = [d['model'] for d in list_dicts]
    print(f"\n{'Metric':<12}", end="")
    for model in models:
        print(f"{model:<30}", end="")
    print()
    for metric in ['f1', 'accuracy', 'precision', 'recall']:
        print(f"{metric:<12}", end="")
        for d in list_dicts:
            print(f"{d[metric]:<30.4f}", end="")
        print()

In [None]:
master_list = [] # create list for future models
# examine distribution of labels in data 
print("Target distribution in data:")
counts = train_df['target'].value_counts()
percentages = train_df['target'].value_counts(normalize=True) * 100
print(f"fake: {counts[0]} ({percentages[0]:.2f}%)")
print(f"real: {counts[1]} ({percentages[1]:.2f}%)")

# baseline - what is our accuracy score if we predict the most common class?
most_common = train_df['target'].mode()[0]
baseline_pred = np.full(len(test_labels), most_common)
# evaluate baseline
baseline_metrics = evaluate_model(test_labels, baseline_pred,"Most Common Class (Baseline)")
master_list.append(baseline_metrics)
print_results(master_list)

Target distribution in data:
fake: 4342 (57.03%)
real: 3271 (42.97%)

Metric      Most Common Class (Baseline)  Most Common Class (Baseline)  
f1          0.0000                        0.0000                        
accuracy    0.5701                        0.5701                        
precision   0.0000                        0.0000                        
recall      0.0000                        0.0000                        


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


### Step 2: NLP Models From Scratch

### Step 3.1: Leveraging Feature Engineering

### Step 3.2: Leveraging Pre-Trained Models

### Step 4: Ensemble Models

### Step 5: Evaluation