**Step 1: Data Loading**

In [48]:
from google.colab import files

# Uploading file
uploaded = files.upload()


Saving train.csv to train (5).csv


**Step 2: Exploring the Dataset**

In [49]:
import pandas as pd

df = pd.read_csv('train.csv')

print("Shape of the dataset:", df.shape)
print("\nColumns:", df.columns.tolist())
print("\nMissing values:\n", df.isnull().sum())

# Class distribution
print("\nClass distribution (target = 1 means real disaster):")
print(df['target'].value_counts())

# Show some examples per class
print("\nExamples - Real Disaster (target=1):")
print(df[df['target'] == 1]['text'].sample(3).values)

print("\nExamples - Not a Disaster (target=0):")
print(df[df['target'] == 0]['text'].sample(3).values)


Shape of the dataset: (7613, 5)

Columns: ['id', 'keyword', 'location', 'text', 'target']

Missing values:
 id             0
keyword       61
location    2533
text           0
target         0
dtype: int64

Class distribution (target = 1 means real disaster):
target
0    4342
1    3271
Name: count, dtype: int64

Examples - Real Disaster (target=1):
['Trauma injuries involving kids and sport usually cycling related: Director Trauma NS  http://t.co/8DdijZyNkf #NS http://t.co/52Uus4TFN3'
 '@UnivSFoundation For the people who died in Human Experiments by Unit 731 of Japanese military http://t.co/vVPLFQv58P http://t.co/Rwaph6dAUv'
 'Dutch crane collapses demolishes houses: Dramatic eyewitness video captures the moment a Dutch crane hoisting... http://t.co/dYy7ml2NzJ']

Examples - Not a Disaster (target=0):
["@GeoffRickly I don't see the option to buy the full collapse vinyl with tee bundle just the waiting?"
 "'My Fifty Online Dates and why I'm still single' by Michael Windstorm $2.99 B&amp

**Step 3: Preprocessing Pipeline**

Although NLTK was originally considered for preprocessing due to tokenizer error (punkt_tab). That's why spaCy was used instead. It offers integrated lemmatization and stopword removal.

In [50]:
!pip install spacy --quiet
!python -m spacy download en_core_web_sm


[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gensim 4.3.3 requires numpy<2.0,>=1.18.5, but you have numpy 2.3.2 which is incompatible.
scipy 1.13.1 requires numpy<2.3,>=1.22.4, but you have numpy 2.3.2 which is incompatible.
opencv-python 4.12.0.88 requires numpy<2.3.0,>=2; python_version >= "3.9", but you have numpy 2.3.2 which is incompatible.
cupy-cuda12x 13.3.0 requires numpy<2.3,>=1.22, but you have numpy 2.3.2 which is incompatible.
opencv-python-headless 4.12.0.88 requires numpy<2.3.0,>=2; python_version >= "3.9", but you have numpy 2.3.2 which is incompatible.
opencv-contrib-python 4.12.0.88 requires numpy<2.3.0,>=2; python_version >= "3.9", but you have numpy 2.3.2 which is incompatible.
tensorflow 2.18.0 requires numpy<2.1.0,>=1.26.0, but you have numpy 2.3.2 which is incompatible.
tsfresh 0.21.0 requires scipy>=1.14.0; python_version >= "3.10

In [69]:
import spacy
import re

nlp = spacy.load("en_core_web_sm")

def preprocess(text):
    text = text.lower()
    text = re.sub(r"http\S+|www\S+|@\w+|#\w+|\d+", "", text)
    text = re.sub(r"[^\w\s]", "", text)

    doc = nlp(text)
    tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]

    return " ".join(tokens)

df['clean_text'] = df['text'].apply(preprocess)


In [52]:
sample_text = df['text'].iloc[0]
print("Original:\n", sample_text)
print("\nCleaned:\n", preprocess(sample_text))


Original:
 Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all

Cleaned:
 deed reason   allah forgive


**Step 4: Feature Engineering**

In [70]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

df['clean_text'] = df['text'].apply(preprocess)

# Spliting features and labels
X_text = df['clean_text']
y = df['target']

# Bag-of-Words (unigrams)
bow_vectorizer = CountVectorizer()
X_bow = bow_vectorizer.fit_transform(X_text)

# TF-IDF (unigrams + bigrams)
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,2))
X_tfidf = tfidf_vectorizer.fit_transform(X_text)

print("BoW shape:", X_bow.shape)
print("TF-IDF shape:", X_tfidf.shape)


BoW shape: (7613, 11435)
TF-IDF shape: (7613, 48494)


## Due to unresolved compatibility issues with gensim and numpy on Google Colab, we used Sentence-BERT to produce dense embeddings. This also reflects an optional extension mentioned in the assignment

In [58]:
!pip install -U sentence-transformers --quiet


[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/62.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m470.2/470.2 kB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m60.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m39.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m32.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m881.3 kB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [59]:
from sentence_transformers import SentenceTransformer

# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')  # small, fast, 384-dim

# Convert each cleaned tweet into dense vector
X_bert = model.encode(df['clean_text'], show_progress_bar=True)

print("Sentence-BERT shape:", X_bert.shape)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/238 [00:00<?, ?it/s]

  return forward_call(*args, **kwargs)


Sentence-BERT shape: (7613, 384)


**Step 5: Modeling & Evaluation**

In [64]:
from sklearn.model_selection import train_test_split

X_train_text, X_test_text, y_train, y_test = train_test_split(
    X_text, y, test_size=0.2, random_state=42, stratify=y
)

train_idx = X_train_text.index
test_idx = X_test_text.index


In [65]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

# Spliting TF-IDF features
X_train_tfidf = X_tfidf[train_idx]
X_test_tfidf = X_tfidf[test_idx]

# Training and predicting
nb_model = MultinomialNB()
nb_model.fit(X_train_tfidf, y_train)
y_pred_nb = nb_model.predict(X_test_tfidf)

# Evaluating
print("Multinomial Naive Bayes on TF-IDF:")
print(classification_report(y_test, y_pred_nb))


Multinomial Naive Bayes on TF-IDF:
              precision    recall  f1-score   support

           0       0.77      0.91      0.84       869
           1       0.85      0.65      0.73       654

    accuracy                           0.80      1523
   macro avg       0.81      0.78      0.79      1523
weighted avg       0.81      0.80      0.79      1523



In [66]:
from sklearn.linear_model import LogisticRegression

# Training and predicting
lr_model_tfidf = LogisticRegression(max_iter=1000)
lr_model_tfidf.fit(X_train_tfidf, y_train)
y_pred_lr_tfidf = lr_model_tfidf.predict(X_test_tfidf)

print("Logistic Regression on TF-IDF:")
print(classification_report(y_test, y_pred_lr_tfidf))


Logistic Regression on TF-IDF:
              precision    recall  f1-score   support

           0       0.76      0.93      0.84       869
           1       0.87      0.60      0.71       654

    accuracy                           0.79      1523
   macro avg       0.82      0.77      0.78      1523
weighted avg       0.81      0.79      0.78      1523



In [67]:
# Spliting dense features (Sentence-BERT)
X_train_bert = X_bert[train_idx]
X_test_bert = X_bert[test_idx]

# Training and predicting
lr_model_bert = LogisticRegression(max_iter=1000)
lr_model_bert.fit(X_train_bert, y_train)
y_pred_lr_bert = lr_model_bert.predict(X_test_bert)

print("Logistic Regression on Sentence-BERT:")
print(classification_report(y_test, y_pred_lr_bert))


Logistic Regression on Sentence-BERT:
              precision    recall  f1-score   support

           0       0.82      0.86      0.84       869
           1       0.80      0.74      0.77       654

    accuracy                           0.81      1523
   macro avg       0.81      0.80      0.80      1523
weighted avg       0.81      0.81      0.81      1523



**Step 6: Analysis & Discussion**

**Naive Bayes vs. Logistic Regression**

I noticed that Logistic Regression gave better results than Naive Bayes, especially when using TF-IDF. Naive Bayes was faster but not as accurate. It assumes that all words are independent, which might be too simple for real tweets.

**Sparse vs. Dense Features**

I tried two types of features:

1. TF-IDF, which is based on counting words

2. Sentence-BERT embeddings, which are based on meaning

The results from Sentence-BERT are quite strong, especially in catching the meaning of tweets. TF-IDF also worked well and was faster. I think BERT was better at understanding short messages, while TF-IDF was more literal.

**Time, Memory & Practical Stuff**

The Sentence-BERT part took more time but gave more meaningful vectors. If I had a large dataset, I might go with TF-IDF. But for accuracy and deep understanding, BERT seems worth it.

**Explainability**

TF-IDF is easier to explain, you can see which words mattered the most. But with Sentence-BERT, it is like a black box. It's powerful but harder to interpret.