# NLP Disaster Tweets — Mini Project (Week 4)

This project is part of the Deep Learning Module 4 assignment.  
The goal is to classify tweets as **disaster-related (1)** or **not disaster-related (0)** using NLP techniques.

Dataset source:  
Kaggle — “NLP Getting Started” competition  
https://www.kaggle.com/competitions/nlp-getting-started


## Alexander Voit

## 1. Problem Description & Data Overview

This is a binary classification task where the goal is to determine whether a tweet is related to a real disaster (label = 1) or not (label = 0).  
The dataset contains short, noisy Twitter messages that may include hashtags, links, emojis, abbreviations, and informal language — making it a typical real-world NLP problem.

**Dataset summary:**
- Training samples: 7,613 tweets  
- Test samples: 3,263 tweets  
- Columns:
  - `id`  
  - `keyword` (may contain helpful context)  
  - `location` (optional, often noisy)  
  - `text` (main feature: the tweet itself)  
  - `target` (0 or 1, only in training set)

Next, we load the dataset and inspect its structure.


In [None]:
import pandas as pd

# Load data
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
sample_sub = pd.read_csv("sample_submission.csv")

# Show basic info
train.head(), train.shape, test.shape


**Observation:**  
The dataset contains 7,613 rows and 5 columns.  
`keyword` and `location` have missing values, which is expected for tweets and will not prevent modeling since the main signal is in the `text` column.


## 2. Exploratory Data Analysis (EDA)

In this step, we inspect the dataset to understand its structure, identify missing values,  
explore the distribution of the target variable, and analyze tweet characteristics such as text length.

The goal of this EDA is to decide how to clean and preprocess the tweets before modeling.


### 2.1 Data Structure Overview

We begin by inspecting the dataset structure, column types, and the presence of missing values.  
This helps us understand how clean the data is and what preprocessing will be necessary.


In [None]:
# Basic information
train.info()


**Observation:**  
The classes are slightly imbalanced: non-disaster tweets are more frequent.  
However, the imbalance is not severe and we can proceed without applying special balancing techniques.


### 2.2 Target Distribution

Next, we examine the balance between the two classes:  
- `0` → non-disaster  
- `1` → disaster  

This helps evaluate whether we need to address class imbalance.


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.countplot(data=train, x="target")
plt.title("Target Distribution (0 = non-disaster, 1 = disaster)")
plt.show()


**Observation:**  
Most tweets have between 50 and 150 characters.  
This indicates that they are short, noisy, and require cleaning but not heavy truncation.  
We will use full tweets as input for the model.


### 2.3 Tweet Length Distribution

We analyze the number of characters per tweet to understand the typical tweet size.  
This helps determine preprocessing limits such as maximum sequence length.


In [None]:
train["text_len"] = train["text"].apply(len)

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8,5))
sns.histplot(train["text_len"], bins=40, kde=True)
plt.title("Tweet Length Distribution")
plt.xlabel("Tweet length (characters)")
plt.ylabel("Count")
plt.show()

train["text_len"].describe()


**Observation:**  
Most tweets fall between 50 and 150 characters.  
The distribution is typical for Twitter data: short, noisy messages.  
We will keep full text length during preprocessing since truncation is not necessary.


In [None]:
train.isnull().sum()


**Observation:**  
`keyword` and `location` contain missing values, but these fields are optional and not critical.  
Our main feature is the tweet text, so we can proceed without filling these columns.


## 3. Model Architecture

In this section, we describe the preprocessing pipeline, feature extraction methods,  
and the model architectures used for the classification task.  
We begin with traditional NLP approaches (TF-IDF + Logistic Regression)  
and then build a deep learning model using an Embedding layer followed by an LSTM/GRU network.

The goal is not only to build a working model but also to demonstrate understanding  
of how sequential neural networks (RNN,LSTM,GRU) process text data.


### 3.1 Text Preprocessing

Tweets often contain URLs, punctuation, mentions, hashtags, and inconsistent casing.  
To clean the text before tokenization, we apply the following steps:

- Lowercase all text  
- Remove URLs  
- Remove HTML tags  
- Remove punctuation  
- Remove numbers  
- Remove extra spaces  
- (Optional) Remove stopwords  

We implement a reusable `clean_text()` function for both training and test datasets.


In [None]:
import re
import string

def clean_text(text):
    # lowercase
    text = text.lower()
    
    # remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text)
    
    # remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    
    # remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # remove numbers
    text = re.sub(r'\d+', '', text)
    
    # remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

# Apply cleaning
train["clean_text"] = train["text"].apply(clean_text)
test["clean_text"] = test["text"].apply(clean_text)

train[["text", "clean_text"]].head()


**Observation:**  
After preprocessing, the tweets become cleaner and more uniform.  
URLs, punctuation, and noise are removed, which helps the model focus on semantic content  
rather than surface-level artifacts. This prepares the text for vectorization and embedding.


### 3.2 Feature Extraction

We use two feature extraction methods:

**1. TF-IDF (traditional NLP approach)**  
This converts each tweet into a sparse vector based on term frequency and inverse document frequency.  
It serves as a strong baseline for text classification with simple models such as Logistic Regression.

**2. Tokenizer + Embedding (for RNN models)**  
For LSTM/GRU, we convert text into integer sequences and learn dense word embeddings.  
This allows the model to capture semantic relationships between words.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# TF-IDF vectorizer
tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1,2))

X_tfidf = tfidf.fit_transform(train["clean_text"])
y = train["target"]

# Train/validation split
X_train, X_val, y_train, y_val = train_test_split(
    X_tfidf, y, test_size=0.2, random_state=42
)

# Baseline model: Logistic Regression
log_reg = LogisticRegression(max_iter=200)
log_reg.fit(X_train, y_train)

# Predictions
y_pred = log_reg.predict(X_val)

# Accuracy
baseline_acc = accuracy_score(y_val, y_pred)
baseline_acc


**Observation:**  
The baseline TF-IDF + Logistic Regression model achieved an accuracy of **0.8056** on the validation set.  
This is a strong classical NLP baseline and shows that simple bag-of-words features already capture useful
signal in the dataset.  
We will now build a deep learning model (LSTM/GRU) to determine whether sequential modeling can improve performance.


### 3.3 Tokenization and Sequence Preparation (for LSTM/GRU)

For the deep learning model, we convert each cleaned tweet into a sequence of integer indices:

1. Fit a Keras `Tokenizer` on the cleaned training text.  
2. Transform tweets into integer sequences.  
3. Pad or truncate sequences to a fixed maximum length.

This representation allows us to use an Embedding layer followed by an LSTM/GRU network.


In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# Parameters
max_words = 20000      # maximum vocabulary size
max_len = 40           # maximum tweet length in tokens (we'll experiment later)

# Fit tokenizer on training text
tokenizer = Tokenizer(num_words=max_words, oov_token="<OOV>")
tokenizer.fit_on_texts(train["clean_text"].values)

# Text -> sequences
X_seq = tokenizer.texts_to_sequences(train["clean_text"].values)
X_test_seq = tokenizer.texts_to_sequences(test["clean_text"].values)

# Pad sequences
X_pad = pad_sequences(X_seq, maxlen=max_len, padding="post", truncating="post")
X_test_pad = pad_sequences(X_test_seq, maxlen=max_len, padding="post", truncating="post")

y = train["target"].values

X_pad.shape, X_test_pad.shape
