<a href="https://colab.research.google.com/github/BaberFaisal/NLP_Text-classification_HW/blob/main/Natural_Language_Processing_with_Disaster_Tweet_LSTM_%26_RNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Explanation of Architecture/Hyperparameter Choices Architecture (BiLSTM & RNN):**

BiLSTM (Bidirectional LSTM):

Chosen to capture bidirectional context in text (past and future word dependencies), which is critical for understanding disaster-related language nuances.

LSTM’s memory cells mitigate the vanishing gradient problem, making it suitable for sequential data like text.

RNN:

Used as a baseline model to compare against LSTM. While simpler, RNNs struggle with long-term dependencies but are computationally lighter.

Hyperparameters:

Embedding Size (300):

Aligned with standard pre-trained embeddings (e.g., GloVe 300D), though not explicitly used here. Balances dimensionality and computational cost.

Hidden Size (512):

Provides sufficient capacity to learn complex patterns without excessive parameters.

Layers (2):

Adds depth for hierarchical feature extraction while avoiding overfitting.

Dropout (0.4):

Regularizes the model to prevent overfitting, especially important with limited training data.

Learning Rate (0.0005):

A small rate ensures stable gradient descent in AdamW optimizer.

Vocabulary Size (20,000):

Limits the vocabulary to frequent words, reducing noise and computational load.

**Import Necessary Libraries**

We import essential libraries such as pandas for data handling, numpy for numerical computations, re and string for text preprocessing, nltk for NLP tasks, torch for deep learning, and sklearn for data splitting.

In [2]:
import pandas as pd
import numpy as np
import os
import re
import string
import nltk
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

**Download Necessary NLTK Resources**

To ensure tokenization, stopword removal, and lemmatization work properly, we download required datasets from nltk.

In [5]:
# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

**Load Dataset**

This dataset contains text data and labels (0 for non-disaster and 1 for disaster tweets).

In [6]:
df = pd.read_csv('/content/train (1).csv')

**Preprocessing the text data**

Cleans and preprocesses text data.

Steps:

1. Convert text to lowercase to ensure uniformity.

2. Remove URLs to clean the text.

3. Remove punctuation to reduce noise.
   
4. Tokenize text into words.
    
5. Remove stopwords to focus on meaningful words.
    
6. Apply lemmatization to get the root form of words.
    

In [7]:
def preprocess_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(f'[{string.punctuation}]', '', text)  # Remove punctuation
    words = word_tokenize(text)  # Tokenization
    words = [word for word in words if word not in stopwords.words('english')]  # Remove stopwords
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]  # Lemmatization
    return ' '.join(words)


In [8]:
# Apply preprocessing
df['cleaned_text'] = df['text'].apply(preprocess_text)

**Tokenization and Padding**

We convert text into numerical sequences using the Tokenizer class. Padding ensures all sequences have the same length for model training.

In [9]:
tokenizer = Tokenizer(num_words=10000, oov_token="<OOV>")
tokenizer.fit_on_texts(df['cleaned_text'])
sequences = tokenizer.texts_to_sequences(df['cleaned_text'])
padded_sequences = pad_sequences(sequences, maxlen=100, padding='post', truncating='post')

**Split Data into Training and Testing Sets**

We divide the dataset into 80% training and 20% testing data for model evaluation.We convert the data into tensors so that it can be used with PyTorch for training.

In [10]:
y = df['target'].values
X_train, X_test, y_train, y_test = train_test_split(padded_sequences, y, test_size=0.2, random_state=42)

X_train_tensor = torch.tensor(X_train, dtype=torch.long)
X_test_tensor = torch.tensor(X_test, dtype=torch.long)
y_train_tensor = torch.tensor(y_train, dtype=torch.long)
y_test_tensor = torch.tensor(y_test, dtype=torch.long)

**Create Dataset and DataLoader**

We create a PyTorch Dataset class to load data efficiently in batches for training and evaluation.

In [11]:
class TextDataset(Dataset):
    def __init__(self, X, y):
        self.X = X
        self.y = y

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

train_dataset = TextDataset(X_train_tensor, y_train_tensor)
test_dataset = TextDataset(X_test_tensor, y_test_tensor)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)


**Define LSTM And RNN Model**

In [12]:
class BiLSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, output_size, num_layers, dropout):
        super(BiLSTMClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True, bidirectional=True, dropout=dropout)
        self.fc = nn.Linear(hidden_size * 2, output_size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        x = self.embedding(x)
        lstm_out, _ = self.lstm(x)
        x = self.dropout(lstm_out[:, -1, :])
        x = self.fc(x)
        return x


In [13]:
class RNNClassifier(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, output_size, num_layers, dropout):
        super(RNNClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.RNN(embed_size, hidden_size, num_layers, batch_first=True, dropout=dropout)
        self.fc = nn.Linear(hidden_size, output_size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        x = self.embedding(x)
        rnn_out, _ = self.rnn(x)
        x = self.dropout(rnn_out[:, -1, :])
        x = self.fc(x)
        return x


**Define Model Parameters and Initialize Model**

Hyperparameters like embedding size, hidden size, and number of layers are chosen based on best practices for text classification.

In [14]:
vocab_size = 20000
embed_size = 300
hidden_size = 512
output_size = 2
num_layers = 2
dropout = 0.4


In [15]:
rnn_model = RNNClassifier(vocab_size, embed_size, hidden_size, output_size, num_layers, dropout)
lstm_model = BiLSTMClassifier(vocab_size, embed_size, hidden_size, output_size, num_layers, dropout)


**Define Loss Function and Optimizer**

We use CrossEntropyLoss for binary classification and Adam optimizer for efficient training.

In [16]:
criterion = nn.CrossEntropyLoss()
optimizer_rnn = optim.AdamW(rnn_model.parameters(), lr=0.0005)
optimizer_lstm = optim.AdamW(lstm_model.parameters(), lr=0.0005)

**Train the Model**

We train the model for 5 epochs, updating weights using backpropagation.

In [18]:
rnn_checkpoint_path = "rnn_model_checkpoint.pth"
lstm_checkpoint_path = "lstm_model_checkpoint.pth"

def save_checkpoint(epoch, model, optimizer, loss, checkpoint_path):
    """Save training checkpoint."""
    torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': loss
    }, checkpoint_path)
    print(f"Checkpoint saved at epoch {epoch+1}")

def load_checkpoint(model, optimizer, checkpoint_path):
    """Load training checkpoint if available."""
    if os.path.exists(checkpoint_path):
        checkpoint = torch.load(checkpoint_path)
        model.load_state_dict(checkpoint['model_state_dict'])
        optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
        print(f"Resuming from epoch {checkpoint['epoch']+1}")
        return checkpoint['epoch'] + 1
    return 0

def train_model(model, train_loader, criterion, optimizer, epochs=2, checkpoint_path="model_checkpoint.pth"):
    model.train()


    start_epoch = load_checkpoint(model, optimizer, checkpoint_path)

    for epoch in range(start_epoch, epochs):
        total_loss = 0
        for inputs, labels in train_loader:
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()

        avg_loss = total_loss / len(train_loader)
        print(f"Epoch {epoch+1}/{epochs}, Loss: {avg_loss:.4f}")


        save_checkpoint(epoch, model, optimizer, avg_loss, checkpoint_path)

print("Training LSTM Model...")
train_model(lstm_model, train_loader, criterion, optimizer_lstm, epochs=2, checkpoint_path=lstm_checkpoint_path)
print("Training RNN Model")
train_model(rnn_model, train_loader, criterion, optimizer_rnn, epochs=2, checkpoint_path=rnn_checkpoint_path)



Training LSTM Model...
Epoch 1/2, Loss: 0.6889
Checkpoint saved at epoch 1
Epoch 2/2, Loss: 0.6851
Checkpoint saved at epoch 2


**Evaluate the Model**

We calculate accuracy on the test dataset to measure model performance.

In [27]:
def evaluate_model(model, test_loader):
    model.eval()
    correct, total = 0, 0
    with torch.no_grad():
        for inputs, labels in test_loader:
            outputs = model(inputs)
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    print(f'Accuracy: {correct / total:.4f}')


In [28]:
print("Evaluating RNN Model")
evaluate_model(rnn_model, test_loader)

Evaluating RNN Model
Accuracy: 0.5739


In [29]:
print("Evaluating LSTM Model")
evaluate_model(lstm_model, test_loader)

Evaluating LSTM Model
Accuracy: 0.5739


**Generate Predictions for Kaggle Submission**

In [38]:
test_df = pd.read_csv('/content/test (1).csv')
test_df['cleaned_text'] = test_df['text'].apply(preprocess_text)
test_sequences = tokenizer.texts_to_sequences(test_df['cleaned_text'])
test_padded = pad_sequences(test_sequences, maxlen=100, padding='post', truncating='post')
test_tensor = torch.tensor(test_padded, dtype=torch.long)

model = rnn_model
model.eval()
with torch.no_grad():
    test_outputs = model(test_tensor)
    _, test_predictions = torch.max(test_outputs, 1)

submission = pd.DataFrame({'id': test_df['id'], 'target': test_predictions.numpy()})
submission.to_csv('submission.csv', index=False)



**Conclusion**

Best Model in Terms of Quality/Resources:

Both RNN and BiLSTM achieved identical test accuracy (57.39%), suggesting underfitting or insufficient training (only 2 epochs).

BiLSTM theoretically offers better quality due to bidirectional context but requires more resources (memory and computation).RNN is lighter but less effective for sequential dependencies.

How to Improve Results:

Increase Training Epochs: Train for more epochs (e.g., 10–20) to allow convergence.

Hyperparameter Tuning: Experiment with embedding/hidden sizes, dropout rates, and learning rates.

Pre-trained Embeddings: Use GloVe or BERT embeddings for better word representations.

Attention Mechanisms: Add attention layers to focus on critical words.

Data Augmentation: Expand the dataset with synonym replacement or back-translation.

Class Balancing: Address potential class imbalance in disaster/non-disaster tweets.

Difficulties Encountered:

Low Accuracy: The models underperformed (57.39% accuracy), likely due to:
Insufficient training time (2 epochs).

Over-aggressive preprocessing (e.g., removing punctuation might discard context like “!” in disaster tweets).

Limited model capacity or suboptimal hyperparameters.

Framework Mixing: Using Keras (Tokenizer) with PyTorch for modeling could introduce inconsistencies.

Recommendation
The BiLSTM holds more potential with adjustments (more epochs, tuned hyperparameters, and pre-trained embeddings). For immediate improvements, prioritize extended training and embedding enhancements.