### Overview of the DistilBert + BI-LSTM Classification Model

The `DistilBert + BI-LSTM Classification Model` model synergizes the lightweight and efficient DistilBERT with a bidirectional LSTM network to create a powerful text classification system. This model is designed to capture the contextual embeddings provided by DistilBERT and enhance the sequence modeling capabilities with LSTM's to effectively understand and predict sequences over extended contexts. This hybrid model is particularly useful in scenarios where computational efficiency is crucial without significantly sacrificing the depth of contextual understanding required for complex NLP tasks.

### Key Components of the Model:

1. **DistilBERT Model**:
   - Serves as the backbone for generating contextual embeddings from input text. DistilBERT is a smaller, faster version of BERT that retains most of the original model's predictive power, making it suitable for environments with limited computational resources.

2. **Bidirectional LSTM**:
   - Enhances the sequence processing capabilities by handling information from both past and future contexts simultaneously. This is particularly useful for understanding the context in which words appear within sequences, thereby improving the accuracy of predictions for tasks like sentiment analysis or contextual text classification.

3. **Dropout and Linear Layer**:
   - Dropout is applied to the output of the LSTM to prevent overfitting, ensuring that the model remains generalizable to new, unseen data. The linear layer then maps the LSTM outputs to the final classification labels.

### Implementation Details:

1. **Data Preprocessing and Tokenization**:
   - Raw text data undergoes a cleaning process to standardize it by removing URLs, user mentions, and special characters, ensuring that the DistilBERT tokenizer processes only relevant textual content.
   - The DistilBertTokenizer is used to tokenize texts, which are then padded to a uniform length to ensure consistent input size.

2. **Dataset and DataLoader**:
   - A custom `TweetDataset` class is utilized to manage the tokenized data and corresponding labels. This setup facilitates efficient data loading during training and evaluation through the PyTorch `DataLoader`.

3. **Model Configuration**:
   - The model integrates DistilBERT with an LSTM layer that is specifically configured to process the output embeddings from DistilBERT. The model operates in a bidirectional manner to capture dependencies and context from both directions of a sequence.

4. **Training and Evaluation**:
   - The model is trained over multiple epochs where it learns by minimizing the cross-entropy loss between predicted and actual labels, using the AdamW optimizer for effective backpropagation and weight adjustment.
   - During evaluation, the model's performance is assessed on validation and test sets to ensure its effectiveness and ability to generalize across different datasets.

5. **GPU Utilization for Enhanced Performance**:
   - The model checks for GPU availability and utilizes it if available, significantly enhancing computational speed and efficiency necessary for training and inference phases.

### Benefits of Using DistilBertLSTM:

- **Efficient Computation**: By using DistilBERT, the model achieves faster computation times and requires less memory, making it feasible for deployment in resource-constrained environments.
- **Enhanced Sequence Modeling**: The addition of a bidirectional LSTM allows the model to effectively understand and utilize the context around each word in a sentence, leading to more accurate predictions, especially in tasks requiring a deep understanding of textual context.
- **Adaptability**: This model can be easily adapted and fine-tuned for various NLP tasks, including but not limited to sentiment analysis, hate speech detection, and other forms of text classification.

This hybrid model represents an optimal approach for applications needing a balance between performance and computational efficiency, offering robust text classification capabilities enhanced by deep contextual embeddings and sophisticated sequence modeling.

In [12]:
import re
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from tqdm import tqdm
from transformers import DistilBertTokenizer, DistilBertModel, AdamW

# Define the hybrid model
class DistilBertLSTMForSequenceClassification(nn.Module):
    def __init__(self, distilbert_model, lstm_hidden_dim, num_labels):
        super(DistilBertLSTMForSequenceClassification, self).__init__()
        self.distilbert = distilbert_model
        self.lstm_hidden_dim = lstm_hidden_dim
        self.dropout = nn.Dropout(0.1)
        self.lstm = nn.LSTM(bidirectional=True, num_layers=1, input_size=distilbert_model.config.dim,
                            hidden_size=lstm_hidden_dim, batch_first=True)
        self.fc = nn.Linear(lstm_hidden_dim * 2, num_labels)  # Multiply by 2 for bidirectional LSTM

    def forward(self, input_ids, attention_mask):
        # DistilBERT encoding
        distilbert_outputs = self.distilbert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = distilbert_outputs.last_hidden_state[:, 0, :]  # Use CLS token for classification
        # LSTM processing
        lstm_output, _ = self.lstm(pooled_output.unsqueeze(0))
        lstm_output = lstm_output.squeeze(0)
        # Classification layer
        output = self.fc(self.dropout(lstm_output))
        return output

# Define text cleaning function
def clean_text(text):
    emoticons = [':-)', ':)', '(:', '(-:', ':))', '((:', ':-D', ':D', 'X-D', 'XD', 'xD', '<3', '3', ':*', ':-*', 'xP', 'XP', 'XP', 'Xp', ':-|', ':->', ':-<', '8-)', ':-P', ':-p', '=P', '=p', ':*)', '*-*', 'B-)', 'O.o', 'X-(', ')-X']
    text = text.lower()
    text = re.sub(r'https?://[^\s]+', '', text)
    text = re.sub(r'@\w+', '', text)
    text = re.sub(r'\d+', '', text)
    for emoticon in emoticons:
        text = text.replace(emoticon, '')
    text = re.sub(r"[^a-zA-Z?.!,¿]+", " ", text)
    text = re.sub(r"([?.!,¿])", r" ", text)
    text = re.sub(r'[" "]+', " ", text)
    return text.strip()

# Check GPU availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f'Using device: {device}')

# Load dataset and split
df = pd.read_csv('/kaggle/input/dataset/labeled_data.csv')
df['tweet'] = df['tweet'].apply(clean_text)

train_texts, temp_texts, train_labels, temp_labels = train_test_split(df['tweet'], df['class'], test_size=0.3, random_state=42)
val_texts, test_texts, val_labels, test_labels = train_test_split(temp_texts, temp_labels, test_size=0.5, random_state=42)

# Tokenization with DistilBERT tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

train_encodings = tokenizer(train_texts.tolist(), truncation=True, padding=True, max_length=128)
test_encodings = tokenizer(test_texts.tolist(), truncation=True, padding=True, max_length=128)
val_encodings = tokenizer(val_texts.tolist(), truncation=True, padding=True, max_length=128)

# Dataset class
class TweetDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = TweetDataset(train_encodings, train_labels.tolist())
test_dataset = TweetDataset(test_encodings, test_labels.tolist())
val_dataset = TweetDataset(val_encodings, val_labels.tolist())

# DataLoader initialization
batch_size = 32

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

# Model initialization
distilbert_model = DistilBertModel.from_pretrained('distilbert-base-uncased')
lstm_hidden_dim = 128
num_labels = 3
model = DistilBertLSTMForSequenceClassification(distilbert_model, lstm_hidden_dim, num_labels)
optimizer = AdamW(model.parameters(), lr=5e-6)
model.to(device)

# Training function
def train(epoch):
    model.train()
    total_loss, total_accuracy = 0, 0
    for batch in tqdm(train_loader, desc=f"Training Epoch {epoch}"):
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids, attention_mask)
        loss = nn.CrossEntropyLoss()(outputs, labels)
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
        _, predictions = torch.max(outputs, 1)
        total_accuracy += torch.sum(predictions == labels).item() / len(labels)
    
    avg_loss = total_loss / len(train_loader)
    avg_accuracy = total_accuracy / len(train_loader)
    print(f"Training Loss: {avg_loss:.3f}")
    print(f"Training Accuracy: {avg_accuracy:.3f}")

# Evaluation function
def evaluate(loader, desc="Evaluating"):
    model.eval()
    total_loss, total_accuracy = 0, 0
    all_predictions, all_labels = [], []
    
    with torch.no_grad():
        for batch in tqdm(loader, desc=desc):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            outputs = model(input_ids, attention_mask)
            loss = nn.CrossEntropyLoss()(outputs, labels)
            
            total_loss += loss.item()
            _, predictions = torch.max(outputs, 1)
            total_accuracy += torch.sum(predictions == labels).item() / len(labels)
            
            all_predictions.extend(predictions.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())

    avg_loss = total_loss / len(loader)
    avg_accuracy = total_accuracy / len(loader)
    print(f"{desc} Loss: {avg_loss:.3f}")
    print(f"{desc} Accuracy: {avg_accuracy:.3f}")
    
    return all_labels, all_predictions

# Main training loop
for epoch in range(1, 4):
    train(epoch)
    evaluate(val_loader)

# Final evaluation on test set
labels, predictions = evaluate(test_loader, "Final Test Evaluation")
print(classification_report(labels, predictions, target_names=['Hate Speech', 'Offensive Language', 'Neither']))

# Accuracy
accuracy = accuracy_score(labels, predictions)
print(f"Test Accuracy: {accuracy:.3f}")


Using device: cuda


Training Epoch 1: 100%|██████████| 543/543 [00:47<00:00, 11.33it/s]


Training Loss: 0.461
Training Accuracy: 0.839


Evaluating: 100%|██████████| 117/117 [00:03<00:00, 36.06it/s]


Evaluating Loss: 0.308
Evaluating Accuracy: 0.902


Training Epoch 2: 100%|██████████| 543/543 [00:47<00:00, 11.37it/s]


Training Loss: 0.280
Training Accuracy: 0.909


Evaluating: 100%|██████████| 117/117 [00:03<00:00, 35.93it/s]


Evaluating Loss: 0.262
Evaluating Accuracy: 0.911


Training Epoch 3: 100%|██████████| 543/543 [00:47<00:00, 11.35it/s]


Training Loss: 0.241
Training Accuracy: 0.918


Evaluating: 100%|██████████| 117/117 [00:03<00:00, 35.98it/s]


Evaluating Loss: 0.253
Evaluating Accuracy: 0.907


Final Test Evaluation: 100%|██████████| 117/117 [00:03<00:00, 36.91it/s]

Final Test Evaluation Loss: 0.246
Final Test Evaluation Accuracy: 0.915
                    precision    recall  f1-score   support

       Hate Speech       0.51      0.36      0.42       207
Offensive Language       0.94      0.96      0.95      2880
           Neither       0.89      0.89      0.89       631

          accuracy                           0.91      3718
         macro avg       0.78      0.73      0.75      3718
      weighted avg       0.91      0.91      0.91      3718

Test Accuracy: 0.914



