### Overview of the Bert + BI-LSTM Classification Model

The `Bert + BI-LSTM Classification` model combines the strengths of BERT for deep contextual embedding with the sequential processing capabilities of a bidirectional LSTM (Long Short-Term Memory) network. This hybrid approach aims to leverage the contextual insights provided by BERT along with the LSTM's ability to capture dependencies in sequences over long distances, making it particularly suited for tasks like sentiment analysis or contextual classification where understanding the entire sequence is crucial.

### Key Components of the Model:

1. **BERT Model**:
   - Utilized as the initial embedding layer to convert input text tokens into rich, contextualized embeddings. BERT's pre-trained models are adept at understanding complex language nuances, which forms a robust foundation for further sequence processing.

2. **Bidirectional LSTM**:
   - Follows the BERT embedding layer, allowing the model to process text embeddings in both forward and backward directions across the text sequence. This bidirectionality helps the model to capture context from both past and future tokens simultaneously, enhancing its ability to understand the overall sequence.

3. **Dropout and Linear Layer**:
   - A dropout layer is applied after the LSTM to prevent overfitting by randomly zeroing out some of the features. This is followed by a linear layer that maps the LSTM outputs to the final classification labels.

### Model Architecture Details:

- **Input**: The model takes tokenized text input, which is processed by BERT to produce embeddings. These embeddings are then fed into a bidirectional LSTM.
- **Output**: The final output is generated through a linear layer that classifies the text into predefined categories based on the learned features.

### Implementation Details:

1. **Data Preprocessing and Tokenization**:
   - Text data is cleaned to remove URLs, user mentions, numerical data, and special characters to standardize the input and focus on meaningful text content.
   - The BERT tokenizer converts cleaned text into tokens that are suitable for processing by the BERT model.

2. **Dataset and DataLoader**:
   - A custom `TweetDataset` class manages the tokenized data and labels, ensuring efficient batch processing during training and evaluation via PyTorch’s `DataLoader`.

3. **Model Training and Evaluation**:
   - Training involves multiple epochs where the model learns by minimizing the cross-entropy loss between predicted and actual labels. AdamW optimizer is used for effective weight updates.
   - Evaluation assesses model performance on validation and test datasets to ensure generalizability beyond the training data.

4. **Computational Efficiency**:
   - The model is designed to run on GPU if available, significantly speeding up computations necessary for training and inference phases.

5. **Loss and Accuracy Computation**:
   - During training and validation, loss and accuracy metrics are computed to monitor the model's performance and guide training decisions.

### Usage and Application:

This hybrid model is particularly effective for nuanced text classification tasks where both the context provided by individual words and the overall sequence of words play crucial roles. Examples include sentiment analysis, topic classification, and other NLP tasks requiring a deep understanding of language context.

### Benefits of the BertLSTM Model:

- **Enhanced Contextual Understanding**: By combining BERT and LSTM, the model captures both deep contextual embeddings and sequence dynamics, offering superior performance over models that might use only one of these methods.
- **Flexibility and Adaptability**: The model can be easily adapted to various text classification tasks by fine-tuning on specific datasets, making it highly versatile.
- **Robust Performance**: The bidirectional LSTM layer adds an additional layer of context processing, potentially leading to higher accuracy in tasks involving complex linguistic structures.

This model represents a robust approach to text classification, harnessing the power of both transformer and recurrent network architectures to deliver high-quality predictions.

In [10]:
import re
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from tqdm import tqdm
from transformers import BertTokenizer, BertModel, AdamW

# Define the hybrid model
class BertLSTMForSequenceClassification(nn.Module):
    def __init__(self, bert_model, lstm_hidden_dim, num_labels):
        super(BertLSTMForSequenceClassification, self).__init__()
        self.bert = bert_model
        self.lstm_hidden_dim = lstm_hidden_dim
        self.dropout = nn.Dropout(0.1)
        self.lstm = nn.LSTM(bidirectional=True, num_layers=1, input_size=bert_model.config.hidden_size,
                            hidden_size=lstm_hidden_dim, batch_first=True)
        self.fc = nn.Linear(lstm_hidden_dim * 2, num_labels)  # Multiply by 2 for bidirectional LSTM

    def forward(self, input_ids, attention_mask):
        # BERT encoding
        bert_outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = bert_outputs.pooler_output
        # LSTM processing
        lstm_output, _ = self.lstm(pooled_output.unsqueeze(0))
        lstm_output = lstm_output.squeeze(0)
        # Classification layer
        output = self.fc(self.dropout(lstm_output))
        return output

# Define text cleaning function
def clean_text(text):
    emoticons = [':-)', ':)', '(:', '(-:', ':))', '((:', ':-D', ':D', 'X-D', 'XD', 'xD', '<3', '3', ':*', ':-*', 'xP', 'XP', 'XP', 'Xp', ':-|', ':->', ':-<', '8-)', ':-P', ':-p', '=P', '=p', ':*)', '*-*', 'B-)', 'O.o', 'X-(', ')-X']
    text = text.lower()
    text = re.sub(r'https?://[^\s]+', '', text)
    text = re.sub(r'@\w+', '', text)
    text = re.sub(r'\d+', '', text)
    for emoticon in emoticons:
        text = text.replace(emoticon, '')
    text = re.sub(r"[^a-zA-Z?.!,¿]+", " ", text)
    text = re.sub(r"([?.!,¿])", r" ", text)
    text = re.sub(r'[" "]+', " ", text)
    return text.strip()

# Check GPU availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f'Using device: {device}')

# Load dataset and split
df = pd.read_csv('/kaggle/input/dataset/labeled_data.csv')
df['tweet'] = df['tweet'].apply(clean_text)

train_texts, temp_texts, train_labels, temp_labels = train_test_split(df['tweet'], df['class'], test_size=0.3, random_state=42)
val_texts, test_texts, val_labels, test_labels = train_test_split(temp_texts, temp_labels, test_size=0.5, random_state=42)

# Tokenization with BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

train_encodings = tokenizer(train_texts.tolist(), truncation=True, padding=True, max_length=128)
test_encodings = tokenizer(test_texts.tolist(), truncation=True, padding=True, max_length=128)
val_encodings = tokenizer(val_texts.tolist(), truncation=True, padding=True, max_length=128)

# Dataset class
class TweetDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = TweetDataset(train_encodings, train_labels.tolist())
test_dataset = TweetDataset(test_encodings, test_labels.tolist())
val_dataset = TweetDataset(val_encodings, val_labels.tolist())

# DataLoader initialization
batch_size = 32

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

# Model initialization
bert_model = BertModel.from_pretrained('bert-base-uncased')
lstm_hidden_dim = 128
num_labels = 3
model = BertLSTMForSequenceClassification(bert_model, lstm_hidden_dim, num_labels)
optimizer = AdamW(model.parameters(), lr=5e-6)
model.to(device)

# Training function
def train(epoch):
    model.train()
    total_loss, total_accuracy = 0, 0
    for batch in tqdm(train_loader, desc=f"Training Epoch {epoch}"):
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids, attention_mask)
        loss = nn.CrossEntropyLoss()(outputs, labels)
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
        _, predictions = torch.max(outputs, 1)
        total_accuracy += torch.sum(predictions == labels).item() / len(labels)
    
    avg_loss = total_loss / len(train_loader)
    avg_accuracy = total_accuracy / len(train_loader)
    print(f"Training Loss: {avg_loss:.3f}")
    print(f"Training Accuracy: {avg_accuracy:.3f}")

# Evaluation function
def evaluate(loader, desc="Evaluating"):
    model.eval()
    total_loss, total_accuracy = 0, 0
    all_predictions, all_labels = [], []
    
    with torch.no_grad():
        for batch in tqdm(loader, desc=desc):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            outputs = model(input_ids, attention_mask)
            loss = nn.CrossEntropyLoss()(outputs, labels)
            
            total_loss += loss.item()
            _, predictions = torch.max(outputs, 1)
            total_accuracy += torch.sum(predictions == labels).item() / len(labels)
            
            all_predictions.extend(predictions.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())

    avg_loss = total_loss / len(loader)
    avg_accuracy = total_accuracy / len(loader)
    print(f"{desc} Loss: {avg_loss:.3f}")
    print(f"{desc} Accuracy: {avg_accuracy:.3f}")
    
    return all_labels, all_predictions

# Main training loop
for epoch in range(1, 4):
    train(epoch)
    evaluate(val_loader)

# Final evaluation on test set
labels, predictions = evaluate(test_loader, "Final Test Evaluation")
print(classification_report(labels, predictions, target_names=['Hate Speech', 'Offensive Language', 'Neither']))

# Accuracy
accuracy = accuracy_score(labels, predictions)
print(f"Test Accuracy: {accuracy:.3f}")


Using device: cuda


Training Epoch 1: 100%|██████████| 543/543 [01:32<00:00,  5.88it/s]


Training Loss: 0.511
Training Accuracy: 0.832


Evaluating: 100%|██████████| 117/117 [00:06<00:00, 18.67it/s]


Evaluating Loss: 0.353
Evaluating Accuracy: 0.904


Training Epoch 2: 100%|██████████| 543/543 [01:32<00:00,  5.88it/s]


Training Loss: 0.312
Training Accuracy: 0.909


Evaluating: 100%|██████████| 117/117 [00:06<00:00, 18.72it/s]


Evaluating Loss: 0.287
Evaluating Accuracy: 0.909


Training Epoch 3: 100%|██████████| 543/543 [01:32<00:00,  5.88it/s]


Training Loss: 0.258
Training Accuracy: 0.917


Evaluating: 100%|██████████| 117/117 [00:06<00:00, 18.74it/s]


Evaluating Loss: 0.266
Evaluating Accuracy: 0.909


Final Test Evaluation: 100%|██████████| 117/117 [00:06<00:00, 19.26it/s]

Final Test Evaluation Loss: 0.252
Final Test Evaluation Accuracy: 0.913
                    precision    recall  f1-score   support

       Hate Speech       0.52      0.20      0.29       207
Offensive Language       0.93      0.96      0.95      2880
           Neither       0.87      0.92      0.90       631

          accuracy                           0.91      3718
         macro avg       0.77      0.70      0.71      3718
      weighted avg       0.90      0.91      0.90      3718

Test Accuracy: 0.913



