**Title:** Bidirectional LSTM for Text Classification

**Introduction:**
This notebook demonstrates the implementation of a Bidirectional Long Short-Term Memory (LSTM) model for text classification using PyTorch. The dataset used comprises labeled tweets, and the objective is to classify the tweets into three categories: Hate Speech, Offensive Language, and Neither. The Bidirectional LSTM architecture is employed to capture both past and future context in sequential data, followed by classification using a fully connected layer.

**Content:**

1. **Environment Setup:** The notebook begins with setting up the Python environment and importing necessary libraries.

2. **Model Initialization with Bidirectional LSTM:** The Bidirectional LSTM model for text classification is defined in this section. It consists of embedding layers, Bidirectional LSTM layers, and a fully connected layer.

3. **Text Cleaning:** A function for cleaning text data is defined to preprocess the tweets, removing URLs, mentions, special characters, and emoticons.

4. **Dataset Loading and Splitting:** The labeled tweet dataset is loaded and split into training, validation, and test sets.

5. **Tokenization with BERT Tokenizer:** The tweets are tokenized using the BERT tokenizer, which converts text inputs into token IDs.

6. **Dataset Class:** A custom dataset class is defined to process the tokenized inputs and corresponding labels.

7. **DataLoader Initialization:** DataLoaders are initialized for the training, validation, and test datasets to efficiently load data in batches during model training and evaluation.

8. **Model Training:** The training function is defined to train the Bidirectional LSTM model on the training dataset using backpropagation.

9. **Evaluation Function:** An evaluation function is defined to assess the model's performance on the validation and test datasets.

10. **Main Training Loop:** The main training loop runs for a specified number of epochs, during which the model is trained on the training dataset and evaluated on the validation dataset.

11. **Final Evaluation on Test Set:** The trained model's performance is evaluated on the test dataset, and classification metrics such as precision, recall, and F1-score are computed.

12. **Conclusion:** The notebook concludes by printing the test accuracy of the Bidirectional LSTM model for text classification.

**Conclusion:**
This notebook provides a comprehensive implementation of a Bidirectional LSTM for text classification task using PyTorch, showcasing the process from data preprocessing to model evaluation. The model effectively captures both past and future context in textual data and achieves competitive performance in classifying tweets into predefined categories. This demonstrates the effectiveness of Bidirectional LSTMs in handling sequential data for classification tasks.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/dataset/labeled_data.csv


In [2]:
import re
import numpy as np
import pandas as pd
import torch
from torch.utils.data import DataLoader, Dataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from tqdm import tqdm
from transformers import BertTokenizer
import torch.nn as nn

In [3]:
# Part 1: Model initialization with Bidirectional LSTM
class BiLSTMForSequenceClassification(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(BiLSTMForSequenceClassification, self).__init__()
        self.embedding = nn.Embedding(input_dim, hidden_dim)
        self.lstm = nn.LSTM(hidden_dim, hidden_dim, batch_first=True, bidirectional=True)
        self.fc = nn.Linear(hidden_dim * 2, output_dim)  # Multiply hidden_dim by 2 for bidirectional LSTM

    def forward(self, x):
        embedded = self.embedding(x)
        lstm_out, _ = self.lstm(embedded)
        # Concatenate the hidden states of the forward and backward LSTM
        combined_out = torch.cat((lstm_out[:, -1, :hidden_dim], lstm_out[:, 0, hidden_dim:]), dim=1)
        output = self.fc(combined_out)
        return output

In [4]:
# Part 2: Define text cleaning function (remains the same)
emoticons = [':-)', ':)', '(:', '(-:', ':))', '((:', ':-D', ':D', 'X-D', 'XD', 'xD', 'xD', '<3', '3', ':*', ':-*', 'xP', 'XP', 'XP', 'Xp', ':-|', ':->', ':-<', '8-)', ':-P', ':-p', '=P', '=p', ':*)', '*-*', 'B-)', 'O.o', 'X-(', ')-X']

def clean_text(text):
    text = text.lower()
    text = re.sub(r'https?://[^\s]+', '', text)
    text = re.sub(r'@\w+', '', text)
    text = re.sub(r'\d+', '', text)
    for emoticon in emoticons:
        text = text.replace(emoticon, '')
    text = re.sub(r"[^a-zA-Z?.!,¿]+", " ", text)
    text = re.sub(r"([?.!,¿])", r" ", text)
    text = re.sub(r'[" "]+', " ", text)
    return text.strip()

# Check GPU availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f'Using device: {device}')


Using device: cuda


In [5]:
# Part 3: Load dataset and split (remains the same)
df = pd.read_csv('/kaggle/input/dataset/labeled_data.csv')
df['tweet'] = df['tweet'].apply(clean_text)

train_texts, temp_texts, train_labels, temp_labels = train_test_split(df['tweet'], df['class'], test_size=0.3, random_state=42)
val_texts, test_texts, val_labels, test_labels = train_test_split(temp_texts, temp_labels, test_size=0.5, random_state=42)


In [6]:
# Part 4: Tokenization with BERT tokenizer (remains the same)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

train_encodings = tokenizer(train_texts.tolist(), truncation=True, padding=True, max_length=128)
test_encodings = tokenizer(test_texts.tolist(), truncation=True, padding=True, max_length=128)
val_encodings = tokenizer(val_texts.tolist(), truncation=True, padding=True, max_length=128)


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [7]:
# Part 5: Dataset class (remains the same)
class TweetDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = TweetDataset(train_encodings, train_labels.tolist())
test_dataset = TweetDataset(test_encodings, test_labels.tolist())
val_dataset = TweetDataset(val_encodings, val_labels.tolist())


In [8]:
# Part 6: DataLoader initialization (remains the same)
batch_size = 32

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)


In [9]:
# Part 7: Model initialization with Bidirectional LSTM
input_dim = len(tokenizer.get_vocab())
hidden_dim = 128
output_dim = 3
model = BiLSTMForSequenceClassification(input_dim, hidden_dim, output_dim)  # Changed model class to BiLSTMForSequenceClassification
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
model.to(device)


BiLSTMForSequenceClassification(
  (embedding): Embedding(30522, 128)
  (lstm): LSTM(128, 128, batch_first=True, bidirectional=True)
  (fc): Linear(in_features=256, out_features=3, bias=True)
)

In [12]:
# Part 8: Training function (remains the same)
def train(epoch):
    model.train()
    total_loss, total_accuracy = 0, 0
    for batch in tqdm(train_loader, desc=f"Training Epoch {epoch}"):
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids)
        loss = nn.CrossEntropyLoss()(outputs, labels)
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
        _, predictions = torch.max(outputs, 1)
        total_accuracy += torch.sum(predictions == labels).item() / len(labels)
    
    avg_loss = total_loss / len(train_loader)
    avg_accuracy = total_accuracy / len(train_loader)
    print(f"Training Loss: {avg_loss:.3f}")
    print(f"Training Accuracy: {avg_accuracy:.3f}")


In [13]:
# Part 9: Evaluation function (remains the same)
def evaluate(loader, desc="Evaluating"):
    model.eval()
    total_loss, total_accuracy = 0, 0
    all_predictions, all_labels = [], []
    
    with torch.no_grad():
        for batch in tqdm(loader, desc=desc):
            input_ids = batch['input_ids'].to(device)
            labels = batch['labels'].to(device)
            outputs = model(input_ids)
            loss = nn.CrossEntropyLoss()(outputs, labels)
            
            total_loss += loss.item()
            _, predictions = torch.max(outputs, 1)
            total_accuracy += torch.sum(predictions == labels).item() / len(labels)
            
            all_predictions.extend(predictions.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())

    avg_loss = total_loss / len(loader)
    avg_accuracy = total_accuracy / len(loader)
    print(f"{desc} Loss: {avg_loss:.3f}")
    print(f"{desc} Accuracy: {avg_accuracy:.3f}")
    
    return all_labels, all_predictions


In [15]:
# Part 10: Main training loop (remains the same)
for epoch in range(1, 4):
    train(epoch)
    evaluate(val_loader)

# Final evaluation on test set
labels, predictions = evaluate(test_loader, "Final Test Evaluation")
print(classification_report(labels, predictions, target_names=['Hate Speech', 'Offensive Language', 'Neither']))

# Accuracy
accuracy = accuracy_score(labels, predictions)
print(f"Test Accuracy: {accuracy:.3f}")

Training Epoch 1: 100%|██████████| 543/543 [00:03<00:00, 147.09it/s]


Training Loss: 0.430
Training Accuracy: 0.845


Evaluating: 100%|██████████| 117/117 [00:00<00:00, 247.19it/s]


Evaluating Loss: 0.337
Evaluating Accuracy: 0.884


Training Epoch 2: 100%|██████████| 543/543 [00:03<00:00, 157.36it/s]


Training Loss: 0.304
Training Accuracy: 0.893


Evaluating: 100%|██████████| 117/117 [00:00<00:00, 246.68it/s]


Evaluating Loss: 0.302
Evaluating Accuracy: 0.896


Training Epoch 3: 100%|██████████| 543/543 [00:03<00:00, 156.01it/s]


Training Loss: 0.254
Training Accuracy: 0.912


Evaluating: 100%|██████████| 117/117 [00:00<00:00, 228.77it/s]


Evaluating Loss: 0.299
Evaluating Accuracy: 0.898


Final Test Evaluation: 100%|██████████| 117/117 [00:00<00:00, 250.22it/s]

Final Test Evaluation Loss: 0.285
Final Test Evaluation Accuracy: 0.900
                    precision    recall  f1-score   support

       Hate Speech       0.48      0.21      0.29       207
Offensive Language       0.93      0.94      0.94      2880
           Neither       0.81      0.92      0.86       631

          accuracy                           0.90      3718
         macro avg       0.74      0.69      0.70      3718
      weighted avg       0.89      0.90      0.89      3718

Test Accuracy: 0.899



