### Overview of the Optimized BERT and CNN Integration

The integration of BERT and CNN models was refined to optimize performance, simplify architecture, enhance memory management, and improve training dynamics. The changes are designed to streamline operations, reduce computational overhead, and better harness the strengths of both BERT and CNN architectures for text classification tasks.

### Detailed Changes Made:

1. **Simplified Model Architecture**:
    - **Last Hidden Layer Output**: The model now utilizes only the last hidden layer output of BERT instead of combining outputs from multiple layers. This change reduces the complexity of the input to the CNN, potentially reducing overfitting and computational demands.
    - **Simplified CNN Structure**: The CNN architecture was simplified by reducing the number of convolutional layers and using an adaptive max pooling layer. This focuses the model on extracting the most salient features from the BERT output, enhancing processing efficiency.

2. **Optimized Memory Management**:
    - **Garbage Collection and CUDA Cache Clearing**: `gc.collect()` and `torch.cuda.empty_cache()` are used within the training and evaluation loops. These commands help in managing GPU memory more efficiently, preventing out-of-memory errors which are common in large model training scenarios.

3. **Streamlined Data Preprocessing**:
    - **Standardized Text Cleaning**: The text cleaning function was standardized to remove URLs, user mentions, and emoticons consistently across the dataset. This standardization helps in reducing noise in the input data, focusing the model training on relevant textual content.

4. **Updated Training and Evaluation Loops**:
    - **Dynamic Progress Updates**: Progress updates during training and evaluation now include real-time feedback on loss and accuracy for each batch, enhancing the visibility of the model’s performance during training.
    - **Gradient Clipping**: Implemented to prevent exploding gradients, which is crucial for maintaining stable training dynamics especially when working with deep neural networks.

5. **Detailed Batch and DataLoader Handling**:
    - **Batch Data Handling in GPU Memory**: Explicit handling of batch data to optimize performance and ensure efficient data processing. This includes systematically moving batch data to the GPU and clearing it post-usage.
    - **RandomSampler and SequentialSampler**: Used for the training and validation datasets, respectively. This ensures that the model sees randomized data during training and sequential data during validation, aiding in robust learning and consistent validation.

6. **Model Saving and Loading**:
    - **Conditional Model Saving**: The model's weights are saved only when there is an improvement in validation loss. This not only saves storage space but also ensures that the training can continue from the best state if needed.
    - **Loading Pre-trained Weights**: Functionality to load pre-trained weights if available, which can accelerate convergence and improve model robustness by leveraging previously learned weights.

7. **Class Weight Handling**:
    - **Imbalanced Data Handling**: If the dataset is imbalanced, class weights are calculated and applied in the loss function to prioritize learning from underrepresented classes, aiming to improve model fairness and accuracy.

### Impact on Accuracy:

These changes collectively contribute to improving the model’s accuracy in several ways:

- **Reduced Overfitting**: By simplifying the input from BERT to CNN and streamlining the CNN architecture, the model is less likely to overfit to noise and more likely to generalize better to unseen data.
- **Efficient Memory Management**: Prevents interruptions in training due to memory issues, allowing the model to train more consistently and effectively.
- **Focused Feature Learning**: With cleaner data and strategic feature extraction via CNN, the model focuses on the most informative aspects of the data, improving its ability to make accurate classifications.
- **Stable Training Dynamics**: Gradient clipping and efficient batch handling promote more stable updates during training, which is crucial for achieving high performance in deep learning models.

These optimizations ensure that the model not only trains more efficiently but also achieves higher accuracy by focusing on meaningful features and maintaining stability throughout the training process.

In [7]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import pandas as pd
from torch.utils.data import DataLoader, Dataset, RandomSampler, SequentialSampler
from transformers import BertTokenizer, BertModel, AdamW
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from tqdm import tqdm
import gc

# Data Preprocessing and Tokenization
def clean_text(text):
    emoticons = [':-)', ':)', '(:', '(-:', ':))', '((:', ':-D', ':D', 'X-D', 'XD', 'xD', '<3', '3', ':*', ':-*', 'xP', 'XP', 'XP', 'Xp', ':-|', ':->', ':-<', '8-)', ':-P', ':-p', '=P', '=p', ':*)', '*-*', 'B-)', 'O.o', 'X-(', ')-X']
    text = text.lower()
    text = re.sub(r'https?://[^\s]+', '', text)
    text = re.sub(r'@\w+', '', text)
    text = re.sub(r'\d+', '', text)
    for emoticon in emoticons:
        text = text.replace(emoticon, '')
    text = re.sub(r"[^a-zA-Z?.!,¿]+", " ", text)
    text = re.sub(r"([?.!,¿])", r" ", text)
    text = re.sub(r'[" "]+', " ", text)
    return text.strip()

df = pd.read_csv('/kaggle/input/dataset/labeled_data.csv')
df['tweet'] = df['tweet'].apply(clean_text)
train_texts, temp_texts, train_labels, temp_labels = train_test_split(df['tweet'], df['class'], test_size=0.3, random_state=42)
val_texts, test_texts, val_labels, test_labels = train_test_split(temp_texts, temp_labels, test_size=0.5, random_state=42)

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
train_encodings = tokenizer(train_texts.tolist(), truncation=True, padding=True, max_length=128)
val_encodings = tokenizer(val_texts.tolist(), truncation=True, padding=True, max_length=128)
test_encodings = tokenizer(test_texts.tolist(), truncation=True, padding=True, max_length=128)

class TweetDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item
    def __len__(self):
        return len(self.labels)

train_dataset = TweetDataset(train_encodings, train_labels.tolist())
val_dataset = TweetDataset(val_encodings, val_labels.tolist())
test_dataset = TweetDataset(test_encodings, test_labels.tolist())
batch_size = 32
train_loader = DataLoader(train_dataset, batch_size=batch_size, sampler=RandomSampler(train_dataset))
val_loader = DataLoader(val_dataset, batch_size=batch_size, sampler=SequentialSampler(val_dataset))
test_loader = DataLoader(test_dataset, batch_size=batch_size, sampler=SequentialSampler(test_dataset))

# BERT + CNN Integrated Model
class BertCNN(nn.Module):
    def __init__(self, bert_model, num_classes):
        super(BertCNN, self).__init__()
        self.bert = bert_model
        self.conv1 = nn.Conv1d(in_channels=768, out_channels=256, kernel_size=3, padding=1)
        self.pool = nn.AdaptiveMaxPool1d(1)
        self.fc = nn.Linear(256, num_classes)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids, attention_mask=attention_mask)
        x = outputs.last_hidden_state.permute(0, 2, 1)
        x = F.relu(self.conv1(x))
        x = self.pool(x).squeeze(2)
        x = self.fc(x)
        return x

# Device configuration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
bert_model = BertModel.from_pretrained('bert-base-uncased')
model = BertCNN(bert_model, num_classes=3)
model.to(device)
optimizer = AdamW(model.parameters(), lr=5e-5)

# Gradient clipping function
def clip_gradients(model, max_norm=1.0):
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)

# Training Function
def train(epoch):
    model.train()
    for batch in tqdm(train_loader, desc=f"Training Epoch {epoch}"):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask)
        loss = nn.CrossEntropyLoss()(outputs, labels)
        loss.backward()
        clip_gradients(model)
        optimizer.step()
        gc.collect()
        torch.cuda.empty_cache()

# Evaluation Function
def evaluate(loader, desc="Evaluating"):
    model.eval()
    total_loss, total_accuracy = 0, 0
    all_predictions, all_labels = [], []
    with torch.no_grad():
        for batch in tqdm(loader, desc=desc):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            outputs = model(input_ids, attention_mask)
            loss = nn.CrossEntropyLoss()(outputs, labels)
            total_loss += loss.item()
            predictions = torch.argmax(outputs, dim=1)
            total_accuracy += (predictions == labels).sum().item() / labels.size(0)
            all_predictions.extend(predictions.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())
    avg_loss = total_loss / len(loader)
    avg_accuracy = total_accuracy / len(loader)
    print(f"{desc} Loss: {avg_loss:.3f}")
    print(f"{desc} Accuracy: {avg_accuracy:.3f}")
    return all_labels, all_predictions

# Main Training Loop
for epoch in range(1, 4):
    train(epoch)
    evaluate(val_loader)

# Final Evaluation on Test Set
labels, predictions = evaluate(test_loader, "Final Test Evaluation")
print(classification_report(labels, predictions, target_names=['Hate Speech', 'Offensive Language', 'Neither']))
accuracy = accuracy_score(labels, predictions)
print(f"Test Accuracy: {accuracy:.3f}")


Training Epoch 1: 100%|██████████| 543/543 [03:41<00:00,  2.45it/s]
Evaluating: 100%|██████████| 117/117 [00:06<00:00, 19.08it/s]


Evaluating Loss: 0.280
Evaluating Accuracy: 0.903


Training Epoch 2: 100%|██████████| 543/543 [03:37<00:00,  2.50it/s]
Evaluating: 100%|██████████| 117/117 [00:06<00:00, 18.97it/s]


Evaluating Loss: 0.263
Evaluating Accuracy: 0.911


Training Epoch 3: 100%|██████████| 543/543 [03:41<00:00,  2.45it/s]
Evaluating: 100%|██████████| 117/117 [00:06<00:00, 18.99it/s]


Evaluating Loss: 0.317
Evaluating Accuracy: 0.909


Final Test Evaluation: 100%|██████████| 117/117 [00:06<00:00, 19.38it/s]

Final Test Evaluation Loss: 0.295
Final Test Evaluation Accuracy: 0.910
                    precision    recall  f1-score   support

       Hate Speech       0.48      0.39      0.43       207
Offensive Language       0.95      0.95      0.95      2880
           Neither       0.86      0.92      0.89       631

          accuracy                           0.91      3718
         macro avg       0.76      0.75      0.75      3718
      weighted avg       0.91      0.91      0.91      3718

Test Accuracy: 0.911



