### Overview of Merging BERT and CNN Models

The merging of BERT and CNN models into a cohesive architecture aims to harness the deep contextual understanding of BERT with the spatial feature extraction capabilities of CNNs for text classification. This integration can potentially enhance performance on tasks like sentiment analysis or hate speech detection, where both context and local textual features are crucial.

### Steps in Merging BERT and CNN:

1. **Data Preprocessing and Tokenization**:
   - **Text Cleaning**: Textual data is cleaned to remove URLs, user mentions, emoticons, and numeric characters, which are irrelevant for understanding the semantic meaning of texts.
   - **Tokenization**: Using BERT's tokenizer, the cleaned text is converted into token IDs, which are numerical representations understandable by the BERT model. These tokens are padded to a fixed length to maintain uniformity across inputs.

2. **Dataset Preparation**:
   - A custom `Dataset` class handles the storage of tokenized text and labels, facilitating easy batch loading during training and evaluation through PyTorch’s `DataLoader`.

3. **Model Architecture (BertCNN)**:
   - **BERT Layer**: The BERT model serves as the feature extractor, where the embeddings for each token are generated. These embeddings are rich in contextual information, capturing both the meaning of each word in the context of the surrounding words.
   - **CNN Layers**: Following BERT, a series of convolutional layers are applied. These layers are designed to extract higher-level features from the BERT embeddings. The convolutional operations can capture patterns across different parts of the sentence, which are essential for understanding complex linguistic constructs like negations or conditionals.
   - **Pooling and Classification**: After convolution, a max pooling layer reduces the dimensionality of the feature maps, focusing only on the most relevant features. The pooled output is then passed to a fully connected layer that maps these features to the target classes.

4. **Training and Evaluation**:
   - **Training Loop**: In each epoch, the model undergoes training where it learns by adjusting weights to minimize the loss between the predicted and actual class labels. Gradients are computed for each batch, and weights are updated using the AdamW optimizer.
   - **Evaluation Loop**: Post training, the model is evaluated on the validation and test sets to monitor performance and avoid overfitting. Performance metrics such as loss and accuracy provide insights into model effectiveness.

5. **Model Deployment**:
   - The model, once trained and validated, can be deployed for inferencing, where it can classify new, unseen text data.

### Explanation of the Merged Code:

The merged code defines an end-to-end workflow from data preprocessing, model definition, training, and evaluation:

- **Data Loading and Cleaning**: Data is loaded from a CSV file and cleaned using a predefined function that strips unnecessary characters and normalizes the text.
- **Tokenization**: Texts are converted into tokenized formats suitable for BERT.
- **Dataset and DataLoader Setup**: Tokenized texts are encapsulated in a custom dataset class, which is then used with DataLoader for efficient batch processing during model training.
- **Model Definition (BertCNN)**:
  - The model integrates BERT for obtaining deep contextual embeddings of the text.
  - Convolutional layers process these embeddings to capture spatial dependencies.
  - The output through the convolutional layers undergoes pooling and is finally fed into a dense layer for classification.
- **Training and Evaluation**:
  - Detailed training and evaluation functions are defined, incorporating loss computation, backpropagation, and accuracy calculation.
  - The training function includes gradient zeroing and optimizer steps, essential for correct weight updates.
  - Evaluation assesses the model's performance on unseen data, crucial for testing the model's generalizability.

This integration effectively combines the nuanced understanding of text provided by BERT with the robust feature extraction capabilities of CNNs, aiming to create a powerful tool for text classification tasks.

In [5]:
import re
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import pandas as pd
from torch.utils.data import DataLoader, Dataset
from transformers import BertTokenizer, BertModel, AdamW
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from tqdm import tqdm

# Data Preprocessing and Tokenization
def clean_text(text):
    emoticons = [':-)', ':)', '(:', '(-:', ':))', '((:', ':-D', ':D', 'X-D', 'XD', 'xD', '<3', '3', ':*', ':-*', 'xP', 'XP', 'XP', 'Xp', ':-|', ':->', ':-<', '8-)', ':-P', ':-p', '=P', '=p', ':*)', '*-*', 'B-)', 'O.o', 'X-(', ')-X']
    text = text.lower()
    text = re.sub(r'https?://[^\s]+', '', text)
    text = re.sub(r'@\w+', '', text)
    text = re.sub(r'\d+', '', text)
    for emoticon in emoticons:
        text = text.replace(emoticon, '')
    text = re.sub(r"[^a-zA-Z?.!,¿]+", " ", text)
    text = re.sub(r"([?.!,¿])", r" ", text)
    text = re.sub(r'[" "]+', " ", text)
    return text.strip()

df = pd.read_csv('/kaggle/input/dataset/labeled_data.csv')
df['tweet'] = df['tweet'].apply(clean_text)
train_texts, temp_texts, train_labels, temp_labels = train_test_split(df['tweet'], df['class'], test_size=0.3, random_state=42)
val_texts, test_texts, val_labels, test_labels = train_test_split(temp_texts, temp_labels, test_size=0.5, random_state=42)

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
train_encodings = tokenizer(train_texts.tolist(), truncation=True, padding=True, max_length=128)
test_encodings = tokenizer(test_texts.tolist(), truncation=True, padding=True, max_length=128)
val_encodings = tokenizer(val_texts.tolist(), truncation=True, padding=True, max_length=128)

class TweetDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item
    def __len__(self):
        return len(self.labels)

train_dataset = TweetDataset(train_encodings, train_labels.tolist())
val_dataset = TweetDataset(val_encodings, val_labels.tolist())
test_dataset = TweetDataset(test_encodings, test_labels.tolist())
batch_size = 32
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

# BERT + CNN Integrated Model
class BertCNN(nn.Module):
    def __init__(self, bert_model, num_classes):
        super(BertCNN, self).__init__()
        self.bert = bert_model
        self.conv1 = nn.Conv1d(in_channels=768, out_channels=256, kernel_size=3, padding=1)  # Adjust parameters as needed
        self.conv2 = nn.Conv1d(in_channels=256, out_channels=128, kernel_size=3, padding=1)
        self.fc = nn.Linear(128, num_classes)

    def forward(self, input_ids, attention_mask):
        with torch.no_grad():
            outputs = self.bert(input_ids, attention_mask=attention_mask)
        # outputs[0] = [batch_size, seq_length, hidden_size]
        x = outputs[0].permute(0, 2, 1)  # Change to [batch_size, hidden_size, seq_length]
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = F.max_pool1d(x, kernel_size=x.size(2)).squeeze(2)
        x = self.fc(x)
        return x

# Device configuration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
bert_model = BertModel.from_pretrained('bert-base-uncased')
model = BertCNN(bert_model, num_classes=3)
model.to(device)
optimizer = AdamW(model.parameters(), lr=5e-5)

# Training Function
def train(epoch):
    model.train()
    total_loss, total_accuracy = 0, 0
    for batch in tqdm(train_loader, desc=f"Training Epoch {epoch}"):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask)
        loss = nn.CrossEntropyLoss()(outputs, labels)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
        predictions = torch.argmax(outputs, dim=1)
        total_accuracy += (predictions == labels).sum().item() / labels.size(0)
    avg_loss = total_loss / len(train_loader)
    avg_accuracy = total_accuracy / len(train_loader)
    print(f"Training Loss: {avg_loss:.3f}")
    print(f"Training Accuracy: {avg_accuracy:.3f}")

# Evaluation Function
def evaluate(loader, desc="Evaluating"):
    model.eval()
    total_loss, total_accuracy = 0, 0
    all_predictions, all_labels = [], []
    with torch.no_grad():
        for batch in tqdm(loader, desc=desc):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            outputs = model(input_ids, attention_mask)
            loss = nn.CrossEntropyLoss()(outputs, labels)
            total_loss += loss.item()
            predictions = torch.argmax(outputs, dim=1)
            total_accuracy += (predictions == labels).sum().item() / labels.size(0)
            all_predictions.extend(predictions.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())
    avg_loss = total_loss / len(loader)
    avg_accuracy = total_accuracy / len(loader)
    print(f"{desc} Loss: {avg_loss:.3f}")
    print(f"{desc} Accuracy: {avg_accuracy:.3f}")
    return all_labels, all_predictions

# Main Training Loop
for epoch in range(1, 4):
    train(epoch)
    evaluate(val_loader)

# Final Evaluation on Test Set
labels, predictions = evaluate(test_loader, "Final Test Evaluation")
print(classification_report(labels, predictions, target_names=['Hate Speech', 'Offensive Language', 'Neither']))
accuracy = accuracy_score(labels, predictions)
print(f"Test Accuracy: {accuracy:.3f}")


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Training Epoch 1: 100%|██████████| 543/543 [00:31<00:00, 17.46it/s]


Training Loss: 0.460
Training Accuracy: 0.837


Evaluating: 100%|██████████| 117/117 [00:06<00:00, 18.94it/s]


Evaluating Loss: 0.332
Evaluating Accuracy: 0.879


Training Epoch 2: 100%|██████████| 543/543 [00:29<00:00, 18.28it/s]


Training Loss: 0.320
Training Accuracy: 0.881


Evaluating: 100%|██████████| 117/117 [00:06<00:00, 19.03it/s]


Evaluating Loss: 0.301
Evaluating Accuracy: 0.890


Training Epoch 3: 100%|██████████| 543/543 [00:29<00:00, 18.23it/s]


Training Loss: 0.291
Training Accuracy: 0.891


Evaluating: 100%|██████████| 117/117 [00:06<00:00, 18.94it/s]


Evaluating Loss: 0.284
Evaluating Accuracy: 0.898


Final Test Evaluation: 100%|██████████| 117/117 [00:06<00:00, 19.45it/s]

Final Test Evaluation Loss: 0.271
Final Test Evaluation Accuracy: 0.898
                    precision    recall  f1-score   support

       Hate Speech       0.54      0.33      0.41       207
Offensive Language       0.93      0.95      0.94      2880
           Neither       0.82      0.86      0.84       631

          accuracy                           0.90      3718
         macro avg       0.76      0.71      0.73      3718
      weighted avg       0.89      0.90      0.89      3718

Test Accuracy: 0.899



