**Title:** CNN for Text Classification

**Introduction:**
This notebook demonstrates the implementation of a Convolutional Neural Network (CNN) for text classification using PyTorch. The dataset used here is labeled tweets, and the task is to classify the tweets into three categories: Hate Speech, Offensive Language, and Neither. The CNN architecture is utilized to learn features from the textual data, followed by classification using a fully connected layer.

**Content:**

1. **Environment Setup:** The notebook begins with setting up the Python environment and importing necessary libraries.

2. **Model Initialization with CNN:** The CNN model for text classification is defined in this section. It consists of embedding layers, convolutional layers, and a fully connected layer.

3. **Text Cleaning:** A function for cleaning text data is defined to preprocess the tweets, removing URLs, mentions, special characters, and emoticons.

4. **Dataset Loading and Splitting:** The labeled tweet dataset is loaded and split into training, validation, and test sets.

5. **Tokenization with BERT Tokenizer:** The tweets are tokenized using the BERT tokenizer, which converts text inputs into token IDs.

6. **Dataset Class:** A custom dataset class is defined to process the tokenized inputs and corresponding labels.

7. **DataLoader Initialization:** DataLoaders are initialized for the training, validation, and test datasets to efficiently load data in batches during model training and evaluation.

8. **Model Training:** The training function is defined to train the CNN model on the training dataset using backpropagation.

9. **Evaluation Function:** An evaluation function is defined to assess the model's performance on the validation and test datasets.

10. **Main Training Loop:** The main training loop runs for a specified number of epochs, during which the model is trained on the training dataset and evaluated on the validation dataset.

11. **Final Evaluation on Test Set:** The trained model's performance is evaluated on the test dataset, and classification metrics such as precision, recall, and F1-score are computed.

12. **Conclusion:** The notebook concludes by printing the test accuracy of the CNN model for text classification.

**Conclusion:**
This notebook provides a comprehensive implementation of a CNN for text classification task using PyTorch, demonstrating the process from data preprocessing to model evaluation. The model achieves competitive performance in classifying tweets into predefined categories, showcasing the effectiveness of CNNs in processing textual data for classification tasks.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/mydatasetof/labeled_data.csv


In [2]:
import re
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from tqdm import tqdm
from transformers import BertTokenizer

# Part 1: Model initialization with CNN
class CNNForSequenceClassification(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(CNNForSequenceClassification, self).__init__()
        self.embedding = nn.Embedding(input_dim, 128)
        self.conv1 = nn.Conv1d(in_channels=128, out_channels=64, kernel_size=3)
        self.conv2 = nn.Conv1d(in_channels=64, out_channels=32, kernel_size=3)
        self.fc = nn.Linear(32, output_dim)

    def forward(self, x):
        embedded = self.embedding(x)
        embedded = embedded.permute(0, 2, 1)  # Change shape for Conv1d
        conv_out1 = F.relu(self.conv1(embedded))
        conv_out2 = F.relu(self.conv2(conv_out1))
        pooled = F.max_pool1d(conv_out2, kernel_size=conv_out2.size(2)).squeeze(2)
        output = self.fc(pooled)
        return output

# Check GPU availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f'Using device: {device}')


Using device: cuda


In [3]:
# Part 2: Define text cleaning function (remains the same)
emoticons = [':-)', ':)', '(:', '(-:', ':))', '((:', ':-D', ':D', 'X-D', 'XD', 'xD', 'xD', '<3', '3', ':*', ':-*', 'xP', 'XP', 'XP', 'Xp', ':-|', ':->', ':-<', '8-)', ':-P', ':-p', '=P', '=p', ':*)', '*-*', 'B-)', 'O.o', 'X-(', ')-X']

def clean_text(text):
    text = text.lower()
    text = re.sub(r'https?://[^\s]+', '', text)
    text = re.sub(r'@\w+', '', text)
    text = re.sub(r'\d+', '', text)
    for emoticon in emoticons:
        text = text.replace(emoticon, '')
    text = re.sub(r"[^a-zA-Z?.!,¿]+", " ", text)
    text = re.sub(r"([?.!,¿])", r" ", text)
    text = re.sub(r'[" "]+', " ", text)
    return text.strip()

In [5]:
# Part 3: Load dataset and split (remains the same)
df = pd.read_csv('/kaggle/input/mydatasetof/labeled_data.csv')
df['tweet'] = df['tweet'].apply(clean_text)

train_texts, temp_texts, train_labels, temp_labels = train_test_split(df['tweet'], df['class'], test_size=0.3, random_state=42)
val_texts, test_texts, val_labels, test_labels = train_test_split(temp_texts, temp_labels, test_size=0.5, random_state=42)


In [6]:
# Part 4: Tokenization with BERT tokenizer (remains the same)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

train_encodings = tokenizer(train_texts.tolist(), truncation=True, padding=True, max_length=128)
test_encodings = tokenizer(test_texts.tolist(), truncation=True, padding=True, max_length=128)
val_encodings = tokenizer(val_texts.tolist(), truncation=True, padding=True, max_length=128)


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [7]:
# Part 5: Dataset class (remains the same)
class TweetDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = TweetDataset(train_encodings, train_labels.tolist())
test_dataset = TweetDataset(test_encodings, test_labels.tolist())
val_dataset = TweetDataset(val_encodings, val_labels.tolist())


In [8]:
# Part 6: DataLoader initialization (remains the same)
batch_size = 32

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)


In [9]:
# Part 7: Model initialization with CNN
input_dim = len(tokenizer.get_vocab())
output_dim = 3
model = CNNForSequenceClassification(input_dim, output_dim)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
model.to(device)


CNNForSequenceClassification(
  (embedding): Embedding(30522, 128)
  (conv1): Conv1d(128, 64, kernel_size=(3,), stride=(1,))
  (conv2): Conv1d(64, 32, kernel_size=(3,), stride=(1,))
  (fc): Linear(in_features=32, out_features=3, bias=True)
)

In [10]:
# Part 8: Training function (remains the same)
def train(epoch):
    model.train()
    total_loss, total_accuracy = 0, 0
    for batch in tqdm(train_loader, desc=f"Training Epoch {epoch}"):
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids)
        loss = nn.CrossEntropyLoss()(outputs, labels)
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
        _, predictions = torch.max(outputs, 1)
        total_accuracy += torch.sum(predictions == labels).item() / len(labels)
    
    avg_loss = total_loss / len(train_loader)
    avg_accuracy = total_accuracy / len(train_loader)
    print(f"Training Loss: {avg_loss:.3f}")
    print(f"Training Accuracy: {avg_accuracy:.3f}")


In [11]:
# Part 9: Evaluation function (remains the same)
def evaluate(loader, desc="Evaluating"):
    model.eval()
    total_loss, total_accuracy = 0, 0
    all_predictions, all_labels = [], []
    
    with torch.no_grad():
        for batch in tqdm(loader, desc=desc):
            input_ids = batch['input_ids'].to(device)
            labels = batch['labels'].to(device)
            outputs = model(input_ids)
            loss = nn.CrossEntropyLoss()(outputs, labels)
            
            total_loss += loss.item()
            _, predictions = torch.max(outputs, 1)
            total_accuracy += torch.sum(predictions == labels).item() / len(labels)
            
            all_predictions.extend(predictions.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())

    avg_loss = total_loss / len(loader)
    avg_accuracy = total_accuracy / len(loader)
    print(f"{desc} Loss: {avg_loss:.3f}")
    print(f"{desc} Accuracy: {avg_accuracy:.3f}")
    
    return all_labels, all_predictions


In [12]:
# Part 10: Main training loop (remains the same)
for epoch in range(1, 4):
    train(epoch)
    evaluate(val_loader)

# Final evaluation on test set
labels, predictions = evaluate(test_loader, "Final Test Evaluation")
print(classification_report(labels, predictions, target_names=['Hate Speech', 'Offensive Language', 'Neither']))

# Accuracy
accuracy = accuracy_score(labels, predictions)
print(f"Test Accuracy: {accuracy:.3f}")

Training Epoch 1: 100%|██████████| 543/543 [00:04<00:00, 120.17it/s]


Training Loss: 0.365
Training Accuracy: 0.863


Evaluating: 100%|██████████| 117/117 [00:00<00:00, 270.74it/s]


Evaluating Loss: 0.295
Evaluating Accuracy: 0.897


Training Epoch 2: 100%|██████████| 543/543 [00:03<00:00, 163.59it/s]


Training Loss: 0.250
Training Accuracy: 0.912


Evaluating: 100%|██████████| 117/117 [00:00<00:00, 270.58it/s]


Evaluating Loss: 0.303
Evaluating Accuracy: 0.886


Training Epoch 3: 100%|██████████| 543/543 [00:03<00:00, 169.71it/s]


Training Loss: 0.201
Training Accuracy: 0.928


Evaluating: 100%|██████████| 117/117 [00:00<00:00, 264.58it/s]


Evaluating Loss: 0.310
Evaluating Accuracy: 0.889


Final Test Evaluation: 100%|██████████| 117/117 [00:00<00:00, 274.56it/s]

Final Test Evaluation Loss: 0.286
Final Test Evaluation Accuracy: 0.897
                    precision    recall  f1-score   support

       Hate Speech       0.49      0.15      0.23       207
Offensive Language       0.91      0.97      0.94      2880
           Neither       0.87      0.81      0.84       631

          accuracy                           0.90      3718
         macro avg       0.76      0.64      0.67      3718
      weighted avg       0.88      0.90      0.88      3718

Test Accuracy: 0.897



