# Text Classification and Analysis of Data Protection Acts

This project focuses on the classification and analysis of data protection acts from Kenya, South Africa, and Europe. The goal is to build a robust text classification model using PyTorch and the Transformers library. The notebook is structured to guide you through the entire process, from data loading and preprocessing to model training and evaluation.

## Project Structure

### 1. Introduction
- **Description**: Overview of the project, its objectives, and the datasets used.
- **Libraries**: Import necessary libraries such as `torch`, `transformers`, `pandas`, and `sklearn`.

### 2. Data Loading and Exploration
- **Datasets Path**: Define the path to the datasets.
- **Loading Data**: Load the GDPR dataset and display the first few rows.
- **Data Cleaning**: Check for missing values and drop unnecessary columns.

### 3. Data Preprocessing
- **Text Preparation**: Combine relevant columns to create the input text for the model.
- **Data Splitting**: Split the data into training, validation, and test sets.
- **Text Cleaning**: Clean the input text by removing unwanted characters and spaces.

### 4. Tokenization
- **Tokenizer Initialization**: Initialize the BERT tokenizer.
- **Tokenization Function**: Define a function to tokenize the data.
- **Tokenize Data**: Tokenize the training, validation, and test datasets.

### 5. Label Encoding
- **Label Mapping**: Create a mapping of categories to numerical labels.
- **Apply Labels**: Map the categories to labels in the datasets.
- **Dataset Preparation**: Prepare the datasets for the model by creating a custom `Dataset` class.

### 6. Model Training
- **Model Initialization**: Initialize the BERT model for sequence classification.
- **Training Arguments**: Define the training arguments such as batch size, learning rate, and number of epochs.
- **Trainer Initialization**: Initialize the `Trainer` with the model, training arguments, and datasets.
- **Model Training**: Train the model using the `Trainer`.

### 7. Model Evaluation
- **Evaluation**: Evaluate the model on the validation dataset.
- **Metrics**: Display the evaluation metrics to assess the model's performance.

### 8. Conclusion
- **Summary**: Summarize the results and discuss potential improvements and future work.

## Variables and Data Structures

- **datasets_path**: Path to the datasets directory.
- **gdpr**: DataFrame containing the GDPR dataset.
- **gdpr_path**: Path to the GDPR CSV file.
- **label_map**: Dictionary mapping categories to numerical labels.
- **model**: BERT model for sequence classification.
- **test_data**: DataFrame containing the test data.
- **test_data_labels**: Numpy array of labels for the test data.
- **test_dataset**: Custom dataset object for the test data.
- **test_encoding**: Tokenized test data.
- **tokenizer**: BERT tokenizer.
- **train_data**: DataFrame containing the training data.
- **train_dataset**: Custom dataset object for the training data.
- **train_encoding**: Tokenized training data.
- **trainer**: Trainer object for model training.
- **training_args**: Training arguments for the Trainer.
- **training_data**: DataFrame containing the original training data.
- **training_data_copy**: Cleaned copy of the training data.
- **training_labels**: Numpy array of labels for the training data.
- **val_data**: DataFrame containing the validation data.
- **val_dataset**: Custom dataset object for the validation data.
- **val_encoding**: Tokenized validation data.
- **val_labels**: Numpy array of labels for the validation data.

This notebook provides a comprehensive guide to building a text classification model for data protection acts, leveraging the power of PyTorch and Transformers.

In [None]:
import torch as th
import transformers as tr
from transformers import AutoTokenizer, AutoModelForSequenceClassification, BertTokenizer
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset
import os

In [None]:
import sys
print(sys.version)

### Data Loading and Exploration

#### Step 1: Importing Libraries
- Import necessary libraries such as pandas, numpy, matplotlib, and seaborn for data manipulation and visualization.

#### Step 2: Loading the Dataset
- Setting the datasets path `../Datasets/` for all datasets for the project.
- Load the dataset using pandas' `gdpr4.csv` function.
- Display the first few rows of the dataset using the `head` method to get an initial understanding of the data structure.

#### Step 3: Inspecting the Dataset
- Use the `info` method to get a concise summary of the dataset, including the number of non-null entries and data types of each column.
- Use the `describe` method to generate descriptive statistics that summarize the central tendency, dispersion, and shape of the dataset’s distribution.

#### Step 4: Checking for Missing Values
- Check for missing values in the dataset using the `isnull` method combined with `sum` to get the total count of missing values per column.


In [None]:
datasets_path = '../Datasets/'
gdpr_path = datasets_path + 'gdpr4.csv'
gdpr = pd.read_csv(gdpr_path)
gdpr.head()

In [None]:
gdpr.info()

In [None]:
gdpr.describe()

In [None]:
gdpr.isna().sum()

In [None]:
gdpr = gdpr.drop(columns=["Unnamed: 4", "Recitals"], axis=1)
gdpr.head()

In [None]:
gdpr.isna().sum()

In [None]:
train_data = pd.DataFrame()

In [None]:
train_data["input_text"] = gdpr["Category"] + ", " + gdpr["Description"] + ", " + gdpr["GDPR Articles"]
train_data["input_text"] = gdpr["Category"] + ": " + gdpr["Description"] + " (GDPR Article: " + gdpr["GDPR Articles"] + ")"
train_data.head()

In [None]:
import re
train_data["input_text"] = train_data["input_text"].str.replace(r"\(|\)", "", regex=True)
train_data.head()

In [None]:
import csv
train_data.to_csv('../Datasets/gdpr4_training_data.csv', index=False, quotechar='"', quoting=csv.QUOTE_NONNUMERIC)
train_data.head()

In [None]:
# Step 1: Extract unique texts
unique_text = train_data["input_text"].str.split(", ").str[0].unique()

# Step 2: Create a label mapping
label_mapping = {label: idx for idx, label in enumerate(unique_text)}
print(label_mapping)

# Step 3: Map the labels to integers
train_data["label"] = train_data["input_text"].str.split(", ").str[0].map(label_mapping)

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

encoding = tokenizer(
    train_data["input_text"].tolist(),
    truncation=True,
    padding=True,
    max_length=512,
    return_tensors="pt"
)

In [None]:
from transformers import BertForSequenceClassification, Trainer, TrainingArguments, EvalPrediction

training_data, test_data = train_test_split(train_data, test_size=0.2, random_state=42)
train_data, val_data = train_test_split(training_data, test_size=0.1, random_state=42)

def tokenize_data(df: pd.DataFrame, tokenizer: tr.PreTrainedTokenizer, max_length: int = 128) -> pd.DataFrame:
    return tokenizer(
        df["input_text"].tolist(),
        padding=True,
        truncation=True,
        max_length=max_length,
        return_tensors="pt"
    )

train_encoding = tokenize_data(train_data, tokenizer)
val_encoding = tokenize_data(val_data, tokenizer)
test_encoding = tokenize_data(test_data, tokenizer)
print(train_encoding.keys())

In [None]:
class TextDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = [int(label) for label in labels]

    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = th.tensor(self.labels[idx])
        return item
    
train_dataset = TextDataset(train_encoding, train_data["label"].tolist())
val_dataset = TextDataset(val_encoding, val_data["label"].tolist())
test_dataset = TextDataset(test_encoding, test_data["label"].tolist())

In [None]:
print(type(train_dataset))

In [None]:
train_data.columns

In [None]:
train_data["label"].nunique()

In [None]:
len(train_data)

In [None]:
train_data

In [None]:
num_labels = train_data["label"].nunique()

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=28)

training_args = TrainingArguments(
    output_dir='./results',  # Directory to save model and logs
    num_train_epochs=25,  # Number of training epochs
    per_device_train_batch_size=16,  # Batch size for training
    per_device_eval_batch_size=64,  # Batch size for evaluation
    warmup_steps=500,  # Number of warmup steps for learning rate scheduler
    weight_decay=0.01,  # Strength of weight decay
    logging_dir='./logs',  # Directory for storing logs
    logging_steps=10,  # How often to log metrics
    eval_strategy='steps',  # When to evaluate the model
)

In [None]:
from sklearn.metrics import accuracy_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc
    }

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

In [None]:
trainer.evaluate()

In [None]:
from transformers import Trainer, TrainingArguments

def train_model(learning_rate, batch_size):
    training_args = TrainingArguments(
        output_dir='./results',
        num_train_epochs=25,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size * 4,
        warmup_steps=500,
        weight_decay=0.01,
        logging_dir='./logs',
        logging_steps=10,
        eval_strategy='steps',
        learning_rate=learning_rate,
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        compute_metrics=compute_metrics
    )
    return trainer.train()

# Example usage of train_model
best_accuracy = 0
best_lr = None
best_bs = None

for lr in [1e-5, 3e-5, 5e-5]:
    for bs in [8, 16, 32]:
        train_output = train_model(lr, bs)
        print(f'Available metrics: {train_output.metrics.keys()}')  # Debugging line
        accuracy = train_output.metrics.get('eval_accuracy', 0)  # Use .get() to avoid KeyError
        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_lr = lr
            best_bs = bs

print(f'Best Accuracy: {best_accuracy} with LR: {best_lr} and Batch Size: {best_bs}')
