---
# Install the required packages

If needed install the following packages:

In [1]:
# !pip install datasets transformers imbalanced-learn evaluate

---
# Imports

In [2]:
from datasets import load_dataset

# Write your code here. Add as many boxes as you need.

In [3]:
from transformers import (
    DistilBertTokenizer,
    DistilBertForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer,
)
from sklearn.metrics import classification_report, precision_recall_fscore_support, accuracy_score
import evaluate
import torch




---
# Laboratory Exercise - Run Mode (8 points)

## Introduction

This laboratory assignment's primary objective is to fine-tune a pre-trained language model for detection of toxic sentences (binary classification). 

The dataset contains two attributes: 
- `text`: The sentence which needs to be classified in to toxic/non-toxic
- `label`: 0/1 indicator if the given sentence is toxic

**Note: You are required to perform this laboratory assignment on your local machine.**

# Read the data

The dataset reading is given. Just run the following 2 cells.

**DO NOT MODIFY IT! Just analyse how the data reading was performed, as in the future this part won't be given.**

In [4]:
dataset = load_dataset(
    'csv', 
    data_files={'train': 'data/train.tsv', 'val': 'data/val.tsv','test': 'data/test.tsv'},
    delimiter='\t'
)

**The prediction target column MUST be named 'label' in the dataset !**

See the dataset structure:

In [5]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1000
    })
    val: Dataset({
        features: ['text', 'label'],
        num_rows: 3130
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 3132
    })
})

---
# Natural Language Processing

## Generate the Tokenizer and Data Collator

For the purposes of this lab you will be using `DistilBertTokenizer` and `DataCollatorWithPadding`.

In [6]:
# Write your code here. Add as many boxes as you need.
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

In [7]:
def tokenize(sample):
    return tokenizer(sample['text'], truncation=True, max_length=15)

## Tokenize the dataset

For the purposes of lowering the amount of computing set the `max_length` parameter to 15.

In [8]:
# Write your code here. Add as many boxes as you need.
tokenized_datasets = dataset.map(tokenize, batched=True)

In [9]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Define the model

The required model for this lab is the `DistilBertForSequenceClassification`.

In [10]:
# Write your code here. Add as many boxes as you need.
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased',num_labels=2)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Define the training arguments

For lowering the compute time I recommend using the following parameters:
- per_device_train_batch_size=128
- per_device_eval_batch_size=128
- **num_train_epochs=1**

In [11]:
# Write your code here. Add as many boxes as you need.
training_args = TrainingArguments(
    output_dir="./results", 
    evaluation_strategy="epoch", 
    learning_rate=2e-5, 
    per_device_train_batch_size=128,
    per_device_eval_batch_size=128,
    num_train_epochs=1,
    weight_decay=0.01,
    fp16=True
)



## Load the metrics

Load the best metric for the this specific problem.

In [12]:
# Write your code here. Add as many boxes as you need.
metric = evaluate.load("accuracy")

### Define the function to compute the metrics

In [13]:
# Write your code here. Add as many boxes as you need.
def compute_metrices(eval_pred):
    logits, labels = eval_pred
    predictions = logits.argmax(axis=-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels,predictions, average='binary')
    acc = accuracy_score(labels, predictions)
    return {"accuracy": acc, "precision": precision, "recall": recall, "f1": f1}

## Generate the Trainer object

In [14]:
# Write your code here. Add as many boxes as you need.
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'], 
    eval_dataset=tokenized_datasets['val'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrices,
)

## Train the model

Use the trainer to train the model.

In [15]:
# Write your code here. Add as many boxes as you need.
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,No log,0.672615,0.761981,0.734018,0.821725,0.775399


TrainOutput(global_step=8, training_loss=0.684434175491333, metrics={'train_runtime': 13.7247, 'train_samples_per_second': 72.861, 'train_steps_per_second': 0.583, 'total_flos': 3880880820000.0, 'train_loss': 0.684434175491333, 'epoch': 1.0})

---
# Evaluate the model

## Generate predictions for the test set

In [16]:
# Write your code here. Add as many boxes as you need.
predictions = trainer.predict(tokenized_datasets['test'])
logits = predictions.predictions

## Extract the predictions (class 0 or 1) from the logits

In [17]:
# Write your code here. Add as many boxes as you need.
predicted_labels = logits.argmax(axis=-1)

## Analyze the performance of the model

In [19]:
# Write your code here. Add as many boxes as you need.
print(compute_metrices((logits,predictions.label_ids)))

{'accuracy': 0.7618135376756067, 'precision': 0.7313769751693002, 'recall': 0.8275862068965517, 'f1': 0.7765128819652487}


# Laboratory Exercise - Bonus Task (+ 2 points)

Implement a simple machine learning pipeline to classify if a given text is **toxic** or not. Use TF-IDF vectorization to convert text into numerical features and train a `MultinomialNB` model. If needed use `RandomUnderSampler()`. Compare the results with the transformer model.

In [20]:
# Write your code here. Add as many boxes as you need.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split

In [21]:
import pandas as pd
train_data = pd.read_csv('data/train.tsv', sep='\t')
test_data = pd.read_csv('data/test.tsv', sep='\t')

In [22]:
rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(train_data['text'].values.reshape(-1, 1), train_data['label'])
X_resampled = X_resampled.flatten()

In [23]:
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=5000)),
    ('clf', MultinomialNB())
])

In [24]:
X_train, X_val, y_train, y_val = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

In [25]:
pipeline.fit(X_train, y_train)

In [26]:
y_pred_val = pipeline.predict(X_val)
y_pred_test = pipeline.predict(test_data['text'])

In [27]:
print("Validation Metrics:")
print(classification_report(y_val, y_pred_val))
print("Test Metrics:")
print(classification_report(test_data['label'], y_pred_test))

Validation Metrics:
              precision    recall  f1-score   support

           0       0.76      0.70      0.73        99
           1       0.72      0.78      0.75        99

    accuracy                           0.74       198
   macro avg       0.74      0.74      0.74       198
weighted avg       0.74      0.74      0.74       198

Test Metrics:
              precision    recall  f1-score   support

           0       0.82      0.69      0.75      1566
           1       0.73      0.85      0.79      1566

    accuracy                           0.77      3132
   macro avg       0.78      0.77      0.77      3132
weighted avg       0.78      0.77      0.77      3132

