---
# Install the required packages

If needed install the following packages:

In [1]:
# !pip install datasets transformers imbalanced-learn evaluate

---
# Imports

In [2]:
import pandas as pd
from datasets import load_dataset

# Write your code here. Add as many boxes as you need.

---
# Laboratory Exercise - Run Mode (8 points)

## Introduction

This laboratory assignment's primary objective is to fine-tune a pre-trained language model for detection of toxic sentences (binary classification). 

The dataset contains two attributes: 
- `text`: The sentence which needs to be classified in to toxic/non-toxic
- `label`: 0/1 indicator if the given sentence is toxic

**Note: You are required to perform this laboratory assignment on your local machine.**

# Read the data

The dataset reading is given. Just run the following 2 cells.

**DO NOT MODIFY IT! Just analyse how the data reading was performed, as in the future this part won't be given.**

In [3]:
import os
os.environ["HF_DATASETS_CACHE"] = "disable"

In [4]:
# dataset = load_dataset(
#     'csv', 
#     data_files={'train': 'data/train.tsv', 'val': 'data/val.tsv','data': 'data/test.tsv'},
#     delimiter='\t'
# )

In [5]:
import pandas as pd
from datasets import Dataset, DatasetDict

# Step 1: Load TSV files into pandas DataFrames
train_df = pd.read_csv('data/train.tsv', delimiter='\t')
val_df = pd.read_csv('data/val.tsv', delimiter='\t')
test_df = pd.read_csv('data/test.tsv', delimiter='\t')

# Step 2: Convert pandas DataFrames to Hugging Face Dataset objects
train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)
test_dataset = Dataset.from_pandas(test_df)

# Step 3: Combine into a DatasetDict (optional, for train/val/test split handling)
dataset = DatasetDict({
    "train": train_dataset,
    "validation": val_dataset,
    "test": test_dataset
})

# Verify the dataset
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 3130
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 3132
    })
})


**The prediction target column MUST be named 'label' in the dataset !**

See the dataset structure:

---
# Natural Language Processing

## Generate the Tokenizer and Data Collator

For the purposes of this lab you will be using `DistilBertTokenizer` and `DataCollatorWithPadding`.

In [6]:
# Write your code here. Add as many boxes as you need.
from transformers import AutoTokenizer, DataCollatorWithPadding

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenized_texts = tokenizer(dataset["train"]["text"])

## Tokenize the dataset

For the purposes of lowering the amount of computing set the `max_length` parameter to 15.

In [7]:
def tokenize(sample):
    return tokenizer(sample["text"], truncation=True, padding=True, max_length=15)

In [8]:
# Write your code here. Add as many boxes as you need.
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenized_dataset = dataset.map(tokenize, batched=True)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/3130 [00:00<?, ? examples/s]

Map:   0%|          | 0/3132 [00:00<?, ? examples/s]

In [9]:
tokenized_dataset["train"]

Dataset({
    features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 1000
})

In [10]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Define the model

The required model for this lab is the `DistilBertForSequenceClassification`.

In [11]:
from transformers import DistilBertForSequenceClassification

# Write your code here. Add as many boxes as you need.
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Define the training arguments

For lowering the compute time I recommend using the following parameters:
- per_device_train_batch_size=128
- per_device_eval_batch_size=128
- **num_train_epochs=1**

In [12]:
# Write your code here. Add as many boxes as you need.
from transformers import TrainingArguments
print("")
training_args = TrainingArguments(
    output_dir="trainer",
    eval_strategy="epoch",
    per_device_train_batch_size=128,  # batch size for training
    per_device_eval_batch_size=128,  # batch size for evaluation
    metric_for_best_model="f1",
    num_train_epochs=1

)




## Load the metrics

Load the best metric for the this specific problem.

In [21]:
# Write your code here. Add as many boxes as you need.
import evaluate
import numpy as np

metric = evaluate.load("f1")

### Define the function to compute the metrics

In [22]:
# Write your code here. Add as many boxes as you need.
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels, average="weighted")

## Generate the Trainer object

In [25]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1000
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3130
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3132
    })
})

In [26]:
# Write your code here. Add as many boxes as you need.
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    compute_metrics=compute_metrics,
    data_collator=data_collator,
)

## Train the model

Use the trainer to train the model.

In [27]:
# Write your code here. Add as many boxes as you need.
trainer.train()


Epoch,Training Loss,Validation Loss,F1
1,No log,0.554275,0.785005


TrainOutput(global_step=8, training_loss=0.631373941898346, metrics={'train_runtime': 12.3546, 'train_samples_per_second': 80.941, 'train_steps_per_second': 0.648, 'total_flos': 3880880820000.0, 'train_loss': 0.631373941898346, 'epoch': 1.0})

---
# Evaluate the model

## Generate predictions for the test set

In [29]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 3130
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 3132
    })
})

In [31]:
dataset['test']

Dataset({
    features: ['text', 'label'],
    num_rows: 3132
})

In [33]:
# Write your code here. Add as many boxes as you need.
predictions = trainer.predict(dataset['test'])


ValueError: You should supply an encoding or a list of encodings to this method that includes input_ids, but you provided ['label']

## Extract the predictions (class 0 or 1) from the logits

In [18]:
# Write your code here. Add as many boxes as you need.

## Analyze the performance of the model

In [19]:
# Write your code here. Add as many boxes as you need.

# Laboratory Exercise - Bonus Task (+ 2 points)

Implement a simple machine learning pipeline to classify if a given text is **toxic** or not. Use TF-IDF vectorization to convert text into numerical features and train a `MultinomialNB` model. If needed use `RandomUnderSampler()`. Compare the results with the transformer model.

In [20]:
# Write your code here. Add as many boxes as you need.

---