# 032-03 - Fine Tuning - Solution Notebook

* Written by Alexandre Gazagnes
* Last update: 2024-02-01

## About 

Context : 

Let's Continue the Party! 

Data  : 

**You can find the dataset [here](https://huggingface.co/docs/datasets/en/index).**




## Preliminaries

### System

These commands will display the system information:

Uncomment theses lines if needed. 

In [None]:
# pwd

In [None]:
# cd ..

In [None]:
# ls

In [None]:
# cd ..

In [None]:
# ls

### Install

In [1]:
#! pip install transformers datasets evaluate

In [None]:
# https://download.pytorch.org/whl/cu116
# !pip3 install torch torchvision torchaudio --extra-index-url

### Imports

In [None]:
import os

In [None]:
import pandas as pd

# import numpy as np

In [None]:
import torch

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification

In [None]:
from transformers import TrainingArguments

In [None]:
import evaluate

from transformers import Trainer

from datasets import load_dataset

### Data

First Data : 

In [None]:
dataset_train = load_dataset(
    "csv",
    data_files="archive/amazon Food Reviews 100k Dataset.csv",
    split="train[:10%]",
)

dataset_eval = load_dataset(
    "csv",
    data_files="archive/amazon Food Reviews 100k Dataset.csv",
    split="train[10%:20%]",
)

Second Data : 

In [None]:
dataset_train = (
    dataset_train.rename_column("Review", "text")
    .rename_column("Rating", "label")
    .remove_columns(["Id"])
)

dataset_eval = (
    dataset_eval.rename_column("Review", "text")
    .rename_column("Rating", "label")
    .remove_columns(["Id"])
)

## Fine tuning

### Tokenizer

Init tokenizer : 

In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenizer

Apply tokenizer : 


In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

In [None]:
tokenized_dataset_train = dataset_train.map(tokenize_function, batched=True)
tokenized_dataset_eval = dataset_eval.map(tokenize_function, batched=True)

### Process labels

Keep the label -1

In [None]:
def minus_1(example):
    example["label"] = example["label"] - 1
    return example

Tokenize : 

In [None]:
tokenized_dataset_train = tokenized_dataset_train.map(minus_1)
tokenized_dataset_eval = tokenized_dataset_eval.map(minus_1)

To pandas : 

In [None]:
tokenized_dataset_train.to_pandas().head()

### Load the model

Start by loading your model and specify the number of expected labels.

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-cased", num_labels=5
)

Manage Cuda

In [None]:
os.environ["CUDA_VISIBLE_DEVICES"] = ""

In [None]:
model = model.to("cuda")

Training hyperparameters

Specify where to save the checkpoints from your training.


Specify the evaluation_strategy parameter in your training arguments to report the evaluation metric at the end of each epoch:

In [None]:
training_args = TrainingArguments(
    output_dir="test_trainer", evaluation_strategy="epoch"
)

### Evalute

Our Metric : 

In [None]:
metric = evaluate.load("accuracy")

Metric function : 

In [None]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

### Trainer

Init the trainer : 

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset_train,
    eval_dataset=tokenized_dataset_eval,
    compute_metrics=compute_metrics,
)

Train : 

In [None]:
trainer.train()

Eval : 

In [None]:
trainer.evaluate()

Manage GPU

In [None]:
if torch.cuda.is_available():
    print("Number of GPU devices:", torch.cuda.device_count())
    for i in range(torch.cuda.device_count()):
        print("Device name:", torch.cuda.get_device_name(i))
else:
    print("No GPU available.")

In [None]:
# https://www.nvidia.com/Download/index.aspx