# Statistical Learning - Final Project
## Model Finetuning
This notebook showcases our finetuning process using `transformers`

Link to dataset: https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail 

|Team member|Student ID|
|-----------|----------|
|Nguyễn Duy Đăng Khoa|21127078|
|Phạm Nguyễn Quốc Thanh|21127428|
|Nguyễn Vũ Minh Khôi|21127518|
|Âu Dương Khang|21127621|

**NOTE**: In reality, because our dataset is so large (>300k rows). We needed to run the finetuning on a hosted instance on **IBM Cloud** (~300GB RAM, 2 V100 GPU). So, there might be a bit of a difference in code between this notebook and the notebook we actually used.  
The difference mainly lies at the **data loading** phase, where we had to adapt to working with `Cloud Object Storage` & `boto`.

## Loading the data

In [None]:
!pip install -q datasets
!pip install -q evaluate

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.3/527.3 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.9/39.9 MB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 24.4.1 requires pyarrow<15.0.0a0,>=14.0.1, but you have pyarrow 17.0.0 which is incompatible.
ibis-framework 8.0.0 requires pyarrow<16,>=2, but you have pyarrow 17.0.0 which is incompatible.[0m[31m
[2K  

In [None]:
!pip install -q rouge_score
!pip install -q bert_score

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m817.9 kB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
!pip install --upgrade -q pyarrow==15.0.2

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.3/38.3 MB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 24.4.1 requires pyarrow<15.0.0a0,>=14.0.1, but you have pyarrow 15.0.2 which is incompatible.[0m[31m
[0m

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Because of the size, we decided to pre-tokenize the dataset (code to tokenize also available below)

In [None]:
# LOAD THE TOKENIZED DATASET IF AVAILABLE ON DISK
# If loaded, ignore all actions in "loading the data & tokenize" and move to trainer initialization
from datasets import Dataset

tokenized_train_dataset = Dataset.load_from_disk('/content/drive/MyDrive/StatisticalLearning_Datasets/tokenized/train')
tokenized_test_dataset = Dataset.load_from_disk('/content/drive/MyDrive/StatisticalLearning_Datasets/tokenized/test')
tokenized_validation_dataset = Dataset.load_from_disk('/content/drive/MyDrive/StatisticalLearning_Datasets/tokenized/validation')

In [None]:
# path = '/content/drive/MyDrive/StatisticalLearning_Datasets/tokenized_csv/'

# tokenized_train_dataset.to_csv(path + 'train.csv')
# tokenized_test_dataset.to_csv(path + 'test.csv')
# tokenized_validation_dataset.to_csv(path + 'validation.csv')

Creating CSV from Arrow format:   0%|          | 0/288 [00:00<?, ?ba/s]

Creating CSV from Arrow format:   0%|          | 0/12 [00:00<?, ?ba/s]

Creating CSV from Arrow format:   0%|          | 0/14 [00:00<?, ?ba/s]

154409256

Loading the data & tokenize

In [None]:
import pandas as pd

train_df = pd.read_csv('/content/drive/MyDrive/StatisticalLearning_Datasets/train.csv')
test_df = pd.read_csv('/content/drive/MyDrive/StatisticalLearning_Datasets/test.csv')
validation_df = pd.read_csv('/content/drive/MyDrive/StatisticalLearning_Datasets/validation.csv')

In [None]:
train_df.drop('id', axis=1,inplace=True)
test_df.drop('id', axis=1,inplace=True)
validation_df.drop('id', axis=1,inplace=True)

In [None]:
from datasets import Dataset

train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)
validation_dataset = Dataset.from_pandas(validation_df)

In [None]:
train_dataset[0]

{'article': "By . Associated Press . PUBLISHED: . 14:11 EST, 25 October 2013 . | . UPDATED: . 15:36 EST, 25 October 2013 . The bishop of the Fargo Catholic Diocese in North Dakota has exposed potentially hundreds of church members in Fargo, Grand Forks and Jamestown to the hepatitis A virus in late September and early October. The state Health Department has issued an advisory of exposure for anyone who attended five churches and took communion. Bishop John Folda (pictured) of the Fargo Catholic Diocese in North Dakota has exposed potentially hundreds of church members in Fargo, Grand Forks and Jamestown to the hepatitis A . State Immunization Program Manager Molly Howell says the risk is low, but officials feel it's important to alert people to the possible exposure. The diocese announced on Monday that Bishop John Folda is taking time off after being diagnosed with hepatitis A. The diocese says he contracted the infection through contaminated food while attending a conference for new

## Initialize the tokenizer and the model + Trainer class

In [None]:
import torch

#Enable GPU if applicable
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cpu


In [None]:
from transformers import AutoTokenizer

#Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained("Falconsai/text_summarization")

def tokenize_function(examples):
    model_inputs = tokenizer(examples['article'], text_target=examples['highlights'], truncation=True, padding="max_length")

    return model_inputs

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
#Skip if already tokenized

tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True)
tokenized_test_dataset = test_dataset.map(tokenize_function, batched=True)
tokenized_validation_dataset = validation_dataset.map(tokenize_function, batched=True)

In [None]:
# OPTIONAL: Save the tokenized datasets to disk for later use
# path = '/content/drive/MyDrive/StatisticalLearning_Datasets/tokenized'

# tokenized_train_dataset.save_to_disk(path + 'train')
# tokenized_test_dataset.save_to_disk(path + 'test')
# tokenized_validation_dataset.save_to_disk(path + 'validation')

Saving the dataset (0/7 shards):   0%|          | 0/287113 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/11490 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/13368 [00:00<?, ? examples/s]

In [None]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("Falconsai/text_summarization").to(device)

Trainer arguments

In [None]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

# Define training arguments
training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=8,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
    predict_with_generate=True,   # Required for text generation tasks
    logging_dir='./logs',
)

Setting evaluation metric

In [None]:
import numpy as np
import evaluate

metric_rouge = evaluate.load("rouge")
metric_bertscore = evaluate.load("bertscore")

def compute_metrics(eval_pred):
    predictions = eval_pred['predictions']
    labels = eval_pred['label_ids']

    rouge_score = metric_rouge.compute(predictions=predictions, references=labels)
    bert_score = metric_bertscore.compute(predictions=predictions, references=labels, lang='vi')

    return {'rouge': rouge_score, 'bert': bert_score}

In [None]:
# Create the Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_validation_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

## Start training & Evaluate on test set

In [None]:
trainer.train()