# <font color = 'indianred'>**Sentiment Analysis with the IMDB Dataset using Pre-Trained model** </font>

**Objective:**

In this notebook, we aim to build upon the foundational concepts and techniques introduced in the first notebook, "Sentiment Analysis using Hugging Face Ecosystem." We focus on enhancing the sentiment analysis model's performance by leveraging a pre-trained BERT (Bidirectional Encoder Representations from Transformers) model. The techniques and methods introduced are aligned with current industry standards, making the insights and skills gained from this notebook directly applicable to real-world problems.


**Plan**

1. Set Environment
2. Load Dataset
3. Accessing and Manipulating Splits
4. Load Pre-trained Tokenizer
5. Create Function for Tokenizer
4. Train Model
  1. Download pre-trained model <br>
  2. Download and modify the model config file <br>
  3. Compute Metric Function <br>
  4. Training Arguments <br>
  5. Instantiate Trainer <br>
  6. Setup WandB <br>
  7. Training and Validation
6. Perfromance on Test Set
7. Model Inference

-----------
**Previous Process**

<img src ="https://drive.google.com/uc?export=view&id=1XkQs_Ohx_bdj3XmliqrKSQmQvBFwL-gJ" width =1000>

-----
**Revised Process**

<img src ="https://drive.google.com/uc?export=view&id=1fuxcrnb4hMlQsYBHhaTsiduV1uBxV58p" width =600>




















# <font color = 'indianred'> **1. Setting up the Environment** </font>



<font color = 'indianred'> *Load Libraries* </font>

In [None]:
## Distill BERT

In [None]:
!pip install torchtext -qq
!pip install transformers evaluate wandb datasets accelerate -U -qq ## NEW LINES ##

In [None]:
basepath = '/content/drive/MyDrive/hw6'

In [None]:
# standard data science librraies for data handling and v isualization
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path


# New libraries introduced in this notebook
import evaluate
from datasets import load_dataset, DatasetDict
from transformers import TrainingArguments, Trainer
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from transformers import AutoConfig
from transformers import pipeline
import wandb

import torch
import torch.nn as nn

# <font color = 'indianred'> **2. Load Data set**
    


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
df = pd.read_csv("/content/drive/MyDrive/Natural Language Processing Datasets/train.csv")
test = pd.read_csv("/content/drive/MyDrive/Natural Language Processing Datasets/test.csv")

# <font color = 'indianred'> **3. Accessing and Manuplating Splits**</font>



<font color = 'indianred'>*Extract Splits*

In [None]:
Text = df['Tweet'].apply(lambda x: x.lower())
Labels = df.drop(['ID','Tweet'],axis=1)


In [None]:
y = Labels.iloc[:,:].values.astype(float)


In [None]:
X =Text.values

In [None]:
test = test['Tweet'].apply(lambda x: x.lower())
testset = test.values


In [None]:
len(Text.values)

7724

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.1, random_state=42)

In [None]:
from datasets import Dataset,DatasetDict
trainset = Dataset.from_dict({
    'text': X_train,
    'label': y_train
})

validset = Dataset.from_dict({
    'text': X_val,
    'label': y_val
})
testset = Dataset.from_dict({
    'text':testset

})


In [None]:
train_val = DatasetDict(
    {"train": trainset, "valid": validset, 'test': testset})

<font color = 'indianred'>*Create futher subdivions of the splits*</font>

<font color = 'indianred'>*small subset for initial experimenttaion*</font>

# <font color = 'indianred'>**4. Load pre-trained Tokenizer**</font>



In [None]:
checkpoint = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(checkpoint, problem_type = 'multi_label_classification')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

#<font color = 'indianred'> **5. Create function for Tokenizer**



In [None]:
def tokenize_fn(batch):
    return tokenizer(text = batch["text"], truncation=True, padding=True, return_tensors="pt")

<font color = 'indianred'> *Use map function to apply tokenization to all splits*

In [None]:

tokenized_dataset= train_val.map(tokenize_fn, batched=True)
tokenized_dataset = tokenized_dataset.remove_columns(
    ['text']
)
tokenized_dataset.set_format(type='torch')

Map:   0%|          | 0/6951 [00:00<?, ? examples/s]

Map:   0%|          | 0/773 [00:00<?, ? examples/s]

Map:   0%|          | 0/3259 [00:00<?, ? examples/s]

In [None]:
tokenized_dataset['test']

{'input_ids': tensor([  101,  1030,  4748,  7229,  1035,  1035,  6275,  2575,  1035,  1035,
          1030,  2004, 29337, 17048,  9148,  4095,  2123,  2102,  4737,  2796,
          2390,  2003,  2006,  2049,  3971,  2000, 18365,  2035, 15554,  2000,
          3109,   102,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0]),
 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0])}

#  <font color = 'indianred'> **6. Model Training**

##  <font color = 'indianred'> **6.1 Download pre-trained model**

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels = 11, problem_type = 'multi_label_classification')  # We are using the same checkpiont as we have used for tokenizer


config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

Some weights of T5ForSequenceClassification were not initialized from the model checkpoint at google/flan-t5-base and are newly initialized: ['classification_head.dense.bias', 'classification_head.dense.weight', 'classification_head.out_proj.bias', 'classification_head.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
def calculate_pos_weights(dataset):
    # Initialize counters for all labels
    num_labels = len(dataset['train']['label'][0])
    total_positives = [0] * num_labels
    total_negatives = [0] * num_labels

    # Count positives and negatives for each label
    for label_array in dataset['train']['label']:
        for i, label in enumerate(label_array):
            if label == 1:
                total_positives[i] += 1
            else:
                total_negatives[i] += 1

    # Calculate pos_weight for each label
    pos_weight = [total_negatives[i] / max(total_positives[i], 1) for i in range(num_labels)]
    return torch.tensor(pos_weight)

# Calculate the pos_weight using the training set
pos_weights = calculate_pos_weights(train_val)


In [None]:
pos_weights

tensor([ 1.6973,  5.9930,  1.6420,  4.7304,  1.6848,  8.2434,  2.3743,  7.6455,
         2.3841, 17.8886, 19.0896])

In [None]:
pos_weights = [2,4,2,4,2,4,2.5,4,2.5,4,4]

In [None]:
pos_weights = torch.tensor(pos_weights)

In [None]:
class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels").float()  # Ensure labels are float for BCE loss
        outputs = model(**inputs)
        logits = outputs.get("logits")

        device = next(model.parameters()).device

        loss_fct = nn.BCEWithLogitsLoss(pos_weight=pos_weights.to(device))
        loss = loss_fct(logits, labels)

        return (loss, outputs) if return_outputs else loss


In [None]:
trainer = CustomTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["valid"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,

)

##  <font color = 'indianred'> **6.2 Downaload and Modify Model Config File**

In [None]:
config = AutoConfig.from_pretrained(checkpoint)
model.config = config


##  <font color = 'indianred'> **6.3 compute_metrics function** </font>



In [None]:
accuracy_metric = evaluate.load('accuracy', 'multilabel')
f1 = evaluate.load('f1','multilabel')


def compute_metrics(eval_pred):
    # accuracy_metric = evaluate.load('accuracy', 'multilabel')

    logits, labels = eval_pred
    # print(logits.shape)
    preds = (logits > 0).astype(int)
    accuracy = accuracy_metric.compute(predictions=preds, references=labels)
    f1_micro = f1.compute(predictions=preds, references=labels, average='micro')
    f1_macro = f1.compute(predictions=preds, references=labels, average='macro')
    return {'f1_micro':f1_micro['f1'],
            'f1_macro':f1_macro['f1'],
            'accuracy':accuracy['accuracy'],
            }

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

## <font color = 'indianred'> **6.4 Training Arguments**</font>







In [None]:
# Define the directory where model checkpoints will be saved
run_name = "emotions_distilbert_im"
base_folder = Path(basepath)
model_folder = base_folder / "models"/run_name
# Create the directory if it doesn't exist
model_folder.mkdir(exist_ok=True, parents=True)

# Configure training parameters
training_args = TrainingArguments(
    # Training-specific configurations
    num_train_epochs=10,  # Total number of training epochs
    # Number of samples per training batch for each device
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=8,

    weight_decay=0.1,  # Apply L2 regularization to prevent overfitting
    learning_rate=1e-4,  # Step size for the optimizer during training
    lr_scheduler_type='linear',
    warmup_steps=0,  # Number of warmup steps for the learning rate scheduler
    optim='adamw_torch',  # Optimizer,
    max_grad_norm = 1.0,
    # max_grad_value = 1.0,

    # Checkpoint saving and model evaluation settings
    output_dir=str(model_folder),  # Directory to save model checkpoints
    evaluation_strategy='steps',  # Evaluate model at specified step intervals
    eval_steps=20,  # Perform evaluation every 10 training steps
    save_strategy="steps",  # Save model checkpoint at specified step intervals
    save_steps=20,  # Save a model checkpoint every 10 training steps
    load_best_model_at_end=True,  # Reload the best model at the end of training
    save_total_limit=2,  # Retain only the best and the most recent model checkpoints
    # Use 'accuracy' as the metric to determine the best model
    metric_for_best_model="eval_f1_macro",
    greater_is_better=True,  # A model is 'better' if its accuracy is higher


    # Experiment logging configurations (commented out in this example)
    logging_strategy='steps',
    logging_steps=20,
    report_to='wandb',  # Log metrics and results to Weights & Biases platform
    run_name=run_name,  # Experiment name for Weights & Biases

    fp16=False,
    bf16=False,
    tf32= False
)


##  <font color = 'indianred'> **6.5 Initialize Trainer**</font>



In [None]:
# initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["valid"],
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)


## <font color = 'indianred'> **6.6 Setup WandB**</font>

In [None]:
wandb.login()
%env WANDB_PROJECT = imdb_bert

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


env: WANDB_PROJECT=imdb_bert


##  <font color = 'indianred'> **6.7 Training and Validation**

In [None]:
trainer.train()  # start training

[34m[1mwandb[0m: Currently logged in as: [33mshahabashraf05[0m ([33mshahabashraf[0m). Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss,Validation Loss


KeyboardInterrupt: 

<font color = 'indianred'> *Evaluate model on Validation Set* </font>


In [None]:
eval_results = trainer.evaluate(tokenized_dataset['valid'])

In [None]:
eval_results

{'eval_loss': 0.6578822135925293,
 'eval_f1_micro': 0.6737481031866465,
 'eval_f1_macro': 0.5952107097836232,
 'eval_accuracy': 0.2069857697283312,
 'eval_runtime': 1.2687,
 'eval_samples_per_second': 609.277,
 'eval_steps_per_second': 38.622,
 'epoch': 3.0}

In [None]:
prediction = trainer.predict( tokenized_dataset['test'] )

In [None]:
prediction = (prediction[0] > 0 ).astype(int)

In [None]:
pred = pd.DataFrame(prediction)

In [None]:
path = "/content/drive/MyDrive/pred.xlsx"

In [None]:
pred.to_excel(path, index=False, header=False)

In [None]:
### second model distill bert

In [None]:
### second model distill bert


In [None]:
checkpoint = "distilroberta-base"
tokenizer = AutoTokenizer.from_pretrained(checkpoint, problem_type = 'multi_label_classification')

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
def tokenize_fn(batch):
    return tokenizer(text = batch["text"], truncation=True, padding=True, return_tensors="pt")

In [None]:
tokenized_dataset= train_val.map(tokenize_fn, batched=True)
tokenized_dataset = tokenized_dataset.remove_columns(
    ['text']
)
tokenized_dataset.set_format(type='torch')

Map:   0%|          | 0/6179 [00:00<?, ? examples/s]

Map:   0%|          | 0/1545 [00:00<?, ? examples/s]

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels = 11,  problem_type = 'multi_label_classification')  # We are using the same checkpiont as we have used for tokenizer


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
config = AutoConfig.from_pretrained(checkpoint)
model.config = config


In [None]:
accuracy_metric = evaluate.load('accuracy', 'multilabel')
f1 = evaluate.load('f1','multilabel')


def compute_metrics(eval_pred):
    # accuracy_metric = evaluate.load('accuracy', 'multilabel')

    logits, labels = eval_pred
    # print(logits.shape)
    preds = (logits > 0).astype(int)
    accuracy = accuracy_metric.compute(predictions=preds, references=labels)
    f1_micro = f1.compute(predictions=preds, references=labels, average='micro')
    f1_macro = f1.compute(predictions=preds, references=labels, average='macro')
    return {'f1_micro':f1_micro['f1'],
            'f1_macro':f1_macro['f1'],
            'accuracy':accuracy['accuracy'],
            }

In [None]:


    run_name = "emotions_test_bert"
    base_folder = Path(basepath)
    model_folder = base_folder / "models"/run_name
    # Create the directory if it doesn't exist
    model_folder.mkdir(exist_ok=True, parents=True)



# Configure training parameters
training_args = TrainingArguments(


    # Training-specific configurations
    num_train_epochs=3,  # Total number of training epochs
    # Number of samples per training batch for each device
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    # auto_find_batch_size=True,
    weight_decay=0.01,  # Apply L2 regularization to prevent overfitting
    learning_rate=2e-5,  # Step size for the optimizer during training
    optim='adamw_torch',  # Optimizer,

    # Checkpoint saving and model evaluation settings
    output_dir='/content/drive/MyDrive/hw6/saved_models',  # Directory to save model checkpoints
    evaluation_strategy='steps',  # Evaluate model at specified step intervals
    eval_steps=100,  # Perform evaluation every 10 training steps
    save_strategy="steps",  # Save model checkpoint at specified step intervals
    save_steps=100,  # Save a model checkpoint every 10 training steps
    load_best_model_at_end=True,  # Reload the best model at the end of training
    save_total_limit=2,  # Retain only the best and the most recent model checkpoints
    # Use 'accuracy' as the metric to determine the best model
    metric_for_best_model="f1_micro",
    greater_is_better=True,  # A model is 'better' if its accuracy is higher


    # Experiment logging configurations (commented out in this example)
    logging_strategy='steps',
    logging_steps=100,
    report_to='wandb',  # Log metrics and results to Weights & Biases platform
    run_name=run_name,  # Experiment name for Weights & Biases

    fp16=True,


)



In [None]:
# initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["valid"],
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)


dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [None]:
trainer.train()  # start training

Step,Training Loss,Validation Loss,F1 Micro,F1 Macro,Accuracy
100,0.4955,0.448874,0.244688,0.101194,0.049191
200,0.3977,0.389757,0.540179,0.271969,0.172168
300,0.3808,0.363328,0.587168,0.363948,0.196117
400,0.3546,0.354558,0.615122,0.391984,0.200647
500,0.3297,0.349933,0.598094,0.372742,0.213592
600,0.3231,0.340373,0.625589,0.438484,0.224595
700,0.3144,0.33295,0.638357,0.447754,0.219417
800,0.317,0.329499,0.645091,0.466516,0.242071
900,0.2973,0.326521,0.645401,0.457202,0.233657
1000,0.3004,0.325128,0.648848,0.46292,0.235599


TrainOutput(global_step=1161, training_loss=0.3423300027641934, metrics={'train_runtime': 233.1008, 'train_samples_per_second': 79.524, 'train_steps_per_second': 4.981, 'total_flos': 749161526826594.0, 'train_loss': 0.3423300027641934, 'epoch': 3.0})

In [None]:
eval_results = trainer.evaluate(tokenized_dataset['valid'])

In [None]:
eval_results

{'eval_loss': 0.32355475425720215,
 'eval_f1_micro': 0.6514426785989816,
 'eval_f1_macro': 0.46463613709713125,
 'eval_accuracy': 0.24336569579288025,
 'eval_runtime': 1.6674,
 'eval_samples_per_second': 926.587,
 'eval_steps_per_second': 58.174,
 'epoch': 3.0}

In [None]:
## Use FLan T-5  for classification

In [None]:
## Use Flan T-5 for classification

In [None]:
checkpoint = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained( checkpoint, problem_type = 'multi_label_classification')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [None]:
def tokenize_fn(batch):
    return tokenizer(text = batch["text"], truncation=True, padding=True, return_tensors="pt")

In [None]:
tokenized_dataset= train_val.map(tokenize_fn, batched=True)
tokenized_dataset = tokenized_dataset.remove_columns(
    ['text']
)
tokenized_dataset.set_format(type='torch')

Map:   0%|          | 0/6179 [00:00<?, ? examples/s]

Map:   0%|          | 0/1545 [00:00<?, ? examples/s]

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels = 11,problem_type = 'multi_label_classification' )  # We are using the same checkpiont as we have used for tokenizer


config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

Some weights of T5ForSequenceClassification were not initialized from the model checkpoint at google/flan-t5-base and are newly initialized: ['classification_head.dense.bias', 'classification_head.dense.weight', 'classification_head.out_proj.bias', 'classification_head.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
config = AutoConfig.from_pretrained(checkpoint)
model.config = config


In [None]:
accuracy_metric = evaluate.load('accuracy', 'multilabel')
f1 = evaluate.load('f1','multilabel')


def compute_metrics(eval_pred):
    # accuracy_metric = evaluate.load('accuracy', 'multilabel')

    logits, labels = eval_pred
    # print(logits.shape)
    preds = (logits[0] > 0).astype(int)
    accuracy = accuracy_metric.compute(predictions=preds, references=labels)
    f1_micro = f1.compute(predictions=preds, references=labels, average='micro')
    f1_macro = f1.compute(predictions=preds, references=labels, average='macro')
    return {'f1_micro':f1_micro['f1'],
            'f1_macro':f1_macro['f1'],
            'accuracy':accuracy['accuracy'],
            }

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

In [None]:


    run_name = "emotions_test_bert"
    base_folder = Path(basepath)
    model_folder = base_folder / "models"/run_name
    # Create the directory if it doesn't exist
    model_folder.mkdir(exist_ok=True, parents=True)



# Configure training parameters
training_args = TrainingArguments(


    # Training-specific configurations
    num_train_epochs=3,  # Total number of training epochs
    # Number of samples per training batch for each device
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    # auto_find_batch_size=True,
    weight_decay=0.01,  # Apply L2 regularization to prevent overfitting
    learning_rate=2e-5,  # Step size for the optimizer during training
    optim='adamw_torch',  # Optimizer,
    max_grad_norm = 1.0,
    gradient_accumulation_steps=2,

    # Checkpoint saving and model evaluation settings
    output_dir='/content/drive/MyDrive/hw6/saved_models',  # Directory to save model checkpoints
    evaluation_strategy='steps',  # Evaluate model at specified step intervals
    eval_steps=100,  # Perform evaluation every 10 training steps
    save_strategy="steps",  # Save model checkpoint at specified step intervals
    save_steps=100,  # Save a model checkpoint every 10 training steps
    load_best_model_at_end=True,  # Reload the best model at the end of training
    save_total_limit=2,  # Retain only the best and the most recent model checkpoints
    # Use 'accuracy' as the metric to determine the best model
    metric_for_best_model="f1_micro",
    greater_is_better=True,  # A model is 'better' if its accuracy is higher


    # Experiment logging configurations (commented out in this example)
    logging_strategy='steps',
    logging_steps=100,
    report_to='wandb',  # Log metrics and results to Weights & Biases platform
    run_name=run_name,  # Experiment name for Weights & Biases

    fp16=False,
    bf16=False,
    tf32=False

)



In [None]:
# initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["valid"],
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)


dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [None]:
trainer.train()  # start training

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss,Validation Loss,F1 Micro,F1 Macro,Accuracy
100,0.5583,0.475642,0.001643,0.000936,0.023948
200,0.4585,0.435733,0.337849,0.143708,0.08932
300,0.4262,0.390595,0.549392,0.279658,0.16699
400,0.3967,0.380185,0.576852,0.30558,0.171521
500,0.3798,0.373825,0.575696,0.303967,0.18123
600,0.3685,0.370068,0.594919,0.328618,0.173463
700,0.368,0.362532,0.599224,0.343484,0.184466
800,0.3775,0.357581,0.602582,0.347056,0.190291
900,0.3571,0.356751,0.606061,0.349867,0.193528
1000,0.3554,0.355898,0.608722,0.356833,0.196764


There were missing keys in the checkpoint model loaded: ['transformer.encoder.embed_tokens.weight', 'transformer.decoder.embed_tokens.weight'].


TrainOutput(global_step=1158, training_loss=0.3965080900307558, metrics={'train_runtime': 1185.2448, 'train_samples_per_second': 15.64, 'train_steps_per_second': 0.977, 'total_flos': 1732810563829464.0, 'train_loss': 0.3965080900307558, 'epoch': 3.0})

In [None]:
eval_results = trainer.evaluate(tokenized_dataset['valid'])

In [None]:
eval_results

{'eval_loss': 0.3547779619693756,
 'eval_f1_micro': 0.6098127224887788,
 'eval_f1_macro': 0.3562746846330755,
 'eval_accuracy': 0.1948220064724919,
 'eval_runtime': 18.2773,
 'eval_samples_per_second': 84.531,
 'eval_steps_per_second': 10.614,
 'epoch': 3.0}