# Task

The task in this notebook is to evaluate the performance of a small language model for classification.

For this work, we will work with AGNews datasets

The model will be evaluate in different conditions:
- Firstly, we will use to prompt and fine-tune an SLM to see how good it is in classification
- Secondly, we will add a prompt to give the task of the SLM and see how it perform with no fine-tuning
- Finally, we will use the prompt and the fine-tuning

At the end, we will compare the 3 cases and know if the prompt and/or the fine-tune is nessecary for an SLM to be good for a classification task

In [1]:
from collections import defaultdict, Counter
import json
from matplotlib import pyplot as plt
import numpy as np
import torch
from datasets import load_dataset, DatasetDict, ClassLabel
from torch.utils.data import DataLoader
from transformers import TrainingArguments, Trainer
from transformers import TrainerCallback, EarlyStoppingCallback
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, confusion_matrix
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, AutoModelForSequenceClassification

  from .autonotebook import tqdm as notebook_tqdm


### Load the dataset

In [2]:
dataset_name1 = "fancyzhx/ag_news"

agnews_dataset = load_dataset(dataset_name1)

In [3]:
agnews_train_val_data = agnews_dataset['train'].train_test_split(test_size=0.2)
agnews_train_val_data['train'][:5]

{'text': ['San Francisco Rules Could Bar Elephants from Zoo Elephants must receive hundreds of times more space to live at San Francisco #39;s zoo or not be kept at the facility, city legislators said on Tuesday in legislation that could effectively bar pachyderms for good.',
  ' #39;Bricolage #39; Barroso takes high risk EU gamble By refusing to sack controversial new justice commissioner Rocco Buttiglione, the new Brussels chief may face a protest vote in the European Parliament next Wednesday.',
  'Middle East ; Iran Rules Out Complete Nuclear Dismantling  quot;Americans also have no right to raise something like this, quot; he said, adding that Iran had never used its nuclear power program for weapons production.',
  'Mistakes hinder Knicks in loss It will be labeled a learning experience, but that won #39;t remove any of the sting. Momentum was lost again Wednesday night as the Knicks blew a lead and lost 94-93 to the Detroit Pistons at Madison Square Garden.',
  "Coping with Cont

#### Prepare the model

In [4]:
model = AutoModelForSequenceClassification.from_pretrained('roneneldan/TinyStories-1M', num_labels=4)
device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")
# model.to(device)

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-125M")
model.config.pad_token_id = model.config.eos_token_id
tokenizer.pad_token = tokenizer.eos_token

Some weights of GPTNeoForSequenceClassification were not initialized from the model checkpoint at roneneldan/TinyStories-1M and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Firtst case : Fine Tune with no prompt

In [5]:
agnews_dataset = DatasetDict(
    train=agnews_train_val_data['train'].shuffle(seed=1111),
    val=agnews_train_val_data['test'].shuffle(seed=1111),
)

In [6]:
agnews_tokenized = agnews_dataset.map(
    lambda input: tokenizer(input['text'], padding=True, truncation=True),
    batched=True,
    batch_size=16
)
agnews_tokenized = agnews_tokenized.remove_columns(["text"])
agnews_tokenized = agnews_tokenized.rename_column("label", "labels")
agnews_tokenized.set_format("torch")
agnews_tokenized['train'][:2]

Map: 100%|██████████| 96000/96000 [01:10<00:00, 1355.86 examples/s]
Map: 100%|██████████| 24000/24000 [00:19<00:00, 1247.83 examples/s]


{'labels': tensor([2, 3]),
 'input_ids': tensor([[   18,  5849, 28244, 38068, 15792, 47875,  1377,  4942,    78,   220,
          12682, 28154,   357, 12637,     8,   532,  7683,  4664, 15792, 47875,
           3457,    13,  1222,  2528,    26,    32,   367, 31688,  2625,  4023,
           1378,  2503,    13, 24859,   273,    13,   260,  5843,    13,   785,
             14, 13295, 25178,    13, 31740,    30,    83, 15799,    28,    34,
             13,    45,  2496, 33223, 29522,    14, 24209, 10951,    14, 12853,
          22708,     1,     5, 13655,    26,    34,    13,    45,     5,  2528,
             26,    14,    32,     5, 13655,    26,   220, 12353,   389,  4305,
            262,  1664,    11,   706,   852,  1043, 12387,   220,  4497,   329,
            938,  1227,   338,  4137, 16512,   286, 15792, 47875,   338,   220,
           2839,  3331,   287,  2869,    11,   257,  1048,  5385,   351,   262,
           2300,   531,    13],
         [   35,   695, 12043,  6023, 17264,   2

#### Fine-Tune

In [7]:
agnews_arguments = TrainingArguments(
    output_dir="agnews_no_prompt_trainer",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    eval_strategy="epoch", # run validation at the end of each epoch
    save_strategy="epoch",
    learning_rate=2e-5,
    load_best_model_at_end=True,
    seed=224
)

def compute_metrics(eval_pred):
    """Called at the end of validation. Gives accuracy"""
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='macro')
    acc = accuracy_score(labels, preds)
    conf_matrix = confusion_matrix(labels, preds)
    return {
        'accuracy': acc,
        'precision': precision,
        'recall': recall,
        'f1_score': f1,
    }



agnews_trainer = Trainer(
    model=model,
    args=agnews_arguments,
    train_dataset=agnews_tokenized['train'],
    eval_dataset=agnews_tokenized['val'], # change to test when you do your final evaluation!
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [8]:
class LoggingCallback(TrainerCallback):
    def __init__(self, log_path):
        self.log_path = log_path
    # will call on_log on each logging step, specified by TrainerArguement. (i.e TrainerArguement.logginng_step)
    def on_log(self, args, state, control, logs=None, **kwargs):
        _ = logs.pop("total_flos", None)
        if state.is_local_process_zero:
            with open(self.log_path, "a") as f:
                f.write(json.dumps(logs) + "\n")
    # def on_epoch(...)


agnews_trainer.add_callback(EarlyStoppingCallback())
agnews_trainer.add_callback(LoggingCallback("agnews_no_prompt_trainer/log.jsonl"))

In [9]:
# train the model
agnews_trainer.train()

    There is an imbalance between your GPUs. You may want to exclude GPU 0 which
    has less than 75% of the memory or cores of GPU 1. You can do so by setting
    the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
    environment variable.
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Score
1,0.4291,0.417415,0.849583,0.853389,0.849123,0.847334
2,0.3092,0.31074,0.892875,0.896142,0.892845,0.892331
3,0.242,0.283598,0.903875,0.90464,0.903745,0.903966
4,0.1984,0.271825,0.908792,0.908717,0.908636,0.908627
5,0.164,0.273746,0.911167,0.911007,0.911008,0.910945


  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
    There is an imbalance between your GPUs. You may want to exclude GPU 0 which
    has less than 75% of the memory or cores of GPU 1. You can do so by setting
    the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
    environment variable.
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
    There is an imbalance between your GPUs. You may want to exclude GPU 0 which
    has less than 75% of the memory or cores of GPU 1. You can do so by setting
    the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
    environment variable.
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
    There is an imbalance between your GPUs. You may want to exclude GPU 0 which
    has less than 75% of the memory or cores of GPU 1. You can do so by setting
    th

TrainOutput(global_step=10000, training_loss=0.3053658576965332, metrics={'train_runtime': 3552.5833, 'train_samples_per_second': 135.113, 'train_steps_per_second': 2.815, 'train_loss': 0.3053658576965332, 'epoch': 5.0})

#### Second Case : Add a prompt

In [10]:
model = AutoModelForSequenceClassification.from_pretrained('roneneldan/TinyStories-1M', num_labels=4)
device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")
# model.to(device)

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-125M")
model.config.pad_token_id = model.config.eos_token_id
tokenizer.pad_token = tokenizer.eos_token

Some weights of GPTNeoForSequenceClassification were not initialized from the model checkpoint at roneneldan/TinyStories-1M and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [11]:
def add_prompt(input):
  prompt= f" Answer by 0 (world), 1 (sports), 2 (business) or 3 (sci/tech). The article \"{input['text']}\" is about "
  return {
      'text': prompt,
      'label': input['label']
  }

In [12]:
agnews_dataset = DatasetDict(
    train=agnews_train_val_data['train'].shuffle(seed=1111).map(add_prompt),
    val=agnews_train_val_data['test'].shuffle(seed=1111).map(add_prompt),
)

Map: 100%|██████████| 96000/96000 [00:14<00:00, 6443.52 examples/s]
Map: 100%|██████████| 24000/24000 [00:03<00:00, 7005.07 examples/s]


In [13]:
agnews_tokenized = agnews_dataset.map(
    lambda input: tokenizer(input['text'], padding=True, truncation=True),
    batched=True,
    batch_size=16
)
agnews_tokenized = agnews_tokenized.remove_columns(["text"])
agnews_tokenized = agnews_tokenized.rename_column("label", "labels")
agnews_tokenized.set_format("torch")
agnews_tokenized['train'][:2]

Map: 100%|██████████| 96000/96000 [01:17<00:00, 1235.34 examples/s]
Map: 100%|██████████| 24000/24000 [00:18<00:00, 1275.21 examples/s]


{'labels': tensor([2, 3]),
 'input_ids': tensor([[23998,   416,   657,   357,  6894,   828,   352,   357, 32945,   828,
            362,   357, 22680,     8,   393,   513,   357, 36216,    14, 13670,
            737,   383,  2708,   366,    18,  5849, 28244, 38068, 15792, 47875,
           1377,  4942,    78,   220, 12682, 28154,   357, 12637,     8,   532,
           7683,  4664, 15792, 47875,  3457,    13,  1222,  2528,    26,    32,
            367, 31688,  2625,  4023,  1378,  2503,    13, 24859,   273,    13,
            260,  5843,    13,   785,    14, 13295, 25178,    13, 31740,    30,
             83, 15799,    28,    34,    13,    45,  2496, 33223, 29522,    14,
          24209, 10951,    14, 12853, 22708,     1,     5, 13655,    26,    34,
             13,    45,     5,  2528,    26,    14,    32,     5, 13655,    26,
            220, 12353,   389,  4305,   262,  1664,    11,   706,   852,  1043,
          12387,   220,  4497,   329,   938,  1227,   338,  4137, 16512,   286,


Evaluate

In [14]:
no_training_args = TrainingArguments(
    output_dir="agnews_prompt_no_train",
    eval_strategy="epoch", # run validation at the end of each epoch
    save_strategy="epoch",
)

agnews_no_train = Trainer(
    model=model,
    args=no_training_args,
    train_dataset=agnews_tokenized['train'],
    eval_dataset=agnews_tokenized['val'], # change to test when you do your final evaluation!
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [15]:
eval_results = agnews_no_train.evaluate()
eval_results

    There is an imbalance between your GPUs. You may want to exclude GPU 0 which
    has less than 75% of the memory or cores of GPU 1. You can do so by setting
    the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
    environment variable.
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


{'eval_loss': 1.447458028793335,
 'eval_accuracy': 0.24579166666666666,
 'eval_precision': 0.11694861946107933,
 'eval_recall': 0.24771749298938822,
 'eval_f1_score': 0.15288915380787896,
 'eval_runtime': 175.4355,
 'eval_samples_per_second': 136.802,
 'eval_steps_per_second': 5.7}

#### Third Case : Add a prompt and Fine-Tune

In [16]:
model = AutoModelForSequenceClassification.from_pretrained('roneneldan/TinyStories-1M', num_labels=4)
device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")
# model.to(device)

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-125M")
model.config.pad_token_id = model.config.eos_token_id
tokenizer.pad_token = tokenizer.eos_token

Some weights of GPTNeoForSequenceClassification were not initialized from the model checkpoint at roneneldan/TinyStories-1M and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [17]:
agnews_arguments = TrainingArguments(
    output_dir="agnews_prompt_trainer",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    eval_strategy="epoch", # run validation at the end of each epoch
    save_strategy="epoch",
    learning_rate=2e-5,
    load_best_model_at_end=True,
    seed=224
)

def compute_metrics(eval_pred):
    """Called at the end of validation. Gives accuracy"""
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='macro')
    acc = accuracy_score(labels, preds)
    conf_matrix = confusion_matrix(labels, preds)
    return {
        'accuracy': acc,
        'precision': precision,
        'recall': recall,
        'f1_score': f1,
    }



agnews_trainer = Trainer(
    model=model,
    args=agnews_arguments,
    train_dataset=agnews_tokenized['train'],
    eval_dataset=agnews_tokenized['val'], # change to test when you do your final evaluation!
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [18]:
agnews_trainer.add_callback(EarlyStoppingCallback())
agnews_trainer.add_callback(LoggingCallback("agnews_prompt_trainer/log.jsonl"))

In [19]:
# train the model
agnews_trainer.train()

    There is an imbalance between your GPUs. You may want to exclude GPU 0 which
    has less than 75% of the memory or cores of GPU 1. You can do so by setting
    the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
    environment variable.
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Score
1,0.4248,0.398094,0.860458,0.862527,0.859912,0.858525
2,0.3092,0.313359,0.891167,0.895912,0.891236,0.891142
3,0.2417,0.28757,0.902625,0.905031,0.902607,0.902889
4,0.1989,0.278183,0.907333,0.907623,0.907187,0.907373
5,0.1619,0.276344,0.909875,0.90961,0.909693,0.90964


  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
    There is an imbalance between your GPUs. You may want to exclude GPU 0 which
    has less than 75% of the memory or cores of GPU 1. You can do so by setting
    the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
    environment variable.
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
    There is an imbalance between your GPUs. You may want to exclude GPU 0 which
    has less than 75% of the memory or cores of GPU 1. You can do so by setting
    the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
    environment variable.
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
    There is an imbalance between your GPUs. You may want to exclude GPU 0 which
    has less than 75% of the memory or cores of GPU 1. You can do so by setting
    th

TrainOutput(global_step=10000, training_loss=0.303584033203125, metrics={'train_runtime': 3542.93, 'train_samples_per_second': 135.481, 'train_steps_per_second': 2.823, 'train_loss': 0.303584033203125, 'epoch': 5.0})

### Conclusion

We can notice that fine-tuning is neccessary for a good classification:
- F1-score = 0.91 when the model is fine-tune vs 0.15 when not
- With a prompt, the F1-score at the first epoch is higher but the result still close so it's difficult to conclude if a good prompting could lead to a better classification- Looking at the training loss, we can notice a better convergence with a prompt