<a href="https://colab.research.google.com/github/RizLuigi/NLU-project/blob/main/GPT2_Evaluation_and_fine_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GPT2 - Evaluation and fine-tuning

[This](https://github.com/huggingface/notebooks/blob/master/examples/language_modeling.ipynb) notebook provided by Hugging Face has been taken as inspiration while developing the project, in particular regarding the fine-tuning procedure and parameters.

## Imports and setup
As first thing, we import Hugging Face `Transformers` and `Datasets`

In [None]:
!pip install transformers
!pip install datasets



Then we instatiate our GPT2 model and its tokenizer

In [None]:
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Store the model we want to use
MODEL_NAME = "gpt2"
DEVICE = 'cuda'

# We need to create the model and tokenizer
model = GPT2LMHeadModel.from_pretrained(MODEL_NAME).to(DEVICE)
tokenizer = GPT2Tokenizer.from_pretrained(MODEL_NAME)

Being GPT2 a Causal Language Model and following the approach described in the notebook, we will concatenate all the texts inside the same dataset once we have tokenized them. Then we will split them into chunks of the same length.

As first thing we define a function to tokenize all the samples in our dataset.

In [None]:
def tokenize(examples):
    return tokenizer(examples["text"])

To tokenize our dataset we will use the Huggingface `Datasets` map functions as follows:
```
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])
```

Note that we remove the column `"text"`, since we will not need it anymore, once the tokens are obtained.
We use `batched=True` and `num_proc=4` to speed up the computations.

At this point we need to define the function to concatenate all the tokens and divide them into fixed-length chunks.

In [None]:
block_size = 128

def group_and_chunk(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

The `map` function needed in this case will be:
```
datasets = tokenized_datasets.map(
    group_and_chunk,
    batched=True,
    batch_size=1000,
    num_proc=4,
)
```

Another useful function is the following: it extracts from the validation set the samples with highest and lowest ppl, in a way it is possible to compare the behaviour of the model before and after training.

In [None]:
from tqdm import tqdm

def highest_lowest_ppl_samples (trainer, tokenizer):
  min = torch.tensor([])
  max = torch.tensor([])
  max_loss = -1
  min_loss = 1000
  for chunk in tqdm(trainer.eval_dataset):
    act_chunk = torch.tensor(chunk['input_ids'])
    act_chunk = act_chunk.to(DEVICE)
    outputs = trainer.model(act_chunk, labels=act_chunk)  # Compute loss using teìhe model
    loss = outputs.loss
    if loss>max_loss:
      max_loss = loss
      max = act_chunk
    if loss<min_loss:
      min_loss = loss
      min = act_chunk
  return {'Min': tokenizer.decode(min), 'Min_ppl': math.exp(min_loss), 'Max': tokenizer.decode(max), 'Max_ppl': math.exp(max_loss)}

At this point we're ready to fine-tune using the three different datasets.

## wikitext / wikitext-2-raw-v1

### Preprocessing

As first thing we import the dataset:

In [None]:
from datasets import load_dataset

dataset = load_dataset('wikitext', 'wikitext-2-raw-v1')

print(dataset)

Reusing dataset wikitext (/root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    test: Dataset({
        features: ['text'],
        num_rows: 4358
    })
    train: Dataset({
        features: ['text'],
        num_rows: 36718
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 3760
    })
})


Then we tokenize the dataset and retain only the input_ids, discarding the `text` column.

In [None]:
tokenized_dataset = dataset.map(tokenize, batched=True, num_proc=4, remove_columns=["text"])

Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20/cache-047a92b0a522ade3.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20/cache-1b3e4910d080860b.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20/cache-faff2ce990115c51.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20/cache-0c909f58aa265c2c.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20/cache-0a2745020e99af17.arrow
Loading cached 

At this point we process the dataset using the `group_and_chunk` function, defined previously.

In [None]:
datasets = tokenized_dataset.map(
    group_and_chunk,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20/cache-d334586d6850760a.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20/cache-a10df5345e32df05.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20/cache-f9ebcc0f44cf6f21.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20/cache-2901fdb0cf0ab3f6.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20/cache-317e2cdb675549b4.arrow
Loading cached 

Now we have to set up the `Trainer` and its `TrainingArguments`.

In [None]:
from transformers import Trainer, TrainingArguments

In [None]:
path='/content/drive/MyDrive/NLU project/'  # Save checkpoints in MyDrive

wiki_training_args = TrainingArguments(
    path+"wikitext-2",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
)

In [None]:
wiki_trainer = Trainer(
    model=GPT2LMHeadModel.from_pretrained(MODEL_NAME).to(DEVICE), # to be sure to work on a copy of gpt
    args=wiki_training_args,
    train_dataset=datasets["train"],
    eval_dataset=datasets["validation"],
)

### Initial evaluation

As a first thing we evaluate the model as it comes out-of-the-box, without fine tuning it.

In [None]:
import math
eval_results = wiki_trainer.evaluate()
print(f"Perplexity before training: {math.exp(eval_results['eval_loss']):.2f}")

***** Running Evaluation *****
  Num examples = 1931
  Batch size = 8


Perplexity before training: 59.12


Now we search over the validation set, looking for the samples with highest and lowest perplexity.

In [None]:
pre_training = highest_lowest_ppl_samples(wiki_trainer, tokenizer)

print(pre_training)

100%|██████████| 1931/1931 [01:00<00:00, 32.06it/s]

{'Min': "arty transferred to Western Washington University and played football for the Vikings. McCarty was immediately a significant factor in the Vikings'gameplan. In the season opener, he rushed for 139 yards and three touchdowns on 30 carries against the Humboldt State Lumberjacks. He also played a large role in the passing game early in the season, making eight receptions for 126 yards through the first two games. After starting the first seven games for the Vikings, McCarty broke his foot in a game against the South Dakota Hardrockers. At the time of his injury, he led the Vikings in rushing and receiving yards. He finished the", 'Min_ppl': 14.073583184913058, 'Max': "ately sculpted but damaged roof lintel, possibly showing Dark Sun engaged in a ritual dance around AD 810. The temple shrine possesses two chambers. \n Temple IV is the tallest temple @-@ pyramid at Tikal, measuring 70 metres ( 230 ft ) from the plaza floor level to the top of its roof comb. Temple IV marks the reig




### Training

Fine-tuning our model is very simple, since we already defined all the parameters in the previous sections. We have just to use the `train()` function.

In [None]:
wiki_trainer.train()

***** Running training *****
  Num examples = 18666
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 7002


Epoch,Training Loss,Validation Loss
1,3.4877,3.410611
2,3.3615,3.395832
3,3.3016,3.392999


Saving model checkpoint to /content/drive/MyDrive/NLU project/wikitext-2/checkpoint-500
Configuration saved in /content/drive/MyDrive/NLU project/wikitext-2/checkpoint-500/config.json
Model weights saved in /content/drive/MyDrive/NLU project/wikitext-2/checkpoint-500/pytorch_model.bin
Saving model checkpoint to /content/drive/MyDrive/NLU project/wikitext-2/checkpoint-1000
Configuration saved in /content/drive/MyDrive/NLU project/wikitext-2/checkpoint-1000/config.json
Model weights saved in /content/drive/MyDrive/NLU project/wikitext-2/checkpoint-1000/pytorch_model.bin
Saving model checkpoint to /content/drive/MyDrive/NLU project/wikitext-2/checkpoint-1500
Configuration saved in /content/drive/MyDrive/NLU project/wikitext-2/checkpoint-1500/config.json
Model weights saved in /content/drive/MyDrive/NLU project/wikitext-2/checkpoint-1500/pytorch_model.bin
Saving model checkpoint to /content/drive/MyDrive/NLU project/wikitext-2/checkpoint-2000
Configuration saved in /content/drive/MyDrive/N

Saving model checkpoint to /content/drive/MyDrive/NLU project/wikitext-2/checkpoint-2500
Configuration saved in /content/drive/MyDrive/NLU project/wikitext-2/checkpoint-2500/config.json
Model weights saved in /content/drive/MyDrive/NLU project/wikitext-2/checkpoint-2500/pytorch_model.bin
Saving model checkpoint to /content/drive/MyDrive/NLU project/wikitext-2/checkpoint-3000
Configuration saved in /content/drive/MyDrive/NLU project/wikitext-2/checkpoint-3000/config.json
Model weights saved in /content/drive/MyDrive/NLU project/wikitext-2/checkpoint-3000/pytorch_model.bin
Saving model checkpoint to /content/drive/MyDrive/NLU project/wikitext-2/checkpoint-3500
Configuration saved in /content/drive/MyDrive/NLU project/wikitext-2/checkpoint-3500/config.json
Model weights saved in /content/drive/MyDrive/NLU project/wikitext-2/checkpoint-3500/pytorch_model.bin
Saving model checkpoint to /content/drive/MyDrive/NLU project/wikitext-2/checkpoint-4000
Configuration saved in /content/drive/MyDriv

TrainOutput(global_step=7002, training_loss=3.4143140964322822, metrics={'train_runtime': 4559.7536, 'train_samples_per_second': 12.281, 'train_steps_per_second': 1.536, 'total_flos': 3657957801984000.0, 'train_loss': 3.4143140964322822, 'epoch': 3.0})

Now let's save the obtained model, so we can easilly reload it using `from_pretrained()`

In [None]:
wiki_trainer.save_model(wiki_training_args.output_dir+'/model')

Saving model checkpoint to /content/drive/MyDrive/NLU project/wikitext-2/model
Configuration saved in /content/drive/MyDrive/NLU project/wikitext-2/model/config.json
Model weights saved in /content/drive/MyDrive/NLU project/wikitext-2/model/pytorch_model.bin


### Final evaluation

As before, we calculate the perplexity of the model over the validation set:




In [None]:
import math
eval_results = wiki_trainer.evaluate()
print(f"Perplexity after training: {math.exp(eval_results['eval_loss']):.2f}")

***** Running Evaluation *****
  Num examples = 1931
  Batch size = 8


Perplexity after training: 29.76


Finally we extract one positive and one negative example and compare them with the pre-training ones.

In [None]:
post_training = highest_lowest_ppl_samples(wiki_trainer, tokenizer)

print(post_training)

100%|██████████| 1931/1931 [01:01<00:00, 31.47it/s]

{'Min': ' @-@ millimetre ( 1 @.@ 9 in ) three @-@ pounder Hotchkiss guns and four 47 @-@ millimetre 2 @.@ 5 @-@ pounder Hotchkiss guns. The former were mounted in the superstructure and the latter in the fighting tops. The three @-@ pounder gun fired 3 @.@ 19 @-@ pound ( 1 @.@ 45 kg ) projectiles at a muzzle velocity of 1 @,@ 927 ft / s ( 587 m / s ), while the 2 @.@ 5 @-@ pounder fired 2 @', 'Min_ppl': 6.582622026878315, 'Max': "ately sculpted but damaged roof lintel, possibly showing Dark Sun engaged in a ritual dance around AD 810. The temple shrine possesses two chambers. \n Temple IV is the tallest temple @-@ pyramid at Tikal, measuring 70 metres ( 230 ft ) from the plaza floor level to the top of its roof comb. Temple IV marks the reign of Yik ’ in Chan Kawil ( Ruler B, the son of Ruler A or Jasaw Chan K 'awiil I ) and two carved wooden lintels over the doorway that leads into the temple on the pyramid ’ s summit record a long count date ( 9", 'Max_ppl': 129.8181524083341}





In [None]:
print('MIN_PPL:\n\t{} - ppl = {:.2f}\n\t{} - ppl = {:.2f}'.format(pre_training['Min'][0:75], pre_training['Min_ppl'], post_training['Min'][0:75], post_training['Min_ppl']))
print('MAX_PPL:\n\t{} - ppl = {:.2f}\n\t{} - ppl = {:.2f}'.format(pre_training['Max'][0:75], pre_training['Max_ppl'], post_training['Max'][0:75], post_training['Max_ppl']))

MIN_PPL:
	arty transferred to Western Washington University and played football for t - ppl = 14.07
	 @-@ millimetre ( 1 @.@ 9 in ) three @-@ pounder Hotchkiss guns and four 47 - ppl = 6.58
MAX_PPL:
	ately sculpted but damaged roof lintel, possibly showing Dark Sun engaged i - ppl = 279.78
	ately sculpted but damaged roof lintel, possibly showing Dark Sun engaged i - ppl = 129.82


## amazon_reviews_multi

### Preprocessing

As first thing we import the dataset and we remove the unneeded columns:

In [None]:
dataset = load_dataset( 'amazon_reviews_multi', 'en')

dataset = dataset.remove_columns(['review_id', 'product_id', 'reviewer_id', 'stars', 'review_title', 'language', 'product_category'])  # Just retain the text, remove the rest
dataset = dataset.rename_column('review_body', 'text')

print(dataset)

Downloading:   0%|          | 0.00/2.74k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.62k [00:00<?, ?B/s]

Downloading and preparing dataset amazon_reviews_multi/en (download: 82.11 MiB, generated: 58.69 MiB, post-processed: Unknown size, total: 140.79 MiB) to /root/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609...


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/82.0M [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/2.06M [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/2.05M [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset amazon_reviews_multi downloaded and prepared to /root/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 200000
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['text'],
        num_rows: 5000
    })
})


Then we tokenize the dataset and retain only the input_ids, discarding the `text` column.

In [None]:
tokenized_dataset = dataset.map(tokenize, batched=True, num_proc=4, remove_columns=["text"])

At this point we process the dataset using the `group_and_chunk` function, defined previously.

In [None]:
datasets = tokenized_dataset.map(
    group_and_chunk,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

Now we have to set up the `Trainer` and its `TrainingArguments`.

In [None]:
amazon_training_args = TrainingArguments(
    path+"amazon_reviews_multi",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [None]:
amazon_trainer = Trainer(
    model=GPT2LMHeadModel.from_pretrained(MODEL_NAME).to(DEVICE), # to be sure to work on a copy of gpt
    args=amazon_training_args,
    train_dataset=datasets["train"].shard(num_shards=3, index=0),
    eval_dataset=datasets["validation"],
)

loading configuration file https://huggingface.co/gpt2/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/fc674cd6907b4c9e933cb42d67662436b89fa9540a1f40d7c919d0109289ad01.7d2e0efa5ca20cef4fb199382111e9d3ad96fd77b849e1d4bed13a66e1336f51
Model config GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "resid_pdrop": 0.1,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version

### Initial evaluation

As a first thing we evaluate the model as it comes out-of-the-box, without fine-tuning it.

In [None]:
import math
eval_results = amazon_trainer.evaluate()
print(f"Perplexity before training: {math.exp(eval_results['eval_loss']):.2f}")

***** Running Evaluation *****
  Num examples = 1602
  Batch size = 8


Perplexity before training: 59.80


Now we search over the validation set, looking for the samples with highest and lowest perplexity.

In [None]:
pre_training = highest_lowest_ppl_samples(amazon_trainer, tokenizer)

print(pre_training)

100%|██████████| 1602/1602 [00:51<00:00, 31.40it/s]

{'Min': " Christmas a couple years ago. I have always had trouble with earbuds staying in my ears, so I thought I'd give these a try after reading the reviews. When I got them, I was pleasantly surprised with how easily and comfortably they fit. Then I plugged them in. I was immediately impressed with the sound quality. I used them all the time, working out at the gym, on long plane rides, and just sitting around and relaxing. Through some normal wear and tear, the wires split and stripped a little. Once that happened, the sound cut in and out. Since I loved these earbuds so much, I", 'Min_ppl': 13.75678233667308, 'Max': ' they are torn & threaded and beyond repair. And as others say line inner tube Line up before adding air. If I get a years out of 2 I’m happy. Also I hate plastic and fabric cover is so much nicer. To the Company please do more stitching and maybe heavier material closer sewn & if the bottom material was little thicker it last longer. I do like just up grade. We will 




### Training

Fine-tuning is performed as in previous case.

In [None]:
amazon_trainer.train()

amazon_trainer.save_model(amazon_training_args.output_dir+'/model')

***** Running training *****
  Num examples = 21134
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 7926


Epoch,Training Loss,Validation Loss
1,3.7505,3.647052
2,3.6548,3.616153
3,3.6079,3.609838


Saving model checkpoint to /content/drive/MyDrive/NLU project/amazon_reviews_multi/checkpoint-500
Configuration saved in /content/drive/MyDrive/NLU project/amazon_reviews_multi/checkpoint-500/config.json
Model weights saved in /content/drive/MyDrive/NLU project/amazon_reviews_multi/checkpoint-500/pytorch_model.bin
Saving model checkpoint to /content/drive/MyDrive/NLU project/amazon_reviews_multi/checkpoint-1000
Configuration saved in /content/drive/MyDrive/NLU project/amazon_reviews_multi/checkpoint-1000/config.json
Model weights saved in /content/drive/MyDrive/NLU project/amazon_reviews_multi/checkpoint-1000/pytorch_model.bin
Saving model checkpoint to /content/drive/MyDrive/NLU project/amazon_reviews_multi/checkpoint-1500
Configuration saved in /content/drive/MyDrive/NLU project/amazon_reviews_multi/checkpoint-1500/config.json
Model weights saved in /content/drive/MyDrive/NLU project/amazon_reviews_multi/checkpoint-1500/pytorch_model.bin
Saving model checkpoint to /content/drive/MyDr

Saving model checkpoint to /content/drive/MyDrive/NLU project/amazon_reviews_multi/checkpoint-3000
Configuration saved in /content/drive/MyDrive/NLU project/amazon_reviews_multi/checkpoint-3000/config.json
Model weights saved in /content/drive/MyDrive/NLU project/amazon_reviews_multi/checkpoint-3000/pytorch_model.bin
Saving model checkpoint to /content/drive/MyDrive/NLU project/amazon_reviews_multi/checkpoint-3500
Configuration saved in /content/drive/MyDrive/NLU project/amazon_reviews_multi/checkpoint-3500/config.json
Model weights saved in /content/drive/MyDrive/NLU project/amazon_reviews_multi/checkpoint-3500/pytorch_model.bin
Saving model checkpoint to /content/drive/MyDrive/NLU project/amazon_reviews_multi/checkpoint-4000
Configuration saved in /content/drive/MyDrive/NLU project/amazon_reviews_multi/checkpoint-4000/config.json
Model weights saved in /content/drive/MyDrive/NLU project/amazon_reviews_multi/checkpoint-4000/pytorch_model.bin
Saving model checkpoint to /content/drive/M

### Final evaluation

As before, we calculate the perplexity of the model:

In [None]:
import math
eval_results = amazon_trainer.evaluate()
print(f"Perplexity after training: {math.exp(eval_results['eval_loss']):.2f}")

***** Running Evaluation *****
  Num examples = 1602
  Batch size = 8


Perplexity after training: 36.96


Finally we extract one positive and one negative example and compare them with the pre-training ones.

In [None]:
post_training = highest_lowest_ppl_samples(amazon_trainer, tokenizer)

print(post_training)

100%|██████████| 1602/1602 [00:50<00:00, 32.03it/s]

{'Min': " Christmas a couple years ago. I have always had trouble with earbuds staying in my ears, so I thought I'd give these a try after reading the reviews. When I got them, I was pleasantly surprised with how easily and comfortably they fit. Then I plugged them in. I was immediately impressed with the sound quality. I used them all the time, working out at the gym, on long plane rides, and just sitting around and relaxing. Through some normal wear and tear, the wires split and stripped a little. Once that happened, the sound cut in and out. Since I loved these earbuds so much, I", 'Min_ppl': 10.524118713747495, 'Max': ' they are torn & threaded and beyond repair. And as others say line inner tube Line up before adding air. If I get a years out of 2 I’m happy. Also I hate plastic and fabric cover is so much nicer. To the Company please do more stitching and maybe heavier material closer sewn & if the bottom material was little thicker it last longer. I do like just up grade. We will




In [None]:
print('MIN_PPL:\n\t{} - ppl = {:.2f}\n\t{} - ppl = {:.2f}'.format(pre_training['Min'][0:75], pre_training['Min_ppl'], post_training['Min'][0:75], post_training['Min_ppl']))
print('MAX_PPL:\n\t{} - ppl = {:.2f}\n\t{} - ppl = {:.2f}'.format(pre_training['Max'][0:75], pre_training['Max_ppl'], post_training['Max'][0:75], post_training['Max_ppl']))

MIN_PPL:
	 Christmas a couple years ago. I have always had trouble with earbuds stayi - ppl = 13.76
	 Christmas a couple years ago. I have always had trouble with earbuds stayi - ppl = 10.52
MAX_PPL:
	 they are torn & threaded and beyond repair. And as others say line inner t - ppl = 224.35
	 they are torn & threaded and beyond repair. And as others say line inner t - ppl = 153.86


## cc_news

### Preprocessing

As first thing we import the dataset and remove the unwanted fields. Note that we take only 1/100 of the original dataset, to maintain reasonable the training time.

In [None]:
from datasets import load_dataset

dataset = load_dataset('cc_news')

print(dataset)

dataset = dataset["train"].shard(num_shards=100, index=0)

print(dataset)

dataset = dataset.remove_columns(['title', 'domain', 'date', 'description', 'url', 'image_url'])  # Just retain the text, remove the rest

print(dataset)

dataset = dataset.train_test_split(test_size=0.1)

print(dataset)

Downloading:   0%|          | 0.00/1.84k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/932 [00:00<?, ?B/s]

Downloading and preparing dataset cc_news/plain_text (download: 805.98 MiB, generated: 1.88 GiB, post-processed: Unknown size, total: 2.67 GiB) to /root/.cache/huggingface/datasets/cc_news/plain_text/1.0.0/6cdde8d7fdaae3e50fb61b5d08d5387c2f0bbea1ee68755ef954af539a6a3a1b...


Downloading:   0%|          | 0.00/845M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

Dataset cc_news downloaded and prepared to /root/.cache/huggingface/datasets/cc_news/plain_text/1.0.0/6cdde8d7fdaae3e50fb61b5d08d5387c2f0bbea1ee68755ef954af539a6a3a1b. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['title', 'text', 'domain', 'date', 'description', 'url', 'image_url'],
        num_rows: 708241
    })
})
Dataset({
    features: ['title', 'text', 'domain', 'date', 'description', 'url', 'image_url'],
    num_rows: 7083
})
Dataset({
    features: ['text'],
    num_rows: 7083
})
DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 6374
    })
    test: Dataset({
        features: ['text'],
        num_rows: 709
    })
})


Then we tokenize the dataset and retain only the input_ids, discarding the `text` column.

In [None]:
tokenized_dataset = dataset.map(tokenize, batched=True, num_proc=4, remove_columns=["text"])

Token indices sequence length is longer than the specified maximum sequence length for this model (1263 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1564 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1088 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1285 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1165 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence leng

At this point we process the dataset using the `group_and_chunk` function, defined previously.

In [None]:
datasets = tokenized_dataset.map(
    group_and_chunk,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

Now we have to set up the `Trainer` and its `TrainingArguments`.

In [None]:
path='/content/drive/MyDrive/NLU project/'

cc_training_args = TrainingArguments(
    path+"cc_news",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
)

In [None]:
cc_trainer = Trainer(
    model=GPT2LMHeadModel.from_pretrained(MODEL_NAME).to(DEVICE), # to be sure to work on a copy of gpt
    args=cc_training_args,
    train_dataset=datasets["train"],
    eval_dataset=datasets["test"],
)

### Initial evaluation

As a first thing we evaluate the model as it comes out-of-the-box, without fine tuning it.

In [None]:
import math
eval_results = cc_trainer.evaluate()
print(f"Perplexity before training: {math.exp(eval_results['eval_loss']):.2f}")

***** Running Evaluation *****
  Num examples = 2951
  Batch size = 8


Perplexity before training: 37.68


Now we search over the validation set, looking for the samples with highest and lowest perplexity.

In [None]:
pre_training = highest_lowest_ppl_samples(cc_trainer, tokenizer)

print(pre_training)

100%|██████████| 2951/2951 [01:36<00:00, 30.56it/s]

{'Min': " mode: 'thumbnails-c', container: 'taboola-interstitial-gallery-thumbnails-5', placement: 'Interstitial Gallery Thumbnails 5', target_type:'mix' }); _taboola.push({flush: true});\nwindow._taboola = window._taboola || []; _taboola.push({ mode: 'thumbnails-c', container: 'taboola-interstitial-gallery-thumbnails-7', placement: 'Interstitial Gallery Thumbnails 7', target_type:'mix' }); _taboola.push({flush: true});\nPhoto: Gene J. Puskar, AP Image 1 of / 7 Caption Close Image 1 of 7 COR", 'Min_ppl': 2.0547252199402473, 'Max': '\n“The person worked for a vendor who sold T-shirts and other clothing items. The vendor was shut down and evicted from the site, and will not be returning.”|\nOnline Mentorship Assignment Help giving by assignmenthelps.co.uk moves essentially from others. they require a swung to start by strategies for seeking after down the necessities of each and every customer’s demand by then watch the fundamental honest to goodness individual with their skilled pool of




### Training

Fine-tuning our model is performed as in previous cases.

In [None]:
cc_trainer.train()

cc_trainer.save_model(cc_training_args.output_dir+'/model')

***** Running training *****
  Num examples = 26126
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 9798


Epoch,Training Loss,Validation Loss
1,3.4371,3.307143
2,3.3382,3.295411
3,3.2808,3.293159


Saving model checkpoint to /content/drive/MyDrive/NLU project/cc_news/checkpoint-500
Configuration saved in /content/drive/MyDrive/NLU project/cc_news/checkpoint-500/config.json
Model weights saved in /content/drive/MyDrive/NLU project/cc_news/checkpoint-500/pytorch_model.bin
Saving model checkpoint to /content/drive/MyDrive/NLU project/cc_news/checkpoint-1000
Configuration saved in /content/drive/MyDrive/NLU project/cc_news/checkpoint-1000/config.json
Model weights saved in /content/drive/MyDrive/NLU project/cc_news/checkpoint-1000/pytorch_model.bin
Saving model checkpoint to /content/drive/MyDrive/NLU project/cc_news/checkpoint-1500
Configuration saved in /content/drive/MyDrive/NLU project/cc_news/checkpoint-1500/config.json
Model weights saved in /content/drive/MyDrive/NLU project/cc_news/checkpoint-1500/pytorch_model.bin
Saving model checkpoint to /content/drive/MyDrive/NLU project/cc_news/checkpoint-2000
Configuration saved in /content/drive/MyDrive/NLU project/cc_news/checkpoint-

Saving model checkpoint to /content/drive/MyDrive/NLU project/cc_news/checkpoint-3500
Configuration saved in /content/drive/MyDrive/NLU project/cc_news/checkpoint-3500/config.json
Model weights saved in /content/drive/MyDrive/NLU project/cc_news/checkpoint-3500/pytorch_model.bin
Saving model checkpoint to /content/drive/MyDrive/NLU project/cc_news/checkpoint-4000
Configuration saved in /content/drive/MyDrive/NLU project/cc_news/checkpoint-4000/config.json
Model weights saved in /content/drive/MyDrive/NLU project/cc_news/checkpoint-4000/pytorch_model.bin
Saving model checkpoint to /content/drive/MyDrive/NLU project/cc_news/checkpoint-4500
Configuration saved in /content/drive/MyDrive/NLU project/cc_news/checkpoint-4500/config.json
Model weights saved in /content/drive/MyDrive/NLU project/cc_news/checkpoint-4500/pytorch_model.bin
Saving model checkpoint to /content/drive/MyDrive/NLU project/cc_news/checkpoint-5000
Configuration saved in /content/drive/MyDrive/NLU project/cc_news/checkpoi

### Final evaluation

As before, we calculate the perplexity of the model:

In [None]:
import math
eval_results = cc_trainer.evaluate()
print(f"Perplexity after training: {math.exp(eval_results['eval_loss']):.2f}")

***** Running Evaluation *****
  Num examples = 2951
  Batch size = 8


Perplexity after training: 26.93


Finally we extract one positive and one negative example and compare them with the pre-training ones.

In [None]:
post_training = highest_lowest_ppl_samples(cc_trainer, tokenizer)

print(post_training)

100%|██████████| 2951/2951 [01:38<00:00, 29.98it/s]

{'Min': " mode: 'thumbnails-c', container: 'taboola-interstitial-gallery-thumbnails-5', placement: 'Interstitial Gallery Thumbnails 5', target_type:'mix' }); _taboola.push({flush: true});\nwindow._taboola = window._taboola || []; _taboola.push({ mode: 'thumbnails-c', container: 'taboola-interstitial-gallery-thumbnails-7', placement: 'Interstitial Gallery Thumbnails 7', target_type:'mix' }); _taboola.push({flush: true});\nPhoto: Gene J. Puskar, AP Image 1 of / 7 Caption Close Image 1 of 7 COR", 'Min_ppl': 1.3320189228616837, 'Max': '\nSEE ALSO: AOL Instant Messenger is being laid to rest and the internet is mourning very loudly\nwether ur sn was xsportybabe92x or u TyPeD lYk tHiS, aim wuz a truly special time in all of our lives\ntho communicating online has never ben easier then it is now, the thrill that aim inspired has yet 2 b matched by any of its predecessors. getting home frum skool n seein ur bff was online was more thrilling then the latest song by puddle of mudd. it was the on




In [None]:
print('MIN_PPL:\n\t{} - ppl = {:.2f}\n\t{} - ppl = {:.2f}'.format(pre_training['Min'][0:75], pre_training['Min_ppl'], post_training['Min'][0:75], post_training['Min_ppl']))
print('MAX_PPL:\n\t{} - ppl = {:.2f}\n\t{} - ppl = {:.2f}'.format(pre_training['Max'][0:75], pre_training['Max_ppl'], post_training['Max'][0:75], post_training['Max_ppl']))

MIN_PPL:
	 mode: 'thumbnails-c', container: 'taboola-interstitial-gallery-thumbnails- - ppl = 2.05
	 mode: 'thumbnails-c', container: 'taboola-interstitial-gallery-thumbnails- - ppl = 1.33
MAX_PPL:
	
“The person worked for a vendor who sold T-shirts and other clothing items - ppl = 338.36
	
SEE ALSO: AOL Instant Messenger is being laid to rest and the internet is  - ppl = 305.99


## Comparisons between models

As first thing we load from storage the different fine-tuned models. This is particularly useful when working with a different runtime w.r.t. the one that performed the training phase.

In [None]:
# Reload models from checkpoints

path='/content/drive/MyDrive/NLU project/'

model = GPT2LMHeadModel.from_pretrained('gpt2')
wiki_model = GPT2LMHeadModel.from_pretrained(path+'wikitext-2/model')
amazon_model = GPT2LMHeadModel.from_pretrained(path+'amazon_reviews_multi/model')
cc_model = GPT2LMHeadModel.from_pretrained(path+'cc_news/model')

### wikitext

At this point we load and preprocess the wikitext dataset as we did in previous steps. Then we perform an evaluation step for each of the models and output the perplexities they present when dealing with this dataset.

In [None]:
from datasets import load_dataset
from transformers import Trainer, TrainingArguments
import math

dataset = load_dataset('wikitext', 'wikitext-2-raw-v1')
tokenized_dataset = dataset.map(tokenize, batched=True, num_proc=4, remove_columns=["text"])
datasets = tokenized_dataset.map(group_and_chunk, batched=True, batch_size=1000, num_proc=4)

tmp_training_args = TrainingArguments(
    path+"cross_testing",
)

original_on_wiki_trainer = Trainer(
    model=model,
    args=tmp_training_args,
    train_dataset=datasets["train"],
    eval_dataset=datasets["validation"],
)

wiki_on_wiki_trainer = Trainer(
    model=wiki_model,
    args=tmp_training_args,
    train_dataset=datasets["train"],
    eval_dataset=datasets["validation"],
)

amazon_on_wiki_trainer = Trainer(
    model=amazon_model,
    args=tmp_training_args,
    train_dataset=datasets["train"],
    eval_dataset=datasets["validation"],
)

cc_on_wiki_trainer = Trainer(
    model=cc_model,
    args=tmp_training_args,
    train_dataset=datasets["train"],
    eval_dataset=datasets["validation"],
)

original_on_wiki_eval_results = original_on_wiki_trainer.evaluate()
wiki_on_wiki_eval_results = wiki_on_wiki_trainer.evaluate()
amazon_on_wiki_eval_results = amazon_on_wiki_trainer.evaluate()
cc_on_wiki_eval_results = cc_on_wiki_trainer.evaluate()

print(f"Perplexity of original model on wiki: {math.exp(original_on_wiki_eval_results['eval_loss']):.2f}")
print(f"Perplexity of wiki model on wiki: {math.exp(wiki_on_wiki_eval_results['eval_loss']):.2f}")
print(f"Perplexity of amazon model on wiki: {math.exp(amazon_on_wiki_eval_results['eval_loss']):.2f}")
print(f"Perplexity of cc model on wiki: {math.exp(cc_on_wiki_eval_results['eval_loss']):.2f}")

Reusing dataset wikitext (/root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20/cache-047a92b0a522ade3.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20/cache-1b3e4910d080860b.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20/cache-faff2ce990115c51.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20/cache-0c909f58aa265c2c.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20/cache-0a2745020e99af17.arrow
Loading cached 

***** Running Evaluation *****
  Num examples = 1931
  Batch size = 8


***** Running Evaluation *****
  Num examples = 1931
  Batch size = 8


***** Running Evaluation *****
  Num examples = 1931
  Batch size = 8


Perplexity of original model on wiki: 59.12
Perplexity of wiki model on wiki: 29.76
Perplexity of amazon model on wiki: 90.97
Perplexity of cc model on wiki: 67.17


### amazon_review_multi

In this case we make the models cope with the amazon reviews dataset.

In [None]:
dataset = load_dataset( 'amazon_reviews_multi', 'en')
dataset = dataset.remove_columns(['review_id', 'product_id', 'reviewer_id', 'stars', 'review_title', 'language', 'product_category'])  # Just retain the text, remove the rest
dataset = dataset.rename_column('review_body', 'text')

tokenized_dataset = dataset.map(tokenize, batched=True, num_proc=4, remove_columns=["text"])
datasets = tokenized_dataset.map(group_and_chunk, batched=True, batch_size=1000, num_proc=4)

tmp_training_args = TrainingArguments(
    path+"cross_testing",
)

original_on_amazon_trainer = Trainer(
    model=model,
    args=tmp_training_args,
    train_dataset=datasets["train"],
    eval_dataset=datasets["validation"],
)

wiki_on_amazon_trainer = Trainer(
    model=wiki_model,
    args=tmp_training_args,
    train_dataset=datasets["train"],
    eval_dataset=datasets["validation"],
)

amazon_on_amazon_trainer = Trainer(
    model=amazon_model,
    args=tmp_training_args,
    train_dataset=datasets["train"],
    eval_dataset=datasets["validation"],
)

cc_on_amazon_trainer = Trainer(
    model=cc_model,
    args=tmp_training_args,
    train_dataset=datasets["train"],
    eval_dataset=datasets["validation"],
)

original_on_amazon_eval_results = original_on_amazon_trainer.evaluate()
wiki_on_amazon_eval_results = wiki_on_amazon_trainer.evaluate()
amazon_on_amazon_eval_results = amazon_on_amazon_trainer.evaluate()
cc_on_amazon_eval_results = cc_on_amazon_trainer.evaluate()

print(f"Perplexity of original model on amazon: {math.exp(original_on_amazon_eval_results['eval_loss']):.2f}")
print(f"Perplexity of wiki model on amazon: {math.exp(wiki_on_amazon_eval_results['eval_loss']):.2f}")
print(f"Perplexity of amazon model on amazon: {math.exp(amazon_on_amazon_eval_results['eval_loss']):.2f}")
print(f"Perplexity of cc model on amazon: {math.exp(cc_on_amazon_eval_results['eval_loss']):.2f}")

Downloading:   0%|          | 0.00/2.74k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.62k [00:00<?, ?B/s]

Downloading and preparing dataset amazon_reviews_multi/en (download: 82.11 MiB, generated: 58.69 MiB, post-processed: Unknown size, total: 140.79 MiB) to /root/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609...


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/82.0M [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/2.06M [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/2.05M [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset amazon_reviews_multi downloaded and prepared to /root/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
***** Running Evaluation *****
  Num examples = 1602
  Batch size = 8


***** Running Evaluation *****
  Num examples = 1602
  Batch size = 8


***** Running Evaluation *****
  Num examples = 1602
  Batch size = 8


***** Running Evaluation *****
  Num examples = 1602
  Batch size = 8


Perplexity of original model on amazon: 59.80
Perplexity of wiki model on amazon: 131.50
Perplexity of amazon model on amazon: 36.96
Perplexity of cc model on amazon: 58.61


### cc_news

In this last case the models are provided with the cc_news dataset.

In [None]:
dataset = load_dataset('cc_news')
dataset = dataset["train"].shard(num_shards=100, index=0)
dataset = dataset.remove_columns(['title', 'domain', 'date', 'description', 'url', 'image_url'])
dataset = dataset.train_test_split(test_size=0.1)
tokenized_dataset = dataset.map(tokenize, batched=True, num_proc=4, remove_columns=["text"])
datasets = tokenized_dataset.map(group_and_chunk, batched=True, batch_size=1000, num_proc=4)

tmp_training_args = TrainingArguments(
    path+"cross_testing",
)

original_on_cc_trainer = Trainer(
    model=model,
    args=tmp_training_args,
    train_dataset=datasets["train"],
    eval_dataset=datasets["test"],
)

wiki_on_cc_trainer = Trainer(
    model=wiki_model,
    args=tmp_training_args,
    train_dataset=datasets["train"],
    eval_dataset=datasets["test"],
)

amazon_on_cc_trainer = Trainer(
    model=amazon_model,
    args=tmp_training_args,
    train_dataset=datasets["train"],
    eval_dataset=datasets["test"],
)

cc_on_cc_trainer = Trainer(
    model=cc_model,
    args=tmp_training_args,
    train_dataset=datasets["train"],
    eval_dataset=datasets["test"],
)

original_on_cc_eval_results = original_on_cc_trainer.evaluate()
wiki_on_cc_eval_results = wiki_on_cc_trainer.evaluate()
amazon_on_cc_eval_results = amazon_on_cc_trainer.evaluate()
cc_on_cc_eval_results = cc_on_cc_trainer.evaluate()

print(f"Perplexity of original model on cc: {math.exp(original_on_cc_eval_results['eval_loss']):.2f}")
print(f"Perplexity of wiki model on cc: {math.exp(wiki_on_cc_eval_results['eval_loss']):.2f}")
print(f"Perplexity of amazon model on cc: {math.exp(amazon_on_cc_eval_results['eval_loss']):.2f}")
print(f"Perplexity of cc model on cc: {math.exp(cc_on_cc_eval_results['eval_loss']):.2f}")

Reusing dataset cc_news (/root/.cache/huggingface/datasets/cc_news/plain_text/1.0.0/6cdde8d7fdaae3e50fb61b5d08d5387c2f0bbea1ee68755ef954af539a6a3a1b)


  0%|          | 0/1 [00:00<?, ?it/s]

Loading cached split indices for dataset at /root/.cache/huggingface/datasets/cc_news/plain_text/1.0.0/6cdde8d7fdaae3e50fb61b5d08d5387c2f0bbea1ee68755ef954af539a6a3a1b/cache-418825a28d992132.arrow and /root/.cache/huggingface/datasets/cc_news/plain_text/1.0.0/6cdde8d7fdaae3e50fb61b5d08d5387c2f0bbea1ee68755ef954af539a6a3a1b/cache-12bb78e54e32bcc9.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/cc_news/plain_text/1.0.0/6cdde8d7fdaae3e50fb61b5d08d5387c2f0bbea1ee68755ef954af539a6a3a1b/cache-f9583e077a90a7ae.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/cc_news/plain_text/1.0.0/6cdde8d7fdaae3e50fb61b5d08d5387c2f0bbea1ee68755ef954af539a6a3a1b/cache-cd2bea28bd521e0b.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/cc_news/plain_text/1.0.0/6cdde8d7fdaae3e50fb61b5d08d5387c2f0bbea1ee68755ef954af539a6a3a1b/cache-015de278d94aa83c.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/cc_news/p

***** Running Evaluation *****
  Num examples = 2718
  Batch size = 8


***** Running Evaluation *****
  Num examples = 2718
  Batch size = 8


***** Running Evaluation *****
  Num examples = 2718
  Batch size = 8


Perplexity of original model on cc: 38.88
Perplexity of wiki model on cc: 94.40
Perplexity of amazon model on cc: 55.74
Perplexity of cc model on cc: 24.19


### Visualize perplexities

As a final step, we plot the obtained perplexity values in a table, to have a global view on what is going on.

In [None]:
import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "Model": pd.Categorical(["original", "fine tuned on wikitext", "fine tuned on amazon_reviews", "fine tuned on cc_news"]),
        "Perplexity on wikitext": np.array([math.exp(original_on_wiki_eval_results['eval_loss']), math.exp(wiki_on_wiki_eval_results['eval_loss']), math.exp(amazon_on_wiki_eval_results['eval_loss']), math.exp(cc_on_wiki_eval_results['eval_loss'])]),
        "Perplexity on amazon_reviews": np.array([math.exp(original_on_amazon_eval_results['eval_loss']), math.exp(wiki_on_amazon_eval_results['eval_loss']), math.exp(amazon_on_amazon_eval_results['eval_loss']), math.exp(cc_on_amazon_eval_results['eval_loss'])]),
        "Perplexity on cc_news": np.array([math.exp(original_on_cc_eval_results['eval_loss']), math.exp(wiki_on_cc_eval_results['eval_loss']), math.exp(amazon_on_cc_eval_results['eval_loss']), math.exp(cc_on_cc_eval_results['eval_loss'])]),
    }
)

display(df)

Unnamed: 0,Model,Perplexity on wikitext,Perplexity on amazon_reviews,Perplexity on cc_news
0,original,59.121695,59.800978,38.879353
1,fine tuned on wikitext,29.755059,131.497932,94.401627
2,fine tuned on amazon_reviews,90.965899,36.960074,55.737504
3,fine tuned on cc_news,67.172469,58.607422,24.194568


### Compare generative capabilities

The idea here is to make the models generate some phrases given a certain sentence start and evaluate how the different fine-tunings affected their behaviour. As a first sentece seed we use a generic `"Hello, I'm a language model,"`.

In [None]:
from transformers import pipeline, set_seed
from random import randint

original_generator = pipeline('text-generation', model=model.to('cpu'), tokenizer=tokenizer)
wiki_generator = pipeline('text-generation', model=wiki_model, tokenizer=tokenizer)
amazon_generator = pipeline('text-generation', model=amazon_model, tokenizer=tokenizer)
cc_generator = pipeline('text-generation', model=cc_model, tokenizer=tokenizer)
set_seed(42) # For reproducibility

num_seq = 5
seed_sentence = "Hello, I'm a language model,"
texts = {}
texts['original model'] = original_generator(seed_sentence, max_length=30, num_return_sequences=num_seq)[randint(0, num_seq-1)]['generated_text']
texts['wiki model'] = wiki_generator(seed_sentence, max_length=30, num_return_sequences=num_seq)[randint(0, num_seq-1)]['generated_text']
texts['amazon model'] = amazon_generator(seed_sentence, max_length=30, num_return_sequences=num_seq)[randint(0, num_seq-1)]['generated_text']
texts['cc model'] = cc_generator(seed_sentence, max_length=30, num_return_sequences=num_seq)[randint(0, num_seq-1)]['generated_text']

print('\nSeed sentence:{}\n'.format(seed_sentence)+'-'*50)
for item in texts.items():
  print(item)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Seed sentence:Hello, I'm a language model,
--------------------------------------------------
('original model', "Hello, I'm a language model, I'm writing a new language for you. But first, I'd like to tell you about the language itself")
('wiki model', 'Hello, I\'m a language model, one of the best ever produced in our history. " \n Her last performance was in 1964. She left')
('amazon model', "Hello, I'm a language model, but it's so much more than what I'm used to, and it is just so confusing, especially in")
('cc model', "Hello, I'm a language model, so for the most part, I'm just a beginner.\nBut since I'm now a master, it")


As we can see, the above results show that the four models behave similarly, meaning that they are able to generalize decently on that particular sentence start. We have to note that only the wiki one has some problems, since it creates a totally incoherent end of sentence.

As a second experiment a new sentence start is taken from the `wikitext` dataset:

In [None]:
seed_sentence = "In the 1940s and 50s,"
texts = {}
texts['original model'] = original_generator(seed_sentence, max_length=30, num_return_sequences=num_seq)[randint(0, num_seq-1)]['generated_text']
texts['wiki model'] = wiki_generator(seed_sentence, max_length=30, num_return_sequences=num_seq)[randint(0, num_seq-1)]['generated_text']
texts['amazon model'] = amazon_generator(seed_sentence, max_length=30, num_return_sequences=num_seq)[randint(0, num_seq-1)]['generated_text']
texts['cc model'] = cc_generator(seed_sentence, max_length=30, num_return_sequences=num_seq)[randint(0, num_seq-1)]['generated_text']

print('\nSeed sentence:{}\n'.format(seed_sentence)+'-'*50)
for item in texts.items():
  print(item)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Seed sentence:In the 1940s and 50s,
--------------------------------------------------
('original model', 'In the 1940s and 50s, the US military trained several hundred Special Forces and trained them to carry out air strikes against communist-dominated Warsaw Pact')
('wiki model', 'In the 1940s and 50s, a wide variety of activities for young musicians existed in London, from classical concerts to concerts, particularly at the Barb')
('amazon model', 'In the 1940s and 50s, this would have been a very nice and stylish shoe. Some were uncomfortable and others felt like they would stay for')
('cc model', 'In the 1940s and 50s, the group moved to Los Angeles from Connecticut.\n"When we moved in, \'Hey, I like the')


At this time we can observe that the behaviour changes completely: the original model and the one fine-tuned on `wikitext` are able to incorporate meaningful information inside the generated strings, whereas the other two models produce phrases that are somehow incoherent, in particular at the end of the sentence.

As a third experiment, the sentence start is extracted from `amazon_reviews`:

In [None]:
seed_sentence = "Perfect for winter season,"
texts = {}
texts['original model'] = original_generator(seed_sentence, max_length=30, num_return_sequences=num_seq)[randint(0, num_seq-1)]['generated_text']
texts['wiki model'] = wiki_generator(seed_sentence, max_length=30, num_return_sequences=num_seq)[randint(0, num_seq-1)]['generated_text']
texts['amazon model'] = amazon_generator(seed_sentence, max_length=30, num_return_sequences=num_seq)[randint(0, num_seq-1)]['generated_text']
texts['cc model'] = cc_generator(seed_sentence, max_length=30, num_return_sequences=num_seq)[randint(0, num_seq-1)]['generated_text']

print('\nSeed sentence:{}\n'.format(seed_sentence)+'-'*50)
for item in texts.items():
  print(item)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Seed sentence:Perfect for winter season,
--------------------------------------------------
('original model', 'Perfect for winter season, these t-shirts will look fantastic on an outdoor porch. You can give these t-shirts to friends and family for $')
('wiki model', 'Perfect for winter season, the lake also runs over a vast area of the watershed. The largest lake of its kind in southeastern Canada, the Oj')
('amazon model', 'Perfect for winter season, I used for my baby in summerMy husband loved this rug! It is so cute. Very well designed but we kept it')
('cc model', 'Perfect for winter season, which makes this recipe even more unique because it combines four sweet peas and two quinoa-based dishes.\n3. Bring')


The situation is pretty similar to the one described above: the original model and the fine-tuned one produce a coherent output, the `cc_model ` tries to incorporate knowledge from receipes in a strange way, whereas the `wiki_model` outputs a descriptive sentence with very low coherence.

Lastly, the sentence start is collected from `cc_news` dataset:

In [None]:
seed_sentence = "The report sets out a series of recommendations,"
texts = {}
texts['original model'] = original_generator(seed_sentence, max_length=30, num_return_sequences=num_seq)[randint(0, num_seq-1)]['generated_text']
texts['wiki model'] = wiki_generator(seed_sentence, max_length=30, num_return_sequences=num_seq)[randint(0, num_seq-1)]['generated_text']
texts['amazon model'] = amazon_generator(seed_sentence, max_length=30, num_return_sequences=num_seq)[randint(0, num_seq-1)]['generated_text']
texts['cc model'] = cc_generator(seed_sentence, max_length=30, num_return_sequences=num_seq)[randint(0, num_seq-1)]['generated_text']

print('\nSeed sentence:{}\n'.format(seed_sentence)+'-'*50)
for item in texts.items():
  print(item)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Seed sentence:The report sets out a series of recommendations,
--------------------------------------------------
('original model', 'The report sets out a series of recommendations, for which the government is committed to working out "what the best solution to the problem has been".\n')
('wiki model', 'The report sets out a series of recommendations, and says there are many variables that make the assessment challenging – particularly considering that it does not attempt to explain')
('amazon model', 'The report sets out a series of recommendations, which is exactly what we wanted. We believe these would have given us the right level of insight into the')
('cc model', 'The report sets out a series of recommendations, including proposals for action, to encourage action to tackle homelessness, and improve child welfare services throughout Scotland.\n')


Also in this case the situation is similar to the one described above: the original model and the fine-tuned one are able to produce totally plausible and meaningful output, whereas the other two versions of GPT2 try to complete the sentence by including structures and patterns they have learned in the fine-tuning phase.

As a final consideration it is worth noting that:
*   The original model has high generalizations capabilities, since it produces good outputs independently from the input it is given;
*   The fine-tuned models are very good when they are considered in the scenarios they have been tuned for, but they struggle when they are presented with new situations;
*    The fine-tuning drives the behaviour of the models: the one trained on wikipedia texts tends to produce descriptive and informative sentences, the one trained on amazon usually outputs personal "opinions" using the first singular and plural person ("*I used it*", "*We wanted*", ...), the one trained on news articles describes realistic situations and proposes information contained in newspapers such as recipes;

This behaviour is totally expectable, since the aim of fine-tuning is to refine a general model in a way to solve a specific task, but maybe it could be a little mitigated chaning the learning parameters at training phase.

