<a href="https://colab.research.google.com/github/Angel-Castro-RC/Final_NLP/blob/main/F7_1_TransferLearning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS 195: Natural Language Processing
## Transfer Learning

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ericmanley/f23-CS195NLP/blob/main/F7_1_TransferLearning.ipynb)

## Reference

Hugging Face NLP Course Chapter 1: Transformer Models https://huggingface.co/learn/nlp-course/chapter1/1

Hugging Face NLP Course Chapter 3: Fine-tuning a model with the Trainer API or Keras https://huggingface.co/learn/nlp-course/chapter3/1

Hugging Face NLP Course Chapter 7, Section 5: Summarization https://huggingface.co/learn/nlp-course/chapter7/5?fw=tf

In [None]:
import sys
!{sys.executable} -m pip install --no-cache-dir datasets keras tensorflow sentencepiece transformers rouge_score

Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24933 sha256=04f7ee01eb7b99c95b1769d9057476c9a3da5a33bca1031b91e6c0d42f3a028f
  Stored in directory: /tmp/pip-ephem-wheel-cache-_scb44t3/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: sentencepiece, rouge_score
Successfully installed rouge_score-0.1.2 sentencepiece-0.1.99


## Transfer Learning

**Transfer Learning** is the process of taking a model that was trained (**pre-trained**) on one task and then **fine tuned** for another task.

Today we're going to practice fine-tuning a pre-trained **transformer** model - we'll cover transformers in more detail next week, but they work a lot like the other neural network models we've looked at so far.

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/pretraining.svg?raw=1" width=700>
    <br />
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/finetuning.svg?raw=1" width=700>
</div>

image source: https://huggingface.co/learn/nlp-course/chapter1/4?fw=tf

## Common pre-trained models

There are a variety of pre-trained models out there
* usually *very large*
* pretrained on *massive amounts of data*

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/model_parameters.png?raw=1" width=800>
</div>

**Encoders:** BERT, ALBERT, DistilBERT, ELECTRA, RoBERTa
* Usually trained on masked input - model tries to predict the missing word in a sequence


**Decoders:** CTRL, GPT, GPT-2, Transformer XL
* Neural language models - usually trying to predict the next word in a sequence

**Encoder-Decoder Models:** BART, mBART, Marian, T5
* full sequence-to-sequence models


## Working Example

We're going to work through our text-to-emoji example, fine-tuning a variant of T5.

### Load and filter our dataset just like before

In [None]:
from datasets import load_dataset


# Define a function to check if 'text' is not None
def is_not_none(example):
    return example['text'] is not None

dataset = load_dataset("KomeijiForce/Text2Emoji",split="train")

# Filter the dataset
dataset = dataset.filter(is_not_none)
dataset

Dataset({
    features: ['text', 'emoji', 'topic'],
    num_rows: 503682
})

### choosing a sample to work with

Even the smaller transformer models will take too long to train on in class

Let's choose a small sample to work on in class

In [None]:
# Shuffle the dataset
shuffled_dataset = dataset.shuffle(seed=42)

# Select a small sample
sample_size = 5000  # Define your sample size
sample_dataset = shuffled_dataset.select(range(sample_size))

#if you want to use the entire dataset just uncomment the following
#sample_dataset = shuffled_dataset

### Train/test split

Hugging Face datasets actually include a `train_test_split` function for splitting into training and testing sets if you don't already have them split.

In [None]:
dataset_split = sample_dataset.train_test_split(test_size=0.2)
dataset_split

DatasetDict({
    train: Dataset({
        features: ['text', 'emoji', 'topic'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['text', 'emoji', 'topic'],
        num_rows: 1000
    })
})

### Reminder of what the data looks like

In [None]:
print(dataset_split["train"]["text"][46])
print(dataset_split["train"]["emoji"][46])

Riding a ferry across the bay offers incredible views of the skyline.
⛴🌉🌊👀


### The Tokenizer

Since we will be using an existing model to start, we need to make sure we prepare our data in the same way that model was trained on.

**T5:** an encoder-decoder Transformer architecture suitable for sequence-to-sequences tasks

**mT5:** A multilingual version of T5, pretrained on the multilingual Common Crawl corpus (mC4), covering 101 languages

**mt5-small:** A small version of mT5, suitable for getting things working before attempting to train on a large model

`mt5-small` uses the SentencePiece tokenizer

In [None]:
from transformers import AutoTokenizer

#uses the sentencepiece tokenizer
model_checkpoint = "google/mt5-small"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=False)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=True`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


### Looking at an example of the tokenization

You'll see that the token ids get returned as `input_ids`

It also includes an `attention_mask` which allows the algorithm to focus on specific important words using its attention mechanism - it's initialized to all 1s

In [None]:
inputs = tokenizer(dataset_split["train"]["text"][46])
inputs

{'input_ids': [47368, 347, 259, 262, 100174, 276, 259, 15259, 287, 7662, 259, 5760, 259, 87448, 6179, 304, 287, 20495, 1397, 260, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

### Converting ids back to tokens

Here's what the tokens look like.

The `▁` and `</s>` are hallmarks of the SentencePiece tokenizer algorithm

In [None]:
tokenizer.convert_ids_to_tokens(inputs.input_ids)

['▁Rid',
 'ing',
 '▁',
 'a',
 '▁ferr',
 'y',
 '▁',
 'across',
 '▁the',
 '▁bay',
 '▁',
 'offers',
 '▁',
 'incredible',
 '▁views',
 '▁of',
 '▁the',
 '▁sky',
 'line',
 '.',
 '</s>']

### How does it work on the emojis?

Fortunately, this seems to work pretty well for the emoji output too

some may come back as `<unk>` for unknown tokens

In [None]:
target = tokenizer(dataset_split["train"]["emoji"][46])
target

{'input_ids': [259, 2, 241593, 239651, 1], 'attention_mask': [1, 1, 1, 1, 1]}

In [None]:
tokenizer.convert_ids_to_tokens(target.input_ids)

['▁', '<unk>', '🌊', '👀', '</s>']

In [None]:
tokenizer.decode(target.input_ids)

'<unk>🌊👀</s>'

### Let's define a preprocessing function

This will allow us to tokenize both the text and labels while allow use to add the token ids from the emojis as the `"labels"` key in the overall data structure where it will be convenient to have them for training.

In [None]:
max_input_length = 100
max_target_length = 20


def preprocess_function(examples):
    model_inputs = tokenizer(
        examples["text"],
        max_length=max_input_length,
        truncation=True,
    )
    labels = tokenizer(
        examples["emoji"], max_length=max_target_length, truncation=True
    )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs



Hugging Face datasets have a `map` method that allows you to apply a preprocessing function like this to every example in the data set.

Notice that we get everything we had before (text, emoji, topic), but now we also have the input_ids (the tokens), the attention mask, and the labels (also token ids).

In [None]:
#turn the tokenized data back into a dataset
tokenized_datasets = dataset_split.map(preprocess_function, batched=True)
tokenized_datasets

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'emoji', 'topic', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['text', 'emoji', 'topic', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1000
    })
})

### Grabbing the pre-trained model

as a reminder, `model_checkpoint` was defined earlier - it is `"google/mt5-small"`

Note that this is an encoder-decoder transformer model the was pretrained on a 750 GB dataset which included tasks for summarization, translation, question answering, and classification.

In [None]:
from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

All model checkpoint layers were used when initializing TFMT5ForConditionalGeneration.

All the layers of TFMT5ForConditionalGeneration were initialized from the model checkpoint at google/mt5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMT5ForConditionalGeneration for predictions without further training.


### Using a data collator

Hugging Face provides a Data Collator class which is used to collect the training data into batches and dynamically pad them so that each batch is appropriately padded but without an overall fixed length.

With `return_tensors="tf"` we're saying we want the data back in an appropriate data structure suitable for using with Keras/Tensorflow.

In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")

Let's make a version of the dataset where the original text fields are removed so we can use it with the collator.

In [None]:
tokenized_datasets_no_text = tokenized_datasets.remove_columns(["text","emoji","topic"])
tokenized_datasets_no_text

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1000
    })
})

In [None]:
tf_train_dataset = model.prepare_tf_dataset(
    tokenized_datasets_no_text["train"],
    collate_fn=data_collator,
    shuffle=True,
    batch_size=32,
)
tf_eval_dataset = model.prepare_tf_dataset(
    tokenized_datasets_no_text["test"],
    collate_fn=data_collator,
    shuffle=False,
    batch_size=32,
)

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


### Setting up the optimizer

When fine-tuning a pre-trained algorithm, you usually want to use a smaller learning rate.

Note that we do not specify a loss function - it will use whatever was used in the base model.

*NB:* I'm using values that were in the example on the website (https://huggingface.co/learn/nlp-course/chapter7/5?fw=tf ) for a different dataset - I don't know if these are the best for this problem

In [None]:
from transformers import create_optimizer
import tensorflow as tf

num_train_epochs = 8
num_train_steps = len(tf_train_dataset) * num_train_epochs

optimizer, schedule = create_optimizer(
    init_lr=5.6e-5,
    num_warmup_steps=0,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
)

model.compile(optimizer=optimizer)

# Train in mixed-precision float16 - can be helpful if running on a GPU
#tf.keras.mixed_precision.set_global_policy("mixed_float16")

In [None]:
model.fit(tf_train_dataset, validation_data=tf_eval_dataset, epochs=8)

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.src.callbacks.History at 0x2c4214e20>

### Saving a copy of the model's weights

This will allow us to load the model later and work with it without completely retraining.

In [None]:
model.save_pretrained("models/emoji-model-v2")

### Reload a saved model

In [None]:
#model = TFAutoModelForSeq2SeqLM.from_pretrained("models/emoji-model-v1")

### Inference

Let's suppose we have an example to get a prediction for. For now, let's grab one from the test set

In [None]:
print( tokenized_datasets["test"]["text"][15] )
print( tokenized_datasets["test"]["emoji"][15] )
print( tokenized_datasets["test"]["input_ids"][15] )

Marvel at the towering cathedral steeples and intricate stained glass windows. This stunning architectural wonder radiates a sense of divine presence and spirituality.
🏙️💒🧚⛪🚄💫🕊️🌸✨
[46577, 344, 287, 288, 176572, 317, 216387, 113489, 104793, 305, 281, 92804, 346, 259, 263, 29967, 27416, 20727, 260, 1494, 259, 263, 59976, 259, 262, 115957, 29100, 79398, 1837, 259, 262, 13336, 304, 64236, 265, 65901, 265, 305, 43498, 2302, 260, 1]


Use the `generate` method to get a prediction sequence from the intput IDs.

If you don't already have the tokens, make sure to use your tokenizer first.

In [None]:
prediction = model.generate([tokenized_datasets["test"]["input_ids"][15]], max_length=max_target_length)
tokenizer.convert_ids_to_tokens(prediction[0])

['<pad>', '▁', '✨', '✨', '</s>']

In [None]:
decoded_output = tokenizer.decode(prediction[0], skip_special_tokens=True)
decoded_output

'✨✨'

## Applied Exploration

The applied exploration for this fortnight will be a little different. I want everyone to get some experience fine-tuning an existing model, so this will be the task for the entire fortnight.

Fine-tune an existing model with the following requirements
* Choose a different starting model - you can use any Hugging Face model, but consider starting with a general one like BART or Llama2.
* Choose a different data set - think about something that would be good to include in an application that interests you
* Evaluate how well it performed. For sequence-to-sequence model, try going back and using Rouge from Fortnight 1.

In [None]:
import sys
!{sys.executable} -m pip install --no-cache-dir datasets keras tensorflow sentencepiece transformers rouge_score




In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq
import tensorflow as tf
from rouge_score import rouge_scorer


In [None]:
from datasets import load_dataset

# Load the translation dataset
dataset = load_dataset("bertin-project/alpaca-spanish", split="train")

In [None]:
from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq

# Choose a different starting model (e.g., BART)
model_checkpoint = "facebook/bart-large"

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=False)

In [None]:
# Define max input and output lengths
max_input_length = 200
max_output_length = 200

# Preprocessing function
def preprocess_function(examples):
    model_inputs = tokenizer(
        examples["input"],
        max_length=max_input_length,
        truncation=True,
    )
    labels = tokenizer(
        examples["output"], max_length=max_output_length, truncation=True
    )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [None]:
# Map preprocessing function to the dataset
tokenized_datasets = dataset.map(preprocess_function, batched=True)

# Load the pre-trained model
model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

All model checkpoint layers were used when initializing TFBartForConditionalGeneration.

All the layers of TFBartForConditionalGeneration were initialized from the model checkpoint at facebook/bart-large.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBartForConditionalGeneration for predictions without further training.


In [None]:

# Prepare TensorFlow datasets
# Convert PyTorch tokenized datasets to TensorFlow datasets
tf_train_dataset = tokenized_datasets.to_tf_dataset(
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=True,
    batch_size=8,  # Adjust batch size as needed
    collate_fn=data_collator,
)

TypeError: ignored

In [None]:
from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, create_optimizer

# Setting up the optimizer
num_train_epochs = 3
num_train_steps = len(tf_train_dataset) * num_train_epochs
optimizer, schedule = create_optimizer(
    init_lr=5e-5,
    num_warmup_steps=0,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
)

# Compile the model
model.compile(optimizer=optimizer)

NameError: ignored

In [None]:
# Train the model
model.fit(tf_train_dataset, epochs=num_train_epochs)


Epoch 1/3


In [None]:
# Save the model's weights
model.save_pretrained("models/translation-model")


In [None]:
# Inference example
example_index = 0  # Adjust as needed
input_text = tokenized_datasets["train"]["input"][example_index]
target_text = tokenized_datasets["train"]["output"][example_index]

inputs = tokenizer(input_text, return_tensors="tf", max_length=max_input_length, truncation=True)
prediction = model.generate(inputs["input_ids"], max_length=max_output_length, num_beams=4, length_penalty=2.0, early_stopping=True)
decoded_output = tokenizer.decode(prediction[0], skip_special_tokens=True)

# Print input, target, and output
print("Input Text:", input_text)
print("Target Text:", target_text)
print("Generated Text:", decoded_output)

# Evaluate using Rouge
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = scorer.score(target_text, decoded_output)

print("Rouge Scores:", scores)

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, create_optimizer
from rouge_score import rouge_scorer
import tensorflow as tf

In [None]:
# Load the multilingual translation dataset
dataset = load_dataset("wmt18", "tr-en", split="train[:5%]")

# Choose a pre-trained multilingual model (e.g., mBART)
model_checkpoint = "facebook/mbart-large-cc25"

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=False)


Downloading builder script:   0%|          | 0.00/3.01k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/10.3k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/41.4k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/23.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/38.7M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files: 0it [00:00, ?it/s]

Generating train split:   0%|          | 0/205756 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3007 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3000 [00:00<?, ? examples/s]

config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

In [None]:
# Define max input and output lengths
max_input_length = 200
max_output_length = 200

# Preprocessing function
# Preprocessing function
def preprocess_function(examples):
    # Check the structure of the dataset
    if isinstance(examples["translation"], list):
        # Assuming it's a list of dictionaries, extract the source and target languages
        source_language = [example["tr"] for example in examples["translation"]]
        target_language = [example["en"] for example in examples["translation"]]
    else:
        # If it's a dictionary, directly extract the source and target languages
        source_language = examples["translation"]["tr"]
        target_language = examples["translation"]["en"]

    # Tokenize the source and target languages
    model_inputs = tokenizer(
        source_language,
        max_length=max_input_length,
        truncation=True,
    )
    labels = tokenizer(
        target_language, max_length=max_output_length, truncation=True
    )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [None]:
# Map preprocessing function to the dataset
tokenized_datasets = dataset.map(preprocess_function, batched=True)

model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
# Data collator
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

# Extract the "train" split from the tokenized datasets
tf_train_dataset = tokenized_datasets.to_tf_dataset(
    columns=["input_ids", "attention_mask", "labels"], shuffle=True, batch_size=8
)


# Setting up the optimizer
num_train_epochs = 5
num_train_steps = len(tf_train_dataset) * num_train_epochs
optimizer, schedule = create_optimizer(
    init_lr=5e-5,
    num_warmup_steps=0,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
)

Map:   0%|          | 0/10288 [00:00<?, ? examples/s]

tf_model.h5:   0%|          | 0.00/2.44G [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFMBartForConditionalGeneration.

All the layers of TFMBartForConditionalGeneration were initialized from the model checkpoint at facebook/mbart-large-cc25.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMBartForConditionalGeneration for predictions without further training.


generation_config.json:   0%|          | 0.00/205 [00:00<?, ?B/s]

In [None]:
# Load the pre-trained model
model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

# Compile the model
model.compile(optimizer=optimizer)

# Train the model
model.fit(tf_train_dataset, epochs=num_train_epochs)

# Save the model's weights
model.save_pretrained("models/translation-model")

In [None]:
# Inference example (similar to the previous example)
example_index = 0  # Adjust as needed
input_text = tokenized_datasets["train"]["translation"]["de"][example_index]
target_text = tokenized_datasets["train"]["translation"]["en"][example_index]

inputs = tokenizer(input_text, return_tensors="tf", max_length=max_input_length, truncation=True)
prediction = model.generate(inputs["input_ids"], max_length=max_output_length, num_beams=4, length_penalty=2.0, early_stopping=True)
decoded_output = tokenizer.decode(prediction[0], skip_special_tokens=True)

# Print input, target, and output
print("Input Text:", input_text)
print("Target Text:", target_text)
print("Generated Text:", decoded_output)

# Evaluate using Rouge
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = scorer.score(target_text, decoded_output)

print("Rouge Scores:", scores)