# Lecture 4: Fine-tuning a Language Model using Huggingface
Building is the best way   of learning AI/ML.

## Fine-Tuning Our Language Model
Language modeling predicts words in a sentence. There are different types. In causal language modeling, the task is to predict the next toklen in a sequence of tokens using only the tokens that came before it. 

### Huggingface
Community and data science center for building, training and deploying ML models based on open source software. 

### Loading up a dataset
Use the Datasets Library, with three main feature:
* Efficient way to load and process data from raw files (CSV/JSON/text) or in-memory data 
* A simple way to access asnd share datasets with the research and practitioner community
* Interoperable with DL frameworks like pandas, NumPy, PyTorch and TensorFlow

**Def**. SQuAD dataset (Stanfrod Question Answering Dataset) consists of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. 

Fine-tune on SQuAD.

In [3]:
from datasets import load_dataset
dataset = load_dataset("squad")
dataset

Reusing dataset squad (/Users/maxcasas/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)
100%|██████████| 2/2 [00:00<00:00,  8.40it/s]


DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

We can remove columns that we are not going to use, and use the map function to add a special token that GPT2 uses to mark the end of a document. 

In [4]:
def add_end_of_text(example: dict) -> dict:
    example["question"] = example["question"] + "<|endoftext|>"
    return example

dataset = dataset.remove_columns(["id", "title", "context", "answers"])
dataset = dataset.map(add_end_of_text)


100%|██████████| 87599/87599 [00:06<00:00, 14561.11ex/s]
100%|██████████| 10570/10570 [00:00<00:00, 17161.07ex/s]


In [5]:
# Look at the structure fo some entries
dataset["train"]["question"][:10]

['To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?<|endoftext|>',
 'What is in front of the Notre Dame Main Building?<|endoftext|>',
 'The Basilica of the Sacred heart at Notre Dame is beside to which structure?<|endoftext|>',
 'What is the Grotto at Notre Dame?<|endoftext|>',
 'What sits on top of the Main Building at Notre Dame?<|endoftext|>',
 'When did the Scholastic Magazine of Notre dame begin publishing?<|endoftext|>',
 "How often is Notre Dame's the Juggler published?<|endoftext|>",
 'What is the daily student paper at Notre Dame called?<|endoftext|>',
 'How many student news papers are found at Notre Dame?<|endoftext|>',
 'In what year did the student paper Common Sense begin publication at Notre Dame?<|endoftext|>']

# Tokenizer
Process the data in an acceptable format for the model. Use a tokenizer, which prepares the inputs for a model. 

A tokenization pipeline in HF comprises several steps:
1. Normalization (any cleanup of the text that is deemed necessary, such as removing spaces or accents, Unicode normalization, etc.)
2. Pre-tokenization (splitting the input into words)
3. Running the input through the model (using the pre-tokenized words to produce a sequence of tokens)
4. Post-processing (adding the special tokens of the tokenizer, generating the attention mask and token type IDs)

For example: Hello how are U tday?
1. hello how are u tday?
2. [hello, how, are, u, tday,]
3. [hello, how, are, u, ##ay, ?]
4. [CLS, hwllo, how, are, u, td, ##ay, ?, SEP]

For tokenization, there are three main subword tokenization algorithms: BPE, WordPiece and Unigram.

Since tokenization processes are model-specific, if we want to fine-tune the model on new data, we need to instantiate the tokenizer using the name of the model, to make sure we use the same rules that were used when the model was pretrained. This is done by the AutoTokenizer class.

In [19]:
from transformers import AutoTokenizer, AutoModel


In [7]:
model_checkpoint = "distilgpt2" 
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading: 100%|██████████| 762/762 [00:00<00:00, 70.5kB/s]
Downloading: 100%|██████████| 0.99M/0.99M [00:00<00:00, 2.17MB/s]
Downloading: 100%|██████████| 446k/446k [00:00<00:00, 990kB/s] 
Downloading: 100%|██████████| 1.29M/1.29M [00:00<00:00, 2.34MB/s]


Convert a sample sentence to tokens. 

In [9]:
sequence = ("This tokenizer is being applied in CS197 at"
            "Harvard.<|endoftext|>")
tokens = tokenizer.tokenize(sequence)
print(tokens)

['This', 'Ġtoken', 'izer', 'Ġis', 'Ġbeing', 'Ġapplied', 'Ġin', 'ĠCS', '197', 'Ġat', 'Har', 'vard', '.', '<|endoftext|>']


In these models, the space before a word is part of a word, so they are converted in a special character Ġ in the tokenizer. To convert tokens into numbers, the tokenizer has a vocabulary, which is the part we download when we instantiate it wit the pretrained method. We need to use the same vocabulary used when the model was pretrained. 

In [11]:
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[1212, 11241, 7509, 318, 852, 5625, 287, 9429, 24991, 379, 13587, 10187, 13, 50256]


In [12]:
sequence = ("This tokenizer is being applied in CS197 at"
            "Harvard.<|endoftext|>")
tokenizer(sequence)

{'input_ids': [1212, 11241, 7509, 318, 852, 5625, 287, 9429, 24991, 379, 13587, 10187, 13, 50256], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

A dictionary with 2 important items:
1. input_ids: the indices corresponding to each token in the sentence
2. attention_mask: indicates whether a token should be attended to or not

In [13]:
# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples["question"], truncation=True)

tokenized_datasets = dataset.map(
    tokenize_function,
    batched=True,
    num_proc=4,
    remove_columns=['question']
    )

#0:   0%|          | 0/22 [00:00<?, ?ba/s]

[A[A
[A
[A

#0:   9%|▉         | 2/22 [00:01<00:10,  1.85ba/s]
[A

#0:  14%|█▎        | 3/22 [00:01<00:06,  2.87ba/s]
[A

[A[A

[A[A
#0:  18%|█▊        | 4/22 [00:01<00:05,  3.58ba/s]

[A[A
#0:  23%|██▎       | 5/22 [00:01<00:04,  4.16ba/s]

#0:  27%|██▋       | 6/22 [00:01<00:04,  3.91ba/s]
[A

#0:  32%|███▏      | 7/22 [00:02<00:04,  3.58ba/s]

[A[A
#0:  36%|███▋      | 8/22 [00:02<00:03,  3.89ba/s]
#0:  41%|████      | 9/22 [00:06<00:20,  1.56s/ba]

[A[A
[A
#0:  45%|████▌     | 10/22 [00:07<00:14,  1.17s/ba]

[A[A
#0:  50%|█████     | 11/22 [00:07<00:09,  1.14ba/s]

[A[A
#0:  55%|█████▍    | 12/22 [00:07<00:07,  1.36ba/s]

[A[A
#0:  59%|█████▉    | 13/22 [00:08<00:05,  1.64ba/s]

[A[A
#0:  64%|██████▎   | 14/22 [00:08<00:03,  2.06ba/s]
[A

#0:  68%|██████▊   | 15/22 [00:08<00:02,  2.44ba/s]
#0:  73%|███████▎  | 16/22 [00:08<00:02,  2.78ba/s]

[A[A
[A

#0:  77%|███████▋  | 17/22 [00:08<00:01,  3.09ba/s]
[A

#0

Where we used the Datasets map function. By setting batched = True, we process multiple elements of the dataset at once and increase the number of processes with num_proc=4. 

In [14]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 10570
    })
})

# Data Processing
For CLM, one of the data preparation steps is to concatenate the different examples together, and then split them into chunks of equal size. This is so that we can have a common length across all examples without needing to pad. We use chunks defined by block_size of 128. The option batched=True lets us change the number of examples in the datasets by returning a different number of examples than we got. 

In [15]:
block_size = 128
def group_texts(examples):
    # Repeat concatenation for input_ids and other keys
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    total_length = (total_length // block_size) * block_size

    # Populate each of the input_ids and other keys
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)] for k, t in concatenated_examples.items()
    }

    # Add labels because we'll need it as the output
    result["labels"] = result["input_ids"].copy()
    return result

lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)



#0:   0%|          | 0/22 [00:00<?, ?ba/s]
[A
[A

#0:   5%|▍         | 1/22 [00:00<00:11,  1.76ba/s]
#0:   9%|▉         | 2/22 [00:00<00:06,  3.02ba/s]

[A[A
[A

#0:  14%|█▎        | 3/22 [00:00<00:04,  3.83ba/s]
[A

#0:  18%|█▊        | 4/22 [00:01<00:05,  3.15ba/s]
[A

#0:  23%|██▎       | 5/22 [00:01<00:04,  3.80ba/s]
[A

#0:  27%|██▋       | 6/22 [00:01<00:03,  4.44ba/s]

[A[A
#0:  36%|███▋      | 8/22 [00:02<00:03,  4.50ba/s]

[A[A
#0:  41%|████      | 9/22 [00:02<00:02,  4.57ba/s]

[A[A
[A

#0:  45%|████▌     | 10/22 [00:02<00:02,  4.18ba/s]
[A

#0:  50%|█████     | 11/22 [00:02<00:02,  4.15ba/s]
[A

#0:  59%|█████▉    | 13/22 [00:03<00:01,  4.84ba/s]

#0:  64%|██████▎   | 14/22 [00:03<00:01,  5.42ba/s]
[A

#0:  68%|██████▊   | 15/22 [00:03<00:01,  5.78ba/s]
[A

#0:  73%|███████▎  | 16/22 [00:03<00:01,  5.99ba/s]
[A

#0:  77%|███████▋  | 17/22 [00:03<00:01,  4.65ba/s]
[A
#0:  82%|████████▏ | 18/22 [00:04<00:00,  4.30ba/s]

[A[A
#0:  86%|████████▋ | 19/22 

In [16]:
print(lm_datasets['train']['input_ids'][0])

[2514, 4150, 750, 262, 5283, 5335, 7910, 1656, 287, 1248, 3365, 287, 406, 454, 8906, 4881, 30, 50256, 2061, 318, 287, 2166, 286, 262, 23382, 20377, 8774, 11819, 30, 50256, 464, 32520, 3970, 286, 262, 17380, 2612, 379, 23382, 20377, 318, 13970, 284, 543, 4645, 30, 50256, 2061, 318, 262, 10299, 33955, 379, 23382, 20377, 30, 50256, 2061, 10718, 319, 1353, 286, 262, 8774, 11819, 379, 23382, 20377, 30, 50256, 2215, 750, 262, 3059, 349, 3477, 11175, 286, 23382, 288, 480, 2221, 12407, 30, 50256, 2437, 1690, 318, 23382, 20377, 338, 262, 39296, 1754, 3199, 30, 50256, 2061, 318, 262, 4445, 3710, 3348, 379, 23382, 20377, 1444, 30, 50256, 2437, 867, 3710, 1705, 9473, 389, 1043, 379, 23382, 20377, 30, 50256, 818, 644, 614, 750, 262, 3710, 3348]


Decode function to go from our encoded ids back to the text.

In [17]:
tokenizer.decode(lm_datasets['train']['input_ids'][0])

"To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?<|endoftext|>What is in front of the Notre Dame Main Building?<|endoftext|>The Basilica of the Sacred heart at Notre Dame is beside to which structure?<|endoftext|>What is the Grotto at Notre Dame?<|endoftext|>What sits on top of the Main Building at Notre Dame?<|endoftext|>When did the Scholastic Magazine of Notre dame begin publishing?<|endoftext|>How often is Notre Dame's the Juggler published?<|endoftext|>What is the daily student paper at Notre Dame called?<|endoftext|>How many student news papers are found at Notre Dame?<|endoftext|>In what year did the student paper"

Make a smaller version of the data so we can fine-tune our model in a reasonable amount of time. 

In [18]:
small_train_dataset = \
    lm_datasets['train'].shuffle(seed=42).select(range(100))
small_eval_dataset = \
    lm_datasets['validation'].shuffle(seed=42).select(range(100))

# Causal language modeling
Define training arguments and set p our Trainer. The Trainer class provides an API for feature-complete training in PyTorch for most standard use cases. 

We will push this model to the Hub, as HF platform where anyone can share and explore models, datassets and demos. 


In [21]:
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

model = AutoModelForCausalLM.from_pretrained("distilgpt2")

Downloading: 100%|██████████| 336M/336M [01:02<00:00, 5.62MB/s]


In [25]:
training_args = TrainingArguments(
    f"{model_checkpoint}-squad",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=False,
)

trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = small_train_dataset,
    eval_dataset = small_eval_dataset,
)

trainer.train()

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
***** Running training *****
  Num examples = 100
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 39
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
[34m[1mwandb[0m: Currently logged in as: [33mmaxcasas[0m. Use [1m`wandb login --relogin`[0m to force relogin


 33%|███▎      | 13/39 [19:54<40:36, 93.72s/it] ***** Running Evaluation *****
  Num examples = 100
  Batch size = 8
                                               
 33%|███▎      | 13/39 [22:28<40:36, 93.72s/it]

{'eval_loss': 3.7186625003814697, 'eval_runtime': 153.0631, 'eval_samples_per_second': 0.653, 'eval_steps_per_second': 0.085, 'epoch': 1.0}


 67%|██████▋   | 26/39 [47:33<29:38, 136.81s/it]  ***** Running Evaluation *****
  Num examples = 100
  Batch size = 8


Evaluate the model. Because we want our model to assign high probabilities to sentences that are real, we seek a model that assigns the highgest probability to the test set. The metric we use is perplexity, the inverse probability of the test set normalized by the number of words in the test set. A lower perplexity is better:

In [None]:
import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Upload our final model and tokenizer to the hub.


In [None]:
tokenizer.save_pretrained("gpt2-squad")
model.push_to_hub("gpt2-squad")

# Generation with our fine-tuned model
To autocomplete some questions.


In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("rajpurkar/gpt2-squad")
tokenizer = AutoTokenizer.from_pretrained("rajpurkar/gpt2-squad")

  from .autonotebook import tqdm as notebook_tqdm
Downloading: 100%|██████████| 1.00k/1.00k [00:00<00:00, 333kB/s]
Downloading: 100%|██████████| 318M/318M [01:04<00:00, 5.21MB/s] 
Downloading: 100%|██████████| 261/261 [00:00<00:00, 114kB/s]
Downloading: 100%|██████████| 779k/779k [00:00<00:00, 1.64MB/s] 
Downloading: 100%|██████████| 446k/446k [00:00<00:00, 964kB/s] 
Downloading: 100%|██████████| 2.01M/2.01M [00:00<00:00, 2.77MB/s]
Downloading: 100%|██████████| 99.0/99.0 [00:00<00:00, 52.1kB/s]


In [4]:
# Tokenize some text, including some context and the start of a question
start_text = ("A speedrun is a playthrough of a video game, \
or section of a video game, with the goal of \
completing it as fast as possible. Speedruns \
often follow planned routes, which may incorporate sequence \
breaking, and might exploit glitches that allow sections to \
be skipped or completed more quickly than intended. ")

prompt = "What is the"

inputs = tokenizer(
    start_text + prompt,
    add_special_tokens=False,
    return_tensors="pt"
)['input_ids']

In [5]:
# Pass the input into the model for generation
prompt_length = len(tokenizer.decode(inputs[0]))
outputs = model.generate(
    inputs,
    max_length=100,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    temperature=0.9,
    num_return_sequences=1,
)

generated = prompt + tokenizer.decode(outputs[0])[prompt_length + 1:]
print(tokenizer.decode(outputs[0]))


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


A speedrun is a playthrough of a video game, or section of a video game, with the goal of completing it as fast as possible. Speedruns often follow planned routes, which may incorporate sequence breaking, and might exploit glitches that allow sections to be skipped or completed more quickly than intended. What is the name of the speedrun in an early video game?<|endoftext|>
