Key Concepts in Fine-Tuning LLMs:

Pre-training:

Initially, a model like GPT (Generative Pretrained Transformer), BERT, or similar large models is trained on a massive corpus of text (like books, articles, websites, etc.) without task-specific labels.
The goal during pre-training is for the model to learn general language patterns, structures, and world knowledge (semantic understanding, grammar, facts, etc.).
This is done using unsupervised learning techniques, such as predicting the next word in a sentence or filling in missing words (masked language modeling).

Fine-tuning:

Once the model is pre-trained, fine-tuning involves adjusting the model’s weights using a smaller, task-specific dataset.
Fine-tuning is supervised, meaning it involves labeled data for specific tasks (e.g., labeled question-answer pairs for a QA task or labeled sentiment for a sentiment analysis task).
The model is trained for a few more epochs on this smaller dataset, updating the pre-trained weights to specialize the model in the desired task.

** Installation and Initial Setup**

In [None]:
!pip install transformers datasets

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

**Loading and Sampling the Dataset**

In [None]:
from datasets import load_dataset
dataset = load_dataset("imdb",split="train[:1%]")
print(dataset[0])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

The string train[:1%] is a slicing expression used to specify a portion of the train split.

train: Refers to the training data portion of the dataset.
[:1%]: This part slices the train split to include only the first 1% of the training data.

In Python, slicing works like start:stop (i.e., from start to stop - 1), but when used with percentages, it selects that percentage of the data from the beginning.

**Data Preprocessing**

In [None]:
def preprocess(batch):
  batch['text']=[text.replace('\n','') for text in batch['text']]
  return batch

# apply preprocessing to the dataset
dataset = dataset.map(preprocess,batched=True)

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

The map function in the Hugging Face datasets library is used to apply a function (preprocess in this case) to each batch of the dataset.

The batched=True argument tells the library that the preprocess function is expecting a batch of data as input. So, instead of applying the function to individual samples one at a time, it processes the entire batch (which might contain multiple samples).

**Load a pre-trained model and tokenizer for fine-tuning**

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
#Padding
tokenizer.pad_token = tokenizer.eos_token

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

**Tokenizing the Data**

In [None]:
def tokenize_function(examples):
  tokenized = tokenizer(examples['text'],padding='max_length',truncation=True)
  tokenized['labels'] = tokenized['input_ids'].copy()
  return tokenized

tokenized_data = dataset.map(tokenize_function,batched=True)

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

**Configuring Training Parameters**

In [None]:
import os

os.environ["WANDB_DISABLED"] = "true"


In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    report_to="none",
    evaluation_strategy="epoch",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=1,
    logging_dir='./logs',
    logging_steps=10,
    save_total_limit=1
)



**divide the dataset into training and evaluation sets**

In [None]:
train_data = tokenized_data.shuffle().select(range(int(0.8*len(tokenized_data))))
eval_data = tokenized_data.shuffle().select(range(int(0.8*len(tokenized_data)),len(tokenized_data)))

**Setting Up the Trainer & Fine-Tuning the Model**


In [None]:
from transformers import Trainer
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_data,
    eval_dataset = eval_data
)

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,1.0291,0.955924


TrainOutput(global_step=50, training_loss=1.5496183776855468, metrics={'train_runtime': 2964.3136, 'train_samples_per_second': 0.067, 'train_steps_per_second': 0.017, 'total_flos': 52259350118400.0, 'train_loss': 1.5496183776855468, 'epoch': 1.0})

**Save the model and tokenizer for future use**

In [None]:
model.save_pretrained("./model")
tokenizer.save_pretrained("./model")

('./model/tokenizer_config.json',
 './model/special_tokens_map.json',
 './model/vocab.json',
 './model/merges.txt',
 './model/added_tokens.json',
 './model/tokenizer.json')

**let’s generate text based on a prompt to evaluate the model:**

In [None]:
prompt = "The script"
inputs = tokenizer(prompt, return_tensors="pt")

output = model.generate(inputs['input_ids'], max_length=15)
print(tokenizer.decode(output[0], skip_special_tokens=True))


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


The script is a bit of a mess, but it's a good one
