Benefits of Fine Tuning

* Transfer Learning
* Time and resource efficiency
* Tailored Response
* Task specific adaptation

Pitfalls of fine tuning

* Overfitting: Avoid using a small dataset or extending training epochs excessively
* Underfitting: ensure sufficient training and an appropriate learning rate to enable adequate learning
* catastrophic Forgetting: Prevent the model from losing its initial broad knowledge which hindered performance on various nlp task
* Data leakage: keep training and validation datasets seperate


Three main approaches of fine tuning models:

    * self supervised fine tuning: Here the model learns to predict missing words in a large unlabeled dataset such as next words or masked words.
    * supervised fine tuning: the model is fine tuned using labeled data from the target task, improving its performance on specific tasks like sentiment analysis.
    * Reinforcement learning from human feedback: the model is adjusted based on explicit feedback from human. 


    * `Hybrid fine tuning`: combining multiple techniques can further enhance model performance
    

`Direct performance optimization:` emerging popular approach that focuses on optimizing large language models directly based on `human preferences`. some of it's worth mentioning features are:

    * simplicity: extremely focuses on aligning model outputs with human preferences and judgements
    DPO requires no reward training. so no need to train an additional reward model
    DPO can achieve faster convergence due to its reliance on direct feedback
    

Supervised fine tuning approaches:

1. Full fine tuning: All parameters are tuned for the specific task

2. Parameter efficient fine tuning (PEFT) [more efficient]: fine tuning without modifying most of the originial parameters

## Pre-training LLMs with Hugging Face

#### Installing Required Libraries

In [4]:
!pip install -qy pandas==1.3.4 numpy==1.21.4 seaborn==0.9.0 matplotlib==3.5.0 torch==2.1.0+cu118
!pip install pmdarima -U
!pip install --upgrade pmdarima==2.0.2


Usage:   
  pip install [options] <requirement specifier> [package-index-options] ...
  pip install [options] -r <requirements file> [package-index-options] ...
  pip install [options] [-e] <vcs project url> ...
  pip install [options] [-e] <local project path> ...
  pip install [options] <archive url/path> ...

no such option: -y
Collecting pmdarima==2.0.2
  Using cached pmdarima-2.0.2.tar.gz (630 kB)
  Preparing metadata (setup.py) ... [?2done
Building wheels for collected packages: pmdarima
  Building wheel for pmdarima (setup.py) ... [?25error
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py bdist_wheel[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[13 lines of output][0m
  [31m   [0m   from pkg_resources import parse_version
  [31m   [0m Partial import of pmdarima during the build process.
  [31m   [0m Requirements: ['joblib>=0.11', 'Cython>=0.29,!=0.29.18,!=0.29.31', 'numpy>=1.21.2', '

In [5]:
!pip install transformers == 4.40.0
!pip install -U git+https://github.com/huggingface/transformers
!pip install --user datasets 2.15.0
!pip install --user portalocker>=2.0.0
!pip install -q -U git+https://github.com/huggingface/accelerate.git

zsh:1: = not found
Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /private/var/folders/y2/33vgfdz176b018l37jlk7z3m0000gn/T/pip-req-build-0znspwb9
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /private/var/folders/y2/33vgfdz176b018l37jlk7z3m0000gn/T/pip-req-build-0znspwb9
  Resolved https://github.com/huggingface/transformers to commit d2ae766836d1862a814ccd016306727111627673
  Installing build dependencies ... [?2done
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25done
Collecting datasets
  Using cached datasets-4.0.0-py3-none-any.whl.metadata (19 kB)
[31mERROR: Could not find a version that satisfies the requirement 2.15.0 (from versions: none)[0m[31m
[31mERROR: No matching distribution found for 2.15.0[0m[31m
zsh:1: 2.0.0 not found


In [6]:
!pip install -q -U accelerate
!pip install --user torch==2.3.0
!pip install -U torchvision
!pip install --user protobug ==3.20.*

Collecting torch==2.3.0
  Using cached torch-2.3.0-cp312-none-macosx_11_0_arm64.whl.metadata (26 kB)
Using cached torch-2.3.0-cp312-none-macosx_11_0_arm64.whl (61.0 MB)
Installing collected packages: torch
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchtext 0.17.2 requires torch==2.2.2, but you have torch 2.3.0 which is incompatible.
torchvision 0.22.1 requires torch==2.7.1, but you have torch 2.3.0 which is incompatible.[0m[31m
[0mSuccessfully installed torch-2.3.0
Collecting torch==2.7.1 (from torchvision)
  Using cached torch-2.7.1-cp312-none-macosx_11_0_arm64.whl.metadata (29 kB)
Using cached torch-2.7.1-cp312-none-macosx_11_0_arm64.whl (68.6 MB)
Installing collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 2.3.0
    Uninstalling torch-2.3.0:
      Successfully uninstalled torch-2.3.0
[31mERROR: pip

In [14]:
import torch
from torch.optim.lr_scheduler import LambdaLR
from torch.utils.data import DataLoader
from torch.optim import AdamW
from transformers import AutoConfig,AutoModelForCausalLM,AutoModelForSequenceClassification,BertConfig,BertForMaskedLM,TrainingArguments,Trainer,TrainingArguments
from transformers import AutoTokenizer,BertTokenizerFast,TextDataset,DataCollatorForLanguageModeling
from transformers import pipeline
#from datasets import load_dataset


from tqdm.auto import tqdm
import math
import time
import os

import warnings
def warn(*args,**kwargs):
    pass
warnings.warn = warn
warnings.filterwarnings('ignore')


In [19]:
! pip install datasets

Collecting datasets
  Using cached datasets-4.0.0-py3-none-any.whl.metadata (19 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py312-none-any.whl.metadata (7.2 kB)
Downloading datasets-4.0.0-py3-none-any.whl (494 kB)
Downloading multiprocess-0.70.16-py312-none-any.whl (146 kB)
Downloading xxhash-3.5.0-cp312-cp312-macosx_11_0_arm64.whl (30 kB)
Installing collected packages: xxhash, multiprocess, datasets
Successfully installed datasets-4.0.0 multiprocess-0.70.16 xxhash-3.5.0


In [20]:
from datasets import load_dataset

In [21]:
!pip install --user dataset



In [15]:
# Set the environment variable tokenizers_parallelism to 'false'
import os
os.environ["TOKENIZERS_PARALLELISM"] = 'false'

In [16]:
model = AutoModelForCausalLM.from_pretrained('facebook/opt-350m')
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")

pine = pipeline(task = "text-generation", model = model, tokenizer = tokenizer)
print(pine("This movie was really")[0]["generated_text"])

config.json:   0%|          | 0.00/644 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/663M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/662M [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

Device set to use mps:0
Both `max_new_tokens` (=256) and `max_length`(=21) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


This movie was really fun. I liked it a lot. I'm not sure I would've liked it if it wasn't for the music and the story.
I like the music but I couldn't help but feel a bit underwhelmed by the story.


## Self supervised training of a BERT MODEL

self supervised training of a BERT model involves training the model with a large corpus of unlabelled data. 
The steps involved in `pre-trainig` a BERT model using the Masked Language Modeling (MLM) objective.

For this excercise, we will use a hugging face transformers library, which provides pre-implemented BERT models and tools for pre-training.


* prepare the train dataset
* train a tokenizer
* preprocess the dataset
* pre train bert using an MLM task
* evaluate the trained model



## Importing required Datasets

The `WikiText dataset` is a widely used benchmark dataset in the field of natural language processing (NLP). The dataset contains a large amount of text extracted from Wikipedia, which is a vast online encyclopedia covering a wide range of topics. The articles in the `WikiText dataset` are `preprocessed` to remove formatting, hyperlinks, and other metadata, resulting in a `clean text corpus`.

The WikiText dataset has `4 different configs`, and is divided into `three` parts: a `training set`, a `validation set`, and a `test set`. The training set is used for training language models, while the validation and test sets are used for evaluating the performance of the models. First, let's load the datasets and concatenate them together to create a big dataset.

*Note: The original BERT was pretrained on Wikipedia and BookCorpus datasets.

In [22]:
## load the datasets
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")

README.md: 0.00B [00:00, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/733k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/6.36M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/657k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

In [23]:
print(dataset)

DatasetDict({
    test: Dataset({
        features: ['text'],
        num_rows: 4358
    })
    train: Dataset({
        features: ['text'],
        num_rows: 36718
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 3760
    })
})


In [24]:
# check a sample record
dataset['train'][400]

{'text': " When Mason was injured in warm @-@ ups late in the year , Columbus was without an active goaltender on their roster . To remedy the situation , the team signed former University of Michigan goaltender Shawn Hunwick to a one @-@ day , amateur tryout contract . After being eliminated from the NCAA Tournament just days prior , Hunwick skipped an astronomy class and drove his worn down 2003 Ford Ranger to Columbus to make the game . He served as the back @-@ up to Allen York during the game , and the following day , he signed a contract for the remainder of the year . With Mason returning from injury , Hunwick was third on the team 's depth chart when an injury to York allowed Hunwick to remain as the back @-@ up for the final two games of the year . In the final game of the season , the Blue Jackets were leading the Islanders 7 – 3 with 2 : 33 remaining when , at the behest of his teammates , Head Coach Todd Richards put Hunwick in to finish the game . He did not face a shot . 

In [25]:
dataset["train"] = dataset["train"].select([i for i in range(1000)])
dataset["test"] = dataset["test"].select([i for i in range(200)])

In [26]:
len(dataset["train"])

1000

In [29]:
dataset['test'][100]

{'text': ' Du Fu \'s popularity grew to such an extent that it is as hard to measure his influence as that of Shakespeare in England : it was hard for any Chinese poet not to be influenced by him . While there was never another Du Fu , individual poets followed in the traditions of specific aspects of his work : Bai Juyi \'s concern for the poor , Lu You \'s patriotism , and Mei Yaochen \'s reflections on the quotidian are a few examples . More broadly , Du Fu \'s work in transforming the lǜshi from mere word play into " a vehicle for serious poetic utterance " set the stage for every subsequent writer in the genre . \n'}

In [30]:
# path to save the datasets to text files
output_file_train = "wikitext_dataset_train.txt"
output_file_test = "wikitext_dataset_test.txt"


# Open the output file in the write mode
with open(output_file_train,"w",encoding = "utf-8") as f:

    # iterate over each example in the dataset
    for example in dataset["train"]:
        # write the example text to the file
        f.write(example["text"] + "\n")


# open the output file in write mode
with open(output_file_test,"w",encoding = "utf-8") as f:
    for example in dataset['test']:
        f.write(example['text']+"\n")

In [31]:
# Create a tokenizer from existing onw to reuse special
bert_tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

model_name = "bert-base-uncased"

model = AutoModelForCausalLM.from_pretrained(model_name, is_decoder = True)

### Training a Tokenizer(Optional)

In the previous cell, you created an instance of tokenizer from a pre-trained BERT tokenizer. If you want to train the tokenizer on your own dataset, you can uncomment the code below. This is specially helpful when using transformers for specific areas such as medicine where tokens are somehow different than the general tokens that tokenizers are created based on. (You can skip this step if you do not want to train the tokenizer on your specific data):


In [38]:
def batch_iterator(batch_size = 10000):
    for i in tqdm(range(0,len(dataset),batch_size)):
        yield dataset['train'][i:i+batch_size]["text"]


## Create a tokenizer from existing one to re-use special tokens
bert_tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")


# train the tokenizer using our own dataset
bert_tokenizer = bert_tokenizer.train_new_from_iterator(text_iterator = batch_iterator(),vocab_size = len(bert_tokenizer.get_vocab()) )


  0%|          | 0/1 [00:00<?, ?it/s]






In [39]:
bert_tokenizer.vocab_size

12577

## Pretraining

In this step, we define the configuration of the BERT Model and create the model

## Define the BERT Configuration

Here, we define the configuration settings for a BERT model using `BertConfig. This includes setting various parameters related to the model's architecture:

* **vocab_size = 30522**:  specifies the size of the vocabulart. This number should match the vocabulary size used by the tokenizer
* **hidden_size = 768**: Sets the size of the hidden layers
* **num_hidden_layers = 12** : determines the number of hidden layers in the transformer model
* **num_attention_heads = 12**: sets the number of attention heads in each attention layer
* **intermediate_size = 3072** : specifies the size of the "intermediate" (i.e : feed-forward) layer within the transformer.

In [40]:
len(bert_tokenizer.get_vocab())

12577

In [41]:
# define the BERT configuration
config = BertConfig(
    vocab_size = len(bert_tokenizer.get_vocab()),
    hidden_size = 768,
    num_hidden_layers = 12,
    num_attention_heads = 12,
    intermediate_size = 3072, # set the intermediate size
    
)

In [42]:
# Create the BERT model for pre-training
model = BertForMaskedLM(config)

In [43]:
model

BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(12577, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwi

## Tokenize Dataset Dynamically

#### tokenize function

The tokenize function is used to preprocess the text data by tokenizing and formatiing it for model training

In [44]:
# Tokenize dataset dynamically
def tokenize_function(examples):
    return bert_tokenizer(examples['text'],truncation=True, padding = "max_length", max_length = 512)


# Tokenize train and test dataset
tokenized_datasets = dataset.map(tokenize_function, batched = True, remove_columns=["text"])

# print tokenized dataset sample
print(tokenized_datasets["train"][0])

# split into training and test sets
train_dataset = tokenized_datasets["train"]
test_dataset = tokenized_datasets["test"]


Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/3760 [00:00<?, ? examples/s]

{'input_ids': [2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

In [47]:
#train_dataset[0],test_dataset[0]

## Define the data collator for language modelling

This line of code sets up a `DataCollatorForLanguageModeling` from the Hugging Face Transformers library. A data collator is used during training to dynamically create batches of data. For language modeling, particularly for models like BERT that use masked language modeling (MLM), this collator prepares training batches by automatically masking tokens according to a specified probability. Here are the details of the parameters used:

- **tokenizer=bert_tokenizer**: Specifies the tokenizer to be used with the data collator. The `bert_tokenizer` is responsible for tokenizing the text and converting it to the format expected by the model.
- **mlm=True**: Indicates that the data collator should mask tokens for masked language modeling training. This parameter being set to `True` configures the collator to randomly mask some of the tokens in the input data, which the model will then attempt to predict.
- **mlm_probability=0.15**: Sets the probability with which tokens will be masked. A probability of 0.15 means that, on average, 15% of the tokens in any sequence will be replaced with a mask token.


In [48]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer = bert_tokenizer, mlm = True, mlm_probability = 0.15
)

In [49]:
# Check how collator transforms a sample input data record
data_collator([train_dataset[0]])

{'input_ids': tensor([[2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0,

Now, we train the BERT Model using the Trainer module. (For a complete list of training arguments, check [here](https://huggingface.co/docs/transformers/v4.33.2/en/main_classes/trainer#transformers.TrainingArguments)):
This section configures the training process by specifying various parameters that control how the model is trained, evaluated, and saved:

- **output_dir="./trained_model"**: Specifies the directory where the trained model and other output files will be saved.
- **overwrite_output_dir=True**: If set to `True`, this will overwrite the contents of the output directory if it already exists. This is useful when running experiments multiple times.
- **do_eval=True**: Enables evaluation of the model. If `True`, the model will be evaluated at the specified intervals.
- **evaluation_strategy="epoch"**: Defines when the model should be evaluated. Setting this to "epoch" means the model will be evaluated at the end of each epoch.
- **learning_rate=5e-5**: Sets the learning rate for training the model. This is a typical learning rate for fine-tuning BERT-like models.
- **num_train_epochs=10**: Specifies the number of training epochs. Each epoch involves a full pass over the training data.
- **per_device_train_batch_size=2**: Sets the batch size for training on each device. This should be set based on the memory capacity of your hardware.
- **save_total_limit=2**: Limits the total number of model checkpoints to be saved. Only the most recent two checkpoints will be kept.
- **logging_steps=20**: Determines how often to log training information, which can help monitor the training process.


In [None]:
## Define the training arguments
training_args = TrainingArguments(
    output_dir='./trained_model',
    overwrite_output_dir= True,
    do_eval= True,
    eval_strategy='epoch',
    learning_rate=5e-5,
    num_train_epochs=10,
    per_device_train_batch_size=2,
    save_total_limit=2,
    logging_steps=20
)

from transformers import Trainer
# Instantiate the trainer
trainer = Trainer(
    model = model,
    args = training_args,
    data_collator = data_collator,
    train_dataset = train_dataset,
    eval_dataset = test_dataset
)


# start the pretraining
trainer.train()