# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* PEFT technique: 
* Model: 
* Evaluation approach: 
* Fine-tuning dataset: 

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

### lodading datasets and tokenizer

In [1]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding, Trainer, TrainingArguments
import torch
import torch.nn.functional as F
import pandas as pd
import numpy as np

In [2]:
dataset = load_dataset("google/boolq")
dataset

DatasetDict({
    train: Dataset({
        features: ['question', 'answer', 'passage'],
        num_rows: 9427
    })
    validation: Dataset({
        features: ['question', 'answer', 'passage'],
        num_rows: 3270
    })
})

In [3]:
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

In [4]:
def process_rows(batch, tokenizer):
    tmp_list = []
    label_list = []
    for i in range(len(batch['question'])):
        concatenated =  batch['passage'][i] + '<|endoftext|>' +\
                        batch['question'][i] + '<|endoftext|>' +\
                        'Yes or No?<|endoftext|>'
        tmp_list.append(concatenated)
        
        # Convert answer to label
        answer = batch['answer'][i]
        label = 1 if answer == True else 0
        label_list.append(label)
        
    # Tokenize the concatenated text
    tokenized = tokenizer(tmp_list, truncation=True, padding=True, return_tensors="pt")
    tokenized["labels"] = torch.tensor(label_list)
    return tokenized

dataset_train = dataset['train'].map(
    lambda batch: process_rows(batch, tokenizer), batched=True)
dataset_validation = dataset['validation'].map(
    lambda batch: process_rows(batch, tokenizer), batched=True)


Map:   0%|          | 0/3270 [00:00<?, ? examples/s]

In [5]:
print(dataset_train)

Dataset({
    features: ['question', 'answer', 'passage', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 9427
})


### loadding fundation model

In [37]:
from transformers import AutoModelForCausalLM, AutoModelForSequenceClassification

In [38]:
#model = AutoModelForCausalLM.from_pretrained("gpt2")
model = AutoModelForSequenceClassification.from_pretrained('gpt2', 
        num_labels=2,
        id2label={0: "right", 1: "wrong"},
        label2id={"wrong": 0, "right": 1}
        )

model.config.pad_token_id = model.config.eos_token_id

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [39]:
model

GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (score): Linear(in_features=768, out_features=2, bias=False)
)

### evaluating original fundation model output

Random pick some QA passages and check outputs

In [40]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [41]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="steps",
    eval_steps=10,
    per_device_eval_batch_size=5,
    seed=42,
    disable_tqdm=False,
    
)

validation_sample = dataset_validation.select(range(0, 500))

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
    train_dataset=dataset_train,
    eval_dataset = validation_sample,
    #eval_dataset=tokenized_dataset["validation"],
)

In [10]:
trainer.evaluate()
#trainer.evaluate(eval_dataset=validation_sample)

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'eval_loss': 1.0964258909225464,
 'eval_accuracy': 0.636,
 'eval_runtime': 33.5153,
 'eval_samples_per_second': 14.919,
 'eval_steps_per_second': 2.984}

## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [42]:
from peft import LoraConfig, get_peft_model

In [43]:
config = LoraConfig(fan_in_fan_out = True, task_type="SEQ_CLS")
lora_model = get_peft_model(model, config)
lora_model.config.pad_token_id = model.config.eos_token_id

In [44]:
lora_model.print_trainable_parameters()
lora_model

trainable params: 297,984 || all params: 124,737,792 || trainable%: 0.23888830740245906


PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): GPT2ForSequenceClassification(
      (transformer): GPT2Model(
        (wte): Embedding(50257, 768)
        (wpe): Embedding(1024, 768)
        (drop): Dropout(p=0.1, inplace=False)
        (h): ModuleList(
          (0-11): 12 x GPT2Block(
            (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (attn): GPT2Attention(
              (c_attn): Linear(
                in_features=768, out_features=2304, bias=True
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=768, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=2304, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()

In [45]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = torch.from_numpy(predictions)
    labels = torch.from_numpy(labels)
    
    loss = F.cross_entropy(predictions, labels)
    accuracy = (torch.argmax(predictions, dim=1) == labels).float().mean()
    
    return {"eval_loss": loss.item(), "eval_accuracy": accuracy.item()}

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=5,
    per_device_eval_batch_size=5,
    learning_rate=1e-4,
    evaluation_strategy='epoch',
    save_strategy='epoch',
    num_train_epochs=2,
    weight_decay=0.01,
    warmup_steps=100,
    load_best_model_at_end=True,
    disable_tqdm=False,
)

In [46]:
validation_sample = dataset_validation.select(range(0, 500))
trainer_sample = dataset_train.shuffle(seed=42).select(range(0, 5000))

trainer = Trainer(
    model=lora_model,
    args=training_args,
    train_dataset=trainer_sample,
    eval_dataset=validation_sample,
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
    
)

In [34]:
## evaluate before train
trainer.evaluate()

{'eval_loss': 1.1966373920440674,
 'eval_accuracy': 0.36800000071525574,
 'eval_runtime': 34.4256,
 'eval_samples_per_second': 14.524,
 'eval_steps_per_second': 2.905}

In [47]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.66,0.642331,0.646
2,0.6542,0.642861,0.654


Checkpoint destination directory ./results/checkpoint-1000 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=2000, training_loss=0.6653919372558593, metrics={'train_runtime': 1774.0147, 'train_samples_per_second': 5.637, 'train_steps_per_second': 1.127, 'total_flos': 4174360350796800.0, 'train_loss': 0.6653919372558593, 'epoch': 2.0})

### save model

In [49]:
tokenizer.save_pretrained("lora-tokenizer")

('lora-tokenizer/tokenizer_config.json',
 'lora-tokenizer/special_tokens_map.json',
 'lora-tokenizer/vocab.json',
 'lora-tokenizer/merges.txt',
 'lora-tokenizer/added_tokens.json',
 'lora-tokenizer/tokenizer.json')

In [50]:
lora_model.save_pretrained("gpt2-lora")

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [51]:
from peft import AutoPeftModelForSequenceClassification, PeftConfig
#config=PeftConfig(lora_model.config)
lora_model_load = AutoPeftModelForSequenceClassification.from_pretrained(
    "gpt2-lora", ignore_mismatched_sizes=True, #config=lora_model.config,
)

lora_model_load.config.pad_token_id = tokenizer.eos_token_id
#lora_model_load.config = lora_model.config

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [52]:
tokenizer_load = AutoTokenizer.from_pretrained("lora-tokenizer")

In [53]:
validation_sample = dataset_validation.select(range(0, 500))
#trainer_sample = dataset_validation.select(range(0, 3000))

# def compute_metrics(eval_pred):
#     predictions, labels = eval_pred
#     predictions = np.argmax(predictions, axis=1)
#     return {"accuracy": (predictions == labels).mean()}

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    learning_rate=1e-4,
    evaluation_strategy='epoch',
    save_strategy='epoch',
    num_train_epochs=2,
    weight_decay=0.01,
    #warmup_steps=100,
    load_best_model_at_end=True,
    disable_tqdm=False,
)

trainer = Trainer(
    model=lora_model_load,
    args=training_args,
    #train_dataset=trainer_sample,
    eval_dataset=validation_sample,
    tokenizer=tokenizer_load,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer_load),
    compute_metrics=compute_metrics,
    
)

In [54]:
trainer.evaluate()

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'eval_loss': 0.6423311233520508,
 'eval_accuracy': 0.6460000276565552,
 'eval_runtime': 33.4373,
 'eval_samples_per_second': 14.953,
 'eval_steps_per_second': 14.953}

In [56]:
items_for_manual_review = dataset_validation.shuffle(seed=42).select(range(0,10))

results = trainer.predict(items_for_manual_review)
df = pd.DataFrame(
    {
        "passage": [item["passage"] for item in items_for_manual_review],
        "question": [item["question"] for item in items_for_manual_review],
        "answer": [item["answer"] for item in items_for_manual_review],
        "predictions": results.predictions.argmax(axis=1),
        "labels": results.label_ids,
    }
)
# Show all the cell
pd.set_option("display.max_colwidth", None)
df

Unnamed: 0,passage,question,answer,predictions,labels
0,"The Ranch is an American comedy web television series starring Ashton Kutcher, Danny Masterson, Debra Winger, Elisha Cuthbert, and Sam Elliott that debuted in 2016 on Netflix. The show takes place on the fictional Iron River Ranch in the fictitious small town of Garrison, Colorado; detailing the life of the Bennetts, a dysfunctional family consisting of two brothers, their rancher father, and his divorced wife and local bar owner. While the opening sequence shows scenes from Norwood and Ouray, Colorado and surrounding Ouray and San Miguel Counties, The Ranch is filmed on a sound stage in front of a live audience in Burbank, California. Each season consists of 20 episodes broken up into two parts, each containing 10 episodes.",is garrison from the ranch a real place,False,1,0
1,"Lanugo (/ləˈnjuːɡoʊ/; from Latin lana ``wool'') is very thin, soft, usually unpigmented, downy hair that is sometimes found on the body of a fetal or new-born human. It is the first hair to be produced by the fetal hair follicles, and it usually appears around sixteen weeks of gestation and is abundant by week twenty. It is normally shed before birth, around seven or eight months of gestation, but is sometimes present at birth. It disappears on its own within a few weeks.",are babies in the womb covered in hair,True,1,1
2,"An administrative law judge (ALJ) in the United States is a judge and trier of fact who both presides over trials and adjudicates the claims or disputes (in other words, ALJ-controlled proceedings are bench trials) involving administrative law.",is an administrative law judge a real judge,True,0,1
3,"Plant research continued on the International Space Station. Biomass Production System was used on the ISS Expedition 4. The Vegetable Production System (Veggie) system was later used aboard ISS. Plants tested in Veggie before going into space included lettuce, Swiss chard, radishes, Chinese cabbage and peas. Red Romaine lettuce was grown in space on Expedition 40 which were harvested when mature, frozen and tested back on Earth. Expedition 44 members became the first American astronauts to eat plants grown in space on 10 August 2015, when their crop of Red Romaine was harvested. Since 2003 Russian cosmonauts have been eating half of their crop while the other half goes towards further research. In 2012, a sunflower bloomed aboard the ISS under the care of NASA astronaut Donald Pettit. In January 2016, US astronauts announced that a zinnia had blossomed aboard the ISS.",are there plants on the international space station,True,1,1
4,"HCF (The Hospitals Contribution Fund of Australia) was formed in 1932 to provide health insurance cover to Australians. Since then, it has grown to become one of the country's largest combined registered private health fund and life insurance organisations. HCF is the 3rd largest health insurance company by market share (10.3% in FY2010) and is the largest not-for-profit health fund in Australia.",is hcf a not for profit health fund,True,1,1
5,"Bank and public holidays in Scotland are determined under the Banking and Financial Dealings Act 1971 and the St Andrew's Day Bank Holiday (Scotland) Act 2007. Unlike the rest of United Kingdom, most bank holidays are not recognised as statutory public holidays in Scotland, as most public holidays are determined by local authorities across Scotland. Some of these may be taken in lieu of statutory holidays, while others may be additional holidays, although many companies, including Royal Mail, do not follow all the holidays listed below; and many swap between English and local holidays. Many large shops and supermarkets continue to operate normally during public holidays, especially since there are no restrictions such as Sunday trading rules in Scotland.",does scotland have the same bank holidays as england,False,1,0
6,"Nationals of any country may visit Montenegro without a visa for up to 30 days if they hold a passport with visas issued by Ireland, a Schengen Area member state, the United Kingdom or the United States or if they are permanent residents of those countries. Residents of the United Arab Emirates do not require a visa for up to 10 days, if they hold a return ticket and proof of accommodation.",can i go to montenegro with a schengen visa,True,1,1
7,"The previous major redesign of the iPhone, the 4.7-inch iPhone 6 and 5.5-inch iPhone 6 Plus, resulted in larger screen sizes. However a significant number of customers still preferred the 4-inch screen size of the iPhone 5 and 5S. Apple stated in their event that they sold 30 million 4-inch iPhones in 2015.",is the iphone se before the iphone 6,False,0,0
8,"Trees have a wide variety of sizes and shapes and growth habits. Specimens may grow as individual trunks, multitrunk masses, coppices, clonal colonies, or even more exotic tree complexes. Most champion tree programs focus finding and measuring the largest single-trunk example of each species. There are three basic parameters commonly measured to characterize the size of a single trunk tree: height, girth, and crown spread. Additional details on the methodology of Tree height measurement, Tree girth measurement, Tree crown measurement, and Tree volume measurement are presented in the links herein. A detailed guideline to these basic measurements is provided in The Tree Measuring Guidelines of the Eastern Native Tree Society by Will Blozan.",can a tree have more than one trunk,True,1,1
9,"A SIM lock, simlock, network lock, carrier lock or (master) subsidy lock is a technical restriction built into GSM and CDMA mobile phones by mobile phone manufacturers for use by service providers to restrict the use of these phones to specific countries and/or networks. This is in contrast to a phone (retrospectively called SIM-free or unlocked) that does not impose any SIM restrictions.",does a sim free phone lock to a network,False,0,0


#### original fundation model eval_accuracy is 63.6%
#### befor train LoRA model eval_accuracy is 36.8%
#### After 2epcoch and 5000 dataset items to train, final model eval_accuracy is 65%