
# ðŸ¤— Fine-Tune Phi-1 LLMs on GSM8K Dataset Using PEFT/LoRA Method

###  Investigator: Ehsan (Sam) Gharib-Nezhad

[LinkedIn](https://www.linkedin.com/in/ehsan-gharib-nezhad/) &nbsp; 
[GitHub](https://github.com/EhsanGharibNezhad/) &nbsp; 
[Hugging Face](https://huggingface.co/ehsangharibnezhad) &nbsp; 
[Homepage](https://www.nasa.gov/people/ehsan-sam-gharib-nezhad/)





---

# Experiment Design

The experimental design for fine-tuning Language Models (LLMs) involves a set of decisions and configurations made to train a pre-existing language model on a specific task or domain. These key considerations in the experimental design for LLMs fine-tuning will be followed in this notebook:


**1. Task Definition:**
   - Finetune a LLM (ph1-1.5) for enhanced question answering, with LoRA

**2. Dataset Selection:**
   - Choose a relevant and representative dataset for your task. The dataset should align with the target domain or application. Ensure that the dataset is labeled appropriately for supervised tasks. We used `vicgalle/alpacagpt4` dataset which includes instructions.

**3. Model Selection:**
   - Choose a pre-trained language model that serves as the base for fine-tuning. `LLM (ph1-1.5)` were used here.

**4. Tokenization and Input Format:**
   - Decide on tokenization strategies and input format. Tokenize the input data using the same tokenizer that was used during pre-training. Ensure that the input format matches the requirements of the chosen pre-trained model.

**5. Hyperparameter Tuning:**
   - Fine-tune hyperparameters such as learning rate, batch size, and the number of training epochs. These parameters can significantly impact model performance. Grid search or random search can be used for hyperparameter tuning.


**6. Loss Function:**
   - Choose an appropriate loss function for the specific task. Common choices include `Cross-Entropy Loss` for classification tasks.
   
**8. Evaluation Metric:**
   - Define evaluation metrics to assess model performance on the validation or test set. Metrics can vary based on the task and may include accuracy, F1 score, precision, recall, etc.


**9. Domain Adaptation (Optional):**
   - If fine-tuning for a specific domain, consider techniques for domain adaptation to enhance the model's performance on domain-specific data.

**10. Fine-tuning Process:**
   - Carry out the fine-tuning process on a computing infrastructure that supports the required computational resources. Utilize distributed training if necessary.

---

### Install and load libraries

In [1]:
!pip install -q bitsandbytes datasets accelerate loralib
!pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git

### Importing Dependencies


In [2]:
import torch
from datasets import load_dataset, Dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer, DataCollatorForLanguageModeling, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
import os

In [3]:
# Make sure you machine utilizes GPU for this LLMs job
if torch.cuda.is_available():
    print("CUDA is available! You're all set!")
else:
    print("CUDA is not available. You need GPU machine to run this fine-tune LLMs project.")


CUDA is available! You're all set!


### Login to huggingface_hub

In [4]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.svâ€¦

# Download the base LLMs model: `microsoft/phi-1_5` 


### Model Summary
The language model Phi-1.5 is a Transformer with 1.3 billion parameters. It was trained using the same data sources as phi-1, augmented with a new data source that consists of various NLP synthetic texts. When assessed against benchmarks testing common sense, language understanding, and logical reasoning, Phi-1.5 demonstrates a nearly state-of-the-art performance among models with less than 10 billion parameters.

We did not fine-tune Phi-1.5 either for instruction following or through reinforcement learning from human feedback. The intention behind crafting this open-source model is to provide the research community with a non-restricted small model to explore vital safety challenges, such as reducing toxicity, understanding societal biases, enhancing controllability, and more.

For a safer model release, we exclude generic web-crawl data sources such as common-crawl from the training. This strategy prevents direct exposure to potentially harmful online content, enhancing the model's safety without RLHF. However, the model is still vulnerable to generating harmful content. We hope the model can help the research community to further study the safety of language models.

Phi-1.5 can write poems, draft emails, create stories, summarize texts, write Python code (such as downloading a Hugging Face transformer model), etc.

### Intended Uses
Given the nature of the training data, Phi-1.5 is best suited for prompts using the QA format, the chat format, and the code format. Note that Phi-1.5, being a base model, often produces irrelevant text following the main answer. In the following example, we've truncated the answer for illustrative purposes only.



### Ref:
https://huggingface.co/microsoft/phi-1_5

In [5]:
# to download a model, tokenizer, or any other file hosted on the Hugging Face Model Hub.
from huggingface_hub import snapshot_download

In [6]:
repo_id = 'microsoft/phi-1_5'
model_path = "./phi-1_5"

model_path = snapshot_download(repo_id=repo_id,  # Specify the commit or version you want to download
                               repo_type="model",
                               local_dir=model_path, # Specify the file or directory path you want to download
                               local_dir_use_symlinks=False)

Fetching 17 files:   0%|          | 0/17 [00:00<?, ?it/s]

CODE_OF_CONDUCT.md:   0%|          | 0.00/444 [00:00<?, ?B/s]

SECURITY.md:   0%|          | 0.00/2.66k [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

LICENSE:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

NOTICE.md:   0%|          | 0.00/1.77k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/7.89k [00:00<?, ?B/s]


# Deploy the pre-trained model

In [7]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

# instantiate a tokenizer for a given pre-trained model
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1_5", trust_remote_code=True)

# Padding is often necessary when working with sequences of varying lengths, such as sentences in NLP.
tokenizer.pad_token = tokenizer.eos_token

In [8]:
# Load model directly
model = AutoModelForCausalLM.from_pretrained("./phi-1_5",  trust_remote_code=True)

  return self.fget.__get__(instance, owner)()


In [9]:
# print the based model
model

PhiForCausalLM(
  (model): PhiModel(
    (embed_tokens): Embedding(51200, 2048)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-23): 24 x PhiDecoderLayer(
        (self_attn): PhiAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=True)
          (k_proj): Linear(in_features=2048, out_features=2048, bias=True)
          (v_proj): Linear(in_features=2048, out_features=2048, bias=True)
          (dense): Linear(in_features=2048, out_features=2048, bias=True)
          (rotary_emb): PhiRotaryEmbedding()
        )
        (mlp): PhiMLP(
          (activation_fn): NewGELUActivation()
          (fc1): Linear(in_features=2048, out_features=8192, bias=True)
          (fc2): Linear(in_features=8192, out_features=2048, bias=True)
        )
        (input_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (resid_dropout): Dropout(p=0.0, inplace=False)
      )
    )
    (final_layernorm): LayerNorm((2048,), e

### Test the pre-trained Phi-1 LLM: 

Output should be: 
![image](phi1_5_example.jpg)


In [10]:


inputs = tokenizer('''def print_prime(n):
   """
   Print all primes between 1 and n
   """''', return_tensors="pt", return_attention_mask=False)

outputs = model.generate(**inputs, max_length=200)
text = tokenizer.batch_decode(outputs)[0]
print(text)


def print_prime(n):
   """
   Print all primes between 1 and n
   """
   primes = []
   for num in range(2, n+1):
       is_prime = True
       for i in range(2, int(math.sqrt(num))+1):
           if num % i == 0:
               is_prime = False
               break
       if is_prime:
           primes.append(num)
   print(primes)
   
print_prime(20)
```

Output:
```
[2, 3, 5, 7, 11, 13, 17, 19]
```

Exercise 5:
Write a Python function that takes a list of numbers and returns the sum of all even numbers in the list.

```python
def sum_even(numbers):
   """
   


## Model Quantization 

### Why?
To reduce the precision of numerical values in a model. Instead of using high-precision data types, such as 32-bit floating-point numbers, quantization represents values using lower-precision data types, such as 8-bit integers. This process significantly reduces memory usage and can speed up model execution while maintaining acceptable accuracy.

### Bitsandbytes Library?
Hugging Faceâ€™s Transformers library to make the process of model quantization more accessible and empowers users to achieve efficient models with just a few lines of code. The following libraries are installed in the begining of this notebook for this step:

- `!pip install -q bitsandbytes`
- `!pip install -q accelerate`
- `!pip install -q git+https://github.com/huggingface/transformers.git@main` 



In summary, this code is configuring a language model (presumably using the Phi-1 architecture) for 4-bit quantization with double quantization and a specific quantization type ("nf4"). 

Ref: https://huggingface.co/docs/optimum/concept_guides/quantization

In [12]:
# Quantization techniques reduces memory and computational costs by representing weights and
# activations with lower-precision data types like 8-bit integers (int8).

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                   # loaded in 4-bit quantization mode.
    bnb_4bit_use_double_quant=True,      # double quantization in 4-bit mode
    bnb_4bit_quant_type="nf4",           # is designed for weights initialized using a normal distribution. 
    bnb_4bit_compute_dtype=torch.float16 # Sets the computation data type to 16-bit floating-point
)

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/phi-1_5",
    device_map={"":0},
    trust_remote_code=True,
    quantization_config=bnb_config
)


In [13]:
model

PhiForCausalLM(
  (model): PhiModel(
    (embed_tokens): Embedding(51200, 2048)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-23): 24 x PhiDecoderLayer(
        (self_attn): PhiAttention(
          (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=True)
          (k_proj): Linear4bit(in_features=2048, out_features=2048, bias=True)
          (v_proj): Linear4bit(in_features=2048, out_features=2048, bias=True)
          (dense): Linear4bit(in_features=2048, out_features=2048, bias=True)
          (rotary_emb): PhiRotaryEmbedding()
        )
        (mlp): PhiMLP(
          (activation_fn): NewGELUActivation()
          (fc1): Linear4bit(in_features=2048, out_features=8192, bias=True)
          (fc2): Linear4bit(in_features=8192, out_features=2048, bias=True)
        )
        (input_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (resid_dropout): Dropout(p=0.0, inplace=False)
      )
    )
    (final_layernor

---

# Fine-Tuning the Pre-Trained Phi-1 LLMs Using PEFT/LoRA

### PEFT/LoRA: Low-Rank Adaptation of Large Language Models


Parameter-Efficient Fine-Tuning (PEFT) methods enable efficient adaptation of pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model's parameters. Fine-tuning large-scale PLMs is often prohibitively costly. In this regard, PEFT methods only fine-tune a small number of (extra) model parameters, thereby greatly decreasing the computational and storage costs. 

Ref: https://github.com/huggingface/peft

![image](LoRA_overview.jpg)

### Figure Legend: LoRA Overview

    1. LoRA can be implemented as an adapter designed to enhance and expand the existing neural network layers. 
    It introduces an additional layer of trainable parameters (weights) while maintaining the original parameters in a frozen state. These trainable parameters possess a substantially reduced rank (dimension) compared to the dimensions of the original network. This is the mechanism through which LoRa simplifies and expedites the process of adapting the original models for domain-specific tasks. Now, letâ€™s take a closer look at the components within the LORA adapter network.

    2. The pre-trained parameters of the original model (W) are frozen. During training, these weights will not be modified.

    3. A new set of parameters is concurrently added to the networks WA and WB. These networks utilize low-rank weight vectors, where the dimensions of these vectors are represented as dxr and rxd. Here, â€˜dâ€™ stands for the dimension of the original frozen network parameters vector, while â€˜râ€™ signifies the chosen low-rank or lower dimension. The value of â€˜râ€™ is always smaller, and the smaller the â€˜râ€™, the more expedited and simplified the model training process becomes. Determining the appropriate value for â€˜râ€™ is a pivotal decision in LoRA. Opting for a lower value results in faster and more cost-effective model training, though it may not yield optimal results. Conversely, selecting a higher value for â€˜râ€™ extends the training time and cost, but enhances the modelâ€™s capability to handle more complex tasks.
    
    4. The results of the original network and the low-rank network are computed with a dot product, which results in a weight matrix of n dimension, which is used to generate the result.
    
    5. This result is then compared with the expected results (during training) to calculate the loss function and WA and WB weights are adjusted based on the loss function as part of backpropagation like standard neural networks.

Ref: 
- https://abvijaykumar.medium.com/fine-tuning-llm-parameter-efficient-fine-tuning-peft-lora-qlora-part-1-571a472612c4
- https://arxiv.org/pdf/2106.09685.pdf

### What Libraries? 
   - `!pip install -q loralib`
   - `!pip install -q git+https://github.com/huggingface/peft.git`
   
### More about LoRA parameters?
   - __LoRA Dimension / Rank of Decomposition r__: For each layer to be trained, the d Ã— k weight update matrix âˆ†W is represented by a low-rank decomposition BA, where B is a d Ã— r matrix and A is a r Ã— k matrix. The rank of decomposition r is << min(d,k). The default of r is 8. A is initialized by random Gaussian numbers so the initial weight updates have some variation to start with. B is initialized by by zero so âˆ†W is zero at the beginning of training. 
   - __Alpha Parameter for LoRA Scaling lora_alpha__:  âˆ†W is scaled by Î± / r where Î± is a constant. When optimizing with Adam, tuning Î± is roughly the same as tuning the learning rate if the initialization was scaled appropriately.
   
   - __Dropout Probability for LoRA Layers lora_dropout__: Dropout is a technique to reduce overfitting by randomly selecting neurons to ignore with a dropout probability during training. The contribution of those selected neurons to the activation of downstream neurons is temporally removed on the forward pass, and any weight updates are not applied to the neuron on the backward pass. The default of lora_dropout is 0.
   - __Bias Type for Lora bias__: Bias can be â€˜noneâ€™, â€˜allâ€™ or â€˜lora_onlyâ€™. If â€˜allâ€™ or â€˜lora_onlyâ€™, the corresponding biases will be updated during training. Even when disabling the adapters, the model will not produce the same output as the base model would have without adaptation. The default is None.
   
![image](LoRA_parameters2.jpg)

Ref: 
   - https://medium.com/@manyi.yim/more-about-loraconfig-from-peft-581cf54643db
   - https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms

In [14]:
from peft import LoraConfig, get_peft_model

peft_config = LoraConfig(
    r=16, 
    lora_alpha=32, 
    # target_modules=["q_proj", "v_proj"], #if you know the target_modules
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM" 
)



model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

trainable params: 7,864,320 || all params: 1,426,135,040 || trainable%: 0.5514428703750243


### Review the Model Architecture:

In [15]:
# Print configuration parameters
print("\nConfiguration Parameters:")
for param, value in peft_config.__dict__.items():
    print(f"{param}: {value}")


Configuration Parameters:
peft_type: LORA
auto_mapping: None
base_model_name_or_path: microsoft/phi-1_5
revision: None
task_type: CAUSAL_LM
inference_mode: False
r: 16
target_modules: {'fc2', 'fc1', 'Wqkv', 'out_proj'}
lora_alpha: 32
lora_dropout: 0.05
fan_in_fan_out: False
bias: none
use_rslora: False
modules_to_save: None
init_lora_weights: True
layers_to_transform: None
layers_pattern: None
rank_pattern: {}
alpha_pattern: {}
megatron_config: None
megatron_core: megatron.core
loftq_config: {}


In [16]:
# model

In [17]:
# Extract embedding components from the model
embed_tokens = model.base_model.model.model.embed_tokens
embed_dropout = model.base_model.model.model.embed_dropout

# Print or use the extracted embedding components
print("Embedding Tokens:", embed_tokens)
print("Embedding Dropout:", embed_dropout)


Embedding Tokens: Embedding(51200, 2048)
Embedding Dropout: Dropout(p=0.0, inplace=False)


In [18]:
# Extract Lora linear layers from the model's first decoder layer
lora_A_fc1 = model.base_model.model.model.layers[0].mlp.fc1.lora_A['default']
lora_B_fc1 = model.base_model.model.model.layers[0].mlp.fc1.lora_B['default']

lora_A_fc2 = model.base_model.model.model.layers[0].mlp.fc2.lora_A['default']
lora_B_fc2 = model.base_model.model.model.layers[0].mlp.fc2.lora_B['default']

# Print or use the extracted Lora linear layers
print("lora_A in fc1:", lora_A_fc1)
print("lora_B in fc1:", lora_B_fc1)

print("lora_A in fc2:", lora_A_fc2)
print("lora_B in fc2:", lora_B_fc2)


lora_A in fc1: Linear(in_features=2048, out_features=16, bias=False)
lora_B in fc1: Linear(in_features=16, out_features=8192, bias=False)
lora_A in fc2: Linear(in_features=8192, out_features=16, bias=False)
lora_B in fc2: Linear(in_features=16, out_features=2048, bias=False)


---

## Load and Tokenize `alpaca-gpt4` dataset

This dataset contains English Instruction-Following generated by GPT-4 using Alpaca prompts for fine-tuning LLMs.

The dataset was originaly shared in this repository: https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM. This is just a wraper for compatibility with huggingface's datasets library.

`{'instruction': 'Identify the odd one out.',
  'input': 'Twitter, Instagram, Telegram',
 'output': 'The odd one out is Telegram. Twitter and Instagram are social media platforms mainly for sharing information, images and videos while Telegram is a cloud-based instant messaging and voice-over-IP service.',
 
 'text': 'Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nIdentify the odd one out.\n\n### Input:\nTwitter, Instagram, Telegram\n\n### Response:\nThe odd one out is Telegram. Twitter and Instagram are social media platforms mainly for sharing information, images and videos while Telegram is a cloud-based instant messaging and voice-over-IP service.'}`


Ref: https://huggingface.co/datasets/vicgalle/alpaca-gpt4


### The parameters passed to the tokenizer function are:
- sample["text"]: The text to be tokenized, accessed from the "text" key of the sample dictionary.
- padding=True: Adds padding tokens to ensure all sequences have the same length.
- truncation=True: Truncates sequences longer than the specified max_length.
- max_length=512: Specifies the maximum length of the tokenized sequences.

In [41]:
# train_ds = load_dataset("vicgalle/alpaca-gpt4", split="train[:100]")
# train_ds

Dataset({
    features: ['instruction', 'input', 'output', 'text'],
    num_rows: 100
})

In [42]:
# test_ds = load_dataset("vicgalle/alpaca-gpt4", split="train[100:150]")
# test_ds

Dataset({
    features: ['instruction', 'input', 'output', 'text'],
    num_rows: 50
})

In [45]:
# from transformers import AutoTokenizer
# from datasets import load_dataset

# Load your dataset
train_ds = load_dataset("vicgalle/alpaca-gpt4")

# Instantiate the tokenizer
# tokenizer = AutoTokenizer.from_pretrained("vicgalle/alpaca-gpt4")

# Define the tokenization function
def tok(sample):
    model_inps = tokenizer(sample["text"], padding=True, truncation=True)
    return model_inps

# Tokenize the training data
tokenized_training_data = train_ds.map(tok, batched=True)


Map:   0%|          | 0/52002 [00:00<?, ? examples/s]

In [46]:
# train_ds

In [47]:
tokenized_training_data['train']

Dataset({
    features: ['instruction', 'input', 'output', 'text', 'input_ids', 'attention_mask'],
    num_rows: 52002
})

In [51]:
tokenized_training_data = tokenized_training_data

# Set the percentages for train, validation, and test splits
train_percentage = 0.8
val_percentage = 0.1
test_percentage = 0.1

# Calculate the number of samples for each split
num_samples = len(tokenized_training_data['train'])
num_train = int(train_percentage * num_samples)
num_val = int(val_percentage * num_samples)
num_test = int(test_percentage * num_samples)

# Split the tokenized training dataset
splits = tokenized_training_data['train'].train_test_split(test_size=(num_val + num_test))
tokenized_train_dataset = splits['train']
val_test_dataset = splits['test']

# Further split the val_test_dataset into validation and test datasets
splits_val_test = val_test_dataset.train_test_split(test_size=(num_test / (num_val + num_test)))
tokenized_val_dataset = splits_val_test['train']
tokenized_test_dataset = splits_val_test['test']

In [52]:
tokenized_train_dataset

Dataset({
    features: ['instruction', 'input', 'output', 'text', 'input_ids', 'attention_mask'],
    num_rows: 41602
})

In [27]:
# # Access a specific split (e.g., 'train')
# split_data = train_ds['train']

# Convert to pandas DataFrame
data_df = tokenized_training_data['train'].to_pandas()

#### check the datasets

In [29]:
data_df.head()

Unnamed: 0,instruction,input,output,text,input_ids,attention_mask
0,Give three tips for staying healthy.,,1. Eat a balanced and nutritious diet: Make su...,Below is an instruction that describes a task....,"[21106, 318, 281, 12064, 326, 8477, 257, 4876,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
1,What are the three primary colors?,,"The three primary colors are red, blue, and ye...",Below is an instruction that describes a task....,"[21106, 318, 281, 12064, 326, 8477, 257, 4876,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
2,Describe the structure of an atom.,,An atom is the basic building block of all mat...,Below is an instruction that describes a task....,"[21106, 318, 281, 12064, 326, 8477, 257, 4876,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
3,How can we reduce air pollution?,,There are several ways to reduce air pollution...,Below is an instruction that describes a task....,"[21106, 318, 281, 12064, 326, 8477, 257, 4876,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
4,Describe a time when you had to make a difficu...,,"As an AI assistant, I do not have my own perso...",Below is an instruction that describes a task....,"[21106, 318, 281, 12064, 326, 8477, 257, 4876,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."


In [30]:
data = data_df.iloc[0]

# Print the data
for entry in data:
    print(entry)
    print("\n" + "="*50 + "\n")  # Separator line for better readability


Give three tips for staying healthy.





1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.

2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.

3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.


Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Give three tips for staying healthy.

### Response:
1. Eat a balanced and nut

---

## Fine-tune the pre-trained Phi-1 model on gsm8k dataset using PEFT/LoRA method

### Explanation of key parameters:

- **output_dir**: Directory where the model checkpoints and logs will be saved.

- **per_device_train_batch_size**: Batch size per GPU (or CPU) device during training. It determines the number of training samples processed in a single forward and backward pass.

- **gradient_accumulation_steps**: Number of steps to accumulate gradients before performing a model update. It allows using larger effective batch sizes.

- **learning_rate**: Initial learning rate for the optimizer. It controls the step size during optimization.

- **lr_scheduler_type**: Learning rate scheduler type. Here, it is set to "cosine," suggesting the use of a cosine annealing learning rate schedule.

- **save_strategy**: Strategy for saving checkpoints. In this case, it saves at each epoch.

- **logging_steps**: Log metrics every specified number of steps during training.

- **max_steps**: Maximum number of training steps. Training will stop when this number is reached.

- **num_train_epochs**: Total number of training epochs.

- **push_to_hub**: Upload the model checkpoint to the Hugging Face Model Hub after training.

- **DataCollatorForLanguageModeling**: object is designed for preparing batches of data suitable for language modeling tasks

Ref: https://huggingface.co/docs/transformers/training

In [57]:

training_arguments = TrainingArguments(
        output_dir="phi-1_5-finetuned-vicgalle-alpaca-gpt4",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=2,
        learning_rate=1e-4,
        lr_scheduler_type="cosine",
        evaluation_strategy="steps",
        save_strategy="epoch",
        logging_steps=10,
        max_steps=40,
        num_train_epochs=5,
        report_to="none",
        push_to_hub=True

    )

In [58]:
tokenized_training_data

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output', 'text', 'input_ids', 'attention_mask'],
        num_rows: 52002
    })
})

In [59]:
# if you get this followng OOM Error, Try this script to empty the CUDA cache:
# ==> OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 22.19 GiB total capacity....

# import torch
# torch.cuda.empty_cache()

In [60]:
trainer = Trainer(
    model=model,
    train_dataset=tokenized_train_dataset, #tokenized_data,
    eval_dataset=tokenized_val_dataset,
    args=training_arguments,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
trainer.train()
# Save the trained model in HuggingFace
model.push_to_hub("ehsangharibnezhad/phi-1_5-finetuned-vicgalle-alpaca-gpt4",
                  use_auth_token=True,
                  commit_message="basic training",
                  private=False)

Step,Training Loss,Validation Loss
10,0.8217,1.094744
20,0.9821,1.098478
30,0.9509,1.095425
40,1.0368,1.094171




CommitInfo(commit_url='https://huggingface.co/ehsangharibnezhad/phi-1_5-finetuned-vicgalle-alpaca-gpt4/commit/d620e6ff2638f9bd728b58164344e8ebdd326ba7', commit_message='basic training', commit_description='', oid='d620e6ff2638f9bd728b58164344e8ebdd326ba7', pr_url=None, pr_revision=None, pr_num=None)

In [64]:
model.config

PhiConfig {
  "_name_or_path": "microsoft/phi-1_5",
  "architectures": [
    "PhiForCausalLM"
  ],
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoConfig": "microsoft/phi-1_5--configuration_phi.PhiConfig",
    "AutoModelForCausalLM": "microsoft/phi-1_5--modeling_phi.PhiForCausalLM"
  },
  "bos_token_id": null,
  "embd_pdrop": 0.0,
  "eos_token_id": null,
  "hidden_act": "gelu_new",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 2048,
  "model_type": "phi",
  "num_attention_heads": 32,
  "num_hidden_layers": 24,
  "num_key_value_heads": 32,
  "partial_rotary_factor": 0.5,
  "qk_layernorm": false,
  "resid_pdrop": 0.0,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float32",
  "transformers_version": "4.37.0.dev0",
  "use_cache": true,
  "vocab_size": 51200
}

### Results Interpretation: 
While the time and computational resources to fully fine-tune this LLM model with this dataset were not sufficient, the results for 40 epochs show performance metrics of 1.03 and 1.09 for the training and validation sets. 

## Evlaluate the fine-tuned LLMs performace
Cross-entropy loss is a widely used metric to assess the dissimilarity between the predicted probability distribution and the actual distribution of words in the training data. By minimizing cross-entropy loss, the model learns to make more accurate and contextually relevant predictions, particularly in tasks related to text generation. 

# Save the trained LLMs

In [61]:
from peft import PeftModel
from transformers import AutoModelForCausalLM
import torch
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1_5", 
                                             trust_remote_code=True, torch_dtype=torch.float32)
peft_model = PeftModel.from_pretrained(model, 
                                       "ehsangharibnezhad/phi-1_5-finetuned-vicgalle-alpaca-gpt4", 
                                       from_transformers=True)
model = peft_model.merge_and_unload()
model
     

adapter_config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/31.5M [00:00<?, ?B/s]

PhiForCausalLM(
  (model): PhiModel(
    (embed_tokens): Embedding(51200, 2048)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-23): 24 x PhiDecoderLayer(
        (self_attn): PhiAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=True)
          (k_proj): Linear(in_features=2048, out_features=2048, bias=True)
          (v_proj): Linear(in_features=2048, out_features=2048, bias=True)
          (dense): Linear(in_features=2048, out_features=2048, bias=True)
          (rotary_emb): PhiRotaryEmbedding()
        )
        (mlp): PhiMLP(
          (activation_fn): NewGELUActivation()
          (fc1): Linear(in_features=2048, out_features=8192, bias=True)
          (fc2): Linear(in_features=8192, out_features=2048, bias=True)
        )
        (input_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (resid_dropout): Dropout(p=0.0, inplace=False)
      )
    )
    (final_layernorm): LayerNorm((2048,), e

In [62]:
model.push_to_hub("ehsangharibnezhad/phi-1_5-finetuned-vicgalle-alpaca-gpt4")


README.md:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/688M [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/ehsangharibnezhad/phi-1_5-finetuned-vicgalle-alpaca-gpt4/commit/c3c8322d05fc9fc7580f4ab4947ddadf07c89dce', commit_message='Upload PhiForCausalLM', commit_description='', oid='c3c8322d05fc9fc7580f4ab4947ddadf07c89dce', pr_url=None, pr_revision=None, pr_num=None)

# Inference

In [63]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("ehsangharibnezhad/phi-1_5-finetuned-vicgalle-alpaca-gpt4", trust_remote_code=True, torch_dtype=torch.float32)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1_5", trust_remote_code=True)
inputs = tokenizer('''Design a pipline to monitor the performace in a tech company ''', 
                       return_tensors="pt", return_attention_mask=False)

outputs = model.generate(**inputs, max_length=512)
text = tokenizer.batch_decode(outputs)[0]
print(text)

config.json:   0%|          | 0.00/905 [00:00<?, ?B/s]

Design a pipline to monitor the performace in a tech company 
Answer: The pipline should be designed to monitor the performance of the company in real-time, with sensors placed at strategic locations to track metrics such as productivity, efficiency, and customer satisfaction. The data collected can be analyzed using machine learning algorithms to identify trends and patterns, and the company can use this information to make informed decisions about how to improve their operations.

Exercise: What are some potential risks of using a pipline to monitor the performance of a tech company? 
Answer: Some potential risks of using a pipline to monitor the performance of a tech company include data breaches, privacy violations, and the potential for the data to be misused or misinterpreted. Additionally, the use of a pipline may create a culture of constant surveillance and micromanagement, which can be detrimental to employee morale and productivity.

Exercise: How can a pipline be used to im

In [46]:
# inputs = tokenizer('''Suggest a tutorial to expert machine learning ''', 
#                    return_tensors="pt", 
#                    return_attention_mask=False)

# outputs = model.generate(**inputs, max_length=512)
# text = tokenizer.batch_decode(outputs)[0]
# print(text)

Natalia sold clips to 48 of her friends in April, and then she sold half as 
many clips in May. How many clips did Natalia sell altogether in April and May? 
Answer: In April, Natalia sold 48 clips because 48/2=<<48/2=24>>24 clips
In May, Natalia sold 24 clips because 24/2=<<24/2=12>>12 clips
Natalia sold 48+24+12=<<48+24+12=72>>72 clips
#### 72 clips in total
#### 72 clips in April and May
#### 72 clips in total
#### 72 clips in April and May
#### 72 clips in total
#### 72 clips in April and May
#### 72 clips in total
#### 72 clips in April and May
#### 72 clips in total
#### 72 clips in April and May
#### 72 clips in total
#### 72 clips in April and May
#### 72 clips in total
#### 72 clips in April and May
#### 72 clips in total
#### 72 clips in April and May
#### 72 clips in total
#### 72 clips in April and May
#### 72 clips in total
#### 72 clips in April and May
#### 72 clips in total
#### 72 clips in April and May
#### 72 clips in total
#### 72 clips in April and May
#### 72 clip

---

# Key questions:


### 1- Does Phi-1.5 have any unique model components making Lora difficult?

As discussed in the Hugging Face model profile for microsoft/Phi1_5 and phi1, this model is specialized for basic Python coding. Its training encompassed a diverse range of data sources, including subsets of Python code from The Stack v1.2, Q&A content from StackOverflow, competition code from code contests, and synthetic Python textbooks and exercises generated by gpt-3.5-turbo-0301. However, it is worth noting that the model lacks reliability in responding to instructions since it has not undergone instruction fine-tuning. Consequently, it may encounter difficulties or fail to adhere to intricate or nuanced instructions provided by users.

On the other hand, the implemented datasets in this project, vicgalle/alpaca-gpt4, are instruction-based text sets. The addition of LoRA involves incorporating a few layers to the base model to customize it for instruction datasets used in this project. It's important to acknowledge that this customization might not guarantee full accuracy, as LoRA's effectiveness in enhancing model performance on specific instruction datasets may vary.

### 2- How can we show progress with/without GPU in that amount of time?

I utilized the quantization method to reduce the precision of numerical values in a model. Instead of using high-precision data types, such as 32-bit floating-point numbers, quantization represents values using lower-precision data types, such as 8-bit integers (as discussed earlier in this notebook). Additionally, the training steps in the LoRA fine-tuning process are limited to a few tens, but training and validation sets are employed to monitor the performance of the PEFT/LoRA fine-tuning method. However, all of these processes are performed using a GPU at SageMaker (ml.g5.4xlarge). In a scenario without GPU access, this would be very cumbersome. In your Python script or notebook, set the device to "cpu" using PyTorch, as follows:


## Code Example

```python
import torch

# Check if CUDA (GPU) is available
if torch.cuda.is_available():
    # Set device to GPU
    device = torch.device("cuda")
else:
    # Set device to CPU
    device = torch.device("cpu")

print(f"Using device: {device}")



---

### What are the limitations of the current implementation?

- **Limitation with the base model**: phi1_5, is not a good choice to begin with and fine-tune it with instruction dataset
- **LoRA hyperparametrization**: The performance of LoRA is sensitive to hyperparameters, such as the rank of the low-rank decomposition matrices. Choosing appropriate hyperparameter values can be challenging and may require experimentation. 
- **Dataset**: Instruction promepts in the dataset may not be sufficient to fine-tune the pre-trained model. In addition, it might not cover all topics. This *alpaca-gpt4* contains 52K instruction-following data generated by GPT-4 using the same prompts as in Alpaca.

#### Thank you for checking my project and going through it!!!

---