### Gemma2 2B
##### we will use gamma on the same task that we use llama3.1 8b notebook-0.4 , but 
*  No Quantization : that mean we will not lower the model precesion 
*  loading the model on float16 


In [1]:
import pandas as pd 
import numpy as np

### Create a simple dataset 

In [2]:
df = pd.DataFrame({"text" :["Ahmed" , "Aboud" , "Mohammed" , "Moh"]})

In [3]:
df.text = df.text.apply(lambda x: f"what is the first character in this name {x}. first character is  : {x[0]}")

##### Sample

In [4]:
df.sample(1).text.values

array(['what is the first character in this name Mohammed. first character is  : M'],
      dtype=object)

In [5]:
from transformers import (AutoModelForCausalLM, AutoTokenizer , BitsAndBytesConfig)

#### Load the model 

In [7]:
# choose the model
model_id = "google/gemma-2-2b"

# Load the model 
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cuda:0",
    torch_dtype="float16",
    # quantization_config=bnb_config, 
)

model.config.use_cache = False
model.config.pretraining_tp = 1

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

#### Load Tokenizer 

In [8]:
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token_id = tokenizer.eos_token_id


#####  sfttrainer vs trainer
* Use Trainer: If you have a large dataset and need extensive customization for your training loop or complex training workflows.
* Use SFTTrainer: If you have a pre-trained model and a relatively smaller dataset, and want a simpler and faster fine-tuning experience with efficient memory usage.
###### Source :[Difference between Trainer class and SFTTrainer (Supervised Fine tuning trainer) in Hugging Face?](https://medium.com/@sujathamudadla1213/difference-between-trainer-class-and-sfttrainer-supervised-fine-tuning-trainer-in-hugging-face-d295344d73f7#:~:text=Use%20Trainer%3A%20If%20you%20have%20a%20large%20dataset,and%20faster%20fine-tuning%20experience%20with%20efficient%20memory%20usage.)
* it take 2 min to read 

##### Convert it to  dataset 
###### [dataset documentation](https://huggingface.co/docs/datasets/v1.2.1/loading_datasets.html)

In [9]:
from datasets import Dataset
train_data = Dataset.from_pandas(df[["text"]])

#### Lora Config 
###### [Lora Config Documentation](https://opendelta.readthedocs.io/en/latest/modules/deltas.html)

In [10]:
from peft import LoraConfig, PeftConfig


peft_config = LoraConfig(
    lora_alpha=16, #  A hyper-parameter to control the init scale of loralib.linear .
    lora_dropout=0, # The dropout rate in lora.linear 
    r=64, # the rank of the lora parameters. The smaller lora_r is , the fewer parameters lora has.
    bias="none",
    task_type="CAUSAL_LM",
    target_modules = ["q_proj", "k_proj" , "v_proj" , "o_proj"]
)



#### TrainingArguments
###### [TrainingArguments documentation](https://huggingface.co/transformers/v3.0.2/main_classes/trainer.html)

In [11]:
from transformers import TrainingArguments

output_dir="D:/llama-3.1-fine-tuned-model" # Choose the directory u want to save the tmp in after saving its about 320MB 

training_arguments = TrainingArguments(
    output_dir=output_dir,                    # directory to save and repository id
    num_train_epochs=5,                       # number of training epochs
    logging_steps=1,  # to show training loss log    
    
    fp16=True,
    bf16=False,
    
    report_to="none",                  #  to not report 
    eval_strategy="steps",              # save checkpoint every epoch
    eval_steps = 0.1 # to show val loss log                 
    # 0.1 mean show eval log each epoch , 0.2 each 2 show one log etc 
    
)


#### Trainer by SFTTrainer
###### [Supervised Fine-tuning Trainer documentation](https://huggingface.co/docs/trl/main/en/sft_trainer)

In [13]:
from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    peft_config=peft_config, # Lora config 
    args=training_arguments, # training argument 
    train_dataset=train_data, # dataset 
    tokenizer=tokenizer, # model tokeniaer 
    dataset_text_field="text", # column name example df[["text"]] u write text 
    eval_dataset=train_data, # evaluate dataset we will use the same dataset for now 
    
)


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Map:   0%|          | 0/4 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Map:   0%|          | 0/4 [00:00<?, ? examples/s]

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


### train the model 

In [14]:
trainer.train()

It is strongly recommended to train Gemma2 models with the `eager` attention implementation instead of `sdpa`. Use `eager` with `AutoModelForCausalLM.from_pretrained('<path-to-checkpoint>', attn_implementation='eager')`.
  attn_output = torch.nn.functional.scaled_dot_product_attention(


Step,Training Loss,Validation Loss
1,6.3072,6.154595
2,6.1546,6.013565
3,6.0136,5.907961
4,5.908,5.833333
5,5.8333,5.795701


TrainOutput(global_step=5, training_loss=6.043327045440674, metrics={'train_runtime': 26.0461, 'train_samples_per_second': 0.768, 'train_steps_per_second': 0.192, 'total_flos': 4428166164480.0, 'train_loss': 6.043327045440674, 'epoch': 5.0})

### Create test prompt

In [15]:
df["text"][0]

'what is the first character in this name Ahmed. first character is  : A'

In [16]:
df["text"][0].split(":")[1][1]

'A'

In [19]:
%%time 
# to show you how much time it take 

# choose a prompt 
for i in range(df.shape[0]):
    prompt = df["text"][i].split(":")[0]
    
    # convert it to tokenize it 
    token = tokenizer(prompt , return_tensors="pt")
    
    # split it ot input_ids and attention_mask 
    input_ids = token.input_ids
    attention_mask = token.attention_mask
    
    # use input to cuda 
    input_ids = input_ids.to('cuda')
    attention_mask = attention_mask.to('cuda')
    
    
    # then use the model to generate the output 
    outputs = model.generate(input_ids = input_ids , attention_mask = attention_mask  , max_length = 17 ,  pad_token_id = 128001  )
    
    # convert the output from tokens to text or readable language 
    print(df["text"][i].split(":")[1][1] , ":", tokenizer.batch_decode(outputs, skip_special_tokens=True))

A : ['what is the first character in this name Ahmed. first character is  A.']
A : ['what is the first character in this name Aboud. first character is  A']
M : ['what is the first character in this name Mohammed. first character is  M.']
M : ['what is the first character in this name Moh. first character is  M.']
CPU times: total: 672 ms
Wall time: 2.89 s


* its good 

### Summary 

* as we see its better than llama3.1 8B
##### Reason
* we have not lower the precesion