<a href="https://colab.research.google.com/github/Chienstartup/2024-AI-Mathematical-Olympiad/blob/main/SOTA_model_PEFT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>Setting Environment

PyArrow and Datasets are often used together in machine learning pipelines to:

- Load and preprocess large datasets efficiently
- Prepare data for model training with minimal memory overhead
- Share and version datasets easily among team members or the wider community

In [2]:
#after install transformers will restart the kernal
!pip install transformers==4.33.0
import transformers
print(f"Transformers version: {transformers.__version__}")

Transformers version: 4.33.0


In [1]:
!pip install --upgrade pyarrow datasets -qq

In [None]:
!pip install peft -qq

In [3]:
# once installed, you need to restart the notebook
pip install -U bitsandbytes -qq



In [4]:
!pip install accelerate -qq

In [3]:
import numpy as np
import pandas as pd
import os
from datasets import Dataset
import torch

In [4]:
# Authenticate user
from google.colab import auth
from huggingface_hub import login

# Paste your huggingface token.
hf_token = "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"

# Save the token in the os environment
os.environ['HF_TOKEN'] = hf_token

# Login
!huggingface-cli login --token $hf_token

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [5]:
from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForSeq2Seq, TrainingArguments, Trainer, BitsAndBytesConfig

  torch.utils._pytree._register_pytree_node(



- AutoTokenizer:
This class automatically loads the appropriate tokenizer for a given model. It converts text into tokens that the model can understand.
- AutoModelForCausalLM:
This loads a pre-trained causal language model for a given architecture. It's used for tasks like text generation where the model predicts the next token based on previous tokens.
- DataCollatorForSeq2Seq:
This prepares batches of data for sequence-to-sequence models. It handles padding and creates attention masks for efficient training.
- TrainingArguments:
This class defines various parameters for model training, such as learning rate, batch size, and number of epochs.
- Trainer:
A high-level API that simplifies the training process. It handles the training loop, evaluation, and saving of models.
- BitsAndBytesConfig:
This configures quantization settings for models, allowing for reduced memory usage and faster inference, often used in techniques like 4-bit or 8-bit quantization.

<h1> Loading Train dataset

In [6]:
train_df = pd.read_csv('https://raw.githubusercontent.com/Chienstartup/2024-AI-Mathematical-Olympiad/main/data%20source/train.csv')
train_df.head()

Unnamed: 0,id,problem,answer
0,229ee8,"Let $k, l > 0$ be parameters. The parabola $y ...",52
1,246d26,Each of the three-digits numbers $111$ to $999...,250
2,2fc4ad,Let the `sparkle' operation on positive intege...,702
3,430b63,What is the minimum value of $5x^2+5y^2-8xy$ w...,800
4,5277ed,There exists a unique increasing geometric seq...,211


<h1>Quantization: to reduce model memory usage and accelerate inference

- Precision Setting:

  load_in_4bit = True: Uses 4-bit precision, significantly reducing memory usage.</br>
  Other options: 8-bit precision (load_in_8bit = True) or no quantization (default).</br>
  4-bit vs 8-bit: 4-bit offers greater memory savings but may slightly reduce accuracy; 8-bit balances memory savings and precision.


- Quantization Type:

  bnb_4bit_quant_type="nf4": Uses Normal Float 4, optimized for normally distributed weights..</br>
  Other options: Such as "fp4" (Float Point 4)..</br>
  NF4 vs FP4: NF4 is suitable for normally distributed weights, FP4 provides a wider dynamic range.


- Computation Data Type:

  bnb_4bit_compute_dtype=torch.bfloat16: Uses bfloat16 for computations.
  bfloat16 vs float16:

  bfloat16 has a larger exponent range, suitable for large values and gradients in deep learning..</br>
  float16 provides higher precision for small values..</br>
  bfloat16 typically performs better in terms of training stability and hardware compatibility.




- Double Quantization:

  bnb_4bit_use_double_quant=True: Enables double quantization, further compressing the model..</br>
  Without double quantization, one can use load_in_4bit or load_in_8bit alone..</br>
  Double quantization provides additional memory savings but may slightly increase computational overhead.

In [8]:
quantization_config = BitsAndBytesConfig(
        load_in_4bit = True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True
    )

#### Downloading model from huggingface might take times and waste your computing credits, please check whether you have enough computing resource

In [9]:
model = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/deepseek-math-7b-rl",
    device_map = "auto",
    quantization_config=quantization_config
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



### The pipeline is a high-level abstraction that simplifies the process of using pre-trained models for various natural language processing (NLP) tasks.
The main pipeline types include:

- Text Classification (text-classification): Classifies text, such as sentiment analysis.
- Text Generation (text-generation): Generates new text, like stories or dialogues.
- Named Entity Recognition (ner): Identifies entities in text, such as names, places, organizations.
- Question Answering (question-answering): Answers questions based on given context.
- Summarization (summarization): Generates summaries of text.
- Translation (translation): Translates text from one language to another.
- Feature Extraction (feature-extraction): Extracts feature vectors from text.
- Fill-Mask (fill-mask): Fills in masked tokens in sentences.
- Zero-Shot Classification (zero-shot-classification): Performs classification without specific training data.
- Text-to-Text Generation (text2text-generation): Various text-to-text tasks like translation or summarization.
- Conversational (conversational): Engages in conversational interactions.
- Image Classification (image-classification): Classifies images.
- Audio Classification (audio-classification): Classifies audio.
- Automatic Speech Recognition (automatic-speech-recognition): Converts speech to text.
- Visual Question Answering (visual-question-answering): Answers questions based on images.

Using these pipelines allows for quick implementation of complex NLP tasks without needing to delve into the details of the underlying models.

In [10]:
from transformers import pipeline

In [11]:
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-math-7b-rl")



<h1>Testing Pretrained Deepseek model

In [12]:
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype='auto',
    device_map= 'auto',
    max_new_tokens = 4500
)

In [13]:
problem1 = train_df.problem[0]

In [14]:
problem1

'Let $k, l > 0$ be parameters. The parabola $y = kx^2 - 2kx + l$ intersects the line $y = 4$ at two points $A$ and $B$. These points are distance 6 apart. What is the sum of the squares of the distances from $A$ and $B$ to the origin?'

<h1> Ensemble Prompt to Problem Set

In [15]:
def generate_prompt(question):

    description = """Below is a math problem you are to solve (positive numerical answer):
    \"{}\"
    To accomplish this, first determine a sympy-based approach for solving the problem by listing each step to take and what functions need to be called in each step.
    Be clear so even an idiot can follow your instructions, and remember, your final answer should be positive integer, not an algebraic expression!
    Write the entire script covering all the steps (use comments and document it well) and print the result.
    After solving the problem, output the final numerical answer within \\boxed{}. You should not repeat the problem or instruction in your answer.
    Approach:"""

    prompt = f""" {description}. Here is the question: {question}"""

    return prompt

In [16]:
prompt_problem = generate_prompt(problem1)

In [17]:
prompt_problem

' Below is a math problem you are to solve (positive numerical answer):\n    "{}"\n    To accomplish this, first determine a sympy-based approach for solving the problem by listing each step to take and what functions need to be called in each step. \n    Be clear so even an idiot can follow your instructions, and remember, your final answer should be positive integer, not an algebraic expression!\n    Write the entire script covering all the steps (use comments and document it well) and print the result. \n    After solving the problem, output the final numerical answer within \\boxed{}. You should not repeat the problem or instruction in your answer.\n    Approach:. Here is the question: Let $k, l > 0$ be parameters. The parabola $y = kx^2 - 2kx + l$ intersects the line $y = 4$ at two points $A$ and $B$. These points are distance 6 apart. What is the sum of the squares of the distances from $A$ and $B$ to the origin?'

In [18]:
result = pipe(prompt_problem)

Setting `pad_token_id` to `eos_token_id`:100001 for open-end generation.


In [19]:
result[0]

{'generated_text': ' Below is a math problem you are to solve (positive numerical answer):\n    "{}"\n    To accomplish this, first determine a sympy-based approach for solving the problem by listing each step to take and what functions need to be called in each step. \n    Be clear so even an idiot can follow your instructions, and remember, your final answer should be positive integer, not an algebraic expression!\n    Write the entire script covering all the steps (use comments and document it well) and print the result. \n    After solving the problem, output the final numerical answer within \\boxed{}. You should not repeat the problem or instruction in your answer.\n    Approach:. Here is the question: Let $k, l > 0$ be parameters. The parabola $y = kx^2 - 2kx + l$ intersects the line $y = 4$ at two points $A$ and $B$. These points are distance 6 apart. What is the sum of the squares of the distances from $A$ and $B$ to the origin?\nTo solve this problem, we first need to find th

<h1>Loading GMS8K dataset:
<h3> from https://www.kaggle.com/datasets/abdullahmeda/annotated-math-and-gsm8k-datasets

In [20]:
# Load the CSV file using Pandas
csv_file_path = "https://raw.githubusercontent.com/Chienstartup/2024-AI-Mathematical-Olympiad/main/data%20source/gsm8k-annotated-non-gpt4.csv"
df = pd.read_csv(csv_file_path)

In [21]:
df

Unnamed: 0,question,answer,code_solution,boxed_number
0,Natalia sold clips to 48 of her friends in Apr...,Natalia sold 48/2 = <<48/2=24>>24 clips in May...,```python\n# Step 1: Define the number of clip...,72
1,Weng earns $12 an hour for babysitting. Yester...,Weng earns 12/60 = $<<12/60=0.2>>0.2 per minut...,```python\n# Step 1: Define the hourly wage an...,10
2,Betty is saving money for a new wallet which c...,"In the beginning, Betty has only 100 / 2 = $<<...",```python\n# Step 1: Calculate the initial amo...,5
3,"Julie is reading a 120-page book. Yesterday, s...",Maila read 12 x 2 = <<12*2=24>>24 pages today....,```python\n# Step 1: Define the number of page...,42
4,James writes a 3-page letter to 2 different fr...,He writes each friend 3*2=<<3*2=6>>6 pages a w...,```python\n# Import the necessary libraries (n...,624
...,...,...,...,...
8439,John had a son James when he was 19. James is...,Dora is 12-3=<<12-3=9>>9\nSo James is 9*2=<<9*...,```python\n# Step 1: Calculate Dora's current ...,8
8440,There are some oranges in a basket. Ana spends...,There are 60 minutes in an hour. Ana peels an ...,```python\n# Calculate the number of oranges A...,5
8441,Mark's car breaks down and he needs to get a n...,The discount on the radiator was 400*.8=$<<400...,```python\n# Step 1: Calculate the discount on...,230
8442,"Farmer Brown has 20 animals on his farm, all e...",Let C be the number of chickens.\nThere are 20...,```python\n# Import the necessary libraries\ni...,5


In [24]:
# Convert the DataFrame to a Hugging Face dataset
dataset = Dataset.from_pandas(df)

# Display some information about the dataset
print(dataset)

Dataset({
    features: ['question', 'answer', 'code_solution', 'boxed_number'],
    num_rows: 8444
})


Convert a pandas DataFrame into a Hugging Face Dataset object.

In [25]:
gms8k_dataset = Dataset.from_pandas(df)

In [26]:
from datasets import load_dataset, DatasetDict

In [29]:
# Split the dataset into training (80%) and validation (20%) sets
split_dataset = gms8k_dataset.train_test_split(test_size=0.2)

# Create a DatasetDict to hold the train and validation datasets
dataset_dict = DatasetDict({
    'train': split_dataset['train'],
    'validation': split_dataset['test']
})

In [30]:
dataset_dict

DatasetDict({
    train: Dataset({
        features: ['question', 'answer', 'code_solution', 'boxed_number'],
        num_rows: 6755
    })
    validation: Dataset({
        features: ['question', 'answer', 'code_solution', 'boxed_number'],
        num_rows: 1689
    })
})

<h1>Process Data in to transformer Datadict

- tokenizer.pad_token = tokenizer.eos_token:

  This line sets the padding token to be the same as the end-of-sequence token.
  This is typically used to ensure the model consistently identifies the end of sequences when dealing with varying lengths.

- max_length = 600:

  Sets the maximum length for both inputs and labels to 600 tokens.


- model_inputs = tokenizer(example["question"], max_length=max_length, truncation=True, padding='max_length'):

  Processes the "question" field using the tokenizer.</br>
  max_length=max_length: Limits the maximum length.</br>
  truncation=True: Truncates text if it exceeds the maximum length.</br>
  padding='max_length': Pads all sequences to the maximum length.


- labels = tokenizer(example["code_solution"], max_length=max_length, truncation=True, padding='max_length'):

  Similarly processes the "code_solution" field as labels.</br>

- model_inputs["labels"] = labels["input_ids"]:

  Adds the processed label input IDs to the model inputs.</br>

- return model_inputs:

  Returns the processed model inputs, including input IDs and corresponding labels.



The purpose of this function is to convert raw text data into a format directly usable by the model, including converting text to token IDs and ensuring all inputs and labels have consistent lengths.

In [32]:
# Set the padding token to be the same as the eos token
tokenizer.pad_token = tokenizer.eos_token

def preprocess_function(example, tokenizer=tokenizer):
    max_length = 600  # Set a consistent max_length for both input and labels
    model_inputs = tokenizer(example["question"], max_length=max_length, truncation=True, padding='max_length')
    labels = tokenizer(example["code_solution"], max_length=max_length,truncation=True, padding='max_length')
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

- The main purpose of this code is to:

  Apply the preprocessing function to each sample in the dataset.</br>
  Perform this in batch mode for efficiency.</br>
  Remove the original columns, keeping only the preprocessed data.</br>
  Create a new dataset containing the preprocessed data.</br>

In [33]:
processed_datasets = dataset_dict.map(preprocess_function,  batched = True, remove_columns = dataset_dict["train"].column_names )
processed_datasets

Map:   0%|          | 0/6755 [00:00<?, ? examples/s]

Map:   0%|          | 0/1689 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 6755
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1689
    })
})

- Purpose of Attention Mask:

  It tells the model which tokens are real input tokens and which are padding tokens.
  For real tokens, the mask value is 1; for padding tokens, it's 0.
  This allows the model to ignore padding tokens and focus only on real input content.

- Why Attention Mask is Needed:

  In batch processing, sequences of different lengths are padded to the same length.
  Attention Mask helps the model identify which parts are real data and which are padding.
  This is crucial for correctly calculating attention and loss.

In [34]:
processed_datasets["train"]

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 6755
})

<h1>PEFT: Parameter-Efficient Fine-Tuning

In [37]:
from peft import PromptEncoderConfig, PromptEncoderReparameterizationType, IA3Config, PeftModel, get_peft_model, LoraConfig, TaskType, prepare_model_for_kbit_training

- PromptEncoderConfig:

  Used to configure settings for the prompt encoder.
  Prompt encoding is a technique to adjust model behavior without changing the original model parameters.


- PromptEncoderReparameterizationType:

  Defines the reparameterization type for the prompt encoder.
  Reparameterization can help optimize the training process.


- IA3Config:

  Configuration class for IA3 (Infused Adapter by Inhibiting and Amplifying Inner Activations).
  IA3 is a parameter-efficient fine-tuning method.


- PeftModel:

  Base class for PEFT models, used to create various types of PEFT models.


- get_peft_model:

  A function to convert a regular Transformer model into a PEFT model.


- LoraConfig:

  Configuration class for LoRA (Low-Rank Adaptation).
  LoRA is a popular parameter-efficient fine-tuning method.


- TaskType:

  Defines different task types, such as sequence-to-sequence, causal language modeling, etc.


- prepare_model_for_kbit_training:

  Function to prepare a model for k-bit training.
  k-bit training is a quantization training technique that can reduce model memory usage.


<h3>checking model structure

In [38]:
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(102400, 4096)
    (layers): ModuleList(
      (0-29): 30 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear4bit(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (nor

<h1> Method 1: IA3, Infused Adapter by inhibiting and Amplifying Inner Activations

- Characteristics of the IA3 method:

  It fine-tunes the model by injecting adapters into the model's internal activations.
  "Inhibiting and Amplifying" refers to its ability to suppress or enhance certain internal features of the model.
  This method can effectively adapt to new tasks while adjusting very few parameters.

- Advantages of using IA3:

  High parameter efficiency: Only a small number of additional parameters need to be trained.
  Strong adaptability: Can effectively adjust the model to suit specific tasks.
  Computational efficiency: Both training and inference are faster compared to full-parameter fine-tuning.

- This configuration is particularly suitable for:

  Situations requiring quick adaptation to new tasks.</br>
  Environments with limited computational resources.</br>
  Scenarios where maintaining most of the original model's knowledge is necessary.

In [39]:
config = IA3Config(task_type=TaskType.CAUSAL_LM)

In [40]:
ia3_model = get_peft_model(model, config)

In [41]:
ia3_model

PeftModelForCausalLM(
  (base_model): IA3Model(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(102400, 4096)
        (layers): ModuleList(
          (0-29): 30 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
              (k_proj): ia3.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (ia3_l): ParameterDict(  (default): Parameter containing: [torch.cuda.FloatTensor of size 4096x1 (cuda:0)])
              )
              (v_proj): ia3.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (ia3_l): ParameterDict(  (default): Parameter containing: [torch.cuda.FloatTensor of size 4096x1 (cuda:0)])
              )
              (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
              (rotary_emb

In [51]:
ia3_model = ia3_model.to(torch.bfloat16)

In [42]:
ia3_model.print_trainable_parameters()

trainable params: 576,000 || all params: 6,910,941,696 || trainable%: 0.0083


In [43]:
print(type(ia3_model))

<class 'peft.peft_model.PeftModelForCausalLM'>


In [44]:
from transformers import DataCollatorForLanguageModeling

In [54]:
args = TrainingArguments(
    output_dir = "/content/sample_data/",
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 8,
    logging_steps = 10,
    num_train_epochs = 1,
    bf16=True,
    bf16_full_eval=True,
)

### Notification: New version of transformers will face ValueError as below:

ValueError: You cannot perform fine-tuning on purely quantized models.

Solution: !pip install transformers==4.33.0

In [55]:
trainer = Trainer(
    model=ia3_model,
    args=args,
    train_dataset=processed_datasets["train"],
    data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, return_tensors='pt', mlm=False)
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None)


### L4 GPU: training time is around 1.5 hours

In [57]:
trainer.train()

<h1> Method 2: QLoRA, Low-Rank Adaptation

QLoRA technology: this method combines the advantages of quantization (reduced memory usage) with the efficiency of LoRA (parameter-efficient fine-tuning), enabling effective fine-tuning of large language models with limited resources.

## Explanation of LoRA vs IA3

  Following code uses the LoRA (Low-Rank Adaptation) technique, which differs from the previously discussed IA3 (Infused Adapter by Inhibiting and Amplifying Inner Activations).

- LoRA vs IA3:

  LoRA: Adapts pre-trained models by adding low-rank matrices.

  IA3: Adjusts models by inhibiting and amplifying internal activations.

- Main Advantages of LoRA:

  High parameter efficiency: Adds only a small number of trainable parameters.

  Computational efficiency: Fast training and inference.

  Flexibility: Easy to switch or combine different LoRA adapters.

  In summary, this code sets up a PEFT configuration using LoRA technology for a causal language modeling task, primarily adjusting the projection matrices in the attention mechanism. Compared to IA3, LoRA offers a different approach to parameter-efficient fine-tuning, potentially performing better or being easier to adjust for certain tasks.

In [58]:
# Define the LoRA configuration for a causal language modeling task
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=2,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj","k_proj","v_proj","o_proj"],
)

In [59]:
model.enable_input_require_grads()

In [60]:
PeftModel = get_peft_model(model, peft_config)

In [61]:
PeftModel.enable_input_require_grads()

In [62]:
PeftModel.print_trainable_parameters()

trainable params: 1,966,080 || all params: 6,912,907,776 || trainable%: 0.0284


In [63]:
args = TrainingArguments(
    output_dir = "/content/sample_data/",
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 8,
    logging_steps = 10,
    num_train_epochs = 1,
)

In [64]:
trainer = Trainer(
    model = PeftModel,
    args = args,
    train_dataset = processed_datasets["train"],
    data_collator = DataCollatorForLanguageModeling(tokenizer = tokenizer, return_tensors = 'pt', mlm=False)
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None)


### L4 GPU: training time is around 1:44 hours

In [66]:
trainer.train()

<h1> Method 3: p-Tuning

P-tuning is a parameter-efficient fine-tuning technique specifically designed for large language models. Here are the main features and working principles of P-tuning:

- Basic Concept:
  P-tuning focuses on optimizing the model's prompts.
  It replaces manually designed discrete prompts with learned continuous virtual tokens.

- Working Mechanism:
  Adds trainable embedding vectors at the beginning of the input sequence.
  These embedding vectors are called "virtual tokens" and their values are optimized during training.
  Virtual tokens act as a soft prompt, guiding the model to generate task-specific outputs.

- Advantages:
  High parameter efficiency: Only a small number of parameters need to be trained.
  Flexibility: Can adapt to different tasks and domains.
  Performance: Can achieve performance comparable to full model fine-tuning on certain tasks.

- Applications:
  Particularly suitable for few-shot and zero-shot learning scenarios.
  Performs well in various natural language processing tasks such as text classification, named entity recognition, etc.

- Comparison with Other Methods:
  Unlike traditional fine-tuning, P-tuning only modifies the input layer without changing other parts of the model.
  Compared to LoRA, P-tuning focuses on optimizing the input layer rather than the internal weights of the model.

- Implementation:
  Usually requires modification of the model's input processing part.
  In Hugging Face's Transformers library, it can be implemented by customizing the tokenizer and model architecture.

  P-tuning represents a new approach of adapting to new tasks by optimizing the input rather than directly modifying model parameters. This method achieves efficient task adaptation while keeping most of the model parameters unchanged.





In [67]:
config = PromptEncoderConfig(task_type = TaskType.CAUSAL_LM,
                num_virtual_tokens = 10)

pt_model = get_peft_model(model, config)

In [68]:
pt_model.print_trainable_parameters()

trainable params: 50,384,896 || all params: 6,963,292,672 || trainable%: 0.7236


In [69]:
pt_model

PeftModelForCausalLM(
  (base_model): LlamaForCausalLM(
    (model): LlamaModel(
      (embed_tokens): Embedding(102400, 4096)
      (layers): ModuleList(
        (0-29): 30 x LlamaDecoderLayer(
          (self_attn): LlamaAttention(
            (q_proj): lora.Linear4bit(
              (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
              (lora_dropout): ModuleDict(
                (default): Dropout(p=0.1, inplace=False)
              )
              (lora_A): ModuleDict(
                (default): Linear(in_features=4096, out_features=2, bias=False)
              )
              (lora_B): ModuleDict(
                (default): Linear(in_features=2, out_features=4096, bias=False)
              )
              (lora_embedding_A): ParameterDict()
              (lora_embedding_B): ParameterDict()
              (lora_magnitude_vector): ModuleDict()
            )
            (k_proj): lora.Linear4bit(
              (base_layer): ia3.Linear4bit(
           

In [70]:
args = TrainingArguments(
    output_dir = "/content/sample_data/",
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 8,
    logging_steps = 10,
    num_train_epochs = 1,
)

In [71]:
trainer = Trainer(
    model = pt_model,
    args = args,
    train_dataset = processed_datasets["train"],
    data_collator = DataCollatorForLanguageModeling(tokenizer = tokenizer, return_tensors = 'pt', mlm=False)
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None)


### L4 GPU: training time is around 1:43 hours

In [73]:
trainer.train()