<h3>Unsloth Demonstration</h3>

This notebook demonstrates how to use the Unsloth library for model training, inference, and saving the trained model for future tasks:

1. Environment Setup:
Mounting Google Drive to access project files.
Installing necessary dependencies via requirements.txt.

2. Model and Data Loading:
Loads a pretrained model (FastLanguageModel) and a tokenizer for use in NLP tasks.
Loads the dataset (data.pkl) and preprocesses it for training, including cleaning and transforming text data.

3. Data Preparation:
Constructs prompts based on question texts, wrong answers, and misconceptions for training the model.
Implements K-fold cross-validation to split the data for model evaluation, ensuring better generalization.

4. Training and Fine-tuning:
Fine-tunes the model using PEFT (LoRA) techniques to efficiently adapt the model for the task.
Uses the generated prompts as input to train the model on understanding and explaining misconceptions.

5. Inference:
Generates test prompts to evaluate the model's ability to provide responses based on inputs.
The trained model can be used to make inferences on new data.

6. Model Saving:
The trained model is saved to be reused for downstream tasks like inference or further fine-tuning.

<h3>Disclaimer</h3>
This is a personal project, and none of Wells Fargo's equipment was used in the creation of any part of this notebook. The entire notebook was developed in Google Colab using a T4 GPU and Google Vertex, utilizing foundation models from the Model Garden.

<h3>Environment Setup</h3>

In [3]:
import ctypes, gc
from unsloth import FastLanguageModel, is_bfloat16_supported
import polars as pl
from pylatexenc.latex2text import LatexNodes2Text
import torch
import numpy as np
import random, os
import pandas as pd
from google.colab import userdata
from transformers import (AutoModelForCausalLM,
                          AutoTokenizer,
                          BitsAndBytesConfig,
                          TrainingArguments,
                          DataCollatorWithPadding,
                          logging)
from datasets import Dataset
from sklearn.model_selection import StratifiedKFold, train_test_split
from peft import LoraConfig, PeftConfig
from trl import SFTTrainer
import joblib
import regex as re
from collections import namedtuple
from tqdm import tqdm
import functools

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


In [4]:
path = "/content/drive/MyDrive/unsloth_demmo"
data_file_path = f'{path}/data.pkl'
seed = 0
max_seq_length = 1024
model_name = 'Qwen/Qwen2.5-7B-Instruct'
n_splits = 5
key_ = userdata.get('HF_TOKEN')
lora_adapter = f'{path}/lora_adapter'
output_dir='prl90777/qwen_7b_instruct_demmo'
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
os.environ["TOKENIZERS_PARALLELISM"] = "false" #disable parallelism in the Hugging Face transformers library for tokenizers.


The initialize_seeds(seed) function ensures reproducibility by setting random seeds for PyTorch, CUDA, NumPy, and Python’s random module, making sure that all random operations across CPUs and GPUs produce the same results in every run. It also configures PyTorch to use deterministic algorithms for consistent behavior.

In [5]:
def initialize_seeds(seed):
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    random.seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

initialize_seeds(seed)

<h3>Model and Data Loading</h3>
The dataset consists of exam questions with incorrect answers, paired with misconceptions that explain the misunderstanding students may have when solving the problem. For example:

- Question: "Which one of the following calculations would work out the number of hours in 5 weeks?"
- Student Response: "5 × 7 × 12"
- Misconception: "Thinks there are 12 hours in 1 day"

The goal is to train a decoder model on the questions, student responses, and misconceptions to infer the misconception in the test data.


In [6]:
pl.Config.set_fmt_str_lengths(500)

full_df = (joblib.load(data_file_path)
            .select(['QuestionId',
                     'QuestionText',
                     'MisconceptionName',
                     'SubjectName',
                     'wrong_answers',
                     'wrong_index'
                     ])
            .with_columns(
                QuestionText = pl.col('QuestionText')
                                 .map_elements(lambda f:
                                               re.sub("(\!\[)|(\]\(\))", "", f),
                                               return_dtype = pl.Utf8
                                               )
            )
     ).head(500)


full_df

QuestionId,QuestionText,MisconceptionName,SubjectName,wrong_answers,wrong_index
i64,str,str,str,str,i64
0,"""  3 × 2+4-5 Where do the brackets need to go to make the answer equal 13 ?""","""Confuses the order of operations, believes addition comes before multiplication ""","""BIDMAS""","""Does not need brackets""",3
1,"""Simplify the following, if possible: m^2+2 m-3/m-3""","""Does not know that to factorise a quadratic expression, to find two numbers that add to give the coefficient of the x term, and multiply to give the non variable term ""","""Simplifying Algebraic Fractions""","""m+1""",0
1,"""Simplify the following, if possible: m^2+2 m-3/m-3""","""Thinks that when you cancel identical terms from the numerator and denominator, they just disappear""","""Simplifying Algebraic Fractions""","""m+2""",1
1,"""Simplify the following, if possible: m^2+2 m-3/m-3""","""Does not know that to factorise a quadratic expression, to find two numbers that add to give the coefficient of the x term, and multiply to give the non variable term ""","""Simplifying Algebraic Fractions""","""m-1""",2
2,"""Tom and Katie are discussing the 5 plants with these heights: 24 cm, 17 cm, 42 cm, 26 cm, 13 cm Tom says if all the plants were cut in half, the range wouldn't change. Katie says if all the plants grew by 3 cm each, the range wouldn't change. Who do you agree with?""","""Believes if you changed all values by the same proportion the range would not change""","""Range and Interquartile Range from a List of Data""","""Only Tom""",0
…,…,…,…,…,…
206,"""If you can connect dots to draw a regular octagon in a ring of equally spaced dots, which other regular shape can you definitely draw?""","""Does not use the associative property of multiplication to find other factors of a number""","""Factors and Highest Common Factor""","""Equilateral triangle""",0
206,"""If you can connect dots to draw a regular octagon in a ring of equally spaced dots, which other regular shape can you definitely draw?""","""Does not use the associative property of multiplication to find other factors of a number""","""Factors and Highest Common Factor""","""Hexagon""",3
206,"""If you can connect dots to draw a regular octagon in a ring of equally spaced dots, which other regular shape can you definitely draw?""","""Does not use the associative property of multiplication to find other factors of a number""","""Factors and Highest Common Factor""","""Pentagon""",2
207,"""x(3 x-5)+6(2 x-3) ≡ P x^2+Q x+R What is the value of R ?""","""Believes you can add or subtract from inside brackets without expanding when solving an equation""","""Expanding Single Brackets""","""-23""",0


The problem requires a model with strong mathematical and reasoning capabilities, and the Qwen2.5 model was selected for its advanced performance in these areas. To address this, I used Unsloth, which provides a fast approach for training sequence-based operations. For more information, refer to the [Unsloth Site](https://docs.unsloth.ai/).

In [7]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name,
    max_seq_length = max_seq_length,
    dtype = None, #set to default dtype
    load_in_4bit = True, #4-bit quantization reduces memory usage and can speed up inference, but reduces precision
)


==((====))==  Unsloth 2024.9.post4: Fast Qwen2 patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [8]:

model = FastLanguageModel.get_peft_model(
    model,
    r = 32,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = seed,
    use_rslora = False,
    loftq_config = None,
)

Unsloth 2024.9.post4 patched 28 layers with 0 QKV layers, 28 O layers and 28 MLP layers.


In [9]:

model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): Qwen2ForCausalLM(
      (model): Qwen2Model(
        (embed_tokens): Embedding(152064, 3584)
        (layers): ModuleList(
          (0-27): 28 x Qwen2DecoderLayer(
            (self_attn): Qwen2Attention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=3584, out_features=3584, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=3584, out_features=32, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=32, out_features=3584, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lora.Linear4bit(
       

This function prepares input prompts for training a model to infer misconceptions based on questions and wrong answers. The prompts are designed to help the model learn how to explain the underlying misunderstanding (misconception) for each incorrect response. The prompt format is crucial for generating the data correctly, and refining this format took the most time to ensure it was appropriate. Additionally, following a strict structure for prompting the model is essential for optimal training results.

In [10]:
def prompt_message_train(row):
    questions = row['QuestionText']
    wrong_answer = row['wrong_answers']
    misconceptions = row.get('MisconceptionName', [''] * len(wrong_answer))
    holder = zip( questions, wrong_answer,  misconceptions)
    prompts = []

    for  question, wrong_answer, misconception in holder:
      prompt = f"""Explain the fundamental misunderstanding of this concept in simple, non-specific terms.
                    ### Instruction:
                    {question.strip()}
                    ### Input:
                    {wrong_answer}
                    ### Response:
                    {misconception}
                """
      prompts.append(prompt.strip() + tokenizer.eos_token)

    return {"prompts" :prompts}


In [11]:
hf_df = Dataset.from_polars(full_df)

In [12]:
df = (full_df
     .to_pandas()
     .drop_duplicates(subset = ['QuestionId'])
    )

data_dic = {}
stratified_kfold = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=seed)
stratify_column = df['SubjectName']
for fold, (train_index, test_index) in enumerate(stratified_kfold.split(df, stratify_column)):
    print(f"This is fold: {fold}")
    tok_train_df = hf_df.filter(lambda f: f['QuestionId'] in df.iloc[train_index, 0].to_list())
    tok_test_df = (hf_df.filter(lambda f: f['QuestionId'] in df.iloc[test_index, 0].to_list())
                               .remove_columns('MisconceptionName')
                  )
    data_dic[fold] = {
        'train': tok_train_df.map(prompt_message_train, batched=True),
        'test': tok_test_df.map(prompt_message_train, batched=True)
    }
    break

This is fold: 0




Filter:   0%|          | 0/500 [00:00<?, ? examples/s]

Filter:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/402 [00:00<?, ? examples/s]

Map:   0%|          | 0/98 [00:00<?, ? examples/s]

In [13]:
data_dic[0]['train'][0]

{'QuestionId': 1,
 'QuestionText': 'Simplify the following, if possible: m^2+2 m-3/m-3',
 'MisconceptionName': 'Does not know that to factorise a quadratic expression, to find two numbers that add to give the coefficient of the x term, and multiply to give the non variable term\n',
 'SubjectName': 'Simplifying Algebraic Fractions',
 'wrong_answers': 'm+1',
 'wrong_index': 0,
 'prompts': 'Explain the fundamental misunderstanding of this concept in simple, non-specific terms.\n                    ### Instruction:\n                    Simplify the following, if possible: m^2+2 m-3/m-3\n                    ### Input:\n                    m+1\n                    ### Response:\n                    Does not know that to factorise a quadratic expression, to find two numbers that add to give the coefficient of the x term, and multiply to give the non variable term<|im_end|>'}

In [14]:
data_dic[0]['test'][0]

{'QuestionId': 0,
 'QuestionText': '\n    3 × 2+4-5\n\nWhere do the brackets need to go to make the answer equal 13 ?',
 'SubjectName': 'BIDMAS',
 'wrong_answers': 'Does not need brackets',
 'wrong_index': 3,
 'prompts': 'Explain the fundamental misunderstanding of this concept in simple, non-specific terms.\n                    ### Instruction:\n                    3 × 2+4-5\n\nWhere do the brackets need to go to make the answer equal 13 ?\n                    ### Input:\n                    Does not need brackets\n                    ### Response:<|im_end|>'}

### Overview of the `SFTTrainer` Function

The `SFTTrainer` is responsible for fine-tuning a model using the **LoRA (Low-Rank Adaptation)** technique with the following configuration:

1. **Model and Tokenizer**:
   - The pre-trained model and tokenizer are used to encode and fine-tune the dataset.
   
2. **Dataset**:
   - **`train_dataset`**: The dataset used for training, which contains the preprocessed prompts.
   - **`dataset_text_field`**: Specifies the field in the dataset that contains the input text for training (`"prompts"`).
   - **`max_seq_length`**: Sets the maximum sequence length for input data.

3. **Parallel Processing**:
   - **`dataset_num_proc`**: The number of processes (2) used for dataset preprocessing to improve performance.

### Training Arguments (via `TrainingArguments`):

1. **`output_dir`**: Directory for saving the model and outputs.
2. **`per_device_train_batch_size`**: Batch size per device (2 samples per training step).
3. **`gradient_accumulation_steps`**: Gradients are accumulated over 4 steps before backpropagation.
4. **`num_train_epochs`**: Number of passes over the dataset (2 epochs).
5. **`learning_rate`**: Learning rate for the optimizer (`2e-4`).
6. **`fp16` and `bf16`**: Mixed precision settings:
   - **`fp16`** is used if `bfloat16` is not supported.
   - **`bf16`** is enabled if supported for memory efficiency.
7. **`logging_steps`**: Logs progress every 50 steps.
8. **`optim`**: Uses the **`adamw_8bit`** optimizer for memory efficiency.
9. **`weight_decay`**: Adds weight decay of 0.01 to prevent overfitting.
10. **`lr_scheduler_type`**: Uses a linear learning rate scheduler.
11. **`seed`**: Ensures reproducibility with a fixed seed.
12. **`push_to_hub`**: Automatically pushes the model to the Hugging Face hub if enabled.



In [15]:
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = data_dic[0]['train'],
    dataset_text_field = "prompts",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    args = TrainingArguments(
        output_dir=output_dir,
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 6,
        num_train_epochs = 2,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 25,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = seed,
        push_to_hub=True
    ),
)

Map (num_proc=2):   0%|          | 0/402 [00:00<?, ? examples/s]

In [16]:
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
5.916 GB of memory reserved.


In [17]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 402 | Num Epochs = 2
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 6
\        /    Total batch size = 12 | Total steps = 66
 "-____-"     Number of trainable parameters = 80,740,352


Step,Training Loss
25,1.8029
50,1.0341


In [21]:
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")


444.2157 seconds used for training.
7.4 minutes used for training.
Peak reserved memory = 6.893 GB.
Peak reserved memory for training = 0.977 GB.
Peak reserved memory % of max memory = 46.739 %.
Peak reserved memory for training % of max memory = 6.625 %.


<h3>Inference</h3>

In [19]:
holder =[]
values = namedtuple("values", "id wrong_id answers")

FastLanguageModel.for_inference(model)

for example in tqdm(data_dic[0]['test']):
    id = example['QuestionId']
    wrong = example['wrong_index']
    inputs = tokenizer(example["prompts"], return_tensors="pt").to(device)
    input_length = inputs['input_ids'].shape[-1]
    generated_outputs = model.generate(**inputs, max_new_tokens = max_seq_length, use_cache = True)
    generated_tokens = generated_outputs[0][input_length:]
    generated_text = tokenizer.decode(generated_tokens, skip_special_tokens=True)
    holder.append(values(id = id, wrong_id = wrong, answers = generated_text))



100%|██████████| 98/98 [02:18<00:00,  1.41s/it]


In [34]:
pl.Config.set_tbl_rows(21)

(pl.DataFrame(pd.DataFrame(holder))
.rename({'id':'QuestionId','wrong_id' :'wrong_index', 'answers':  'Pre_MisconceptionName'})
.join(
    full_df.select( ['QuestionId','QuestionText', 'wrong_answers', 'wrong_index', 'MisconceptionName']),
    on = ['QuestionId','wrong_index', ]
    )

).select('QuestionId','QuestionText', 'wrong_answers', 'Pre_MisconceptionName', 'MisconceptionName').sample(10)


QuestionId,QuestionText,wrong_answers,Pre_MisconceptionName,MisconceptionName
i64,str,str,str,str
23,"""This graph shows how far Fido the dog is from his home. What might the negative-sloping section represent? A graph with time (secs) on the horizontal axis and distance (m) on the vertical axis. The graph starts at the origin, travels in a straight line up and right, travels horizontally, then travels in a straight line down and right back to the x-axis. ""","""Fido is accelerating""","""  Believes acceleration is when an object moves downwards""","""Believes the gradient of a distance-time graph is the acceleration"""
168,"""The pictogram shows how many people came to watch a netball match. What is the mean number of people at each game? A pictogram with a key showing a smiley face represents 12 people. Game 1 has 4 smiley faces, Game 2 has 2 smiley faces, Game 3 has 3 smiley faces and Game 4 has 1 smiley face.""","""2.5""","""  Believes that the mean is the total divided by the total""","""When interpreting a pictogram, thinks each symbol stands for 1"""
68,"""One of these equations has no real solutions Which is it?""","""(w-4)^2=0""","""  Thinks an equation with squared term can never have no real solutions""","""Believes 0 has no square root"""
204,"""How do you write this number in words? 204050""","""Two million and four thousand and fifty""","""  When reading a large number, reads all zeros as zero""","""Mistakes the place value column of hundred thousands with millions"""
53,"""In which region would a rectangle belong? Venn diagram with two circles labelled with 'exactly two lines of symmetry' and 'rotational symmetry of order 2'. A is in the only exactly two lines of symmetry region, B is in the intersection, C is in the only rotational symmetry of order 2 region and D is outside the two circles.""","""C""","""  Believes an equilateral triangle has rotational symmetry of order 2""","""Thinks rectangles have line symmetry in the diagonals """
98,"""This is a graph of y=0.4 x-3 A set of axes with the graph y=0.4x-3 drawn on. Use the graph to solve 0.4 x-3=-1""","""x=-3.3""","""  When solving an equation from a graph, solves the original equation instead of substituting the value into the other side of the equation""","""Mixes up the value of two terms when substituting"""
139,"""Alison is expanding these two brackets. What should she get when she multiplies the two terms indicated by the arrows? The two brackets are (5p-4)(2p-3). The arrows are pointing at the -4 in the first bracket and the -3 in the second bracket.""","""-12""","""  When multiplying two negative numbers, thinks the answer will be negative""","""Believes multiplying two negatives gives a negative answer"""
153,"""Solve the equation: 12 d-3=0""","""d=4""","""  Thinks you should multiply when solving a linear equation""","""Swaps the dividend and divisor in order to get an integer answer"""
150,"""Which of the following is not a measure of volume?""","""Cubic metres""","""  Believes all units that are cubed are measures of volume""","""Does not know that cubic means the same as cubed in units"""
9,"""What transformation maps the graph of y=f(x) to the graph of y=f(x-3)""","""Translation by vector [[ -3; 0 ]]""","""  Believes a horizontal translation should be given with the x-coordinate negative""","""Believes f(x) + a translates the function right a units"""
