Unsloth optimization libraries.

In [1]:
%%capture

!pip install unsloth

!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

In [2]:
print("hello")

hello


FastLanguageModel for Llama pre-trained models using unsloth.

torch for deep learning and gpu computation



max_seq_length: This sets the maximum sequence length for the input tokens. The model can process up to 2048 tokens in a single sequence.



dtype: This is the data type for the model’s tensors. It’s set to None here for automatic detection based on the GPU.



load_in_4bit: This flag is set to True, enabling 4-bit quantization. This is a memory optimization technique, allowing the model to use less GPU memory and run faster by compressing the weights into 4-bit precision.

In [3]:
from unsloth import FastLanguageModel

import torch

max_seq_length = 2048 

dtype = None 

load_in_4bit = True # Use 4bit quantization to reduce memory usage.


model, tokenizer = FastLanguageModel.from_pretrained(

    model_name = "unsloth/Llama-3.2-3B-Instruct",

    max_seq_length = max_seq_length,

    dtype = dtype,

    load_in_4bit = load_in_4bit,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.10.3: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.741 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

Unsloth: We fixed a gradient accumulation bug, but it seems like you don't have the latest transformers version!
Please update transformers via:
`pip uninstall transformers -y && pip install --upgrade --no-cache-dir "git+https://github.com/huggingface/transformers.git"`


 LoRA enables fine-tuning by adding and updating only a small number of trainable parameters, keeping the main model frozen. Wraps model with LoRa fucntionality.



 r = 16: This is the rank of the low-rank decomposition. The value 16 means that LoRA will use a low-rank matrix with rank 16 to approximate the model updates.



 target_modules: These are the specific components of the model where LoRA will be applied. In this case, it's applied to:



q_proj, k_proj, v_proj, o_proj: These represent different projection layers in the attention mechanism.

gate_proj, up_proj, down_proj: These correspond to components of the feed-forward network in the transformer model.



lora_alpha = 16: This is a scaling factor for the LoRA matrices. It controls the impact of the LoRA updates on the model's output. A higher alpha gives more weight to the LoRA updates, while a smaller alpha reduces their influence.



lora_dropout = 0: This specifies the dropout rate for the LoRA layers. Setting this to 0 means no dropout is applied, which is often the optimal setting for smaller datasets or when dropout is not needed for regularization.

In [4]:
model = FastLanguageModel.get_peft_model(

    model,

    r = 16, 

    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",

                      "gate_proj", "up_proj", "down_proj",],

    lora_alpha = 16,

    lora_dropout = 0, 

    bias = "none",  

    use_gradient_checkpointing = "unsloth",

    random_state = 3407,

    use_rslora = False,

    loftq_config = None,
)

Unsloth 2024.10.3 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


<a name="Data"></a>

### Data Prep

We now use the `Llama-3.1` format for conversation style finetunes. We use our QnA scarped data dataset in Huggingfaces normal multiturn format `("role", "content")`.



This imports the get_chat_template function from the unsloth.chat_templates module.

Then, it applies a specific chat template ("llama-3.1") to the tokenizer. The chat template is a predefined structure that formats the conversational data in a way that's suitable for training or fine-tuning the LLaMA model. This ensures the data conforms to the style and format expected by the model for input processing.

In [5]:
from unsloth.chat_templates import get_chat_template



tokenizer = get_chat_template(

    tokenizer,

    chat_template = "llama-3.1",

)



def formatting_prompts_func(examples):

    convos = examples["conversations"]

    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]

    return { "text" : texts, }

pass



from datasets import load_dataset

dataset = load_dataset('json', data_files='/kaggle/input/qna-scraped/processed_data.json', split='train')

Generating train split: 0 examples [00:00, ? examples/s]

Formatting dataset into correct template for model.

Possibly unnecessary because our data is already been formatted. NOTE!!

In [6]:
from unsloth.chat_templates import standardize_sharegpt

dataset = standardize_sharegpt(dataset)

dataset = dataset.map(formatting_prompts_func, batched = True,)

Standardizing format:   0%|          | 0/39668 [00:00<?, ? examples/s]

Map:   0%|          | 0/39668 [00:00<?, ? examples/s]

We look at how the conversations are structured for item 5:

In [7]:
dataset[5]["conversations"]

[{'content': 'You are an assistant', 'role': 'system'},
 {'content': 'Good morning all,\nI\'m working on a project related to the occupations that university graduates are likely to go into. I have a large data set (N in the hundreds of thousands) where the unit of analysis is an individual person. For each person, I have two categorical variables - a code representing the field of their university degree and a code representing their occupation. I\'m looking for a statistically valid way to find out what fields of study and occupations "go together." In other words, what courses of study prepare people for which jobs?\nSo far, I\'ve considered doing this with simple descriptive statistics... pull, say, the top 10 occupations for every subject area while ruling out occupations like cashiers, fast food workers, etc. But it would be great if there was some sort of more rigorous test that could be used for this. Perhaps one regression per occupation with tons of dummy variables representi

And we see how the chat template transformed these conversations.



[Notice] Llama 3.1 Instruct's default chat template default adds "Cutting Knowledge Date: December 2023\nToday Date: 26 July 2024", so do not be alarmed!

In [8]:
dataset[5]["text"]

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\nYou are an assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nGood morning all,\nI\'m working on a project related to the occupations that university graduates are likely to go into. I have a large data set (N in the hundreds of thousands) where the unit of analysis is an individual person. For each person, I have two categorical variables - a code representing the field of their university degree and a code representing their occupation. I\'m looking for a statistically valid way to find out what fields of study and occupations "go together." In other words, what courses of study prepare people for which jobs?\nSo far, I\'ve considered doing this with simple descriptive statistics... pull, say, the top 10 occupations for every subject area while ruling out occupations like cashiers, fast food workers, etc. But it would be great if there was som



### Train the model


In [9]:
from trl import SFTTrainer

from transformers import TrainingArguments, DataCollatorForSeq2Seq

from unsloth import is_bfloat16_supported



trainer = SFTTrainer(

    model=model,  # LLaMA model for fine-tuning

    tokenizer=tokenizer,  # Tokenizer for processing text data

    train_dataset=dataset,  # Training dataset (Q&A pairs)

    dataset_text_field="text",  # Field name in dataset containing text

    max_seq_length=max_seq_length,  # Max sequence length (number of tokens)

    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),  # Collates batches for seq2seq tasks

    dataset_num_proc=2,  # Number of processes to speed up data loading

    packing=False,  # Don't pack short sequences together

    args=TrainingArguments(

        per_device_train_batch_size=4,  # Batch size per device (GPU/CPU)

        gradient_accumulation_steps=4,  # Accumulate gradients for larger effective batch size

        warmup_steps=5,  # Warmup steps to gradually increase the learning rate

        # max_steps=None,  # Stop training after 60 steps

        num_train_epochs = 1, # Set this for 1 full training run

        learning_rate=2e-3,  # Initial learning rate

        fp16=not is_bfloat16_supported(),  # Use FP16 if bfloat16 is not supported

        bf16=is_bfloat16_supported(),  # Use bfloat16 if supported by hardware

        logging_steps=1,  # Log metrics every step

        optim="adamw_8bit",  # Optimizer: AdamW with 8-bit precision for memory efficiency

        weight_decay=0.0001,  # Weight decay to prevent overfitting

        lr_scheduler_type="linear",  # Learning rate schedule: linear decay

        seed=3407,  # Set seed for reproducibility

        output_dir="outputs",  # Directory to save model checkpoints and logs





        # Add these lines:

        run_name = "My_Custom_Run_Name",  # A custom name for your run

        report_to = "none",  # Disable WandB (set to 'wandb' if you want to use it)

    ),

)


Map (num_proc=2):   0%|          | 0/39668 [00:00<?, ? examples/s]

We also use Unsloth's train_on_completions method to only train on the assistant outputs and ignore the loss on the user's inputs.



The train_on_responses_only function focuses the model’s fine-tuning on the responses it generates rather than the entire conversation. It marks the boundaries of user input and model responses using special tokens and helps the model learn how to generate better replies based on input, which is key for tasks like chatbot development. This approach improves training efficiency by emphasizing response generation and omitting unnecessary details from user instructions. It’s useful when the primary goal is to enhance how well the model generates responses to user queries.

In [10]:
from unsloth.chat_templates import train_on_responses_only

trainer = train_on_responses_only(

    trainer,

    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",

    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",

)

Map:   0%|          | 0/39668 [00:00<?, ? examples/s]

Masking verification.



Decoding is crucial when you want to see what the model or tokenizer has processed and transformed the input into. It helps verify that the correct data is being passed into the model. In this case, the model likely generated or processed text in token form, and the decode() function converts those tokens back to readable text for analysis.

In [11]:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\nYou are an assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nGood morning all,\nI\'m working on a project related to the occupations that university graduates are likely to go into. I have a large data set (N in the hundreds of thousands) where the unit of analysis is an individual person. For each person, I have two categorical variables - a code representing the field of their university degree and a code representing their occupation. I\'m looking for a statistically valid way to find out what fields of study and occupations "go together." In other words, what courses of study prepare people for which jobs?\nSo far, I\'ve considered doing this with simple descriptive statistics... pull, say, the top 10 occupations for every subject area while ruling out occupations like cashiers, fast food workers, etc. But it would be great if there was som

In [12]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]

tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

'                                                                                                                                                                                                                                                           \n\nSounds like multinomial logistic regression is the tool to use.\nThe dependent variable would be "field of work" and the independent variable would be "field of study". With such a large N you can be fairly specific in defining the levels of the variables, but you should probably start with frequency counts of each and then a crosstabulation of the two, not for statistical testing but to see what\'s going on and whether you want to combine some categories of either variable.<|eot_id|>'

In [13]:
#@title Show current memory stats

gpu_stats = torch.cuda.get_device_properties(0)

start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)

max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)

print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")

print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
2.635 GB of memory reserved.


And now we train. This should take abt 10 mins per (step size 60) given the Tesla T4 is used.

In [14]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 39,668 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 4
\        /    Total batch size = 16 | Total steps = 2,479
 "-____-"     Number of trainable parameters = 24,313,856


**** Unsloth: Please use our fixed gradient_accumulation_steps by updating transformers and Unsloth!


Step,Training Loss
1,1.9705
2,2.2577
3,2.4848
4,2.5189
5,2.0442
6,2.0761
7,2.2936
8,2.2267
9,1.8906
10,2.377


KeyboardInterrupt: 

### Inference

We use the model. We use TextStreamer for continuous inference, token by token.



temperature=1.5: Controls randomness in the generation. Higher values like 1.5 produce more diverse and creative outputs.

min_p=0.1: Filters out tokens with a cumulative probability lower than 0.1, ensuring less likely tokens are excluded for more coherent results.

In [15]:
FastLanguageModel.for_inference(model) 

prompt = "What is the difference between data science and data engineering?"

messages = [

    {"role": "user", "content": prompt},

]

inputs = tokenizer.apply_chat_template(

    messages,

    tokenize = True,

    add_generation_prompt = True, # Must add for generation

    return_tensors = "pt", #returns as tensor

).to("cuda") #Uses GPU for inference



from transformers import TextStreamer

text_streamer = TextStreamer(tokenizer, skip_prompt = True)

_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128, #Limits the generated text to x amount of tokens, kwool!

                   use_cache = True, temperature = 1.5, min_p = 0.1)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Data engineering is the process of designing an infrastructure that meets the needs of the data and applications. It is a sub-discipline of computer science that focuses on the architecture of data management systems. There is an open area for data engineering, as it is an area of ongoing development, innovation, and learning. Data engineering requires expertise in distributed data structures, and is a large sub-area of data science. It is a natural starting point for many of the areas that data science includes.
Data Science is a much larger area. It is an area where domain knowledge and business knowledge meet with statistical or computer science approaches. So in this broader sense data


SAVING THE MODEL...HOPEFULLY!

In [16]:
# Save the model locally in Kaggle
# model.config.save_pretrained("lora_model_1_epoch")
model.save_pretrained("lora_model_1_epoch")  # Save model to Kaggle's working directory
tokenizer.save_pretrained("lora_model_1_epoch")  # Save tokenizer

('lora_model_1_epoch/tokenizer_config.json',
 'lora_model_1_epoch/special_tokens_map.json',
 'lora_model_1_epoch/tokenizer.json')

USING ZIP AND LOCAL DOWNLOAD(SLOW)

In [17]:
# !zip -r lora_model.zip lora_model/

import shutil

model_directory = "lora_model_1_epoch"
shutil.make_archive(model_directory, 'zip', model_directory)

# from google.colab import files

# files.download("lora_model.zip")

'/kaggle/working/lora_model_1_epoch.zip'