choose the sft_env(python3.10.18)

In [1]:
! pip3 install transformers 



# This explicitly uses the pip associated with Python 3.

In [None]:
import torch # PyTorch library
from transformers import pipeline # Hugging Face Transformers library


In [4]:
device = "mps" if torch.backends.mps.is_available() else ("cuda:0" if torch.cuda.is_available() else "cpu") # Check for MPS, then CUDA, else CPU
dtype = torch.float16 if device == "mps" else torch.float32 # Use float16 for MPS, else float32
print(f"Using device: {device}, dtype: {dtype}") # Print the selected device and data type

Using device: cuda:0, dtype: torch.float32


MPS stands for Metal Performance Shaders, which is a framework developed by Apple to accelerate GPU computations on macOS and iOS devices. In the context of PyTorch, MPS allows you to run tensor operations and deep learning models on Apple GPUs (like those in M1, M2, or M3 chips), offering a hardware-accelerated alternative to CPU execution.

Key Points about MPS in PyTorch:
* Platform-specific: Only available on macOS with Apple Silicon or supported Intel Macs.
* Alternative to CUDA: CUDA is for NVIDIA GPUs, while MPS is for Apple GPUs.
* Improves performance: Using MPS can significantly speed up training and inference compared to CPU.
* Supported in PyTorch: Starting from PyTorch 1.12, MPS support was introduced experimentally.

XetHub is a Git-based data versioning system that Hugging Face integrates to improve performance when accessing large models or datasets. It allows faster and more efficient downloads by using a specialized backend.

pip install hf_xet

In [5]:
ask_llm = pipeline( # Initialize the text generation pipeline
  task="text-generation", # Specify the task as text generation
  model="Qwen/Qwen2.5-3B-Instruct", # You can replace this with another model if desired
  device=device, # Use the selected device
  torch_dtype=dtype # Use the selected data type
)

print(ask_llm("Who is Scott Lai?")[0]["generated_text"]) # Generate text based on the prompt and print the result 

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model-00002-of-00002.safetensors:  29%|##8       | 629M/2.20G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:  18%|#8        | 734M/3.97G [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


Who is Scott Lai? Scott Lai, a native of Taiwan, is an accomplished entrepreneur and investor who has been making waves in the tech industry. Here are some key points about him:

1. Founder: He is the founder of Poshmark, a leading online marketplace for second-hand clothing.

2. Early Career: Before founding Poshmark, Lai worked as a software engineer at Google and eBay.

3. Poshmark: Founded in 2010, Poshmark revolutionized the second-hand clothing market by creating a user-friendly platform where buyers and sellers could easily exchange clothes. The company grew rapidly and went public in 2019.

4. Other Ventures: In addition to Poshmark, Lai has invested in several other startups, including Rent the Runway and Trunk Club.

5. Philanthropy: He is involved in various charitable causes, particularly focusing on education and entrepreneurship.

6. Leadership: Lai has been recognized for his leadership skills and has spoken at numerous conferences and events.

7. Entrepreneurial Spirit:

As you can see here, the model has no idea who I am from above response.

Let's cook it!

First, let's teach the model who I am. Here you can use your personal data to generate the exact format you will use for fine-turning base on your own data. You can use ChatGPT for this, just ask it to transfer your resume into the trainable json format with "prompt" and "completion"

In [None]:
# load data 
from datasets import load_dataset # Hugging Face Datasets library

raw_data = load_dataset('json', data_files = "scott_lai_resume_train.json") # Load the dataset from a JSON file
raw_data # Print the loaded dataset

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['prompt', 'completion'],
        num_rows: 122
    })
})

The Hugging Face load_dataset function is automatically creating a default split named "train" when you load a dataset from a local JSON file and don’t explicitly specify any splits.
The load_dataset function expects datasets to be split into subsets like "train", "test", "validation", etc.
Since your JSON file doesn’t specify any splits, Hugging Face assumes the entire dataset is for training and labels it as "train".

In [None]:
raw_data["train"][0] # Access the first entry in the training split of the dataset

{'prompt': 'What is Scott Lai’s profession?',
 'completion': 'AI Engineer and Data Scientist.'}

As you can see, here we return with the long text, but for fine-tuning we need the data to be small and precise chunks, more like here we apply the tokenization to take the text and split it into smaller chunks. Each chunk is called a token and it the smallest unit of meaning that LLMs work with.

In [10]:
from transformers import AutoTokenizer # Import the AutoTokenizer class from the Transformers library
# Load the tokenizer for the specified model

tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen2.5-3B-Instruct"
)
def preprocess(sample): # Define a function to preprocess each sample in the dataset
    sample = sample['prompt']+ '\n' + sample['completion'] # Concatenate the prompt and completion with a newline
    print(sample)
    tokenized = tokenizer( # Tokenize the sample
        sample, # Use the tokenizer to convert the text into tokens
        max_length = 128, # Set the maximum length of the tokenized sequence
        truncation = True, # Truncate sequences longer than the maximum length
        padding = "max_length"  # Pad sequences to the maximum length  
    )

    tokenized['labels'] = tokenized['input_ids'].copy() # Create a copy of the input IDs for the labels
    return tokenized # Return the tokenized sample
data = raw_data.map(preprocess) # Apply the preprocessing function to the dataset


Map:   0%|          | 0/122 [00:00<?, ? examples/s]

What is Scott Lai’s profession?
AI Engineer and Data Scientist.
How many years of experience does Scott Lai have in generative AI and LLM solutions?
Over 5 years.
What infrastructures is Scott Lai skilled in designing and optimizing?
Scalable ML infrastructures using PyTorch, Hugging Face, and FastAPI on AWS.
What type of workflows is Scott Lai experienced in building?
End-to-end pipelines, scalable microservices, and ETL workflows.
What collaboration experience does Scott Lai have?
Proven track record in cross-functional collaboration and implementing ML and data engineering best practices.
Which skill in Programming & Scripting does Scott Lai have?
Python
Which skill in Programming & Scripting does Scott Lai have?
Rust
Which skill in Programming & Scripting does Scott Lai have?
Node.js
Which skill in Programming & Scripting does Scott Lai have?
HTML
Which skill in Programming & Scripting does Scott Lai have?
CSS
Which skill in Programming & Scripting does Scott Lai have?
JavaScript
W

Qwen2.5-3B-Instruct is a compact yet powerful instruction-tuned language model developed by Alibaba Cloud as part of the Qwen2.5 series. Here's a detailed overview of its features and capabilities:
Model Type: Causal Language Model (decoder-only)
Architecture:
* Transformers with RoPE (Rotary Position Embedding)
* SwiGLU activation
* RMSNorm normalization
* Attention with QKV bias
* Tied word embeddings
Parameters:
* Total: 3.09 billion
* Non-embedding: 2.77 billion
Layers: 36
Attention Heads: 16 for queries, 2 for keys/values (Grouped-Query Attention)
Context Length:
* Input: up to 32,768 tokens
* Output generation: up to 8,192 tokens

The tokenization is done pair by pair. The preprocess function is applied to each individual sample (each prompt-completion pair) in the dataset via dataset.map(), meaning each pair is tokenized separately, though efficiently and quickly due to internal optimizations.

tokenized['labels'] = tokenized['input_ids'].copy()
This is for training language models, where the model tries to predict the next token. So the labels are set to be the same as input_ids.
 In causal language modeling (like with Qwen), we train the model to reconstruct the input — so input = label. 

In [11]:
print(data['train'])

Dataset({
    features: ['prompt', 'completion', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 122
})


When you pass text to a Hugging Face tokenizer, it returns a dictionary containing:

* input_ids: The tokenized text converted into numerical IDs (each word/subword → ID).
* attention_mask: A binary mask indicating which tokens are real (1) vs. padded (0), so the model knows to ignore padding.
* (Optional) token_type_ids: Not shown here — used for sentence-pair tasks like NLI.

Print one sample

In [12]:
# Access the first sample in the dataset
sample = data['train'][0]
print("Input IDs:", sample['input_ids'])
print("Attention Mask:", sample['attention_mask'])
print("Labels:", sample['labels'])

Input IDs: [3838, 374, 9815, 444, 2143, 748, 4808, 5267, 15469, 28383, 323, 2885, 67309, 13, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151

Print shape or length

In [13]:
print("Length of input_ids:", len(sample['input_ids']))  # Should be 128 (due to max_length)
print("Number of non-padded tokens:", sum(sample['attention_mask']))

Length of input_ids: 128
Number of non-padded tokens: 14


Decode input_ids back to text

In [14]:
# See what the tokens actually represent
decoded = tokenizer.decode(sample['input_ids'], skip_special_tokens=False)
print("Decoded text:", decoded)

Decoded text: What is Scott Lai’s profession?
AI Engineer and Data Scientist.<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>

The integer token IDs themselves are not encoded using variable-length codes like Huffman coding — at least not during normal model training/inference in Hugging Face Transformers.

## LoRA

now, let's move into the training

In [None]:
from peft import LoraConfig, get_peft_model, TaskType # PEFT library for parameter-efficient fine-tuning
from transformers import AutoModelForCausalLM # Import the AutoModelForCausalLM class from the Transformers library
import torch # PyTorch library

In [17]:
model = AutoModelForCausalLM.from_pretrained( # Load the pre-trained model
    "Qwen/Qwen2.5-3B-Instruct",
    device_map = device, # Use the selected device
    torch_dtype = torch.float16 # Use float16 data type
)

lora_config = LoraConfig (
    
    task_type = TaskType.CAUSAL_LM,  # Specify the task type as causal language modeling
    target_modules=['q_proj', "k_proj", "v_proj"] # Target specific modules for LoRA adaptation
)
model = get_peft_model(model, lora_config) # Apply the LoRA configuration to the model
model.print_trainable_parameters() # Print the number of trainable parameters in the model

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

trainable params: 2,506,752 || all params: 3,088,445,440 || trainable%: 0.0812


In [None]:
from transformers import TrainingArguments, Trainer # Import the TrainingArguments and Trainer classes from the Transformers library


train_args = TrainingArguments( # Define training arguments
    num_train_epochs = 10, # we will go throught the dataset from start to finish 10 times
    learning_rate=0.001, # learning rate for the optimizer
    logging_steps = 25, # we want to see the result in every 25 steps it runs 
    fp16 = True # float point set to 16 to speed it up, set to "True" if you are on GPU
)

trainer = Trainer(
    args = train_args,
    model = model, 
    train_dataset=data["train"]
)


In [19]:
trainer.train()

Step,Training Loss
25,4.0148
50,0.4126
75,0.2504
100,0.1978
125,0.1625
150,0.1315


TrainOutput(global_step=160, training_loss=0.8151650987565517, metrics={'train_runtime': 1887.9895, 'train_samples_per_second': 0.646, 'train_steps_per_second': 0.085, 'total_flos': 2602200748523520.0, 'train_loss': 0.8151650987565517, 'epoch': 10.0})

It takes 31 m 28.5 s

In [20]:
# save the model
trainer.save_model("./my-qwen")
tokenizer.save_pretrained("./my-qwen")

('./my-qwen\\tokenizer_config.json',
 './my-qwen\\special_tokens_map.json',
 './my-qwen\\chat_template.jinja',
 './my-qwen\\vocab.json',
 './my-qwen\\merges.txt',
 './my-qwen\\added_tokens.json',
 './my-qwen\\tokenizer.json')

Now let's test it out

In [26]:
ask_llm = pipeline(
  task="text-generation",
  model="./my-qwen",
  tokenizer='./my-qwen',
  device=device,
  torch_dtype=dtype, # Use the selected data type
)

print(ask_llm("Who is Scott Lai?")[0]["generated_text"])

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0


OutOfMemoryError: CUDA out of memory. Tried to allocate 1.16 GiB. GPU 0 has a total capacity of 11.99 GiB of which 0 bytes is free. Of the allocated memory 25.76 GiB is allocated by PyTorch, and 169.13 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)