<a href="https://colab.research.google.com/github/Shamanth-KM/phi-demand-intent-lora/blob/main/notebooks/02_finetune_lora.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 02 - Fine-Tuning Preparation: Loading and Tokenizing Sales Notes Dataset

In this notebook, we load the previously generated synthetic dataset and prepare it for fine-tuning.  
We tokenize the sales notes using Phi-1.5 tokenizer and prepare a Hugging Face Dataset.


In [1]:
# Installing the necessary libraries
!pip install -q --upgrade transformers datasets peft accelerate

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/491.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m481.3/491.2 kB[0m [31m15.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m411.1/411.1 kB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m354.7/354.7 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.9/183.9 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
# Importing required libraries
import pandas as pd
from datasets import Dataset
from transformers import AutoTokenizer

## Upload Sales Notes CSV

We upload the previously saved 'sales_notes_2000.csv' file containing the synthetic dataset.


In [3]:
# Let's upload the csv file needed
from google.colab import files
uploaded_file = files.upload()

# Get the file name
file_name = list(uploaded_file.keys())[0]

# Read the file to a pandas DataFrame
sales_notes = pd.read_csv(file_name)

# Check the first few rows
sales_notes.head(10)

Saving sales_notes_2000.csv to sales_notes_2000.csv


Unnamed: 0,id,sales_note,label
0,1,Need emergency stock replenishment by friday.,Urgent Need
1,2,They are requesting a repeat of the previous o...,Repeat Order
2,3,Exploring samples for a new product line.,New Product Demand
3,4,Customized dimension required for part 780Y.,Custom Spec
4,5,inventory outage pls expedite shipping if poss...,Urgent Need
5,6,Requestfor custom color matching on containers.,Custom Spec
6,7,"running low on supplies again, please arrange ...",Stocking Issue
7,8,Need urgent delivery due to market launch next...,Urgent Need
8,9,demo units requested to evaluate updated models,New Product Demand
9,10,Rush request for product A45 due to customer e...,Urgent Need


## Convert to Hugging Face Dataset
We convert the DataFrame into a Hugging Face Dataset for easier tokenization and batching.

## Load Phi-1.5 Tokenizer
We load the Phi-1.5 tokenizer and set up necessary padding.

## Tokenize the Dataset
We tokenize the sales notes to create model-readable input tensors.

In [4]:
# Convert it into Hugging Face Dataset format
sales_hf = Dataset.from_pandas(sales_notes)
print(sales_hf)

# Loading the Phi-1.5 tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1_5", trust_remote_code=True)

# Phi-1.5's tokenizer does not have a pad_token defined by default.
'''
Decoder models (like GPT, Phi) usually don't need padding when generating text — but for training a classification model,
we do need padding to batch inputs of different lengths.
'''
tokenizer.pad_token = tokenizer.eos_token

# Checking on tokenizer
print("Tokenizer loaded. Trying a sample encoding:")
print(tokenizer("There's order request for a old stock - Neural Ninjas"))

# Let's define a tokenization function
def tokenize_function(examples):
    return tokenizer(
        examples["sales_note"],
        padding="max_length",
        truncation=True,
        max_length=128
    )

# Applying tokenization
sales_tokenized = sales_hf.map(tokenize_function, batched=True)

# Checking the structure after tokenization
print(sales_tokenized)


Dataset({
    features: ['id', 'sales_note', 'label'],
    num_rows: 2000
})


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Tokenizer loaded. Trying a sample encoding:
{'input_ids': [1858, 338, 1502, 2581, 329, 257, 1468, 4283, 532, 47986, 10516, 28121], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'sales_note', 'label', 'input_ids', 'attention_mask'],
    num_rows: 2000
})


## Encode Labels

We map each demand category to a numeric label for classification.


## Label Mapping

Initially, the sales notes dataset contains human-readable labels (e.g., "Repeat Order", "Urgent Need").  
For training, we map these labels to numeric IDs using a label_to_id dictionary.  
Later during evaluation and inference, we reverse-map predictions back to human-readable labels using an id_to_label mapping.

This approach ensures that the model works with numerical labels internally while remaining interpretable externally.

In [5]:
# Create label mapping
sales_labels = list(set(sales_notes["label"]))
sales_labels.sort()
label_to_id = {label: idx for idx, label in enumerate(sales_labels)}
id_to_label = {idx: label for label, idx in label_to_id.items()}

print("Label Mapping:", label_to_id)

# Apply label encoding
def encode_labels(example):
    return {"labels": label_to_id[example["label"]]}

sales_tokenized = sales_tokenized.map(encode_labels)

# Remove old label column
sales_tokenized = sales_tokenized.remove_columns(["label"])

print("Labels encoded successfully!")

Label Mapping: {'Custom Spec': 0, 'New Product Demand': 1, 'Repeat Order': 2, 'Stocking Issue': 3, 'Urgent Need': 4}


Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Labels encoded successfully!


## Train-Validation Split

We split the dataset into 60% training and 40% validation sets.

In [6]:
# Split the dataset
sales_split = sales_tokenized.train_test_split(test_size=0.4, seed=42)

train_sales = sales_split["train"]
val_sales = sales_split["test"]

print(f"Dataset split done! Training size: {len(train_sales)}, Validation size: {len(val_sales)}")

Dataset split done! Training size: 1200, Validation size: 800


# Summary
- Loaded the sales notes dataset.
- Tokenized the dataset using Phi-1.5 tokenizer.
- Encoded text labels into numeric form.
- Split the dataset into training and validation sets.

Next, we proceed to model loading, LoRA configuration, and fine-tuning.