<a href="https://colab.research.google.com/github/Monilnarang/Therapy_Esther/blob/main/final_conversation_finetuning_esther_perel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Conversation models finetuning

In this notebook we will finetuning an instruction / chat model to behave like Paul Graham. https://www.paulgraham.com/

Adapted from unsloth notebooks, if something is broken check on:
https://unsloth.ai/

### Installation

In [None]:
%%capture
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton
!pip install --no-deps cut_cross_entropy unsloth_zoo
!pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
!pip install --no-deps unsloth

### Load base model

In [None]:
from unsloth import FastLanguageModel
import torch
from google.colab import userdata

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
    token=userdata.get('HF_ACCESS_TOKEN')
)

==((====))==  Unsloth 2025.5.9: Fast Llama patching. Transformers: 4.52.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.5k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

In [None]:
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
]

formatted_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(formatted_text)


<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is the capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>




### Add lora to base model and patch with Unsloth

In [None]:
# More info about parameters: https://huggingface.co/docs/peft/v0.11.0/en/package_reference/lora#peft.LoraConfig

target_modules =  ["q_proj", "k_proj", "v_proj", "o_proj",
                   "gate_proj", "up_proj", "down_proj"]

# When adding special tokens
train_embeddings = False

if train_embeddings:
  # you run out of memory on colab if you do this
  # target_modules = target_modules + ["lm_head", "embed_tokens"]
  # so if you are on colab and added new tokens instead do
  target_modules = target_modules + ["lm_head"]


model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # rank of lora matrices according to paper not much loss when set relatively low
    target_modules = target_modules, # On which modules of the llm the lora weights are used
    lora_alpha = 16,  # scales the weights of the adapters (more influence on base model), 16 was recommended on reddit
    lora_dropout = 0, # Default on 0.05 in tutorial but unsloth says 0 is better
    bias = "none",   # "none" is optimized
    use_gradient_checkpointing = "unsloth", #"unsloth" for very long context, decreases vram
    random_state = 3407,
    use_rslora = False, # scales lora_alpha with 1/sqrt(r), huggingface says this works better
    loftq_config = None, # And LoftQ
)

Unsloth 2025.5.9 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [None]:
from huggingface_hub import hf_hub_download
import json
from datasets import Dataset

# Download the file directly
file_path = hf_hub_download(
    repo_id="moniln/Esther_test_1",
    filename="train.jsonl",  # Check what the file is actually named
    repo_type="dataset"
)

# file_path = hf_hub_download(
#     repo_id="pookie3000/pg_chat",
#     filename="pg_chat_combined.jsonl",  # Check what the file is actually named
#     repo_type="dataset"
# )


# Load JSONL manually (one JSON object per line)
data = []
with open(file_path, 'r') as f:
    for line in f:
        line = line.strip()
        if line:  # Skip empty lines
            data.append(json.loads(line))

# Convert to dataset
dataset = Dataset.from_list(data)

print(f"Dataset loaded with {len(dataset)} examples")
print(dataset)

# Check the structure
print("\nFirst example:")
print(dataset[0])



# # Load manually
# with open(file_path, 'r') as f:
#     data = json.load(f)

# # Convert to dataset
# if isinstance(data, list):
#     dataset = Dataset.from_list(data)
# else:
#     dataset = Dataset.from_list([data])

# print(f"Dataset loaded with {len(dataset)} examples")
# print (dataset)

train.jsonl:   0%|          | 0.00/955k [00:00<?, ?B/s]

Dataset loaded with 441 examples
Dataset({
    features: ['conversations'],
    num_rows: 441
})

First example:
{'conversations': [{'from': 'human', 'value': 'What is your name?'}, {'from': 'gpt', 'value': "Hello! My name is Esther Perel. It's wonderful to meet you."}]}


### Create dataset

In [None]:
EOS_TOKEN = tokenizer.eos_token

from unsloth.chat_templates import get_chat_template

# Configure tokenizer for ShareGPT format
tokenizer = get_chat_template(
    tokenizer,
    chat_template="llama-3.1",
    mapping={"role": "from", "content": "value", "user": "human", "assistant": "gpt"},
)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass

dataset = dataset.map(formatting_prompts_func, batched=True)

Map:   0%|          | 0/441 [00:00<?, ? examples/s]

In [None]:
for i, sample in enumerate(dataset):
    print(f"\n------ Sample {i + 1} ----")
    print(sample["text"])
    if i > 2:
      break



------ Sample 1 ----
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 July 2024

<|eot_id|><|start_header_id|>human<|end_header_id|>

What is your name?<|eot_id|><|start_header_id|>gpt<|end_header_id|>

Hello! My name is Esther Perel. It's wonderful to meet you.<|eot_id|>

------ Sample 2 ----
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 July 2024

<|eot_id|><|start_header_id|>human<|end_header_id|>

What's your name?<|eot_id|><|start_header_id|>gpt<|end_header_id|>

I'm Esther Perel. Thank you for asking - I'm pleased to introduce myself.<|eot_id|>

------ Sample 3 ----
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 July 2024

<|eot_id|><|start_header_id|>human<|end_header_id|>

What do you do?<|eot_id|><|start_header_id|>gpt<|end_header_id|>

I'm a psychotherapist and relationship expe

### Train the model


In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = 2048,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 4,
        #max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Unsloth: Tokenizing ["text"]:   0%|          | 0/441 [00:00<?, ? examples/s]

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 441 | Num Epochs = 4 | Total steps = 224
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040/8,000,000,000 (0.52% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,3.4306
2,3.2228
3,2.9165
4,2.9418
5,2.546
6,2.6493
7,2.4517
8,2.1708
9,2.3801
10,2.2616


### Inference


In [None]:
FastLanguageModel.for_inference(model)

messages = [
    {"from": "human", "value": "Who are you and what do you do?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128, use_cache = True)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 July 2024

<|eot_id|><|start_header_id|>human<|end_header_id|>

Who are you and what do you do?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I'm Esther Perel, a psychotherapist and author, and I've dedicated my work to understanding love, desire, and relationships.<|eot_id|>


### Save lora adapter

This is both useful for inference and if you want to load the model again

In [None]:
model.push_to_hub_gguf(
    "Meta-Llama-3.1-8B-q4_k_m-esther-perel-wis-GGUF",
    tokenizer,
    quantization_method = "q4_k_m",
    token = userdata.get('HF_ACCESS_TOKEN')
  )

### Merge model with lora weights and save to gguf

You can then do inference locally with Ollama or llama.cpp

##### Popular quantization methods

- **q4_k_m**  
  4bit quantization. Low memory. All models you pull with ollama uses this quantization.
- **q8_0**  
  8bit quantization. Medium memory.
- **f16**  
  16 bit quantization. A lot of models are already in 16 bit so then no quantization happens
- **not_quantized**  
  Often same as f16.

In [None]:
model.push_to_hub(
    "moniln/Meta-Llama-3.1-8B-Instruct-Esther-Perel-LORA-wis",
    tokenizer,
    token = userdata.get('HF_ACCESS_TOKEN')
)

### Load model and saved lora adapters

For if you want to continue finetuning or want to do inference using the model in safetensor format.

In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "moniln/Meta-Llama-3.1-8B-Instruct-Esther-Perel-LORA-wis",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = False,
    load_in_8bit = True,  # Less memory intensive
    device_map = "auto",
    token = userdata.get('HF_ACCESS_TOKEN'),
     # Add buffer offloading as suggested in warning
    offload_buffers = True
)



# model, tokenizer = FastLanguageModel.from_pretrained(
#     model_name = "moniln/Meta-Llama-3.1-8B-Instruct-Esther-Perel-LORA",
#     max_seq_length = 2048,
#     dtype = None,
#     load_in_4bit = True,
#     token=userdata.get('HF_ACCESS_TOKEN')
# )

FastLanguageModel.for_inference(model)

messages = [
    {"from": "human", "value": "What is your job?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128, use_cache = True)