# Fine-Tuning a Conversational LLM for Psychological Dialogue Support

This project fine-tunes a Large Language Model (LLM) to simulate natural, empathetic, and context-aware conversations between a psychologist and a patient. The model learns from a curated dataset of real or synthetic therapy-style question‚Äìanswer exchanges, allowing it to generate emotionally intelligent, coherent, and supportive responses to mental-health-related queries.

Unlike general chatbots trained on open-domain text, this model specializes in therapeutic conversation patterns ‚Äî focusing on reflective listening, validating emotions, and suggesting healthy thought reframing.

The final output is an AI-driven conversational agent that can engage in mental-wellness dialogue, provide psychoeducation, and guide users toward constructive self-reflection ‚Äî without offering clinical diagnosis or treatment.

# Import Libraries

In [1]:
import torch
import os

# Set number of threads (adjust based on your CPU cores)
num_cores = os.cpu_count()  # Get total cores
torch.set_num_threads(num_cores)  # Use all cores
torch.set_num_interop_threads(num_cores)

# For Intel CPUs, these can help too:
os.environ["OMP_NUM_THREADS"] = str(num_cores)
os.environ["MKL_NUM_THREADS"] = str(num_cores)

print(f"Using {num_cores} CPU threads")

Using 8 CPU threads


In [2]:
import pandas as pd
import os
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', 1000)

In [4]:
import torch
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Trainer, TrainingArguments
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments

from transformers import pipeline
from peft import LoraConfig, get_peft_model
import datasets

from transformers.utils import logging as hf_logging


# Import Dataset and Preprocess it

In [3]:
# Load and preprocess the data
df_fine_tuning = pd.read_csv('train.csv')
df_fine_tuning = df_fine_tuning.dropna()

df_fine_tuning = df_fine_tuning.dropna().apply(lambda x: x.str.strip())
df_fine_tuning = df_fine_tuning.sample(n=10, random_state=42)  # ‚úÖ Decreased from 200 to 10 samples
# Shuffle and split the dataset
train_df = df_fine_tuning.sample(frac=0.8, random_state=42)
eval_df = df_fine_tuning.drop(train_df.index)

#length of each splits dataset
print(f"Length of training dataset: {len(train_df)}")
print(f"Length of evaluation dataset: {len(eval_df)}")

Length of training dataset: 8
Length of evaluation dataset: 2


In [4]:
train_df.head(1)

Unnamed: 0,Context,Response
506,"I‚Äôve been on 0.5 mg of Xanax twice a day for the past month. It hasn't been helping me at all, but when I take 1 mg during a big anxiety attack, it calms me down. I was wondering how I can ask my psychologist to up the dose to 1 mg twice a day without her thinking I'm abusing them. I just have very big anxiety attacks. Should I stay on the 0.5mg and deal with the attacks or should I ask to up the dose? I'm afraid she will take me off them and put me on something else.","Do you think you're abusing xanax?It is a highly addictive drug so maybe one reason you feel compelled to take more is bc you already are addicted.Drugs don't do anything helpful in solving life's problems. ¬† Once the effect wears off, the stressful situation is once again waiting for you to address it.Think over your reason for not directly asking your psychologist about upping your dose.Also, do you ever talk about your life problems with this psychologist or only your need for drugs? ¬† ¬†The more gradual path to a better life is to not need drugs in the first place. This consists of your willingness to face the matters that are creating such terrible feelings inside you."


## Download and test model without any finetuning

In [None]:
# ===== MUST BE AT THE VERY TOP - BEFORE ANY IMPORTS =====
import os

num_cores = os.cpu_count()  # Get all available CPU cores
print(f"üîß Detected {num_cores} CPU cores")

# Set environment variables BEFORE importing torch
os.environ["OMP_NUM_THREADS"] = str(num_cores)
os.environ["MKL_NUM_THREADS"] = str(num_cores)
os.environ["OPENBLAS_NUM_THREADS"] = str(num_cores)
os.environ["VECLIB_MAXIMUM_THREADS"] = str(num_cores)
os.environ["NUMEXPR_NUM_THREADS"] = str(num_cores)
os.environ["KMP_BLOCKTIME"] = "0"
os.environ["KMP_AFFINITY"] = "granularity=fine,compact,1,0"

# NOW import torch and other libraries
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Set PyTorch threads (must be before any model operations)
torch.set_num_threads(num_cores)
# Remove or comment out this line if it still causes issues:
# torch.set_num_interop_threads(num_cores)

print(f"‚úÖ Configured to use all {num_cores} cores with Intel optimizations")
# ========================================================

model_name = "microsoft/Phi-3-mini-4k-instruct"

# Define quantization configuration for 8-bit loading (better for CPU)
quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,  # Use 8-bit quantization (better CPU support)
    llm_int8_threshold=6.0  # Helps with outlier features
)

print("üì¶ Loading model (this may take a minute)...")
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto",  # Will use CPU automatically
    low_cpu_mem_usage=True,  # Reduces CPU memory spikes during loading
    torch_dtype=torch.float32  # Use float32 for CPU (better compatibility)
)

print("‚úÖ Model loaded successfully!")

# Ask a simple question
question = "What are 2 healthy ways to deal with anxiety?"

messages = [
    {"role": "system", "content": "You are a calm, empathetic assistant that offers short, clear mental wellness advice."},
    {"role": "user", "content": question}
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

print("ü§ñ Generating response (watch Task Manager - CPU should spike!)...")
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=150,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id,
        num_beams=1,  # Greedy decoding is faster on CPU
        use_cache=True  # Cache key/value pairs for faster generation
    )

reply = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print("\n" + "="*60)
print("Bot:", reply.strip())
print("="*60)

üîß Detected 8 CPU cores
‚úÖ Configured to use all 8 cores with Intel optimizations
üì¶ Loading model (this may take a minute)...


`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

‚úÖ Model loaded successfully!
ü§ñ Generating response (watch Task Manager - CPU should spike!)...





Bot: 1. Mindfulness meditation: Practice mindfulness meditation by finding a quiet and comfortable space, closing your eyes, and focusing on your breath. Pay attention to your breathing, and try to clear your mind of any distracting thoughts. This can help you become more aware of your thoughts and emotions, and learn to manage them more effectively.

2. Physical exercise: Engaging in regular physical activity can help reduce anxiety by releasing endorphins, which are the body's natural mood-lifters. Find an activity that you enjoy, such as walking, running, or yoga, and make it a regular part of your routine.

3. T


## Now Finetune the model and see the results

In [5]:
#import Dataset

from datasets import Dataset

train_ds = Dataset.from_pandas(train_df[["Context", "Response"]].reset_index(drop=True))
eval_ds  = Dataset.from_pandas(eval_df[["Context", "Response"]].reset_index(drop=True))

## Download the Model and its tokenizer


In [6]:
MODEL_NAME = "microsoft/Phi-3-mini-4k-instruct"   # new base
MAX_LEN = 384


In [7]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
tokenizer.truncation_side = "right"
tokenizer.model_max_length = MAX_LEN

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float32
)
model.config.pad_token_id = tokenizer.pad_token_id
model.config.use_cache = False  # needed for training

# Correct LoRA targets for Phi-3
target_modules = ["qkv_proj", "o_proj"] # "gate_up_proj", "down_proj" for feedforward layers require more training resources

lora_config = LoraConfig(
    r=1, # (rank): the adapter‚Äôs low-rank size. Higher r ‚áí more capacity, more parameters
    lora_alpha=2, # scaling factor
    target_modules=target_modules, # modules to apply LoRA to
    lora_dropout=0.1, # dropout for regularization
    bias="none", # no bias modification
    task_type="CAUSAL_LM" # task type for causal language modeling
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

trainable params: 589,824 || all params: 3,821,669,376 || trainable%: 0.0154


## Tokenize the dataset for training

In [8]:
# --- System prompt for consistent tone ---
SYSTEM_PROMPT = (
    "You are a calm, empathetic assistant for mental wellbeing. "
    "Validate feelings, be non-judgmental, suggest one small next step. "
    "Do not diagnose. If crisis is indicated, advise contacting local emergency services."
)


def encode_row(example):
    """
    Convert one (Context ‚Üí Response) pair from your dataset
    into tokenized model-ready tensors for fine-tuning a chat model.

    Input:
        example: a dictionary-like object with keys:
                 "Context"  - what the user said (the question)
                 "Response" - what the assistant (therapist) replied

    Output:
        A dictionary containing:
          - input_ids: token IDs of the full conversation
          - attention_mask: mask for real vs padded tokens
          - labels: same as input_ids but with prompt tokens masked as -100
                    (so loss is only computed on assistant‚Äôs reply)
    """
    # -------------------------------------------------------------------------
    # 1Ô∏è‚É£ Build the "full conversation" message list (system + user + assistant)
    # -------------------------------------------------------------------------
    # SYSTEM_PROMPT provides consistent tone/behavior.
    # The user and assistant parts come from your dataset row.
    messages_full = [
        {"role": "system",    "content": SYSTEM_PROMPT},      # defines model personality
        {"role": "user",      "content": example["Context"]}, # user question/input
        {"role": "assistant", "content": example["Response"]} # correct reply to learn
    ]

    # Convert that structured list into plain text formatted for Phi-3.
    # Example output:
    #   <|system|>You are calm...
    #   <|user|>I feel anxious
    #   <|assistant|>That‚Äôs understandable...
    text_full = tokenizer.apply_chat_template(
        messages_full,
        tokenize=False,             # return as string, not token IDs yet
        add_generation_prompt=False # don't append an empty assistant header
    )

    # -------------------------------------------------------------------------
    # 2Ô∏è‚É£ Build the "prompt-only" version (system + user only, no assistant text)
    # -------------------------------------------------------------------------
    # This helps us identify how long the prompt is in tokens,
    # so we can later mask that region in the labels.
    messages_prompt = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user",   "content": example["Context"]}
    ]

    # Setting add_generation_prompt=True tells the tokenizer to append
    # the "assistant" header ‚Äî basically where generation will begin.
    prompt_only = tokenizer.apply_chat_template(
        messages_prompt,
        tokenize=False,
        add_generation_prompt=True
    )

    # -------------------------------------------------------------------------
    # 3Ô∏è‚É£ Tokenize both versions (full and prompt)
    # -------------------------------------------------------------------------
    # Convert the text into token IDs that the model understands.
    # We truncate to MAX_LEN (to fit model context) and pad shorter ones.
    # return_tensors="pt" gives PyTorch tensors directly.
    enc_full = tokenizer(
        text_full,
        truncation=True,
        max_length=MAX_LEN,
        padding="max_length",
        return_tensors="pt"
    )

    enc_prompt = tokenizer(
        prompt_only,
        truncation=True,
        max_length=MAX_LEN,
        padding="max_length",
        return_tensors="pt"
    )

    # -------------------------------------------------------------------------
    # 4Ô∏è‚É£ Extract token IDs and attention masks from encodings
    # -------------------------------------------------------------------------
    input_ids = enc_full["input_ids"][0]          # the actual tokens (numbers)
    attn_mask = enc_full["attention_mask"][0]     # 1 = real token, 0 = padding

    # -------------------------------------------------------------------------
    # 5Ô∏è‚É£ Create labels for training (same as input_ids initially)
    # -------------------------------------------------------------------------
    labels = input_ids.clone()

    # -------------------------------------------------------------------------
    # 6Ô∏è‚É£ Mask out the prompt tokens (system + user)
    # -------------------------------------------------------------------------
    # We compute how many tokens belong to the prompt part.
    # We use the attention mask of the "prompt-only" encoding to count them.
    prompt_len = int((enc_prompt["attention_mask"][0]).sum().item())

    # For all tokens that belong to the system+user part,
    # we set label = -100 so the loss is ignored on them.
    # Only the assistant's part will be used for loss calculation.
    labels[:prompt_len] = -100

    # -------------------------------------------------------------------------
    # 7Ô∏è‚É£ Return the dictionary that the Trainer expects
    # -------------------------------------------------------------------------
    return {
        "input_ids": input_ids,           # tokenized full conversation
        "attention_mask": attn_mask,      # mask for real tokens vs padding
        "labels": labels                  # same as input_ids but masked
    }


In [9]:
train_tokenized = train_ds.map(encode_row)
eval_tokenized  = eval_ds.map(encode_row)

Map:   0%|          | 0/8 [00:00<?, ? examples/s]

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

In [10]:
# set format for PyTorch
cols = ["input_ids","attention_mask","labels"]
train_tokenized.set_format(type="torch", columns=cols)
eval_tokenized.set_format(type="torch", columns=cols)

In [11]:
# CHEKC IF GPU IS AVAILABLE
import torch
print("CUDA available:", torch.cuda.is_available())
print("Current device:", torch.cuda.current_device() if torch.cuda.is_available() else "CPU only")


CUDA available: False
Current device: CPU only


In [12]:
# --- Training args 
hf_logging.set_verbosity_info()
datasets.logging.set_verbosity_info()

args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=2,  # ‚úÖ Reduced from 16 to 2 for more visible steps
    per_device_eval_batch_size=2,   # ‚úÖ Reduced for consistency
    gradient_accumulation_steps=1,  
    learning_rate=2e-4,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=1,                # ‚úÖ Log every step
    logging_first_step=True, 
    logging_strategy="steps", 
    dataloader_pin_memory=False,
    disable_tqdm=False,             # ‚úÖ Keep progress bars enabled
    report_to="none",
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_tokenized,
    eval_dataset=eval_tokenized,
)

# # --- Print step counts before training ---
# num_samples = len(train_tokenized)
# batch_size = args.per_device_train_batch_size
# grad_accum = args.gradient_accumulation_steps
# epochs = args.num_train_epochs

# steps_per_epoch = (num_samples + (batch_size * grad_accum) - 1) // (batch_size * grad_accum)
# total_steps = steps_per_epoch * epochs

# print(f"üìä Dataset size: {num_samples} samples")
# print(f"üß© Effective batch size: {batch_size * grad_accum}")
# print(f"üîÅ Steps per epoch: {steps_per_epoch}")
# print(f"‚è±Ô∏è Total training steps: {total_steps}\n")

# --- Train ---
trainer.train()

PyTorch: setting up devices
The following columns in the Training set don't have a corresponding argument in `PeftModelForCausalLM.forward` and have been ignored: Response, Context. If Response, Context are not expected by `PeftModelForCausalLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 8
  Num Epochs = 3
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 2
  Gradient Accumulation steps = 1
  Total optimization steps = 12
  Number of trainable parameters = 589,824


Epoch,Training Loss,Validation Loss
1,5.8423,4.41518
2,6.5284,4.318367
3,6.3829,4.270855


The following columns in the Evaluation set don't have a corresponding argument in `PeftModelForCausalLM.forward` and have been ignored: Response, Context. If Response, Context are not expected by `PeftModelForCausalLM.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 2

***** Running Evaluation *****
  Num examples = 2
  Batch size = 2
  Batch size = 2
Saving model checkpoint to ./results\checkpoint-4
Saving model checkpoint to ./results\checkpoint-4
loading configuration file config.json from cache at C:\Users\Tashfeen Ahmed\.cache\huggingface\hub\models--microsoft--Phi-3-mini-4k-instruct\snapshots\0a67737cc96d2554230f90338b163bc6380a2a85\config.json
Model config Phi3Config {
  "architectures": [
    "Phi3ForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoConfig": "configuration_phi3.Phi3Config",
    "AutoModelForCausalLM": "modeling_phi3.Phi3ForCausalLM"
  },
  "bos_token_id": 1,
  "dtype": 

TrainOutput(global_step=12, training_loss=6.7316296100616455, metrics={'train_runtime': 3220.6364, 'train_samples_per_second': 0.007, 'train_steps_per_second': 0.004, 'total_flos': 205876340195328.0, 'train_loss': 6.7316296100616455, 'epoch': 3.0})

In [None]:
# --- Save LoRA adapter (small) ---
os.makedirs("./lora_finetuned_model", exist_ok=True)
trainer.model.save_pretrained("./lora_finetuned_model") #saves the finetuned parameters lora adapters
tokenizer.save_pretrained("./lora_finetuned_model")
print("‚úÖ LoRA adapter saved to ./lora_finetuned_model")

loading configuration file config.json from cache at C:\Users\Tashfeen Ahmed\.cache\huggingface\hub\models--microsoft--Phi-3-mini-4k-instruct\snapshots\0a67737cc96d2554230f90338b163bc6380a2a85\config.json
Model config Phi3Config {
  "architectures": [
    "Phi3ForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoConfig": "configuration_phi3.Phi3Config",
    "AutoModelForCausalLM": "modeling_phi3.Phi3ForCausalLM"
  },
  "bos_token_id": 1,
  "dtype": "bfloat16",
  "embd_pdrop": 0.0,
  "eos_token_id": 32000,
  "hidden_act": "silu",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 4096,
  "model_type": "phi3",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "original_max_position_embeddings": 4096,
  "pad_token_id": 32000,
  "partial_rotary_factor": 1.0,
  "resid_pdrop": 0.0,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0

‚úÖ LoRA adapter saved to ./lora_finetuned_model


: 

# Load the fine tuned model and test it

In [6]:
# --- Load the fine-tuned LoRA model and test it ---

# ===== CPU Optimization - Use all cores =====
import os
num_cores = os.cpu_count()
print(f"üîß Configuring to use all {num_cores} CPU cores...")

# Set environment variables for Intel CPU optimization
os.environ["OMP_NUM_THREADS"] = str(num_cores)
os.environ["MKL_NUM_THREADS"] = str(num_cores)
os.environ["OPENBLAS_NUM_THREADS"] = str(num_cores)
os.environ["VECLIB_MAXIMUM_THREADS"] = str(num_cores)
os.environ["NUMEXPR_NUM_THREADS"] = str(num_cores)
os.environ["KMP_BLOCKTIME"] = "0"
os.environ["KMP_AFFINITY"] = "granularity=fine,compact,1,0"

import torch
torch.set_num_threads(num_cores)

print(f"‚úÖ CPU configured with Intel optimizations")
# ============================================

SYSTEM_PROMPT = (
    "You are a calm, empathetic assistant for mental wellbeing. "
    "Validate feelings, be non-judgmental, suggest one small next step. "
    "Do not diagnose. If crisis is indicated, advise contacting local emergency services."
)
MODEL_NAME = "microsoft/Phi-3-mini-4k-instruct"

from peft import PeftModel
from transformers import AutoTokenizer, AutoModelForCausalLM

print("üì¶ Loading base model (float32, no quantization)...")
# Load base model without quantization
base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float32
)

print("üîß Loading LoRA adapter...")
# Load LoRA adapter on top of base model
finetuned_model = PeftModel.from_pretrained(base_model, "./lora_finetuned_model")
finetuned_tokenizer = AutoTokenizer.from_pretrained("./lora_finetuned_model")

print("‚úÖ Fine-tuned model loaded successfully!")

# Test with the SAME question as the base model
test_question = "What are 2 healthy ways to deal with anxiety?"

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": test_question}
]

# Prepare input
prompt = finetuned_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = finetuned_tokenizer(prompt, return_tensors="pt")

print("ü§ñ Generating fine-tuned response...")
# Generate response
with torch.no_grad():
    outputs = finetuned_model.generate(
        **inputs,
        max_new_tokens=150,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        pad_token_id=finetuned_tokenizer.eos_token_id,
        eos_token_id=finetuned_tokenizer.eos_token_id,
        num_beams=1,
        use_cache=True
    )

# Decode and print response
response = finetuned_tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print("\n" + "=" * 60)
print("üß† FINE-TUNED MODEL RESPONSE")
print("=" * 60)
print(f"Question: {test_question}")
print(f"\nResponse: {response.strip()}")
print("=" * 60)

üîß Configuring to use all 8 CPU cores...
‚úÖ CPU configured with Intel optimizations
üì¶ Loading base model (float32, no quantization)...


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

üîß Loading LoRA adapter...
‚úÖ Fine-tuned model loaded successfully!
‚úÖ Fine-tuned model loaded successfully!
ü§ñ Generating fine-tuned response...
ü§ñ Generating fine-tuned response...

üß† FINE-TUNED MODEL RESPONSE
Question: What are 3 healthy ways to deal with anxiety?

Response: Dealing with anxiety can be challenging, but there are healthy ways to manage it. Here are three strategies that might help:


1. Mindfulness and Meditation: Engaging in mindfulness practices can help you stay grounded in the present moment and reduce the impact of anxious thoughts. Try incorporating meditation into your daily routine, even if it's just for a few minutes. There are many free resources available online that guide you through meditation for anxiety relief.


2. Physical Activity: Exercise is a powerful tool for reducing anxiety. It helps release endorphins, which are chemicals in the brain that act as natural painkillers and mood

üß† FINE-TUNED MODEL RESPONSE
Question: What are 3 heal