To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### News

**Read our [blog post](https://unsloth.ai/blog/r1-reasoning) for guidance to train reasoning model.** GRPO notebook is inspired by [@shxf0072](https://x.com/shxf0072/status/1886085377146180091), [@Teknium1](https://x.com/Teknium1/status/1885077369142337550), [@willccbb](https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb)

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [1]:
%%capture
# Normally using pip install unsloth is enough

# Temporarily as of Jan 31st 2025, Colab has some issues with Pytorch
# Using pip install unsloth will take 3 minutes, whilst the below takes <1 minute:
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29 peft trl triton
!pip install --no-deps cut_cross_entropy unsloth_zoo
!pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
!pip install --no-deps unsloth
!pip install unsloth

### Unsloth

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.3.19: Fast Llama patching. Transformers: 4.50.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/235 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.3.19 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [4]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
import json
!pip install datasets
from datasets import Dataset

# Load training data from Google Drive
with open('/content/drive/My Drive/training_data.jsonl', 'r') as file:
    data = [json.loads(line) for line in file]

# Prepare the dataset
texts = [entry['prompt'] + entry['completion'] for entry in data]
dataset = Dataset.from_dict({'text': texts})



In [6]:
# Configure the tokenizer with chat template

from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template="llama-3",
    mapping={"role": "from", "content": "value", "user": "human", "assistant": "gpt"},
)


In [7]:
def formatting_prompts_func(examples):
    texts = examples["text"]
    formatted_texts = [tokenizer.apply_chat_template(
        [{"role": "user", "content": text}],
        tokenize=False,
        add_generation_prompt=False
    ) for text in texts]
    return {"text": formatted_texts}

# Apply formatting function to the dataset
dataset = dataset.map(formatting_prompts_func, batched=True)

Map:   0%|          | 0/66018 [00:00<?, ? examples/s]

In [8]:
# Training setup
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=10,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
    ),
)


Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/66018 [00:00<?, ? examples/s]

In [9]:
# Train the model
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 66,018 | Num Epochs = 1 | Total steps = 10
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040/8,000,000,000 (0.52% trained)
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mmaggie-tao1982[0m ([33mmaggie-tao1982-institution[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,2.1244
2,2.0654
3,2.0147
4,2.0251
5,2.1671
6,2.0154
7,1.9809
8,1.9308
9,1.9721
10,1.9327


In [10]:
# Enable inference mode
FastLanguageModel.for_inference(model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(128256, 4096, padding_idx=128004)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lor

In [11]:
template = """
You are a supportive assistant. Provide helpful and empathetic advice to the user.
User query: {user_query}
Response:
"""

In [12]:
def generate_response(input_text, model, tokenizer):
    if not input_text.strip():
        return "Please provide a valid query."

    # Define a custom prompt template
    template = """
    You are a supportive assistant. Provide a single, helpful, and empathetic response to the user's query.
    User query: {user_query}
    Response:
    """
    prompt = template.format(user_query=input_text)

    # Prepare input
    inputs = tokenizer(
        prompt,
        return_tensors="pt"
    ).to("cuda")

    # Generate response
    outputs = model.generate(
        input_ids=inputs['input_ids'],
        attention_mask=inputs['attention_mask'],
        max_new_tokens=250,  # Limit response length
        temperature=0.7,     # Lower temperature for more focused responses
        top_p=0.9,       # Use nucleus sampling
        repetition_penalty=1.1,
        num_return_sequences=1  # Ensure a single response is generated
    )

    # Decode and return response
    response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

    # Post-process and validate response
    if "I'm sorry" not in response:  # Example of a simple validation
        return response.strip()
    else:
        return "I couldn't provide a helpful response. Please try again."


In [None]:
# Example usage
response = generate_response("I'm feeling depressed, what shall I do?", model, tokenizer)
print(response)

You are a supportive assistant. Provide a single, helpful, and empathetic response to the user's query.
    User query: I'm feeling depressed, what shall I do?
    Response:
     Hi there! It sounds like you're having a rough day. Is there anything in particular that is making you feel this way? Sometimes just talking about how we're feeling can help us gain some perspective on things. If you'd rather not talk right now, it might be helpful to try writing down your thoughts or doing something relaxing, such as taking a bath or listening to music. Remember that everyone goes through tough times sometimes, but they don't last forever. Take care of yourself!


# LLM model eval

## MMLU benchmarks


In [None]:
%%capture
!pip install lm-eval
from lm_eval import evaluator
from lm_eval.models.huggingface import HFLM

In [None]:
#wrapped it in `lm_eval.models.huggingface.HFLM(pretrained=my_model)` first, then lm_eval can recognize the model
%%capture
lm_model = HFLM(model)



In [None]:
tasks = ["mmlu_philosophy"]
# Evaluate
result_philosophy = evaluator.simple_evaluate(
    model=lm_model,
    tasks=tasks,
    batch_size=8)

print(result_philosophy)

README.md:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

mmlu_no_train.py:   0%|          | 0.00/5.86k [00:00<?, ?B/s]

data.tar:   0%|          | 0.00/166M [00:00<?, ?B/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

100%|██████████| 311/311 [00:00<00:00, 685.22it/s]
Running loglikelihood requests: 100%|██████████| 1244/1244 [01:24<00:00, 14.78it/s]


{'results': {'mmlu_philosophy': {'alias': 'philosophy', 'acc,none': 0.6945337620578779, 'acc_stderr,none': 0.026160584450140453}}, 'group_subtasks': {'mmlu_philosophy': []}, 'configs': {'mmlu_philosophy': {'task': 'mmlu_philosophy', 'task_alias': 'philosophy', 'tag': 'mmlu_humanities_tasks', 'dataset_path': 'hails/mmlu_no_train', 'dataset_name': 'philosophy', 'dataset_kwargs': {'trust_remote_code': True}, 'test_split': 'test', 'fewshot_split': 'dev', 'doc_to_text': '{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:', 'doc_to_target': 'answer', 'unsafe_code': False, 'doc_to_choice': ['A', 'B', 'C', 'D'], 'description': 'The following are multiple choice questions (with answers) about philosophy.\n\n', 'target_delimiter': ' ', 'fewshot_delimiter': '\n\n', 'fewshot_config': {'sampler': 'first_n'}, 'num_fewshot': 0, 'metric_list': [{'metric': 'acc', 'aggregation': 'mean', 'higher_is_better': True}], 'output_type': 'multiple_choice', '

In [None]:
tasks = ["mmlu_moral_scenarios"]

# Evaluate
result_moral_scenarios = evaluator.simple_evaluate(
    model=lm_model,
    tasks=tasks,
    batch_size=8)

print(result_moral_scenarios)

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

100%|██████████| 895/895 [00:03<00:00, 224.67it/s]
Running loglikelihood requests: 100%|██████████| 3580/3580 [04:37<00:00, 12.90it/s]


{'results': {'mmlu_moral_scenarios': {'alias': 'moral_scenarios', 'acc,none': 0.2737430167597765, 'acc_stderr,none': 0.014912413096372432}}, 'group_subtasks': {'mmlu_moral_scenarios': []}, 'configs': {'mmlu_moral_scenarios': {'task': 'mmlu_moral_scenarios', 'task_alias': 'moral_scenarios', 'tag': 'mmlu_humanities_tasks', 'dataset_path': 'hails/mmlu_no_train', 'dataset_name': 'moral_scenarios', 'dataset_kwargs': {'trust_remote_code': True}, 'test_split': 'test', 'fewshot_split': 'dev', 'doc_to_text': '{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:', 'doc_to_target': 'answer', 'unsafe_code': False, 'doc_to_choice': ['A', 'B', 'C', 'D'], 'description': 'The following are multiple choice questions (with answers) about moral scenarios.\n\n', 'target_delimiter': ' ', 'fewshot_delimiter': '\n\n', 'fewshot_config': {'sampler': 'first_n'}, 'num_fewshot': 0, 'metric_list': [{'metric': 'acc', 'aggregation': 'mean', 'higher_is_better': Tru

In [None]:
tasks = ["mmlu_logical_fallacies"]

# Evaluate
result_logical_fallacies = evaluator.simple_evaluate(
    model=lm_model,
    tasks=tasks,
    batch_size=8)

print(result_logical_fallacies)

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

100%|██████████| 163/163 [00:00<00:00, 194.04it/s]
Running loglikelihood requests: 100%|██████████| 652/652 [00:43<00:00, 14.90it/s]


{'results': {'mmlu_logical_fallacies': {'alias': 'logical_fallacies', 'acc,none': 0.7239263803680982, 'acc_stderr,none': 0.03512385283705048}}, 'group_subtasks': {'mmlu_logical_fallacies': []}, 'configs': {'mmlu_logical_fallacies': {'task': 'mmlu_logical_fallacies', 'task_alias': 'logical_fallacies', 'tag': 'mmlu_humanities_tasks', 'dataset_path': 'hails/mmlu_no_train', 'dataset_name': 'logical_fallacies', 'dataset_kwargs': {'trust_remote_code': True}, 'test_split': 'test', 'fewshot_split': 'dev', 'doc_to_text': '{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:', 'doc_to_target': 'answer', 'unsafe_code': False, 'doc_to_choice': ['A', 'B', 'C', 'D'], 'description': 'The following are multiple choice questions (with answers) about logical fallacies.\n\n', 'target_delimiter': ' ', 'fewshot_delimiter': '\n\n', 'fewshot_config': {'sampler': 'first_n'}, 'num_fewshot': 0, 'metric_list': [{'metric': 'acc', 'aggregation': 'mean', 'higher_

In [None]:
tasks = ["mmlu_nutrition"]

# Evaluate
result_nutrition = evaluator.simple_evaluate(
    model=lm_model,
    tasks=tasks,
    batch_size=8)

print(result_nutrition)

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

100%|██████████| 306/306 [00:00<00:00, 693.02it/s]
Running loglikelihood requests: 100%|██████████| 1224/1224 [01:23<00:00, 14.70it/s]


{'results': {'mmlu_nutrition': {'alias': 'nutrition', 'acc,none': 0.7124183006535948, 'acc_stderr,none': 0.02591780611714716}}, 'group_subtasks': {'mmlu_nutrition': []}, 'configs': {'mmlu_nutrition': {'task': 'mmlu_nutrition', 'task_alias': 'nutrition', 'tag': 'mmlu_other_tasks', 'dataset_path': 'hails/mmlu_no_train', 'dataset_name': 'nutrition', 'dataset_kwargs': {'trust_remote_code': True}, 'test_split': 'test', 'fewshot_split': 'dev', 'doc_to_text': '{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:', 'doc_to_target': 'answer', 'unsafe_code': False, 'doc_to_choice': ['A', 'B', 'C', 'D'], 'description': 'The following are multiple choice questions (with answers) about nutrition.\n\n', 'target_delimiter': ' ', 'fewshot_delimiter': '\n\n', 'fewshot_config': {'sampler': 'first_n'}, 'num_fewshot': 0, 'metric_list': [{'metric': 'acc', 'aggregation': 'mean', 'higher_is_better': True}], 'output_type': 'multiple_choice', 'repeats': 1, '

In [None]:
tasks = ["mmlu_sociology"]

# Evaluate
result_sociology = evaluator.simple_evaluate(
    model=lm_model,
    tasks=tasks,
    batch_size=8)

print(result_sociology)

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

100%|██████████| 201/201 [00:00<00:00, 716.77it/s]
Running loglikelihood requests: 100%|██████████| 804/804 [00:54<00:00, 14.79it/s]


{'results': {'mmlu_sociology': {'alias': 'sociology', 'acc,none': 0.8258706467661692, 'acc_stderr,none': 0.026814951200421603}}, 'group_subtasks': {'mmlu_sociology': []}, 'configs': {'mmlu_sociology': {'task': 'mmlu_sociology', 'task_alias': 'sociology', 'tag': 'mmlu_social_sciences_tasks', 'dataset_path': 'hails/mmlu_no_train', 'dataset_name': 'sociology', 'dataset_kwargs': {'trust_remote_code': True}, 'test_split': 'test', 'fewshot_split': 'dev', 'doc_to_text': '{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:', 'doc_to_target': 'answer', 'unsafe_code': False, 'doc_to_choice': ['A', 'B', 'C', 'D'], 'description': 'The following are multiple choice questions (with answers) about sociology.\n\n', 'target_delimiter': ' ', 'fewshot_delimiter': '\n\n', 'fewshot_config': {'sampler': 'first_n'}, 'num_fewshot': 0, 'metric_list': [{'metric': 'acc', 'aggregation': 'mean', 'higher_is_better': True}], 'output_type': 'multiple_choice', 'rep

In [None]:
tasks = ["mmlu_medical_genetics"]

# Evaluate
result_medical_genetics = evaluator.simple_evaluate(
    model=lm_model,
    tasks=tasks,
    batch_size=8)

print(result_medical_genetics)

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

100%|██████████| 100/100 [00:00<00:00, 383.64it/s]
Running loglikelihood requests: 100%|██████████| 400/400 [00:29<00:00, 13.60it/s]


{'results': {'mmlu_medical_genetics': {'alias': 'medical_genetics', 'acc,none': 0.78, 'acc_stderr,none': 0.041633319989322626}}, 'group_subtasks': {'mmlu_medical_genetics': []}, 'configs': {'mmlu_medical_genetics': {'task': 'mmlu_medical_genetics', 'task_alias': 'medical_genetics', 'tag': 'mmlu_other_tasks', 'dataset_path': 'hails/mmlu_no_train', 'dataset_name': 'medical_genetics', 'dataset_kwargs': {'trust_remote_code': True}, 'test_split': 'test', 'fewshot_split': 'dev', 'doc_to_text': '{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:', 'doc_to_target': 'answer', 'unsafe_code': False, 'doc_to_choice': ['A', 'B', 'C', 'D'], 'description': 'The following are multiple choice questions (with answers) about medical genetics.\n\n', 'target_delimiter': ' ', 'fewshot_delimiter': '\n\n', 'fewshot_config': {'sampler': 'first_n'}, 'num_fewshot': 0, 'metric_list': [{'metric': 'acc', 'aggregation': 'mean', 'higher_is_better': True}], 'outpu

In [None]:
tasks = ["mmlu_human_sexuality"]

# Evaluate
result_human_sexuality = evaluator.simple_evaluate(
    model=lm_model,
    tasks=tasks,
    batch_size=8)

print(result_human_sexuality)

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

100%|██████████| 131/131 [00:00<00:00, 616.50it/s]
Running loglikelihood requests: 100%|██████████| 524/524 [00:35<00:00, 14.69it/s]


{'results': {'mmlu_human_sexuality': {'alias': 'human_sexuality', 'acc,none': 0.7175572519083969, 'acc_stderr,none': 0.03948406125768362}}, 'group_subtasks': {'mmlu_human_sexuality': []}, 'configs': {'mmlu_human_sexuality': {'task': 'mmlu_human_sexuality', 'task_alias': 'human_sexuality', 'tag': 'mmlu_social_sciences_tasks', 'dataset_path': 'hails/mmlu_no_train', 'dataset_name': 'human_sexuality', 'dataset_kwargs': {'trust_remote_code': True}, 'test_split': 'test', 'fewshot_split': 'dev', 'doc_to_text': '{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:', 'doc_to_target': 'answer', 'unsafe_code': False, 'doc_to_choice': ['A', 'B', 'C', 'D'], 'description': 'The following are multiple choice questions (with answers) about human sexuality.\n\n', 'target_delimiter': ' ', 'fewshot_delimiter': '\n\n', 'fewshot_config': {'sampler': 'first_n'}, 'num_fewshot': 0, 'metric_list': [{'metric': 'acc', 'aggregation': 'mean', 'higher_is_better':

## Long Context Understanding: ms_marco

In [1]:
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
%%capture
!pip install transformers peft torch

In [3]:
!pip install -U bitsandbytes



In [4]:
!pip install --force-reinstall bitsandbytes

Collecting bitsandbytes
  Using cached bitsandbytes-0.45.4-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting torch<3,>=2.0 (from bitsandbytes)
  Using cached torch-2.6.0-cp311-cp311-manylinux1_x86_64.whl.metadata (28 kB)
Collecting numpy>=1.17 (from bitsandbytes)
  Using cached numpy-2.2.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
Collecting filelock (from torch<3,>=2.0->bitsandbytes)
  Using cached filelock-3.18.0-py3-none-any.whl.metadata (2.9 kB)
Collecting typing-extensions>=4.10.0 (from torch<3,>=2.0->bitsandbytes)
  Using cached typing_extensions-4.13.1-py3-none-any.whl.metadata (3.0 kB)
Collecting networkx (from torch<3,>=2.0->bitsandbytes)
  Using cached networkx-3.4.2-py3-none-any.whl.metadata (6.3 kB)
Collecting jinja2 (from torch<3,>=2.0->bitsandbytes)
  Using cached jinja2-3.1.6-py3-none-any.whl.metadata (2.9 kB)
Collecting fsspec (from torch<3,>=2.0->bitsandbytes)
  Using cached fsspec-2025.3.2-py3-none-any.whl.metadata (11 k

In [5]:
%%capture
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# 1. load trained model and tokenizer
model_path = "/content/drive/My Drive/LLM_project_health_chatbot"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

In [6]:
!pip install datasets
from datasets import load_dataset  # Import the load_dataset function

ds2 = load_dataset("microsoft/ms_marco", "v1.1")

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.w

README.md:   0%|          | 0.00/9.48k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/21.4M [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/175M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/10047 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/82326 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/9650 [00:00<?, ? examples/s]

In [7]:
print(ds2)  # print out dataset info
print(ds2.keys())
print(ds2["train"][0]) #print out the train info

DatasetDict({
    validation: Dataset({
        features: ['answers', 'passages', 'query', 'query_id', 'query_type', 'wellFormedAnswers'],
        num_rows: 10047
    })
    train: Dataset({
        features: ['answers', 'passages', 'query', 'query_id', 'query_type', 'wellFormedAnswers'],
        num_rows: 82326
    })
    test: Dataset({
        features: ['answers', 'passages', 'query', 'query_id', 'query_type', 'wellFormedAnswers'],
        num_rows: 9650
    })
})
dict_keys(['validation', 'train', 'test'])
{'answers': ['Results-Based Accountability is a disciplined way of thinking and taking action that communities can use to improve the lives of children, youth, families, adults and the community as a whole.'], 'passages': {'is_selected': [0, 0, 0, 0, 0, 1, 0, 0, 0, 0], 'passage_text': ["Since 2007, the RBA's outstanding reputation has been affected by the 'Securency' or NPA scandal. These RBA subsidiaries were involved in bribing overseas officials so that Australia might win luc

In [13]:
# Use 1% of the dataset
train_data = ds2['train'].shuffle(seed=42).select(range(int(1 * len(ds2['train']))))
test_data = ds2['test'].shuffle(seed=42).select(range(int(0.01 * len(ds2['test']))))
validation_data = ds2['validation'].shuffle(seed=42).select(range(int(0.01 * len(ds2['validation']))))

In [14]:
!pip install rouge-score
from tqdm import tqdm
import torch
from rouge_score import rouge_scorer

def evaluate_model_rouge2(model, tokenizer, test_data, max_samples=200):
    model.eval()
    device = model.device
    scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)

    rouge1_scores, rouge2_scores, rougeL_scores = [], [], []
    total = min(len(test_data), max_samples)

    for i in tqdm(range(total)):
        sample = test_data[i]
        question = sample["query"]

        if len(sample["answers"]) == 0:
            continue

        reference = sample["answers"][0]

        inputs = tokenizer(
            question,
            return_tensors="pt",
            max_length=512,
            truncation=True
        ).to(device)

        with torch.no_grad():
            outputs = model.generate(
                input_ids=inputs.input_ids,
                attention_mask=inputs.attention_mask,
                max_new_tokens=256,
                pad_token_id=tokenizer.eos_token_id
            )

        prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)

        scores = scorer.score(reference, prediction)
        rouge1_scores.append(scores["rouge1"].fmeasure)
        rouge2_scores.append(scores["rouge2"].fmeasure)
        rougeL_scores.append(scores["rougeL"].fmeasure)

    return {
        "ROUGE-1": sum(rouge1_scores) / total,
        "ROUGE-2": sum(rouge2_scores) / total,
        "ROUGE-L": sum(rougeL_scores) / total
    }



In [15]:
## run evalution
results = evaluate_model_rouge2(model, tokenizer, test_data)
print(results)

100%|██████████| 96/96 [39:09<00:00, 24.47s/it]

{'ROUGE-1': 0.07585937719572741, 'ROUGE-2': 0.018890239531918398, 'ROUGE-L': 0.06318884979482155}





In [13]:
# Define the path in Google Drive where you want to save the model
drive_path = '/content/drive/My Drive/LLM_project_health_chatbot'
# Save the model and tokenizer
model.save_pretrained(drive_path)
tokenizer.save_pretrained(drive_path)

('/content/drive/My Drive/LLM_project_health_chatbot/tokenizer_config.json',
 '/content/drive/My Drive/LLM_project_health_chatbot/special_tokens_map.json',
 '/content/drive/My Drive/LLM_project_health_chatbot/tokenizer.json')

# RAG

In [None]:
!pip install langchain-huggingface # Install the langchain-huggingface package
from langchain.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings # Now this import should work

Collecting langchain-huggingface
  Downloading langchain_huggingface-0.1.2-py3-none-any.whl.metadata (1.3 kB)
Downloading langchain_huggingface-0.1.2-py3-none-any.whl (21 kB)
Installing collected packages: langchain-huggingface
Successfully installed langchain-huggingface-0.1.2


In [None]:
# Load and process documents for RAG
# Define embedding model
embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-mpnet-base-v2')

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
!pip install pypdf

Collecting pypdf
  Downloading pypdf-5.4.0-py3-none-any.whl.metadata (7.3 kB)
Downloading pypdf-5.4.0-py3-none-any.whl (302 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/302.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m297.0/302.3 kB[0m [31m10.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.3/302.3 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-5.4.0


In [None]:
import os # Import the os module
from PyPDF2 import PdfReader
from google.colab import drive
from langchain.document_loaders import PyPDFLoader # Import PyPDFLoader
from langchain.docstore.document import Document # Import Document class

def extract_text_from_pdf(pdf_path):
    loader = PyPDFLoader(pdf_path)
    documents = loader.load()
    text = []
    for page in documents:
        text.append(page.page_content)
    return "\n".join(text)

In [None]:
def load_professional_documents(directory):
    documents = []
    for filename in os.listdir(directory):
        if filename.endswith('.pdf'):
            pdf_path = os.path.join(directory, filename)
            text = extract_text_from_pdf(pdf_path)
            documents.append(Document(page_content=text)) # Now Document is defined
    return documents

In [None]:
# Function to prepare FAISS index
def prepare_faiss_index(documents, embedding_model_name='sentence-transformers/all-mpnet-base-v2'):
    embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name)
    return FAISS.from_documents(documents, embeddings)

In [None]:
def generate_rag_response(input_text, vector_db, model, tokenizer, context_limit=1000):
    if not input_text.strip():
        return "Please provide a valid query."

    # Retrieve relevant documents
    retrieved_docs = vector_db.as_retriever(search_type="similarity", search_kwargs={'k': 4}).get_relevant_documents(input_text)

    # Concatenate retrieved documents and limit context length
    context = "\n\n".join([doc.page_content for doc in retrieved_docs])
    if len(context) > context_limit:
        context = context[:context_limit] + "..."

    # Define a refined prompt template
    template = """
    You are a supportive assistant. Provide a single, concise, and empathetic response based on the context.

    ### Context from Retrieved Documents:
    {context}

    ### User query:
    {user_query}

    ### Response:
    """
    prompt = template.format(context=context, user_query=input_text)

    # Prepare input
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    # Generate a single response
    outputs = model.generate(
        input_ids=inputs['input_ids'],
        attention_mask=inputs['attention_mask'],
        max_new_tokens=250,  # Limit response length
        temperature=0.7,     # Lower temperature for more focused responses
        repetition_penalty=1.1,
        top_p=0.9,       # Use nucleus sampling
        num_return_sequences=1  # Ensure a single response is generated
    )

    # Decode and return the single response
    response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

    # Validate and format response
    return response.strip()


In [None]:
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.4 kB)
Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl (30.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/30.7 MB[0m [31m26.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.10.0


In [None]:
# Example usage
pdf_directory = '/content/drive/My Drive/RAG docs'
# Load and index professional documents
documents = load_professional_documents(pdf_directory)
vector_db = prepare_faiss_index(documents)

In [None]:
# Example usage
user_question = "I'm feeling depressed, what shall I do?"
response = generate_rag_response(user_question, vector_db, model, tokenizer)

# Print the interaction
print("### User query:")
print(user_question)
print("\n### Response:")
print(response)

# Simulated Feedback and Rating
print("\n### Response rating:")
print("1=Not helpful, 2=Slightly helpful, 3=Moderately helpful, 4=Very Helpful, 5=Extremely Helpful")
print("\nOverall Rating: Moderately helpful")

print("\n### Feedback:")
print("Thank you for your response! I'll consider talking to someone and trying journaling as you suggested.")

### User query:
I'm feeling depressed, what shall I do?

### Response:
You are a supportive assistant. Provide a single, concise, and empathetic response based on the context.

    ### Context from Retrieved Documents:
    
https://www.openbookpublishers.com
©2025 Michael Briant; Foreword ©Rowan Williams
This work is licensed under a Creative Commons Attribution-NonCommercial-
NoDerivatives 4.0 International (CC BY-NC-ND 4.0). This license allows you to share, 
copy, distribute and transmit the work for non-commercial purposes, providing 
attribution is made to the author (but not in any way that suggests that he endorses 
you or your use of the work). If you remix, transform, or build upon the material, you 
may not distribute the modified material. Attribution should include the following 
information:
Michael Briant, Troubled People, Troubled World: Psychotherapy, Ethics, and Society. 
Cambridge, UK: Open Book Publishers, 2025,  https://doi.org/10.11647/OBP.0416
Further details abou