To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth?tab=readme-ov-file#-installation-instructions).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).

[NEW] Llama-3.1 8b, 70b & 405b are trained on a crazy 15 trillion tokens with 128K long context lengths!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

In [None]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

* We support Llama, Mistral, Phi-3, Gemma, Yi, DeepSeek, Qwen, TinyLlama, Vicuna, Open Hermes etc
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* [**NEW**] We make Gemma-2 9b / 27b **2x faster**! See our [Gemma-2 9b notebook](https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing)
* [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)
* [**NEW**] We make Mistral NeMo 12B 2x faster and fit in under 12GB of VRAM! [Mistral NeMo notebook](https://colab.research.google.com/drive/17d3U-CAIwzmbDRqbZ9NnpHxCkmXB6LZ0?usp=sharing)

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # We also uploaded 4bit for 405b!
    "unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral 12b 2x faster!
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",        # Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.10.3: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

Unsloth: We fixed a gradient accumulation bug, but it seems like you don't have the latest transformers version!
Please update transformers via:
`pip uninstall transformers -y && pip install --upgrade --no-cache-dir "git+https://github.com/huggingface/transformers.git"`


We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.10.3 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the Alpaca dataset from [yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned), which is a filtered version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). You can replace this code section with your own data prep.

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

If you want to use the `llama-3` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing).

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

In [None]:
# Load necessary libraries
import pandas as pd
from datasets import Dataset
from sklearn.model_selection import train_test_split
# Define the instruction template
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""


# EOS token setup
EOS_TOKEN = tokenizer.eos_token

# Function to dynamically generate instructions
def generate_instruction(user_input, instruction=None):
    # If no specific instruction is provided, default to therapist role
    if instruction is None:
        return "You are a therapist. Respond with advice and guidance based on the patient's input and in the same language of the patient's input."
    else:
        return instruction  # You can pass a custom instruction here

# Function to format the DataFrame into Alpaca-style prompts with dynamic instructions
def format_dataset(df, custom_instruction=None):
    formatted_data = {
        "instruction": [],
        "input": [],
        "output": []
    }

    for index, row in df.iterrows():
        # Generate the instruction dynamically
        instruction = generate_instruction(row['Context'], custom_instruction)
        user_input = row['Context']
        response = row['Response']

        # Ensure instruction is based on therapist perspective
        formatted_data["instruction"].append(instruction)
        formatted_data["input"].append(user_input)
        formatted_data["output"].append(response)

    return pd.DataFrame(formatted_data)


# Load the dataset
whole_df = pd.read_json("hf://datasets/Amod/mental_health_counseling_conversations/combined_dataset.json", lines=True)

# Split the dataset (80% for training, 20% for testing)
df, test_df = train_test_split(whole_df, test_size=0.2, random_state=42)

# Reset index to keep things clean
df.reset_index(drop=True, inplace=True)
test_df.reset_index(drop=True, inplace=True)

# Call the formatting function, allowing for dynamic instructions
custom_instruction = None
formatted_df = format_dataset(df,custom_instruction)

# Convert the formatted DataFrame into a Huggingface Dataset object
dataset = Dataset.from_pandas(formatted_df)



# Function to add Alpaca-style format and EOS token
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

# Apply the formatting to the dataset
dataset = dataset.map(formatting_prompts_func, batched=True)


Map:   0%|          | 0/2809 [00:00<?, ? examples/s]

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 3,
        gradient_accumulation_steps = 2,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 120,
        learning_rate = 1e-5,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        eval_steps = 10,
        weight_decay = 0.05,  # Increase regularization to prevent overfitting

    ),
)

Map (num_proc=2):   0%|          | 0/2809 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
11.549 GB of memory reserved.


In [None]:
import os
os.environ['WANDB_MODE'] = 'disabled'


In [None]:
trainer_stats = trainer.train()

**** Unsloth: Please use our fixed gradient_accumulation_steps by updating transformers and Unsloth!


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 2,809 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 3 | Gradient Accumulation steps = 2
\        /    Total batch size = 6 | Total steps = 120
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,2.6465
2,2.6536
3,2.5768
4,2.2796
5,2.449
6,2.9616
7,2.0677
8,2.0342
9,2.0821
10,1.9295


In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

987.3075 seconds used for training.
16.46 minutes used for training.
Peak reserved memory = 11.549 GB.
Peak reserved memory for training = 0.0 GB.
Peak reserved memory % of max memory = 78.309 %.
Peak reserved memory for training % of max memory = 0.0 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "", # instruction
        "I'm stressed, help me out", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "", # instruction
        "I'm stressed, help me out",
        "", # output- leave this blank for generation!

    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 400)


In [None]:
# continous inference - No conversation history keeping

# alpaca_prompt = Copied from above
while(True):
  inpt = input()
  FastLanguageModel.for_inference(model) # Enable native 2x faster inference
  inputs = tokenizer(
  [
      alpaca_prompt.format(
          "", # instruction
          inpt,
          "", # output- leave this blank for generation!

      )
  ], return_tensors = "pt").to("cuda")

  from transformers import TextStreamer
  text_streamer = TextStreamer(tokenizer)
  _ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 400)


# Evaluation


In [None]:
df

Unnamed: 0,Context,Response
0,We don't have sex a lot. I cheat when we argue...,"\nHello, and thank you for your question. The ..."
1,I don't know how to notice or express my feeli...,"Well, then give yourself some credit for notic..."
2,I start counseling/therapy in a few days (I'm ...,"People do cry in therapy sometimes, but it's n..."
3,I don't know how to tell someone how I feel ab...,"""Practice makes perfect""!Simply by expressing ..."
4,My daughter seemed to be developing at a norma...,Good for you to know your daughter's friendshi...
...,...,...
2804,I went to my ex-boyfriend to reach out to one ...,Your compassionate reach out to the friend is ...
2805,"Last year, I just always felt hopeless. I don'...",I am so sorry about your loss. Losing someone...
2806,I recently went through a divorce. My ex-husba...,"Unfortunately, I can't tell you what your sist..."
2807,My grandson's step-mother sends him to school ...,Absolutely not! It is never in a child's best ...


In [None]:
test_df

Unnamed: 0,Context,Response
0,I've hit my head on walls and floors ever sinc...,The best way to handle anxiety of this level i...
1,Over a year ago I had a female friend. She tur...,We women really do tend to struggle with the c...
2,"My long-distance girlfriend is in a sorority, ...",You may already be doing as much as possible f...
3,Cheating is something unacceptable for me but ...,It is completely understandable that you are s...
4,I have twin toddlers. I experienced a death of...,"First, let me say that you are a survivor and ..."
...,...,...
698,We have been together over a year. We spend ti...,"Hello, and thank you for your question. I am v..."
699,I read that you should ignore them and they ha...,It is not correct because someone who is narci...
700,I'm depressed. I have been for years. I hide i...,"Hi Georgia, There's a really good lesson here...."
701,I was violently raped by another women who was...,I'm sorry for your suffering.There are therapy...


In [None]:
df, test_df = train_test_split(whole_df, test_size=0.2, random_state=42)

In [None]:
# Evaluation


# 10 rows only
test_df = test_df.head(1)
# Function to format the test dataset (without including the expected response)
def format_test_dataset(df, custom_instruction=None):
    formatted_data = {
        "instruction": [],
        "input": [],
        "output": []
    }

    for index, row in df.iterrows():
        # Generate the instruction dynamically (therapist perspective)
        instruction = generate_instruction(row['Context'], custom_instruction)
        user_input = row['Context']
        response = ""  # Leave blank for test/evaluation

        # Add the formatted data
        formatted_data["instruction"].append(instruction)
        formatted_data["input"].append(user_input)
        formatted_data["output"].append(response)  # Empty string

    return pd.DataFrame(formatted_data)

# Call the formatting function for the test dataset
formatted_test_df = format_test_dataset(test_df, custom_instruction)


model = FastLanguageModel.for_inference(model)  # Enable faster inference
model = model.to('cuda')


import re

# # Function to extract and clean the "Response" part from the generated text
def extract_response(generated_text):
    # try:
    #     # Find the start of the response section
    #     response_start = generated_text.find("### Response:")

    #     if response_start != -1:
    #         response_start += len("### Response:")
    #         response = generated_text[response_start:].strip()
    #     else:
    #         response = generated_text.strip()

    #     # Clean the response: remove unwanted characters like '://', excessive spaces, and more
    #     # Remove occurrences of '://'
    #     response = re.sub(r":\/\/\S*", "", response)

    #     # Remove any extra spaces or unwanted special characters
    #     response = re.sub(r"\s+", " ", response).strip()

    #     return response
    # except Exception as e:
    #     print(f"Error extracting response: {e}")
    #     return generated_text.strip()

# Update the generation function to store only the cleaned responses
def generate_responses(model, tokenizer, df, batch_size=8):
    generated_responses = []

    for i in range(0, len(df), batch_size):
        batch_df = df.iloc[i:i+batch_size]
        instructions = batch_df['instruction'].tolist()
        inputs = batch_df['input'].tolist()

        # Format the prompts for the model
        prompts = [alpaca_prompt.format(instruction, user_input, "") + EOS_TOKEN for instruction, user_input in zip(instructions, inputs)]

        # Tokenize the input
        inputs = tokenizer(prompts, return_tensors="pt", padding=True, truncation=True).to("cuda")

        # Generate responses in batches
        outputs = model.generate(**inputs, max_new_tokens=400)

        # Decode and extract just the "Response" part
        for output in outputs:
            generated_text = tokenizer.decode(output, skip_special_tokens=True)
            # Extract and clean the "Response" part
            # response = extract_response(generated_text)
            generated_responses.append(generated_text)

        # Optional: Log progress
        if i % 100 == 0:
            print(f"Processed {i}/{len(df)} rows")

        # Clear GPU cache periodically
        torch.cuda.empty_cache()

    return generated_responses

# Generate predictions for the test dataset and clean up the responses
test_responses = generate_responses(model, tokenizer, formatted_test_df)


# Add the generated responses to the DataFrame for comparison
formatted_test_df['generated_response'] = test_responses



IndentationError: expected an indented block after function definition on line 38 (<ipython-input-14-ea24a55eabf7>, line 62)

In [None]:
# Evaluation


# 10 rows only
test_df = test_df.head(10)
# Function to format the test dataset (without including the expected response)
def format_test_dataset(df, custom_instruction=None):
    formatted_data = {
        "instruction": [],
        "input": [],
        "output": []
    }

    for index, row in df.iterrows():
        # Generate the instruction dynamically (therapist perspective)
        instruction = generate_instruction(row['Context'], custom_instruction)
        user_input = row['Context']
        response = ""  # Leave blank for test/evaluation

        # Add the formatted data
        formatted_data["instruction"].append(instruction)
        formatted_data["input"].append(user_input)
        formatted_data["output"].append(response)  # Empty string

    return pd.DataFrame(formatted_data)

# Call the formatting function for the test dataset
formatted_test_df = format_test_dataset(test_df, custom_instruction = None)


model = FastLanguageModel.for_inference(model)  # Enable faster inference
model = model.to('cuda')

import re

# Function to extract and clean the "Response" part from the generated text
def extract_response(generated_text):
    try:
        # Find the start of the response section
        response_start = generated_text.find("### Response:")

        if response_start != -1:
            response_start += len("### Response:")
            response = generated_text[response_start:].strip()
        else:
            response = generated_text.strip()

        # Remove occurrences of '://'
        response = response.replace('\n://', '').replace('://', '').strip()

        # Remove any extra spaces or unwanted special characters
        response = re.sub(r"\s+", " ", response).strip()

        return response
    except Exception as e:
        print(f"Error extracting response: {e}")
        return generated_text.strip()



# Updated generate_responses function
def generate_responses(model, tokenizer, df, batch_size=8):
    generated_responses = []

    for i in range(0, len(df), batch_size):
        batch_df = df.iloc[i:i+batch_size]
        instructions = batch_df['instruction'].tolist()
        inputs = batch_df['input'].tolist()

        # Format the prompts for the model
        prompts = [alpaca_prompt.format(instruction, user_input,'') for instruction, user_input in zip(instructions, inputs)]

        # Tokenize the input
        inputs = tokenizer(prompts, return_tensors="pt", padding=True, truncation=True).to("cuda")

        # Generate responses in batches
        outputs = model.generate(**inputs, max_new_tokens=400)

        # Decode and process the responses
        for output in outputs:
            generated_text = tokenizer.decode(output, skip_special_tokens=True)
            # Format the generated response
            response = extract_response(generated_text)
            generated_responses.append(response)

        # Optional: Log progress
        if i % 100 == 0:
            print(f"Processed {i}/{len(df)} rows")

        # Clear GPU cache periodically
        torch.cuda.empty_cache()

    return generated_responses


# Generate predictions for the test dataset and clean up the responses
test_responses = generate_responses(model, tokenizer, formatted_test_df)

# Add the generated responses to the DataFrame for comparison
formatted_test_df['generated_response'] = test_responses


Processed 0/5 rows


In [None]:
test_responses

['Hi, it sounds like you have a lot of anxiety and you have a history of trauma. I would recommend that you work with a therapist who can help you to process the trauma and also to help you with your anxiety. A therapist can help you to understand what is going on and also to help you to change your behavior.',
 'I would suggest that you try to explain to your girlfriend that you are not attracted to women, and that you have no interest in being friends with the other woman. If your girlfriend continues to be suspicious, then it might be a good idea to seek counseling for the two of you.',
 "There is no easy answer to this question. The relationship is a two way street and both of you have to be willing to make changes. If you are feeling that your girlfriend is changing and not for the better, then you need to have a conversation with her. Let her know how you feel and why. If you are feeling that your girlfriend is not listening to you, then you need to decide what you want to do. Do

In [None]:
!pip install evaluate
!pip install rouge_score

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24935 sha256=14de0d3348042399f71f7844e7f9ea11c8dcf8c2bb4a6969a902077d5bee35ff
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [None]:
from evaluate import load

# Load ROUGE evaluation metrics
rouge = load('rouge')

# Compute ROUGE scores
rouge_score = rouge.compute(predictions=test_responses, references=test_df['Response'].tolist())

print("ROUGE Score:", rouge_score)


ROUGE Score: {'rouge1': 0.3154497935058252, 'rouge2': 0.06216951765474543, 'rougeL': 0.15110666680191107, 'rougeLsum': 0.1516970506530294}


In [None]:
!pip install bert_score

Collecting bert_score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Downloading bert_score-0.3.13-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bert_score
Successfully installed bert_score-0.3.13


In [None]:
# Compute BERTScore
bertscore = load('bertscore')
bertscore_results = bertscore.compute(predictions=test_responses, references=test_df['Response'].tolist(), lang = 'en')


print("BERTScore:", bertscore_results)

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERTScore: {'precision': [0.8905251026153564, 0.8775880336761475, 0.8559650182723999, 0.8401226997375488, 0.8670701384544373], 'recall': [0.8430310487747192, 0.7995772361755371, 0.8315000534057617, 0.8259008526802063, 0.8353158235549927], 'f1': [0.8661274313926697, 0.836768388748169, 0.8435551524162292, 0.8329510688781738, 0.8508968353271484], 'hashcode': 'roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.44.2)'}


# Evaluation is Low, as the model output consists of instruction + user input + response to the input so you need to split and take the response part only to be compared with the true responses

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# alpaca_prompt = You MUST copy from above!

inputs = tokenizer(
[
    alpaca_prompt.format(
        "What is a famous tall tower in Paris?", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [None]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
from google.colab import drive
drive.mount('/content/drive')
save_path = '/content/drive/My Drive/Model'

# Merge to 16bit
if True: model.save_pretrained_merged(save_path, tokenizer, save_method="merged_16bit")

# if False: model.push_to_hub_merged("hf/Llama_mentalchatbot", tokenizer, save_method = "merged_16bit", token = "")

# # Merge to 4bit
# if False: model.save_pretrained_merged("/content/Mentalchatbot", tokenizer, save_method = "merged_4bit",)
# if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# # Just LoRA adapters
# if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
# if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

Mounted at /content/drive


Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which will take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 5.7G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 4.64 out of 12.67 RAM for saving.


 34%|███▍      | 11/32 [00:00<00:01, 15.26it/s]We will save to Disk and not RAM now.
100%|██████████| 32/32 [01:38<00:00,  3.07s/it]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Unsloth: Saving /content/drive/My Drive/Model/pytorch_model-00001-of-00004.bin...
Unsloth: Saving /content/drive/My Drive/Model/pytorch_model-00002-of-00004.bin...
Unsloth: Saving /content/drive/My Drive/Model/pytorch_model-00003-of-00004.bin...
Unsloth: Saving /content/drive/My Drive/Model/pytorch_model-00004-of-00004.bin...
Done.


### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "", # Get a token at https://huggingface.co/settings/tokens
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in `llama.cpp` or a UI based system like `GPT4All`. You can install GPT4All by going [here](https://gpt4all.io/index.html).

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/u54VK8m8tk) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Zephyr DPO 2x faster [free Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing)
2. Llama 7b 2x faster [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
3. TinyLlama 4x faster full Alpaca 52K in 1 hour [free Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
4. CodeLlama 34b 2x faster [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
5. Mistral 7b [free Kaggle version](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)
6. We also did a [blog](https://huggingface.co/blog/unsloth-trl) with 🤗 HuggingFace, and we're in the TRL [docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)!
7. `ChatML` for ShareGPT datasets, [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing)
8. Text completions like novel writing [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)
9. [**NEW**] We make Phi-3 Medium / Mini **2x faster**! See our [Phi-3 Medium notebook](https://colab.research.google.com/drive/1hhdhBa1j_hsymiW9m-WzxQtgqTH_NHqi?usp=sharing)
10. [**NEW**] We make Gemma-2 9b / 27b **2x faster**! See our [Gemma-2 9b notebook](https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing)
11. [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)
12. [**NEW**] We make Mistral NeMo 12B 2x faster and fit in under 12GB of VRAM! [Mistral NeMo notebook](https://colab.research.google.com/drive/17d3U-CAIwzmbDRqbZ9NnpHxCkmXB6LZ0?usp=sharing)

<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Support our work if you can! Thanks!
</div>