# FINE TUNING 
As mentioned before, fine-tuning large language models (LLMs) is a process that allows models to adapt to specific tasks, domains, or user requirements. However, depending on the size of the model and the fine-tuning dataset, the process can take a significant amount of time and demand high-performance GPUs to handle the computation load. However, there are various ways one can make the task more computationally efficient. One method is called Parameter Efficient Fine Tuning (PEFT), which helps fine tune only a small subset of a model’s parameters, significantly reducing the computational expenses while freezing the weights of original pretrained LLM.

One PEFT method is called LoRA, which stands for Low Rank Adaptation. This technique introduces trainable rank decomposition matrices (matrices A and B in the image below) within the transformer architecture and also reduces trainable parameters for downstream task while keeping the pre trained weights frozen. The method assumes that redundant information is often easily stored in a big matrix, especially in high-dimensional spaces. Hence, a more "parameter efficient" matrix can capture the important data attributes during training.

![LoRA_diagram.png](./LoRA_diagram.png "LoRA_diagram.png")

So we should be able to go ahead and start fine tuning the model right? Unfortunately, there is another step we must consider before proceeding. In most fine tuning cases, you may be limited to a single GPU. Therefore, we need to be able to make the fine tuning method even more efficient. This is where Quantized Low Rank Adaptation (QLoRA) comes in.

QLoRA is the extended version of LoRA which works by quantizing the precision of the weight parameters in the pre trained LLM to 4-bit precision. Typically, parameters of trained models are stored in a 32-bit format, but QLoRA compresses them to a 4-bit format. This reduces the memory footprint of the LLM, making it possible to finetune it on a single GPU.

## 
## Pre-requisites

Before continuing, you would need to have a hugging face account. If you head to: https://huggingface.co/ , you should be able to create an one.

Next you will need access to Llama 3.2 1B, which is the model we will use for this task.  Use the link: https://huggingface.co/meta-llama/Llama-3.2-1B. We are going to be fine tuning a base model ( a model that does not understand instructions) to understad instructions!

Once you reach the website, complete the required form (Do not mention that you are affliated to Accenture! Use a random univeristy maybe)

Once you have your HuggingFace account, create an access token to use. Head to your profile on the top right of your page and select "access tokens". Once created, you can store it in a notepad in your local machine.

## Install and import libraries
Lets install and import the required dependencies:

In [None]:
!pip install transformers datasets bitsandbytes peft trl accelerate torch
# !pip install "unsloth[cu124-ampere-torch240] @ git+https://github.com/unslothai/unsloth.git"

Collecting bitsandbytes
  Downloading bitsandbytes-0.45.3-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting peft
  Downloading peft-0.14.0-py3-none-any.whl.metadata (13 kB)
Collecting trl
  Downloading trl-0.15.2-py3-none-any.whl.metadata (11 kB)
Collecting huggingface-hub<1.0,>=0.23.2 (from transformers)
  Downloading huggingface_hub-0.29.2-py3-none-any.whl.metadata (13 kB)
Collecting accelerate
  Downloading accelerate-1.4.0-py3-none-any.whl.metadata (19 kB)
Collecting datasets
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting transformers
  Downloading transformers-4.49.0-py3-none-any.whl.metadata (44 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/44.0 kB[0m [31m?[0m eta [36m-:--:--[0m
[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━[0m [32m41.0/44.0 kB[0m [31m29.2 MB/s[0m eta [36m0:00:01[0m
[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━[0m [32m41.0/44.0 k

In [None]:
%restart_python

In [None]:
import torch 
import os
from datasets import load_dataset, Dataset, load_from_disk # load datasets from hugging face 
from transformers import (AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, Trainer, DataCollatorForSeq2Seq) 
from trl import SFTConfig, SFTTrainer
import pandas as pd 
import numpy as np 
import torch 
from peft import LoraConfig


2025-03-10 22:04:39.050290: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1741644279.059846    2360 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1741644279.063345    2360 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-10 22:04:39.076696: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Update the package manager in the OS and install the libaio-dev package

**Note - This is required only if you do not have the libaio package. Run it anyways, see what happpens**

In [None]:
command = "apt update"
os.system(command=command)





Get:1 https://repos.azul.com/zulu/deb stable InRelease [5,289 B]
Get:2 http://security.ubuntu.com/ubuntu noble-security InRelease [126 kB]
Get:3 http://archive.ubuntu.com/ubuntu noble InRelease [256 kB]
Get:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64  InRelease [1,581 B]
Get:5 https://repos.azul.com/zulu/deb stable/main amd64 Packages [378 kB]
Get:6 https://repos.azul.com/zulu/deb stable/main arm64 Packages [233 kB]
Get:7 http://archive.ubuntu.com/ubuntu noble-updates InRelease [126 kB]
Get:8 http://security.ubuntu.com/ubuntu noble-security/main amd64 Packages [841 kB]
Get:9 http://archive.ubuntu.com/ubuntu noble-backports InRelease [126 kB]
Get:10 http://archive.ubuntu.com/ubuntu noble/main amd64 Packages [1,808 kB]
Get:11 http://security.ubuntu.com/ubuntu noble-security/main amd64 Components [10.1 kB]
Get:12 http://security.ubuntu.com/ubuntu noble-security/universe amd64 Packages [1,063 kB]
Get:13 http://security.ubuntu.com/ubuntu noble-security/unive

W: https://repos.azul.com/zulu/deb/dists/stable/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details.
W: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details.


0

In [None]:
command_2 = "apt-get -y install libaio-dev g++"
os.system(command=command_2)

## Clear the GPU memeory (just in case)

In [None]:
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
import gc
gc.collect()
torch.cuda.empty_cache()

## Assign environemnt variables

In [None]:
## Add your HF token 
os.environ['HF_TOKEN'] = "Insert HF Token"

## Reduce VRAM usage by reducing fragmentation
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
## Assign where to cache downloaded datasets 
os.environ["HF_HOME"] = "/dbfs/huggingface/"
os.environ["HF_DATASETS_CACHE_DIR"] = "/dbfs/huggingface/.datasets_cache"

## Just make sure you release unused memory in the GPU
torch.cuda.empty_cache()

## Load the dataset

In [None]:

# load the preprocessed datasets using the load_from_disk function
training_dataset = load_from_disk("training_data").rename_columns({"prompt": "text"})
evaluation_dataset = load_from_disk("evaluation_data").rename_columns({"prompt": "text"})

## Load the model 
We will use the BitsandBytes library to create a configuration that will fetch us the quatised version of the Llama 3.2 1B base model. Its important to remember that although we will be quantizing the models weights to 4 bits, the precision of the LoRA matrices will be in 16 bits. This is to ensure that the learning process does not miss any details. During inference, the adapters are merged with the frozen quantised weights on the fly by dequantizing the weights. 

In [None]:
use_4bit = True
# Compute dtype for 4-bit base models. 
bnb_4bit_compute_dtype = torch.float16
# Quantization process (fp4 or nf4)
bnb_4bit_quant_type = "nf4"
# Activate nested quantization for 4-bit base models (double quantization)
use_double_quant = True

bnb_config = BitsAndBytesConfig(load_in_4bit= use_4bit, bnb_4bit_quant_type= bnb_4bit_quant_type, bnb_4bit_compute_dtype= bnb_4bit_compute_dtype, bnb_4bit_use_double_quant= use_double_quant) 
## model name
model_name = "meta-llama/Llama-3.2-1B" 
## downloading the model in a 4bit format 
# model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config, device_map= "auto", token = os.environ['HF_TOKEN'], attn_implementation="flash_attention_2", use_cache=True) ## downloading the model in a 4bit format 
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config, device_map= "auto", token = os.environ['HF_TOKEN'], attn_implementation="flash_attention_2", use_cache=True)

tokenizer = AutoTokenizer.from_pretrained(model_name, token = os.environ['HF_TOKEN'])

**Lets play with the base model a little and see what kind of outputs we get!**

In [None]:
text = "Who is at risk for Lymphocytic Choriomeningitis (LCM), list out in bullet points?"
tokenizer.pad_token_id = tokenizer.eos_token_id
inputs = tokenizer(text, return_tensors="pt", padding=True).to("cuda")
outputs = model.generate(**inputs, max_new_tokens=1000, do_sample=True, temperature= 0.2)
outputs = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(outputs)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Who is at risk for Lymphocytic Choriomeningitis (LCM), list out in bullet points? What are the symptoms of LCM?
What are the treatment options for LCM?
What is the prognosis for LCM?
What are the complications of LCM?
What are the complications of LCM?
What are the complications of LCM?
What are the complications of LCM?
What are the complications of LCM?
What are the complications of LCM?
What are the complications of LCM?
What are the complications of LCM?
What are the complications of LCM?
What are the complications of LCM?
What are the complications of LCM?
What are the complications of LCM?
What are the complications of LCM?
What are the complications of LCM?
What are the complications of LCM?
What are the complications of LCM?
What are the complications of LCM?
What are the complications of LCM?
What are the complications of LCM?
What are the complications of LCM?
What are the complications of LCM?
What are the complications of LCM?
What are the complications of LCM?
What are the

Notice that when you ask "Can you tell me about the moon landing", it simply tries to complete that sentence by looking at what the next best words are. We want to be able to leverage its ability to learn patterns, and make it understand quesitons in a domain. I our case, it is being able to answer specific medical questions completely.

## QLoRA variables




Now that we understand what LORA is, let’s dive into some practical aspects of it. When we fine-tune a language model with QLORA, two new hyperparameters come into play:

1. Rank (r)

2.  Alpha (α)


Let’s suppose that our original weight matrix W was 10,000 x 10,000. Using LoRA, we break it down into two smaller matrices, A and B, with dimensions 10,000 x 8 and 8 x 10,000, respectively. Multiplying A and B allows us to reconstruct our original weight matrix with a shape of 10,000 by 10,000. In this example, 8 represents the rank of this LORA fine-tuning. We can choose any value of rank while decomposing or weight matrix into A and B. A higher rank means a greater number of trainable parameters in our model, making fine-tuning more memory intensive. 

The alpha pamaeter is a scaling factor that is applied to the product of the matrices B and A. This is usually a value that is twice the rank.

In [None]:
################################################################################
# QLoRA parameters
################################################################################
# LoRA attention dimension for the model. This is the rank of the LoRA projected matrix.
lora_r = 24
# Alpha parameter for LoRA scaling. scaling factor that controls the magnitude of the weight changes added to the base model when fine-tuninng.
lora_alpha = 48
# Dropout probability for LoRA layers
lora_dropout = 0.1
################################################################################

Lets create the LoRA configuration that we will add to our model.

1. LoRA Alpha = alpha paramter

2. LoRA droptout = dropout paramter

3. task_type = Causal language model since the model is auto regressive

4. Target Modules = Target the linear projection layers of the model 


In [None]:
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=['up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj']
)

## Model & Tokenizer configuration

#### Model configuration involves adding a pad token to both the tokenizer and the model. This to ensure that all input sequences in a batch have the same length. The model and the tokenizer will pad short input sentences based on the longest input in a batch

In [None]:
# an output tokens hidden state remains the same once computed for every further generation step, so recomputing it every time we want to generate a new token seems wasteful.
model.config.use_cache = False
# Modify the tokenizer to add the pad tokens 
tokenizer_special_tokens_map = {'bos_token': '<|im_start|>',
 'eos_token': '<|im_end|>'}
tokenizer.add_special_tokens(tokenizer_special_tokens_map)
tokenizer.pad_token = tokenizer.eos_token 
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = "right" 

# Modify the models tokenzer library 
model.resize_token_embeddings(len(tokenizer))
model.config.eos_token = '<|im_end|>'
model.config.bos_token = '<|im_start|>'
model.config.eos_token_id = tokenizer.eos_token_id
model.config.bos_token_id = tokenizer.bos_token_id
model.config.pad_token_id = model.config.eos_token_id

# # Add the LoRA adapter to the model architecture
model.add_adapter(peft_config, "adapter 1")


The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


#### Lets check how many parameters are we actually fine tuning!

In [None]:
print(sum(p.numel() for p in model.parameters() if p.requires_grad))

LlamaConfig {
  "_attn_implementation_autoset": true,
  "_name_or_path": "meta-llama/Llama-3.2-1B",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token": "<|im_start|>",
  "bos_token_id": 128256,
  "eos_token": "<|im_end|>",
  "eos_token_id": 128257,
  "head_dim": 64,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 16,
  "num_key_value_heads": 8,
  "pad_token_id": 128257,
  "pretraining_tp": 1,
  "quantization_config": {
    "_load_in_4bit": true,
    "_load_in_8bit": false,
    "bnb_4bit_compute_dtype": "float16",
    "bnb_4bit_quant_storage": "uint8",
    "bnb_4bit_quant_type": "nf4",
    "bnb_4bit_use_double_quant": true,
    "llm_int8_enable_fp32_cpu_offload": false,
    "llm_int8_has_fp16_weight": false,
    "llm_int8_skip_modul

## Model training arguements

Lets define the training arguements for training our model:

1. **num_train_epochs:** This is the number of times the model will go through your entire dataset.

2. **bf16/fp16:** floating point data format or binary float format to use during training.

3. **per_device_train_batch_size:** The number of samples that are processed before weights are updated Larger batch sizes generally lead to more stable training, although we would need to consider our GPU memory. We can use gradient accumulation to effectively take in larger batches over multiple forward/backward passes befoer updating the model. 

4. **per_device_eval_batch_size:** The size of the batch used for evaluation, ususally should be the same as the training batch size.

5. **gradient_accumulation_steps:** Instead of updating the A and B matrices/weights after each batch of data, you accumulate the gradients from batches before performing the weight update.

6. **max_grad_norm:** Limits the maximum magnitude of gradients during training to prevent exploding gradients. Common values are 1, 3, 5, 8, 10

7. **Learning_rate:** Controls the size of adjustments made to the A and B parameters (weights) at each iteration during optimization. A low number might cause training to be slow and may also cause the model to get stuck in local minima. Too high and the training may become unstable or diverge, which will degrade the performance. 

8. **weight_decay:** a value added to the computation of the loss function which restricts the development of large paramters.weights. Encourages the model to learn more simple and generalisable featreus. Usually a value of 0.01 works well.

9. **optim:** An optimizer is a crucial element that fine-tunes a neural network's parameters during training. Its primary role is to minimize the model's error or loss function , enhancing performance. In practice, Adamw8-bit is stringly recommended, it performs as well as its 32bit version while using less memory. There is a paged version (Parts of the optimizer states are moved automativally between CPU and GPU when you use this version).

10. **lr_scheduler_type:** changes the learning rate during learning. Starting with a higher LR for rapid initial progress and then decreasing it in later stages. Linear and cosine schedulers are the two most common options.

11. **warmup_steps:** the initial training period where the learning rate is set low to gradually adjust the newly added parameters (like LoRA matrices) before ramping up to a higher learning rate for full optimization.

12. **group_by_length:** Group sequences into batches with same length. Saves memory and speeds up training considerably. 

13. **logging_steps:** Log every X updates steps


In [None]:
output_dir = "./medical_model_results"
# Number of training epochs
num_train_epochs = 2

# Enable fp16/bf16 training. Usually keep fp16 True and bf16 False as bf16 works well if you have a larger GPU (like A100)
fp16 = True
bf16 = False

# Batch size per GPU for training. I.e how many tokens would you like the model to go through in a single forward and backward pass? Given how big the Q and A pairs are, we can only have it up until 3 samples per batch  
per_device_train_batch_size = 5

# Batch size per GPU for evaluation
per_device_eval_batch_size = 5
 
# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 2

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping). Limits the maximum magnitude of gradient vector during training to prevent exploding gradients. 
max_grad_norm = 1

# Initial learning rate (AdamW optimizer)
learning_rate = 4e-4

# Weight decay to apply to all layers except bias/LayerNorm weights. This is a regularization parameter added to the loss function for avoiding large weights 
weight_decay = 0.01

# Optimizer to use is adam 8bit 
optim = "adamw_8bit"
# optim = "paged_adamw_32bit"
# Learning rate schedule, how the learning rate decays over time
lr_scheduler_type = "linear"

# steps for a linear warmup (from 0 to learning rate)
warmup_steps = 2

# Log every X updates steps
logging_steps = 3

# goup smaller size samples into same batch 
group_by_length = True

## ensure that different tensors do not share the same memory during training 
safe_tensors = True

### keep overwiting the output directory
overwrite_output_dir = True

## Create a training configuration using the training arguements:

In [None]:

training_args = SFTConfig(
        learning_rate= learning_rate,
        lr_scheduler_type= lr_scheduler_type,
        eval_strategy = "steps",
        per_device_train_batch_size= per_device_train_batch_size,
        per_device_eval_batch_size= per_device_eval_batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,
        num_train_epochs= num_train_epochs,
        fp16= fp16,
        bf16= bf16,
        logging_steps= logging_steps,
        optim= optim,
        weight_decay= weight_decay,
        warmup_steps= warmup_steps,
        output_dir=output_dir,
        seed=0, 
        group_by_length= group_by_length, 
        do_eval = True, 
        overwrite_output_dir = overwrite_output_dir,
        max_grad_norm = max_grad_norm, 
        save_safetensors = True)



## Model Training 

We will use the SFTTrainer class to train the model based on the training configuration from the previous step

In [None]:
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset= training_dataset,
    eval_dataset= evaluation_dataset,
    args=training_args)

trainer.train()


  trainer = SFTTrainer(


[2025-03-10 22:05:33,275] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)


df: /root/.triton/autotune: No such file or directory




  @autocast_custom_fwd
  @autocast_custom_bwd


Step,Training Loss,Validation Loss
3,1.8081,2.440445
6,1.6439,1.827389
9,1.7122,1.727247
12,1.6047,1.655102
15,1.6285,1.610006
18,1.5667,1.575418
21,1.6817,1.552109
24,1.6568,1.536861
27,1.623,1.527722
30,1.6197,1.51746


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Tr

com.databricks.backend.common.rpc.CommandCancelledException
	at com.databricks.spark.chauffeur.SequenceExecutionState.$anonfun$cancel$5(SequenceExecutionState.scala:136)
	at scala.Option.getOrElse(Option.scala:189)
	at com.databricks.spark.chauffeur.SequenceExecutionState.$anonfun$cancel$3(SequenceExecutionState.scala:136)
	at com.databricks.spark.chauffeur.SequenceExecutionState.$anonfun$cancel$3$adapted(SequenceExecutionState.scala:133)
	at scala.collection.immutable.Range.foreach(Range.scala:158)
	at com.databricks.spark.chauffeur.SequenceExecutionState.cancel(SequenceExecutionState.scala:133)
	at com.databricks.spark.chauffeur.ExecContextState.cancelRunningSequence(ExecContextState.scala:714)
	at com.databricks.spark.chauffeur.ExecContextState.$anonfun$cancel$1(ExecContextState.scala:432)
	at scala.Option.getOrElse(Option.scala:189)
	at com.databricks.spark.chauffeur.ExecContextState.cancel(ExecContextState.scala:432)
	at com.databricks.spark.chauffeur.ExecutionContextManagerV1.can

## Save the Model locally

In [None]:
trainer.save_state()
# trainer.save_pretrained("new_medical_model")

## Clear the GPU memeory 

In [None]:
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
import gc
gc.collect()
torch.cuda.empty_cache()

## Import the model and infer

In [None]:
from peft import PeftModel, PeftConfig
# new_model = "meta-llama/Llama-3.2-1B"
# base_model = AutoModelForCausalLM.from_pretrained(
#     model_name,
#     low_cpu_mem_usage=True,
#     return_dict=True,
#     torch_dtype=torch.float16,
#     device_map= "auto",
# )

# ### merge the trained adapter with the base model 
# trial_model = PeftModel.from_pretrained(base_model, new_model)
# trial_model = trial_model.merge_and_unload()

output_dir = "./medical_model_results"
tokenizer = AutoTokenizer.from_pretrained(output_dir)
trial_model = AutoModelForCausalLM.from_pretrained(output_dir, load_in_4bit=True, device_map="auto")

In [None]:
text = "can you explain what haemetoma is and how it is treated?"
inputs = tokenizer(text, return_tensors="pt").to("cuda")
outputs = trial_model.generate(**inputs,temperature=0.1, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


can you explain what haemetoma is and how it is treated? i have a 2 year old dog that has a history of hemangiosarcoma and has been treated with a combination of chemotherapy and surgery. he has a history of a large hemangiosarcoma on his right side that was removed in 2011. he has also had a large hemangiosarcoma on his left side that was removed in 2012. he has had a large hemangiosarcoma on his right side that was removed in 2013. he has had a large hemangiosarcoma on his left side that was removed in 2014. he has had a large hemangiosarcoma on his right side that was removed in 2015. he has had a large hemangiosarcoma on his left side that was removed in 2016. he has had a large hemangiosarcoma on his right side that was removed in 2017. he has had a large hemangiosarcoma on his left
