## This notebook is used to load **Lama3.1** Base model and the trained **LORA**
this exploratpry LORA is trained with love by **David E. Girges** <br>

### Install Unsloth and Necessary Libraries

In [None]:
%%capture
!pip install "unsloth[colab-new]@git+https://github.com/unslothai/unsloth.git"
!pip install bitsandbytes

 ### Imports and Optional Hugging Face Login

In [3]:
from unsloth import FastLanguageModel
import torch
from transformers import AutoTokenizer, TextStreamer # TextStreamer is for streaming output
from huggingface_hub import login # Optional, for Hugging Face Hub authentication



🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


### Define Model and LoRA Adapter Parameters

In [5]:

base_model_name = "unsloth/Meta-Llama-3.1-8B-Instruct"

max_seq_length = 2048
dtype = None
load_in_4bit = True


lora_adapter_path_on_hub = "DavidElks/LamaFineTuned"


print(f"Will load LoRA adapter from: {lora_adapter_path_on_hub}")

Will load LoRA adapter from: DavidElks/LamaFineTuned


###  Load the Base Model using Unsloth

In [6]:
print(f"Loading base model: {base_model_name}...")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = base_model_name,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)
print("Base model loaded successfully.")

Loading base model: unsloth/Meta-Llama-3.1-8B-Instruct...
==((====))==  Unsloth 2025.5.8: Fast Llama patching. Transformers: 4.52.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.5k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

Base model loaded successfully.


### Applying LoRA Adapter from Hugging Face Hub

In [7]:
print(f"Applying LoRA adapter from: {lora_adapter_path_on_hub}...")
try:
    model.load_adapter(lora_adapter_path_on_hub)
    print("LoRA adapter applied successfully.")
except Exception as e:
    print(f"Error applying LoRA adapter: {e}")
    print("Please ensure the `lora_adapter_path_on_hub` is correct and the model is public or you are logged in.")


FastLanguageModel.for_inference(model)
print("Unsloth fast inference kernels enabled.")

Applying LoRA adapter from: DavidElks/LamaFineTuned...


adapter_config.json:   0%|          | 0.00/881 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

LoRA adapter applied successfully.
Unsloth fast inference kernels enabled.


 ### Prepare Model and Tokenizer for Inference

In [8]:
model.eval()

from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

print("Model and tokenizer are ready for inference.")

Model and tokenizer are ready for inference.


### Run Inference with an Example Prompt

In [12]:
# a sample question from the trained dataset
user_question = "What is the efficacy of metformin in treating polycystic ovary syndrome (PCOS) compared to lifestyle modifications?"

messages = [
    {"role": "user", "content": user_question},
]


inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda" if torch.cuda.is_available() else "cpu") # Send inputs to the GPU if available

# For streaming the output token by token (like ChatGPT)
text_streamer = TextStreamer(tokenizer, skip_prompt = True, skip_special_tokens = True)

print(f"\nUser Question: {user_question}")
print("Generating response (streaming):")

# Generate the response
with torch.no_grad(): # Disable gradient calculations for inference
    outputs = model.generate(
        input_ids = inputs,
        streamer = text_streamer,
        max_new_tokens = 512,
        use_cache = True,
        temperature = 0.6,
        top_p = 0.9,
        eos_token_id = tokenizer.eos_token_id,
        pad_token_id = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id
    )

print("\n--- Generation Complete ---")




User Question: What is the efficacy of metformin in treating polycystic ovary syndrome (PCOS) compared to lifestyle modifications?
Generating response (streaming):
Metformin is effective in treating PCOS compared to lifestyle modifications. Metformin significantly reduces androgen levels and improves clinical and biochemical parameters.

--- Generation Complete ---
