<a href="https://colab.research.google.com/github/RubinThomas75/epfLLM-eval/blob/main/notebooks/1_load_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Install Dependencies
!pip install transformers==4.47.1  # or your preferred version
!pip install torch                # or 'torch==2.0.0' for a specific version
!pip install accelerate         # (Optional) for efficient inference on multi-GPU



In [2]:
import os
from google.colab import drive

drive.mount('/content/drive')
token_file_path = "/content/drive/MyDrive/hf_read_token.txt"

with open(token_file_path, "r", encoding="utf-8-sig") as f:
    token = f.read().strip()

os.environ["HF_TOKEN"] = token

print("Hugging Face token loaded.")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Hugging Face token loaded.


In [3]:
# Load EPFL LLM (Meditron)

# Replace "epfLLM/meditron" with the exact model path from Hugging Face if needed

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "epfl-llm/meditron-7b"
cache_dir = "/content/drive/MyDrive/epfLLM_meditron7b"

# Load tokenizer
print(f"Loading tokenizer for {model_name}...")
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    cache_dir=cache_dir,
    use_auth_token=os.environ["HF_TOKEN"]
)

# Load model
print(f"Loading model for {model_name}...")
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    cache_dir=cache_dir,
    use_auth_token=os.environ["HF_TOKEN"]
)

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
print(f"Model loaded on device: {device}")

Loading tokenizer for epfl-llm/meditron-7b...




Loading model for epfl-llm/meditron-7b...




Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

Model loaded on device: cpu


In [None]:
# Quick Test Inference

input_text = "What is the common treatment for a headache?"

input_ids = tokenizer.encode(input_text, return_tensors="pt").to(device)

model.config.pad_token_id = model.config.eos_token_id

# Generate attention mask
attention_mask = torch.ones(input_ids.shape, device=device)

# Generate the response with attention mask
with torch.no_grad():
    output_ids = model.generate(
        input_ids,
        attention_mask=attention_mask,
        max_length=200,         # Adjust for longer/shorter responses
        num_beams=5,          # Increase for more exhaustive search
        early_stopping=True,
        no_repeat_ngram_size=2,
        top_p=0.9,              # Use nucleus sampling
        top_k=50                # Use top-k sampling to narrow down possibilities
    )

# Decode and print the output
decoded_output = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print("\n=== Model Response ===")
print(decoded_output)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



=== Model Response ===
What is the common treatment for a headache?
The most common treatments for headaches are nonsteroidal anti-inflammatory drugs (NSAIDs), acetaminophen (Tylenol), and opioid analgesics.

### What are the side effects of these medications, and what should I do if I experience them? 
Side effects include gastrointestinal (GI) upset, drowsiness, nausea, vomiting, constipation, or diarrhea. If you experience these symptoms, stop taking the medication and contact your health care provider for advice. Do not take more than the recommended dose or for longer than prescribed, as this may increase the risk of serious adverse effects, including GI bleeding, liver damage, kidney failure, heart attack, stroke, respiratory depression, addiction, overdose,


In [22]:
import json
import itertools
import random
import sys
sys.path.append('/content/epfLLM-eval')  # use repo utils
import prompt_templates



file_path = '/content/drive/MyDrive/extracted_questions/dev.json'

sample_size = 100
extracted_data = []

with open(file_path, 'r') as file:
    lines = itertools.islice(file, 0, None) # returns iterator from beginning to end
    sampled_lines = random.sample(list(lines), sample_size)

    for line in sampled_lines:
        try:
            data = json.loads(line.strip())
            extracted_data.append({
                'question': data['question'],
                'A': data['opa'],
                'B': data['opb'],
                'C': data['opc'],
                'D': data['opd'],
                'correct_answer': data['cop']
            })
        except json.JSONDecodeError as e:
            print(f"Skipping invalid line: {e}")

# Print the extracted data
for data in extracted_data:
  prompt = prompt_templates.prepare_prompt(data)
  print(generate_response(prompt))


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Answer the following multiple-choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes), where LETTER is one of A, B, C, or D. Think step by step before answering.

Example 1:
Q: What is the first-line treatment for hypertension?
A) Diuretics
B) Beta-blockers
C) ACE inhibitors
D) Calcium channel blockers
Answer: A

Example 2:
Q: What is the treatment for type 2 diabetes?
A) Insulin therapy
B) Metformin
C) Statins
D) Diuretics
Answer: B

Now, answer the following question:
Q: All the following are features of Addison's disease, EXCEPT:
A) Hypoglycemia
B) Hypocalcaemia
C) Hypotension
D) Hyponatremia
Answer: C


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Answer the following multiple-choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes), where LETTER is one of A, B, C, or D. Think step by step before answering.

Example 1:
Q: What is the first-line treatment for hypertension?
A) Diuretics
B) Beta-blockers
C) ACE inhibitors
D) Calcium channel blockers
Answer: A

Example 2:
Q: What is the treatment for type 2 diabetes?
A) Insulin therapy
B) Metformin
C) Statins
D) Diuretics
Answer: B

Now, answer the following question:
Q: Which of the following is false about Transfusion-Related Acute Lung Injury?
A) Develops within 24 hours
B) Mostly seen after sepsis and cardiac surgeries
C) It's a cause of non-cardiogenic pulmonary edema
D) Plasma is more likely to cause it than whole blood
Answer: C


KeyboardInterrupt: 

In [21]:
import torch

torch.set_num_threads(4)  # Set the number of CPU threads to use

# Use optimized settings for faster inference
model.config.pad_token_id = model.config.eos_token_id

def generate_response(prompt):
    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
    attention_mask = torch.ones(input_ids.shape, device=device)

    # Generate the response with optimized settings
    with torch.no_grad():
        output_ids = model.generate(
            input_ids,
            attention_mask=attention_mask,
            max_new_tokens=1,
            num_beams=1,            # Use greedy search instead of beam search
            top_p=0.9,              # Nucleus sampling
            top_k=50                # Top-k sampling
        )
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

In [8]:
%cd /content/epfLLM-eval

/content/epfLLM-eval


In [9]:
!git pull origin main  # Replace 'main' with your branch name if it's different


remote: Enumerating objects: 5, done.[K
remote: Counting objects:  20% (1/5)[Kremote: Counting objects:  40% (2/5)[Kremote: Counting objects:  60% (3/5)[Kremote: Counting objects:  80% (4/5)[Kremote: Counting objects: 100% (5/5)[Kremote: Counting objects: 100% (5/5), done.[K
remote: Compressing objects: 100% (1/1)[Kremote: Compressing objects: 100% (1/1), done.[K
remote: Total 3 (delta 2), reused 3 (delta 2), pack-reused 0 (from 0)[K
Unpacking objects:  33% (1/3)Unpacking objects:  66% (2/3)Unpacking objects: 100% (3/3)Unpacking objects: 100% (3/3), 456 bytes | 456.00 KiB/s, done.
From https://github.com/RubinThomas75/epfLLM-eval
 * branch            main       -> FETCH_HEAD
   f17a2a0..3365be4  main       -> origin/main
Updating f17a2a0..3365be4
Fast-forward
 prompt_templates.py | 8 [32m++++++++[m
 1 file changed, 8 insertions(+)
