# Baseline Evaluation: Zero-Shot Llama-3-8B

This notebook evaluates the **baseline performance** of Llama-3-8B on clinical trial outcome predictions without any fine-tuning.

## Overview

We'll test how well a vanilla language model can predict clinical trial outcomes using only its pre-trained knowledge. This establishes our baseline to compare against fine-tuned performance.

## What We're Testing

- **Model:** Llama-3-8B (zero-shot, no training)
- **Task:** Binary prediction (will trial succeed: YES/NO)
- **Test Set:** 206 held-out questions
- **Evaluation:** Accuracy, confusion matrix, error analysis

## Requirements

- ‚ö†Ô∏è **GPU Required** - Use Google Colab with T4 runtime (free)
- Hugging Face account with Llama-3 access ([request here](https://huggingface.co/meta-llama/Meta-Llama-3-8B))
- Test dataset from notebook 01

## Expected Results

- **Baseline accuracy:** ~55-58% (slightly better than random)
- **Common issue:** Optimistic bias (predicts success too often)
- **Runtime:** ~18-23 minutes

## Output Files

- `baseline_results.csv` - All baseline predictions with correctness labels

---

**Note:** Don't run this on your laptop unless you have a dedicated GPU. Use Google Colab!

Let's establish our baseline! üìä

# Install required packages

In [None]:
!pip install -q transformers accelerate bitsandbytes huggingface_hub

# Login to Hugging Face
from huggingface_hub import login
login()  # This will prompt you to enter your token

import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from tqdm import tqdm
import re

[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m60.7/60.7 MB[0m [31m19.3 MB/s[0m eta [36m0:00:00[0m
[?25h

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv‚Ä¶

# Load test data and model

In [None]:
test_df = pd.read_csv("clinical_test.csv")
print(f"Loaded {len(test_df)} test samples")

model_name = "meta-llama/Meta-Llama-3-8B"
print(f"\nLoading {model_name}...")

# Configure 8-bit quantization to save memory
quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    bnb_8bit_compute_dtype=torch.float16
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto"
)

# Set pad token if not set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print("‚úÖ Model loaded!")

Loaded 206 test samples

Loading meta-llama/Meta-Llama-3-8B...


config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/291 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/177 [00:00<?, ?B/s]

‚úÖ Model loaded!


# Function to create prompt & extract prediction

In [None]:
def create_baseline_prompt(question):
    return f"""You are evaluating clinical trial outcomes. Based on the question below, predict whether the outcome will be YES (1) or NO (0).

Question: {question}

Respond with only a single digit: 0 or 1.
Answer:"""


def extract_prediction(text):
    # Look for 0 or 1 in the response
    text = text.strip()
    match = re.search(r'\b[01]\b', text)
    if match:
        return int(match.group())
    return None

# Run baseline evaluation
predictions = []
correct = 0
total = 0
unparseable = 0

print("\nüîç Running baseline evaluation (this takes ~2-3 minutes)...\n")



üîç Running baseline evaluation (this takes ~10-15 minutes)...



# Generate predictions

In [6]:
for idx, row in tqdm(test_df.iterrows(), total=len(test_df)):
    prompt = create_baseline_prompt(row['Question'])

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=10,
            temperature=0.1,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id
        )

    response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
    prediction = extract_prediction(response)

    if prediction is not None:
        predictions.append({
            'Question': row['Question'],
            'true_answer': row['Answer'],
            'predicted_answer': prediction,
            'correct': prediction == row['Answer'],
            'raw_response': response.strip()
        })

        if prediction == row['Answer']:
            correct += 1
        total += 1
    else:
        unparseable += 1
        predictions.append({
            'Question': row['Question'],
            'true_answer': row['Answer'],
            'predicted_answer': None,
            'correct': False,
            'raw_response': response.strip()
        })

# Calculate accuracy
baseline_accuracy = correct / total if total > 0 else 0

  0%|          | 0/206 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 206/206 [03:45<00:00,  1.09s/it]


# Print results and save to CSV

In [8]:
print("\n" + "="*80)
print("BASELINE EVALUATION RESULTS")
print("="*80)
print(f"Model: {model_name}")
print(f"Total test samples: {len(test_df)}")
print(f"Successfully parsed: {total}")
print(f"Unparseable responses: {unparseable}")
print(f"Correct predictions: {correct}")
print(f"Baseline Accuracy: {baseline_accuracy:.1%}")
print("="*80)

# Save results
results_df = pd.DataFrame(predictions)
results_df.to_csv("baseline_results.csv", index=False)
print("\n‚úÖ Results saved to baseline_results.csv")

# Show some examples
print("\nüìã Sample predictions:")
display_df = results_df[['Question', 'true_answer', 'predicted_answer', 'correct']].head(10)
print(display_df)

# Additional statistics
print("\n" + "="*80)
print("DETAILED STATISTICS")
print("="*80)
if total > 0:
    print(f"Answer distribution in predictions:")
    print(results_df[results_df['predicted_answer'].notna()]['predicted_answer'].value_counts())
    print(f"\nTrue answer distribution:")
    print(test_df['Answer'].value_counts())


BASELINE EVALUATION RESULTS
Model: meta-llama/Meta-Llama-3-8B
Total test samples: 206
Successfully parsed: 206
Unparseable responses: 0
Correct predictions: 116
Baseline Accuracy: 56.3%

‚úÖ Results saved to baseline_results.csv

üìã Sample predictions:
                                            Question  true_answer  \
0  Will Pharnext announce positive topline result...            0   
1  Will Sarepta Therapeutics release results from...            1   
2  Will EIP Pharma (CervoMed) complete its Phase ...            1   
3  Will argenx SE receive FDA approval for VYVGAR...            1   
4  Will Capricor Therapeutics report top-line dat...            0   
5  Will the FDA approve AbbVie's Rinvoq (upadacit...            1   
6  Will Reata Pharmaceuticals' Skyclarys (omavelo...            1   
7  Will Biogen begin patient screening for its Ph...            0   
8  Will AbbVie's Rinvoq (upadacitinib) receive FD...            0   
9  Will UCB's Bimzelx (bimekizumab-bkzx) be comme...  