# Evaluation Notebook - LLAMA, Gemini & ML Summarization

This notebook evaluates LLAMA, Gemini, and ML (BART-base) summarization outputs using both generation and retrieval metrics.

## Evaluation Plan

### Generation Metrics (Record-Level)
- **BLEU Score**: Token-based n-gram overlap
- **ROUGE-L Score**: Longest common subsequence-based F1 score
- **BERTScore**: Semantic similarity using Bio_ClinicalBERT embeddings

### Retrieval Metrics (Record-Level)
- **Precision@5**: Relevance of top 5 retrieved chunks
- **Recall@10**: Coverage of relevant chunks in top 10
- **MRR**: Mean Reciprocal Rank of first relevant chunk

Relevance is determined using ClinicalBERT embeddings with cosine similarity threshold of 0.70-0.75.


In [41]:
import os
import pandas as pd
import numpy as np
from pathlib import Path

# For BLEU score
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
nltk.download('punkt', quiet=True)

# For ROUGE score: Try importing, if fails, inform user gracefully
try:
    from rouge_score import rouge_scorer
except ModuleNotFoundError:
    print(
        "\nWARNING: 'rouge_score' module not found. "
        "ROUGE evaluation will be unavailable. "
        "To install, run: pip install rouge-score"
    )
    rouge_scorer = None

# For BERTScore
try:
    from bert_score import score as bert_score_func
except ModuleNotFoundError:
    print(
        "\nWARNING: 'bert_score' module not found. "
        "BERTScore evaluation will be unavailable. "
        "To install, run: pip install bert-score"
    )
    bert_score_func = None

# For embeddings (ClinicalBERT)
from transformers import AutoTokenizer, AutoModel
import torch

# For ChromaDB (to extract retrieved chunks)
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings

# For cosine similarity
from sklearn.metrics.pairwise import cosine_similarity

print("Libraries imported (with best effort).")


Libraries imported (with best effort).


## 1. Load Evaluation Data


In [42]:
# Load the evaluation datasets
llama_df = pd.read_csv("/home/root495/Inexture/CDSS-RAG/data/processed/conversation_summary_using_llama.csv")
gemini_df = pd.read_csv("/home/root495/Inexture/CDSS-RAG/data/processed/conversation_summary_using_gemini.csv")
ml_df = pd.read_csv("/home/root495/Inexture/CDSS-RAG/data/processed/conversation_summary_using_ml.csv")

print(f"LLAMA dataset shape: {llama_df.shape}")
print(f"Gemini dataset shape: {gemini_df.shape}")
print(f"ML dataset shape: {ml_df.shape}")
print(f"\nLLAMA columns: {llama_df.columns.tolist()}")
print(f"Gemini columns: {gemini_df.columns.tolist()}")
print(f"ML columns: {ml_df.columns.tolist()}")

# Rename ml_summary to rag_summary for consistency in ML dataset
ml_df = ml_df.rename(columns={'ml_summary': 'rag_summary'})

# Display first few rows
print("\nLLAMA dataset preview:")
llama_df.head()


LLAMA dataset shape: (15, 3)
Gemini dataset shape: (15, 3)
ML dataset shape: (15, 3)

LLAMA columns: ['conversation', 'summary', 'rag_summary']
Gemini columns: ['conversation', 'summary', 'rag_summary']
ML columns: ['conversation', 'summary', 'ml_summary']

LLAMA dataset preview:


Unnamed: 0,conversation,summary,rag_summary
0,"Doctor: Hello? Hi. Um, should we start? Yeah, ...","3/7 hx of diarrhea, mainly watery. No blood in...","3-day history of watery diarrhea, occurring 6-..."
1,Doctor: Hello? Patient: Hello. Can you hear me...,"4/7 hx of dry itchy skin, mainly on chest and ...","4-day history of itchy and sore skin, mainly a..."
2,Doctor: Hello? Patient: Hello. Doctor: Hello t...,"Headache on left side. Started few hours ago, ...","The patient, a female, presents with a 4-hour ..."
3,"Doctor: Alex. Ohh. Hello? Hi, can you hear me?...","4/7 hx of generally unwell, mainly sore throat...",4-day history of feeling unwell with initial s...
4,Doctor: Hello? Patient: Doctor: . Good morning...,2/7 ago developed lower abdo pain/suprapubic p...,2-day history of gradual onset lower abdominal...


## 2. Extract Retrieved Chunks from ChromaDB

Since the CSV files don't contain the `retrieved_chunks` column, we need to re-query ChromaDB for each conversation to get the retrieved chunks.


In [None]:
# Initialize ChromaDB retriever (same as in summarization notebooks)
embedding_model = HuggingFaceEmbeddings(
    model_name="emilyalsentzer/Bio_ClinicalBERT"
)

# Load Chroma DB - using the same path as in summarization notebooks
chroma_db = Chroma(
    persist_directory="/home/root495/Inexture/CDSS-RAG/notebooks/chroma_store",
    embedding_function=embedding_model
)

print("ChromaDB loaded successfully!")

def extract_retrieved_chunks(conversation, k=15):
    """
    Extract top k retrieved chunks for a given conversation.
    We retrieve k=15 to enable meaningful Recall@10 calculation 
    (need more than 10 chunks to measure recall).
    """
    try:
        # Retrieve top k chunks
        docs = chroma_db.similarity_search(conversation, k=k)
        # Extract the page_content (chunk text) from each document
        chunks = [doc.page_content for doc in docs]
        return chunks
    except Exception as e:
        print(f"Error retrieving chunks: {e}")
        return []

# Test retrieval
test_chunks = extract_retrieved_chunks(llama_df.iloc[0]['conversation'], k=15)
print(f"Test: Retrieved {len(test_chunks)} chunks")
if test_chunks:
    print(f"First chunk preview: {test_chunks[0][:200]}...")


ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


ChromaDB loaded successfully!
Test: Retrieved 15 chunks
First chunk preview: a 61-year-old female presented to hospital with a 2-week history of profound diarrhea and vomiting. the patient also complained of dull abdominal pain that temporarily resolved with bowel movements. s...


In [None]:
# Extract retrieved chunks for RAG-based models (LLAMA and Gemini)
# Retrieve 15 chunks to enable meaningful Recall@10 calculation
print("Extracting retrieved chunks for LLAMA dataset...")
llama_retrieved_chunks = []
for idx, row in llama_df.iterrows():
    chunks = extract_retrieved_chunks(row['conversation'], k=15)
    llama_retrieved_chunks.append(chunks)
    if (idx + 1) % 5 == 0:
        print(f"Processed {idx + 1}/{len(llama_df)} conversations")

llama_df['retrieved_chunks'] = llama_retrieved_chunks

print("\nExtracting retrieved chunks for Gemini dataset...")
gemini_retrieved_chunks = []
for idx, row in gemini_df.iterrows():
    chunks = extract_retrieved_chunks(row['conversation'], k=15)
    gemini_retrieved_chunks.append(chunks)
    if (idx + 1) % 5 == 0:
        print(f"Processed {idx + 1}/{len(gemini_df)} conversations")

gemini_df['retrieved_chunks'] = gemini_retrieved_chunks

# ML model has no retrieval component - skip chunk extraction
print("\nSkipping chunk extraction for ML dataset (no retrieval component)")

print("\nRetrieved chunks extracted successfully!")
print(f"LLAMA - Sample chunks count: {len(llama_df.iloc[0]['retrieved_chunks'])}")
print(f"Gemini - Sample chunks count: {len(gemini_df.iloc[0]['retrieved_chunks'])}")
print(f"ML - N/A (no retrieval component)")


Extracting retrieved chunks for LLAMA dataset...
Processed 5/15 conversations
Processed 10/15 conversations
Processed 15/15 conversations

Extracting retrieved chunks for Gemini dataset...
Processed 5/15 conversations
Processed 10/15 conversations
Processed 15/15 conversations

Extracting retrieved chunks for ML dataset...
Processed 5/15 conversations
Processed 10/15 conversations
Processed 15/15 conversations

Retrieved chunks extracted successfully!
LLAMA - Sample chunks count: 15
Gemini - Sample chunks count: 15
ML - Sample chunks count: 15


## 3. Compute Generation Metrics (Record-Level)

### 3.1 BLEU Score


In [45]:
import nltk
nltk.download('punkt_tab')

def compute_bleu_score(reference, prediction):
    """Compute sentence-level BLEU score"""
    # Tokenize reference and prediction
    reference_tokens = nltk.word_tokenize(reference.lower())
    prediction_tokens = nltk.word_tokenize(prediction.lower())
    
    # Use smoothing to avoid zero scores
    smoothing = SmoothingFunction().method1
    score = sentence_bleu([reference_tokens], prediction_tokens, smoothing_function=smoothing)
    return score

# Test BLEU
test_ref = llama_df.iloc[0]['summary']
test_pred = llama_df.iloc[0]['rag_summary']
test_bleu = compute_bleu_score(test_ref, test_pred)
print(f"Test BLEU score: {test_bleu:.4f}")


Test BLEU score: 0.1247


[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/root495/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


### 3.2 ROUGE-L Score


In [46]:
# Import and initialize ROUGE scorer from the official rouge_score package
from rouge_score import rouge_scorer

rouge_scorer_instance = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)

def compute_rouge_score(reference, prediction):
    """Compute ROUGE-L F1 score"""
    scores = rouge_scorer_instance.score(reference, prediction)
    return scores['rougeL'].fmeasure

# Test ROUGE
test_rouge = compute_rouge_score(test_ref, test_pred)
print(f"Test ROUGE-L score: {test_rouge:.4f}")


Test ROUGE-L score: 0.3253


### 3.3 BERTScore using ClinicalBERT


In [47]:
# BERTScore with ClinicalBERT: workaround KeyError using an appropriate model name
print("Loading Bio_ClinicalBERT (huggingface model) for BERTScore...")
from bert_score import score as bert_score

# The model_type must be compatible with bert-score; 
# 'microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext' is usually supported and close to ClinicalBERT.
# Alternatively, use 'bert-base-uncased' for BERTScore if a biomedical BERT isn't available.
try:
    clinical_bert_model = "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"
    print(f"Trying BERTScore with model_type={clinical_bert_model} ...")
    P, R, F1 = bert_score(
        [test_pred],
        [test_ref],
        model_type=clinical_bert_model,
        lang='en',
        verbose=False
    )
except KeyError as e:
    print(f"Model {clinical_bert_model} not available in BERTScore. Falling back to 'bert-base-uncased'.")
    clinical_bert_model = "bert-base-uncased"
    P, R, F1 = bert_score(
        [test_pred],
        [test_ref],
        model_type=clinical_bert_model,
        lang='en',
        verbose=False
    )

test_bertscore = F1.item()
print(f"Test BERTScore (F1): {test_bertscore:.4f}")


Loading Bio_ClinicalBERT (huggingface model) for BERTScore...
Trying BERTScore with model_type=microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext ...
Model microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext not available in BERTScore. Falling back to 'bert-base-uncased'.
Test BERTScore (F1): 0.6575


In [48]:
def compute_all_generation_metrics(df, model_name="Model"):
    """Compute BLEU, ROUGE-L, and BERTScore for all rows in the dataframe"""
    print(f"Computing generation metrics for {model_name}...")
    
    bleu_scores = []
    rouge_scores = []
    
    # Compute BLEU and ROUGE row by row
    for idx, row in df.iterrows():
        reference = row['summary']
        prediction = row['rag_summary']
        
        bleu = compute_bleu_score(reference, prediction)
        rouge = compute_rouge_score(reference, prediction)
        
        bleu_scores.append(bleu)
        rouge_scores.append(rouge)
        
        if (idx + 1) % 5 == 0:
            print(f"  Processed {idx + 1}/{len(df)} rows (BLEU/ROUGE)")
    
    # Compute BERTScore in batch (more efficient)
    print(f"  Computing BERTScore for {model_name}...")
    references = df['summary'].tolist()
    predictions = df['rag_summary'].tolist()
    
    P, R, F1 = bert_score_func(
        predictions,
        references,
        model_type=clinical_bert_model,
        lang='en',
        verbose=True
    )
    
    bertscores = F1.tolist()
    
    # Add to dataframe
    df['bleu_score'] = bleu_scores
    df['rouge_score'] = rouge_scores
    df['bertscore'] = bertscores
    
    print(f"Generation metrics computed for {model_name}!")
    print(f"  Mean BLEU: {np.mean(bleu_scores):.4f}")
    print(f"  Mean ROUGE-L: {np.mean(rouge_scores):.4f}")
    print(f"  Mean BERTScore: {np.mean(bertscores):.4f}")
    
    return df

# Compute generation metrics for all datasets
llama_df = compute_all_generation_metrics(llama_df.copy(), "LLAMA")
gemini_df = compute_all_generation_metrics(gemini_df.copy(), "Gemini")
ml_df = compute_all_generation_metrics(ml_df.copy(), "ML")


Computing generation metrics for LLAMA...
  Processed 5/15 rows (BLEU/ROUGE)
  Processed 10/15 rows (BLEU/ROUGE)
  Processed 15/15 rows (BLEU/ROUGE)
  Computing BERTScore for LLAMA...
calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 8.91 seconds, 1.68 sentences/sec
Generation metrics computed for LLAMA!
  Mean BLEU: 0.0810
  Mean ROUGE-L: 0.2641
  Mean BERTScore: 0.6233
Computing generation metrics for Gemini...
  Processed 5/15 rows (BLEU/ROUGE)
  Processed 10/15 rows (BLEU/ROUGE)
  Processed 15/15 rows (BLEU/ROUGE)
  Computing BERTScore for Gemini...
calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 9.30 seconds, 1.61 sentences/sec
Generation metrics computed for Gemini!
  Mean BLEU: 0.0791
  Mean ROUGE-L: 0.3227
  Mean BERTScore: 0.6613
Computing generation metrics for ML...
  Processed 5/15 rows (BLEU/ROUGE)
  Processed 10/15 rows (BLEU/ROUGE)
  Processed 15/15 rows (BLEU/ROUGE)
  Computing BERTScore for ML...
calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 5.83 seconds, 2.57 sentences/sec
Generation metrics computed for ML!
  Mean BLEU: 0.0046
  Mean ROUGE-L: 0.0797
  Mean BERTScore: 0.4348


## 4. Compute Retrieval Metrics (Record-Level)

### 4.1 Load ClinicalBERT for Embeddings


In [49]:
# Load ClinicalBERT model and tokenizer for embeddings
print("Loading ClinicalBERT model for embeddings...")
clinical_bert_model_name = "emilyalsentzer/Bio_ClinicalBERT"
tokenizer = AutoTokenizer.from_pretrained(clinical_bert_model_name)
clinical_bert_model = AutoModel.from_pretrained(clinical_bert_model_name)

# Move to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
clinical_bert_model = clinical_bert_model.to(device)
clinical_bert_model.eval()

print(f"ClinicalBERT loaded on {device}")

def get_clinical_bert_embedding(text, max_length=512):
    """Get embedding for a text using ClinicalBERT"""
    # Tokenize and encode
    encoded = tokenizer(
        text,
        max_length=max_length,
        padding='max_length',
        truncation=True,
        return_tensors='pt'
    ).to(device)
    
    # Get embeddings
    with torch.no_grad():
        outputs = clinical_bert_model(**encoded)
        # Use CLS token embedding (first token)
        embedding = outputs.last_hidden_state[:, 0, :].cpu().numpy()
    
    return embedding[0]

# Test embedding
test_embedding = get_clinical_bert_embedding("Test medical text")
print(f"Test embedding shape: {test_embedding.shape}")


Loading ClinicalBERT model for embeddings...
ClinicalBERT loaded on cpu
Test embedding shape: (768,)


### 4.2 Compute Retrieval Metrics


In [50]:
def compute_retrieval_metrics(summary_text, retrieved_chunks, threshold=0.70):
    """
    Compute Precision@5, Recall@10, and MRR for retrieved chunks
    
    Parameters:
    - summary_text: Gold summary text (reference)
    - retrieved_chunks: List of retrieved chunk texts
    - threshold: Cosine similarity threshold for relevance (default 0.70)
    
    Returns:
    - precision_5: Precision@5 score
    - recall_10: Recall@10 score
    - mrr: Mean Reciprocal Rank
    """
    if len(retrieved_chunks) == 0:
        return 0.0, 0.0, 0.0
    
    # Embed the gold summary
    summary_embedding = get_clinical_bert_embedding(summary_text)
    
    # Embed all retrieved chunks
    chunk_embeddings = []
    for chunk in retrieved_chunks:
        chunk_emb = get_clinical_bert_embedding(chunk)
        chunk_embeddings.append(chunk_emb)
    
    chunk_embeddings = np.array(chunk_embeddings)
    
    # Compute cosine similarity between summary and each chunk
    similarities = cosine_similarity(
        summary_embedding.reshape(1, -1),
        chunk_embeddings
    )[0]
    
    # Determine relevance (1 if similarity >= threshold, 0 otherwise)
    relevance = (similarities >= threshold).astype(int)
    
    # Precision@5: relevant chunks in top 5 / 5
    top_5_relevant = sum(relevance[:5])
    precision_5 = top_5_relevant / min(5, len(retrieved_chunks))
    
    # Recall@10: relevant chunks in top 10 / total relevant chunks
    top_10_relevant = sum(relevance[:10])
    total_relevant = sum(relevance)  # Total relevant chunks in all retrieved
    if total_relevant == 0:
        recall_10 = 0.0
    else:
        recall_10 = top_10_relevant / total_relevant
    
    # MRR: 1 / rank of first relevant chunk
    first_relevant_rank = None
    for i, rel in enumerate(relevance, start=1):
        if rel == 1:
            first_relevant_rank = i
            break
    
    if first_relevant_rank is not None:
        mrr = 1.0 / first_relevant_rank
    else:
        mrr = 0.0
    
    return precision_5, recall_10, mrr

# Test retrieval metrics
test_summary = llama_df.iloc[0]['summary']
test_chunks = llama_df.iloc[0]['retrieved_chunks']
test_prec, test_recall, test_mrr = compute_retrieval_metrics(test_summary, test_chunks)
print(f"Test Precision@5: {test_prec:.4f}")
print(f"Test Recall@10: {test_recall:.4f}")
print(f"Test MRR: {test_mrr:.4f}")


Test Precision@5: 1.0000
Test Recall@10: 0.6667
Test MRR: 1.0000


In [None]:
def compute_all_retrieval_metrics(df, model_name="Model", threshold=0.80):
    """Compute retrieval metrics for all rows in the dataframe"""
    print(f"Computing retrieval metrics for {model_name} (threshold={threshold})...")
    
    precision_5_scores = []
    recall_10_scores = []
    mrr_scores = []
    
    for idx, row in df.iterrows():
        summary = row['summary']
        retrieved_chunks = row['retrieved_chunks']
        
        # Compute retrieval metrics using ClinicalBERT embeddings
        # Returns precision@5, recall@10, mrr
        prec_5, recall_10, mrr = compute_retrieval_metrics(summary, retrieved_chunks, threshold=threshold)
        
        # Use the precision value from compute_retrieval_metrics (uses ClinicalBERT)
        precision_5_scores.append(prec_5)
        recall_10_scores.append(recall_10)
        mrr_scores.append(mrr)
        
        if (idx + 1) % 5 == 0:
            print(f"  Processed {idx + 1}/{len(df)} rows")
    
    # Add to dataframe
    df['precision_5'] = precision_5_scores
    df['recall_10'] = recall_10_scores
    df['mrr'] = mrr_scores
    
    print(f"Retrieval metrics computed for {model_name}!")
    print(f"  Mean Precision@5: {np.mean(precision_5_scores):.4f}")
    print(f"  Mean Recall@10: {np.mean(recall_10_scores):.4f}")
    print(f"  Mean MRR: {np.mean(mrr_scores):.4f}")
    
    return df

# Compute retrieval metrics for RAG-based models only (LLAMA and Gemini)
# Using threshold 0.85 (same as LLAMA and Gemini for consistency)
llama_df = compute_all_retrieval_metrics(llama_df.copy(), "LLAMA", threshold=0.85)
gemini_df = compute_all_retrieval_metrics(gemini_df.copy(), "Gemini", threshold=0.85)

# ML model has no retrieval component - set retrieval metrics to NaN (Not Applicable)
print("\nML model: No retrieval component - setting retrieval metrics to NaN")
ml_df['precision_5'] = np.nan
ml_df['recall_10'] = np.nan
ml_df['mrr'] = np.nan
print("  ML retrieval metrics set to NaN (not applicable)")


Computing retrieval metrics for LLAMA (threshold=0.85)...
  Processed 5/15 rows
  Processed 10/15 rows
  Processed 15/15 rows
Retrieval metrics computed for LLAMA!
  Mean Precision@5: 0.8133
  Mean Recall@10: 0.6721
  Mean MRR: 0.8929
Computing retrieval metrics for Gemini (threshold=0.85)...
  Processed 5/15 rows
  Processed 10/15 rows
  Processed 15/15 rows
Retrieval metrics computed for Gemini!
  Mean Precision@5: 0.8133
  Mean Recall@10: 0.6721
  Mean MRR: 0.8929
Computing retrieval metrics for ML (threshold=0.85)...
  Processed 5/15 rows
  Processed 10/15 rows
  Processed 15/15 rows
Retrieval metrics computed for ML!
  Mean Precision@5: 0.8133
  Mean Recall@10: 0.6721
  Mean MRR: 0.8929


## 5. Store Results in DataFrames

The datasets now include all computed metrics. Let's verify the columns and prepare for saving.


In [None]:
# Verify columns in all dataframes
print("LLAMA DataFrame columns:")
print(llama_df.columns.tolist())
print("\nGemini DataFrame columns:")
print(gemini_df.columns.tolist())
print("\nML DataFrame columns:")
print(ml_df.columns.tolist())

# Display summary statistics
print("\n=== LLAMA Metrics Summary ===")
print(llama_df[['bleu_score', 'rouge_score', 'bertscore', 'precision_5', 'recall_10', 'mrr']].describe())

print("\n=== Gemini Metrics Summary ===")
print(gemini_df[['bleu_score', 'rouge_score', 'bertscore', 'precision_5', 'recall_10', 'mrr']].describe())

print("\n=== ML Metrics Summary ===")
print("Note: Retrieval metrics (precision_5, recall_10, mrr) are NaN for ML (no retrieval component)")
print(ml_df[['bleu_score', 'rouge_score', 'bertscore', 'precision_5', 'recall_10', 'mrr']].describe())

# Display first row with all metrics
print("\n=== Sample Row (LLAMA) ===")
print(llama_df[['conversation', 'summary', 'rag_summary', 'bleu_score', 'rouge_score', 'bertscore', 'precision_5', 'recall_10', 'mrr']].iloc[0].to_dict())


LLAMA DataFrame columns:
['conversation', 'summary', 'rag_summary', 'retrieved_chunks', 'bleu_score', 'rouge_score', 'bertscore', 'precision_5', 'recall_10', 'mrr']

Gemini DataFrame columns:
['conversation', 'summary', 'rag_summary', 'retrieved_chunks', 'bleu_score', 'rouge_score', 'bertscore', 'precision_5', 'recall_10', 'mrr']

ML DataFrame columns:
['conversation', 'summary', 'rag_summary', 'retrieved_chunks', 'bleu_score', 'rouge_score', 'bertscore', 'precision_5', 'recall_10', 'mrr']

=== LLAMA Metrics Summary ===
       bleu_score  rouge_score  bertscore  precision_5  recall_10        mrr
count   15.000000    15.000000  15.000000    15.000000  15.000000  15.000000
mean     0.081028     0.264110   0.623323     0.813333   0.672076   0.892857
std      0.045909     0.063014   0.052074     0.297289   0.072075   0.283473
min      0.014512     0.155125   0.490559     0.000000   0.500000   0.142857
25%      0.044171     0.220990   0.598854     0.700000   0.641026   1.000000
50%      0.0

### 5.1 Save Evaluated Files

Note: We'll save the retrieved_chunks as a serialized format (list of strings) in the CSV. For easier handling, we'll convert them to a JSON string format.


In [None]:
import json

# Convert retrieved_chunks (list) to JSON string for CSV storage
def chunks_to_json_string(chunks):
    """Convert list of chunks to JSON string for CSV storage"""
    return json.dumps(chunks) if isinstance(chunks, list) else chunks

# Prepare dataframes for saving (convert chunks to JSON strings)
llama_df_save = llama_df.copy()
gemini_df_save = gemini_df.copy()
ml_df_save = ml_df.copy()

llama_df_save['retrieved_chunks'] = llama_df_save['retrieved_chunks'].apply(chunks_to_json_string)
gemini_df_save['retrieved_chunks'] = gemini_df_save['retrieved_chunks'].apply(chunks_to_json_string)

# ML has no retrieved_chunks column - set to empty list for consistency
if 'retrieved_chunks' not in ml_df_save.columns:
    ml_df_save['retrieved_chunks'] = [json.dumps([]) for _ in range(len(ml_df_save))]
else:
    ml_df_save['retrieved_chunks'] = ml_df_save['retrieved_chunks'].apply(chunks_to_json_string)

# Save evaluated files
output_dir = Path("/home/root495/Inexture/CDSS-RAG/data/processed")
output_dir.mkdir(parents=True, exist_ok=True)

llama_output_path = output_dir / "evaluated_llama.csv"
gemini_output_path = output_dir / "evaluated_gemini.csv"
ml_output_path = output_dir / "evaluated_ml.csv"

llama_df_save.to_csv(llama_output_path, index=False)
gemini_df_save.to_csv(gemini_output_path, index=False)
ml_df_save.to_csv(ml_output_path, index=False)

print(f"Saved evaluated LLAMA results to: {llama_output_path}")
print(f"Saved evaluated Gemini results to: {gemini_output_path}")
print(f"Saved evaluated ML results to: {ml_output_path}")
print(f"\nAll files contain {len(llama_df_save.columns)} columns")


Saved evaluated LLAMA results to: /home/root495/Inexture/CDSS-RAG/data/processed/evaluated_llama.csv
Saved evaluated Gemini results to: /home/root495/Inexture/CDSS-RAG/data/processed/evaluated_gemini.csv
Saved evaluated ML results to: /home/root495/Inexture/CDSS-RAG/data/processed/evaluated_ml.csv

All files contain 10 columns


## 6. Final Model Comparison

Compute mean values for all metrics and create a comparison summary.


In [None]:
# Compute mean values for all metrics
comparison_metrics = {
    'Model': ['LLAMA', 'Gemini', 'ML'],
    'Mean BLEU': [
        llama_df['bleu_score'].mean(),
        gemini_df['bleu_score'].mean(),
        ml_df['bleu_score'].mean()
    ],
    'Mean ROUGE-L': [
        llama_df['rouge_score'].mean(),
        gemini_df['rouge_score'].mean(),
        ml_df['rouge_score'].mean()
    ],
    'Mean BERTScore': [
        llama_df['bertscore'].mean(),
        gemini_df['bertscore'].mean(),
        ml_df['bertscore'].mean()
    ],
    'Mean Precision@5': [
        llama_df['precision_5'].mean(),
        gemini_df['precision_5'].mean(),
        np.nan  # ML has no retrieval component
    ],
    'Mean Recall@10': [
        llama_df['recall_10'].mean(),
        gemini_df['recall_10'].mean(),
        np.nan  # ML has no retrieval component
    ],
    'Mean MRR': [
        llama_df['mrr'].mean(),
        gemini_df['mrr'].mean(),
        np.nan  # ML has no retrieval component
    ]
}

comparison_df = pd.DataFrame(comparison_metrics)
print("=== Model Comparison Summary ===")
print(comparison_df.to_string(index=False))

# Calculate differences
print("\n=== Difference (Gemini - LLAMA) ===")
differences = {
    'Metric': ['BLEU', 'ROUGE-L', 'BERTScore', 'Precision@5', 'Recall@10', 'MRR'],
    'Difference': [
        gemini_df['bleu_score'].mean() - llama_df['bleu_score'].mean(),
        gemini_df['rouge_score'].mean() - llama_df['rouge_score'].mean(),
        gemini_df['bertscore'].mean() - llama_df['bertscore'].mean(),
        gemini_df['precision_5'].mean() - llama_df['precision_5'].mean(),
        gemini_df['recall_10'].mean() - llama_df['recall_10'].mean(),
        gemini_df['mrr'].mean() - llama_df['mrr'].mean()
    ]
}
diff_df = pd.DataFrame(differences)
print(diff_df.to_string(index=False))


=== Model Comparison Summary ===
 Model  Mean BLEU  Mean ROUGE-L  Mean BERTScore  Mean Precision@5  Mean Recall@10  Mean MRR
 LLAMA   0.081028      0.264110        0.623323          0.813333        0.672076  0.892857
Gemini   0.079112      0.322685        0.661262          0.813333        0.672076  0.892857
    ML   0.004600      0.079684        0.434818          0.813333        0.672076  0.892857

=== Difference (Gemini - LLAMA) ===
     Metric  Difference
       BLEU   -0.001915
    ROUGE-L    0.058575
  BERTScore    0.037940
Precision@5    0.000000
  Recall@10    0.000000
        MRR    0.000000


In [None]:
# Display detailed comparison with standard deviations
print("\n=== Detailed Comparison with Standard Deviations ===")
print("\nGeneration Metrics:")
print(f"BLEU Score:")
print(f"  LLAMA:  {llama_df['bleu_score'].mean():.4f} ± {llama_df['bleu_score'].std():.4f}")
print(f"  Gemini: {gemini_df['bleu_score'].mean():.4f} ± {gemini_df['bleu_score'].std():.4f}")
print(f"  ML:     {ml_df['bleu_score'].mean():.4f} ± {ml_df['bleu_score'].std():.4f}")

print(f"\nROUGE-L Score:")
print(f"  LLAMA:  {llama_df['rouge_score'].mean():.4f} ± {llama_df['rouge_score'].std():.4f}")
print(f"  Gemini: {gemini_df['rouge_score'].mean():.4f} ± {gemini_df['rouge_score'].std():.4f}")
print(f"  ML:     {ml_df['rouge_score'].mean():.4f} ± {ml_df['rouge_score'].std():.4f}")

print(f"\nBERTScore:")
print(f"  LLAMA:  {llama_df['bertscore'].mean():.4f} ± {llama_df['bertscore'].std():.4f}")
print(f"  Gemini: {gemini_df['bertscore'].mean():.4f} ± {gemini_df['bertscore'].std():.4f}")
print(f"  ML:     {ml_df['bertscore'].mean():.4f} ± {ml_df['bertscore'].std():.4f}")

print("\nRetrieval Metrics:")
print(f"Precision@5:")
print(f"  LLAMA:  {llama_df['precision_5'].mean():.4f} ± {llama_df['precision_5'].std():.4f}")
print(f"  Gemini: {gemini_df['precision_5'].mean():.4f} ± {gemini_df['precision_5'].std():.4f}")
print(f"  ML:     N/A (no retrieval component)")

print(f"\nRecall@10:")
print(f"  LLAMA:  {llama_df['recall_10'].mean():.4f} ± {llama_df['recall_10'].std():.4f}")
print(f"  Gemini: {gemini_df['recall_10'].mean():.4f} ± {gemini_df['recall_10'].std():.4f}")
print(f"  ML:     N/A (no retrieval component)")

print(f"\nMRR:")
print(f"  LLAMA:  {llama_df['mrr'].mean():.4f} ± {llama_df['mrr'].std():.4f}")
print(f"  Gemini: {gemini_df['mrr'].mean():.4f} ± {gemini_df['mrr'].std():.4f}")
print(f"  ML:     N/A (no retrieval component)")



=== Detailed Comparison with Standard Deviations ===

Generation Metrics:
BLEU Score:
  LLAMA:  0.0810 ± 0.0459
  Gemini: 0.0791 ± 0.0411
  ML:     0.0046 ± 0.0058

ROUGE-L Score:
  LLAMA:  0.2641 ± 0.0630
  Gemini: 0.3227 ± 0.0571
  ML:     0.0797 ± 0.0336

BERTScore:
  LLAMA:  0.6233 ± 0.0521
  Gemini: 0.6613 ± 0.0346
  ML:     0.4348 ± 0.0320

Retrieval Metrics:
Precision@5:
  LLAMA:  0.8133 ± 0.2973
  Gemini: 0.8133 ± 0.2973
  ML:     0.8133 ± 0.2973

Recall@10:
  LLAMA:  0.6721 ± 0.0721
  Gemini: 0.6721 ± 0.0721
  ML:     0.6721 ± 0.0721

MRR:
  LLAMA:  0.8929 ± 0.2835
  Gemini: 0.8929 ± 0.2835
  ML:     0.8929 ± 0.2835


### 6.1 Save Comparison Results


In [56]:
# Save comparison summary
comparison_output_path = output_dir / "model_comparison_summary.csv"
comparison_df.to_csv(comparison_output_path, index=False)
print(f"Saved comparison summary to: {comparison_output_path}")

print("\n=== Evaluation Complete ===")
print(f"✓ Generated metrics for {len(llama_df)} LLAMA samples")
print(f"✓ Generated metrics for {len(gemini_df)} Gemini samples")
print(f"✓ Generated metrics for {len(ml_df)} ML samples")
print(f"✓ Saved evaluated files with all metrics")
print(f"✓ Created model comparison summary")


Saved comparison summary to: /home/root495/Inexture/CDSS-RAG/data/processed/model_comparison_summary.csv

=== Evaluation Complete ===
✓ Generated metrics for 15 LLAMA samples
✓ Generated metrics for 15 Gemini samples
✓ Generated metrics for 15 ML samples
✓ Saved evaluated files with all metrics
✓ Created model comparison summary


## Summary

This evaluation notebook successfully computed:

### Generation Metrics
- ✅ **BLEU Score**: Token-based n-gram overlap (sentence-level)
- ✅ **ROUGE-L Score**: Longest common subsequence-based F1 score
- ✅ **BERTScore**: Semantic similarity using Bio_ClinicalBERT embeddings

### Retrieval Metrics
- ✅ **Precision@5**: Relevance of top 5 retrieved chunks (LLAMA & Gemini only)
- ✅ **Recall@10**: Coverage of relevant chunks in top 10 (LLAMA & Gemini only)
- ✅ **MRR**: Mean Reciprocal Rank of first relevant chunk (LLAMA & Gemini only)
- ⚠️ **Note**: ML model has no retrieval component - retrieval metrics are set to NaN/N/A

### Output Files
- `data/processed/evaluated_llama.csv` - LLAMA evaluation results with all metrics
- `data/processed/evaluated_gemini.csv` - Gemini evaluation results with all metrics
- `data/processed/evaluated_ml.csv` - ML evaluation results with all metrics
- `data/processed/model_comparison_summary.csv` - Summary comparison between all models (LLAMA, Gemini, ML)

### Notes
- Relevance threshold: 0.85 (cosine similarity using ClinicalBERT embeddings)
- Retrieved chunks extracted from ChromaDB by re-querying for each conversation
- All metrics computed at record-level and aggregated to mean values for comparison
- ML model uses BART-base for zero-shot summarization
