In [18]:
!pip install nltk bert-score
!pip install rouge-metric



In [7]:
import pandas as pd

# Load the saved CSV file
longformer_generated_summaries = pd.read_csv("Longformer_soap_generated_summaries.csv")

# Verify the data
print(longformer_generated_summaries.head(2))

                                               input  \
0  Good afternoon, champ, how you holding up? Goo...   
1  What brings you in here today? Hi, I'm um, I'm...   

                                              output  \
0  Subjective:\n- Symptoms: Lower back pain, radi...   
1  Subjective:\n- Presenting with dry cough for 1...   

                                   generated_summary  
0  A 75-year-old man is experiencing chronic lowe...  
1  , but after that it seemed to clear up a bit. ...  


In [24]:
print(longformer_generated_summaries['input'].iloc[50])

Good morning, young lady, how old are you? Good morning, doctor. I'm thirteen. Good, and what seems to be the problem today? Mom, can you explain for me? Guest_family: Well, if you look, doctor, her back posture is very rounded. I think, it's rounding about the thoracic spine. Is there a family history of this problem? Guest_family: Yes, on my side, my aunt and grandfather had, um, kyphosis. Yes, that's what this is. This is thoracic kyphosis to be specific. Has she seen another doctor for this? Guest_family: Yes, we saw another orthopedist. What did they recommend? Guest_family: They recommended we come in for further observation, so we're here for a second opinion. Good, is there any back pain, numbness or tingling? No, I don't have any of that. Is there any weakness, numbness or tingling in your legs and arms, my dear? No, I'm very strong, especially for my age. Are you going to the bathroom with no problem? Yes, doctor, everything is regular there.


In [25]:
print(longformer_generated_summaries['generated_summary'].iloc[50])

A 13-year-old girl comes to see her mother's doctor because she is concerned about her back


In [26]:
from tqdm import tqdm
import pandas as pd
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from bert_score import score
from rouge_metric import PyRouge

def clean_text(text):
    """Clean and normalize text."""
    if pd.isna(text) or not isinstance(text, str):
        return ""
    return ' '.join(text.strip().lower().split())  # Lowercase, strip spaces, normalize.

def compute_bleu_scores(reference, candidate):
    """Compute BLEU-1 and BLEU-2 scores."""
    try:
        smoothing_function = SmoothingFunction().method1
        # Compute BLEU-1
        bleu1 = sentence_bleu(
            [reference.split()],
            candidate.split(),
            weights=(1.0, 0, 0, 0),  # Only unigrams
            smoothing_function=smoothing_function
        )
        # Compute BLEU-2
        bleu2 = sentence_bleu(
            [reference.split()],
            candidate.split(),
            weights=(0.5, 0.5, 0, 0),  # Unigrams and bigrams only
            smoothing_function=smoothing_function
        )
        return bleu1 * 100, bleu2 * 100  # Convert to percentages
    except Exception as e:
        print(f"BLEU Error: {e}")
        print(f"Reference: '{reference[:50]}...'")
        print(f"Candidate: '{candidate[:50]}...'")
        return 0.0, 0.0

def compute_rouge_l(reference, candidate):
    """Compute ROUGE-L score."""
    rouge = PyRouge(rouge_n=(1, 2), rouge_l=True, rouge_w=False,
                    rouge_w_weight=1.2, rouge_s=False, rouge_su=False, skip_gap=4)
    try:
        scores = rouge.evaluate([candidate], [[reference]])
        return scores['rouge-l']['f'] * 100  # Convert to percentage
    except Exception as e:
        print(f"ROUGE-L Error: {e}")
        print(f"Reference: '{reference[:50]}...'")
        print(f"Candidate: '{candidate[:50]}...'")
        return 0.0

def compute_bert_score_batched(references, candidates, batch_size=32):
    """Compute BERTScore in batches."""
    all_P, all_R, all_F1 = [], [], []
    for i in range(0, len(references), batch_size):
        batch_refs = references[i:i + batch_size]
        batch_cands = candidates[i:i + batch_size]
        try:
            P, R, F1 = score(batch_cands, batch_refs, lang="en", verbose=False)
            all_P.extend([p * 100 for p in P.tolist()])  # Convert to percentages
            all_R.extend([r * 100 for r in R.tolist()])  # Convert to percentages
            all_F1.extend([f * 100 for f in F1.tolist()])  # Convert to percentages
        except Exception as e:
            print(f"BERTScore Error in batch {i}: {e}")
            batch_len = len(batch_refs)
            all_P.extend([0.0] * batch_len)
            all_R.extend([0.0] * batch_len)
            all_F1.extend([0.0] * batch_len)
    return all_P, all_R, all_F1

def evaluate_summaries(df):
    bleu1_scores, bleu2_scores, rouge_l_scores = [], [], []
    print("Computing BLEU and ROUGE-L scores...")
    
    with tqdm(total=len(df), desc="Processing Rows", unit="row") as pbar:
        for _, row in df.iterrows():
            reference = clean_text(row['output'])
            candidate = clean_text(row['generated_summary'])  # Updated field name
            
            if not reference or not candidate:
                print(f"Empty text - Reference: '{reference}', Candidate: '{candidate}'")
                bleu1_scores.append(0.0)
                bleu2_scores.append(0.0)
                rouge_l_scores.append(0.0)
            else:
                bleu1, bleu2 = compute_bleu_scores(reference, candidate)
                bleu1_scores.append(bleu1)
                bleu2_scores.append(bleu2)
                rouge_l_scores.append(compute_rouge_l(reference, candidate))
            
            pbar.update(1)
    
    print("\nComputing BERTScore...")
    references = [clean_text(text) for text in df['output'].tolist()]
    candidates = [clean_text(text) for text in df['generated_summary'].tolist()]
    bert_p, bert_r, bert_f1 = compute_bert_score_batched(references, candidates)
    
    # Add all scores to DataFrame
    df['bleu1'] = bleu1_scores
    df['bleu2'] = bleu2_scores
    df['rouge_l'] = rouge_l_scores
    df['bert_p'] = bert_p
    df['bert_r'] = bert_r
    df['bert_f1'] = bert_f1
    
    # Print evaluation metrics
    print("\nEvaluation Metrics (in percentages):")
    print("Average BLEU-1:", df['bleu1'].mean(), "%")
    print("Average BLEU-2:", df['bleu2'].mean(), "%")
    print("Average ROUGE-L:", df['rouge_l'].mean(), "%")
    print("Average BERT P:", df['bert_p'].mean(), "%")
    print("Average BERT R:", df['bert_r'].mean(), "%")
    print("Average BERT F1:", df['bert_f1'].mean(), "%")
    
    # Print standard deviations
    print("\nStandard Deviations (in percentages):")
    print("BLEU-1 Std:", df['bleu1'].std(), "%")
    print("BLEU-2 Std:", df['bleu2'].std(), "%")
    print("ROUGE-L Std:", df['rouge_l'].std(), "%")
    print("BERT F1 Std:", df['bert_f1'].std(), "%")
    print("BERT P Std:", df['bert_p'].std(), "%")
    print("BERT R Std:", df['bert_r'].std(), "%")
    
    
    return df

# Updated DataFrame Name
longformer_generated_summaries = evaluate_summaries(longformer_generated_summaries)
longformer_generated_summaries.to_csv("longformer_evaluation_results.csv", index=False)
print("\nResults saved to 'longformer_evaluation_results.csv'")


Computing BLEU and ROUGE-L scores...


Processing Rows: 100%|██████████| 100/100 [00:00<00:00, 578.92row/s]


Computing BERTScore...



Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You sh


Evaluation Metrics (in percentages):
Average BLEU-1: 2.1845065608225953 %
Average BLEU-2: 1.1891421036571728 %
Average ROUGE-L: 6.762804383330671 %
Average BERT P: 85.6003046631813 %
Average BERT R: 79.89022243022919 %
Average BERT F1: 82.63759052753448 %

Standard Deviations (in percentages):
BLEU-1 Std: 3.6347510655731963 %
BLEU-2 Std: 2.562401702906771 %
ROUGE-L Std: 5.85447961411285 %
BERT F1 Std: 2.877933082281974 %
BERT P Std: 2.944374474744506 %
BERT R Std: 3.044152994052751 %

Results saved to 'longformer_evaluation_results.csv'


### LLM as a judge

In [5]:
from huggingface_hub import login

# Use your Hugging Face token
login("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")

In [3]:
# Install Hugging Face Transformers
!pip install transformers
!pip install sacremoses
!pip install bitsandbytes accelerate

Collecting transformers
  Downloading transformers-4.51.2-py3-none-any.whl.metadata (38 kB)
Collecting huggingface-hub<1.0,>=0.30.0 (from transformers)
  Downloading huggingface_hub-0.30.2-py3-none-any.whl.metadata (13 kB)
Collecting regex!=2019.12.17 (from transformers)
  Downloading regex-2024.11.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
Collecting tokenizers<0.22,>=0.21 (from transformers)
  Downloading tokenizers-0.21.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.8 kB)
Collecting safetensors>=0.4.3 (from transformers)
  Downloading safetensors-0.5.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Downloading transformers-4.51.2-py3-none-any.whl (10.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.4/10.4 MB[0m [31m64.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading huggingface_hub-0.30.2-py3-none-any.whl (481 kB)
Downloading regex-2024.11.6-cp310-cp310-manylinux_2_17_x86_

In [8]:
import torch
import pandas as pd
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Step 1: Use 8-bit quantization for efficient inference
bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
    bnb_8bit_compute_dtype=torch.float16,
    llm_int8_enable_fp32_cpu_offload=True
)

# Step 2: Load tokenizer & model (Gemma 3 1B Instruction-tuned)
model_id = "google/gemma-3-1b-it"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
).eval()

# Step 3: Use your existing DataFrame
df = longformer_generated_summaries

# Step 4: Define the evaluation prompt
def create_prompt(input_text, output_text, summary):
    return f"""You are a helpful clinical NLP evaluation assistant.

Input Text:
{input_text}

Reference Summary:
{output_text}

Generated Summary:
{summary}

Evaluate the generated summary using the following criteria:
1. Does it capture the main ideas of the reference summary? (Yes/No)
2. Is it coherent and logically structured? (Yes/No)
3. Are there factual inaccuracies or important omissions? (List any)
4. Rate the summary from 1 to 5 based on how well it captures the reference summary.

Please give your evaluation in this format:
- Captures main ideas: [Yes/No]
- Coherence: [Yes/No]
- Issues: [Write here or 'None']
- Score: [1-5]
"""

# Step 5: Run evaluation
results = []
for index, row in tqdm(df.iterrows(), total=df.shape[0]):
    prompt = create_prompt(row['input'], row['output'], row['generated_summary'])
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=1024).to(model.device)

    # Use max_new_tokens instead of max_length to avoid the OOM error
    outputs = model.generate(**inputs, max_new_tokens=150)
    eval_response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    results.append(eval_response)

# Step 6: Save evaluations to the DataFrame
df['evaluation_gemma'] = results

# Optional: Save to file
df.to_csv("evaluated_summaries_gemma.csv", index=False)
print(df[['input', 'generated_summary', 'evaluation_gemma']].head())


100%|██████████| 100/100 [48:30<00:00, 29.11s/it]

                                               input  \
0  Good afternoon, champ, how you holding up? Goo...   
1  What brings you in here today? Hi, I'm um, I'm...   
2  Do you have any known allergies to medications...   
3  How may I help you today? Yeah I've had, a fev...   
4  It sounds like that you're experiencing some c...   

                                   generated_summary  \
0  A 75-year-old man is experiencing chronic lowe...   
1  , but after that it seemed to clear up a bit. ...   
2  The individual in question is a patient who ha...   
3  that someone else could be infected. But, yeah...   
4  into this further, including running some test...   

                                    evaluation_gemma  
0  You are a helpful clinical NLP evaluation assi...  
1  You are a helpful clinical NLP evaluation assi...  
2  You are a helpful clinical NLP evaluation assi...  
3  You are a helpful clinical NLP evaluation assi...  
4  You are a helpful clinical NLP evaluation assi..




In [11]:
print(df['evaluation_gemma'].iloc[5])

You are a helpful clinical NLP evaluation assistant.

Input Text:
Hi there! What brings you in today? Guest_family: I think my baby got into the ant bait. I am not sure if he consumed any of it but he was under the counter and it was in his hands. What kind ant bait did he get into? Guest_family: It was the one with Borax in it. Do you have a picture of it? Guest_family: Yes. It is in my phone.

Reference Summary:
Subjective:
- Concern that the baby may have gotten into ant bait containing Borax.
- Uncertainty about whether the baby consumed any of it.
- The baby was found under the counter with the ant bait in his hands.

Objective:
- No measurable or observable data provided.

Assessment:
- No clinician's interpretation or diagnosis provided.

Plan:
- No specific actions, medications, tests, follow-up, or patient education provided.

Generated Summary:
A guest's infant son was found to be under the kitchen counter, having ingested some type of

Evaluate the generated summary using th