In [1]:
!pip install -q nltk bert-score
!pip install -q rouge-metric

In [2]:
import pandas as pd

# Load the saved CSV file
generated_summaries_with_reduced_text = pd.read_csv("summarization_results.csv")

# Verify the data
print(generated_summaries_with_reduced_text.head())

          note_id                                              input  \
0  16002318-DS-17  <SEX> F <SERVICE> SURGERY <ALLERGIES> Iodine /...   
1   15638884-DS-4  <SEX> M <SERVICE> MEDICINE <ALLERGIES> Augment...   
2  12435705-DS-14  <SEX> M <SERVICE> MEDICINE <ALLERGIES> ibuprof...   
3   12413577-DS-4  <SEX> F <SERVICE> OBSTETRICS/GYNECOLOGY <ALLER...   
4  17967161-DS-29  <SEX> M <SERVICE> SURGERY <ALLERGIES> lisinopr...   

                                              target  input_tokens  \
0  This is a ___ yo F admitted to the hospital af...          1195   
1  Mr. ___ is a ___ yo man with CAD with prior MI...          3496   
2  Mr. ___ is a ___ w/ Ph+ve ALL on dasatanib and...          5591   
3  On ___, Ms. ___ was admitted to the gynecology...          1119   
4  Mr. ___ underwent an angiogram on ___ which sh...          3307   

   target_tokens                                       reduced_text  \
0             75  <|begin_of_text|><SEX> F <SERVICE> SURGERY <AL...   
1   

In [4]:
generated_summaries_with_reduced_text.nunique()

note_id              100
input                100
target               100
input_tokens         100
target_tokens         91
reduced_text         100
importance_scores    100
generated_summary    100
bleu1                 99
bleu2                 99
rouge_l               99
bert_p               100
bert_r               100
bert_f1              100
dtype: int64

In [5]:
generated_summaries_with_reduced_text['input'].iloc[20]

"<SEX> F <SERVICE> MEDICINE <ALLERGIES> Penicillins <ATTENDING> ___. <CHIEF COMPLAINT> DKA <MAJOR SURGICAL OR INVASIVE PROCEDURE> None <HISTORY OF PRESENT ILLNESS> ___ year old female with a history of type 1 diabetes mellitus presents with DKA secondary to mechanical failure of her insulin pump. She awoke around 3: 00 this morning with kinking of the tubing of her insulin pump. She has had problems before with her pump and she thought she had resolved it. Her glucometer measured a blood sugar>600 (the limits of her monitor). She also had some confusion and was persistently thirsty despite drinking large amounts of water. Mild nausea without vomiting or abdominal pain. She came to the ED. In the ED, initial vs were: T98.6, ___, BP 119/86, RR 18, O2 Sat 100% on RA. She had a 28 point gap. She was started on an insulin drip improvement in her sugars, started on fluids, and admitted. Review of systems: (+) Per HPI (-) Denies fever, chills, night sweats, recent weight loss or gain. Denies 

In [7]:
generated_summaries_with_reduced_text['target'].iloc[36]

"The patient was admitted to the hospital for elective repair of a ventral hernia. He was taken to the operating room where he underwent a ventral hernia repair with mesh overlay and lysis of adhesion. A thoracic epidural catheter was placed prior to the procedure for pain management. The operative course was stable with minimal blood loss. The patient was extubated after the procedure and monitored in the recovery room. His post-operative course was stable. The patient resumed a regular diet on the operative day. His foley catheter was removed on POD #1 and the patient voided without difficulty. He was transitioned from intravenous analgesia to an oral pain regimen and his epidural catheter was removed. He was reported to have a white blood cell count of 20 on POD #1. The patient's hemodynamic status remained stable and there were no signs of wound infection. On POD #2, his white blood cell count remained elevated to 22 and he underwent a chest x-ray which revealed a right-sided conso

In [6]:
generated_summaries_with_reduced_text['generated_summary'].iloc[36]

'consciousness = stable <CONCLUSION>\n \n\nNote:\n\nThe final answer is:\n\n\n**Patient Profile:** Male\n\n**Chief Complaints**: Chest pain and shortness of breath\n\n**Diagnosis**: Myocardial Infarction (MI)\n\n**Treatment**: Aspirin treatment and admission to Cardiac Unit\n\n\n**Physical Examination Findings**\n\n* Pulse: Regular, 70 beats/min\n* Blood Pressure: 120/80 mmHg\n* Cardiovascular Exam Revealed:\n\t+ Regular rhythm\n\t+ Normal Apical Impulse\n\t+ Normal Heart Sounds\n\t+ No Significant Murmur\n\t+ Edematous extremities\n* Lung Function Tests:\n\t+ CTP scan shows scattered wheezes but no rales or rhonchi\n\t+ Absence of pneumonia signs\n* Vital Signs:\n\t+ Temperature: Not mentioned\n\t+ Respiratory Rate: Scattered\n\t+ Oxygen Saturation: Not mentioned\n\t+ Other vital signs within normal limits except'

In [3]:
from tqdm import tqdm
import pandas as pd
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from bert_score import score
from rouge_metric import PyRouge

def clean_text(text):
    """Clean and normalize text."""
    if pd.isna(text) or not isinstance(text, str):
        return ""
    return ' '.join(text.strip().lower().split())  # Lowercase, strip spaces, normalize.

def compute_bleu_scores(reference, candidate):
    """Compute BLEU-1 and BLEU-2 scores."""
    try:
        smoothing_function = SmoothingFunction().method1
        # Compute BLEU-1
        bleu1 = sentence_bleu(
            [reference.split()],
            candidate.split(),
            weights=(1.0, 0, 0, 0),  # Only unigrams
            smoothing_function=smoothing_function
        )
        # Compute BLEU-2
        bleu2 = sentence_bleu(
            [reference.split()],
            candidate.split(),
            weights=(0.5, 0.5, 0, 0),  # Unigrams and bigrams only
            smoothing_function=smoothing_function
        )
        return bleu1 * 100, bleu2 * 100  # Convert to percentages
    except Exception as e:
        print(f"BLEU Error: {e}")
        print(f"Reference: '{reference[:50]}...'")
        print(f"Candidate: '{candidate[:50]}...'")
        return 0.0, 0.0

def compute_rouge_l(reference, candidate):
    """Compute ROUGE-L score."""
    rouge = PyRouge(rouge_n=(1, 2), rouge_l=True, rouge_w=False,
                    rouge_w_weight=1.2, rouge_s=False, rouge_su=False, skip_gap=4)
    try:
        scores = rouge.evaluate([candidate], [[reference]])
        return scores['rouge-l']['f'] * 100  # Convert to percentage
    except Exception as e:
        print(f"ROUGE-L Error: {e}")
        print(f"Reference: '{reference[:50]}...'")
        print(f"Candidate: '{candidate[:50]}...'")
        return 0.0

def compute_bert_score_batched(references, candidates, batch_size=32):
    """Compute BERTScore in batches."""
    all_P, all_R, all_F1 = [], [], []
    for i in range(0, len(references), batch_size):
        batch_refs = references[i:i + batch_size]
        batch_cands = candidates[i:i + batch_size]
        try:
            P, R, F1 = score(batch_cands, batch_refs, lang="en", verbose=False)
            all_P.extend([p * 100 for p in P.tolist()])  # Convert to percentages
            all_R.extend([r * 100 for r in R.tolist()])  # Convert to percentages
            all_F1.extend([f * 100 for f in F1.tolist()])  # Convert to percentages
        except Exception as e:
            print(f"BERTScore Error in batch {i}: {e}")
            batch_len = len(batch_refs)
            all_P.extend([0.0] * batch_len)
            all_R.extend([0.0] * batch_len)
            all_F1.extend([0.0] * batch_len)
    return all_P, all_R, all_F1

def evaluate_summaries(df):
    bleu1_scores, bleu2_scores, rouge_l_scores = [], [], []
    print("Computing BLEU and ROUGE-L scores...")
    
    with tqdm(total=len(df), desc="Processing Rows", unit="row") as pbar:
        for _, row in df.iterrows():
            reference = clean_text(row['target'])
            candidate = clean_text(row['generated_summary'])
            
            if not reference or not candidate:
                print(f"Empty text - Reference: '{reference}', Candidate: '{candidate}'")
                bleu1_scores.append(0.0)
                bleu2_scores.append(0.0)
                rouge_l_scores.append(0.0)
            else:
                bleu1, bleu2 = compute_bleu_scores(reference, candidate)
                bleu1_scores.append(bleu1)
                bleu2_scores.append(bleu2)
                rouge_l_scores.append(compute_rouge_l(reference, candidate))
            
            pbar.update(1)
    
    print("\nComputing BERTScore...")
    references = [clean_text(text) for text in df['target'].tolist()]
    candidates = [clean_text(text) for text in df['generated_summary'].tolist()]
    bert_p, bert_r, bert_f1 = compute_bert_score_batched(references, candidates)
    
    # Add all scores to DataFrame
    df['bleu1'] = bleu1_scores
    df['bleu2'] = bleu2_scores
    df['rouge_l'] = rouge_l_scores
    df['bert_p'] = bert_p
    df['bert_r'] = bert_r
    df['bert_f1'] = bert_f1
    
    # Print evaluation metrics
    print("\nEvaluation Metrics (in percentages):")
    print("Average BLEU-1:", df['bleu1'].mean(), "%")
    print("Average BLEU-2:", df['bleu2'].mean(), "%")
    print("Average ROUGE-L:", df['rouge_l'].mean(), "%")
    print("Average BERT P:", df['bert_p'].mean(), "%")
    print("Average BERT R:", df['bert_r'].mean(), "%")
    print("Average BERT F1:", df['bert_f1'].mean(), "%")
    
    # Print standard deviations
    print("\nStandard Deviations (in percentages):")
    print("BLEU-1 Std:", df['bleu1'].std(), "%")
    print("BLEU-2 Std:", df['bleu2'].std(), "%")
    print("ROUGE-L Std:", df['rouge_l'].std(), "%")
    print("BERT F1 Std:", df['bert_f1'].std(), "%")
    print("BERT P Std:", df['bert_p'].std(), "%")
    print("BERT R Std:", df['bert_r'].std(), "%")
    
    return df

AGTD_Summaries_mimic_iv_bhc_100 = evaluate_summaries(generated_summaries_with_reduced_text)
AGTD_Summaries_mimic_iv_bhc_100.to_csv("AGTD_evaluation_results.csv", index=False)
print("\nResults saved to 'evaluation_results.csv'")

Computing BLEU and ROUGE-L scores...


Processing Rows: 100%|██████████| 100/100 [00:02<00:00, 48.67row/s]



Computing BERTScore...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You sho


Evaluation Metrics (in percentages):
Average BLEU-1: 4.49476491641967 %
Average BLEU-2: 1.324374356577883 %
Average ROUGE-L: 5.577647681622525 %
Average BERT P: 80.32063806056976 %
Average BERT R: 78.61822867393494 %
Average BERT F1: 79.43593049049377 %

Standard Deviations (in percentages):
BLEU-1 Std: 5.477810946592718 %
BLEU-2 Std: 2.0429039129104316 %
ROUGE-L Std: 2.9416886652155423 %
BERT F1 Std: 1.7633774807706055 %
BERT P Std: 2.384023658631612 %
BERT R Std: 2.068661411975958 %

Results saved to 'evaluation_results.csv'
