# **Medical Report Summarization with BART**

---



**Task:** Create a Python script that extracts and summarizes medical reports using BERT, based on a given patient ID. The script should search for the patient ID in a dataset of medical reports, extract the relevant information, and generate a concise summary of the report utilizing BERT for Natural Language Processing.

In [61]:
import pandas as pd
from transformers import pipeline
import matplotlib.pyplot as plt
from IPython.display import display

In [62]:
df = pd.read_excel('/content/medical_records.xlsx')
df.head()

Unnamed: 0,Patient ID,Name,Age,Gender,Diagnosis,Medical History,Prescribed Medication,Lab Test Results,Doctor's Notes
0,1,Jaime Lynch,47,Male,Hypertension,Chronic back pain,Atorvastatin,Elevated blood sugar,Clearly describe tree onto situation middle ea...
1,2,Luke Martin,98,Male,Allergy,Allergic to penicillin; Obesity; Family histor...,Ibuprofen,Elevated blood sugar,Factor everything program never.\nMeet Mrs tho...
2,3,Dr. Laura Moody DDS,55,Other,Heart Disease,Alcoholic; Obesity,Ibuprofen,High cholesterol; Normal ECG; Positive COVID-1...,Manage shake visit.\nStudent maintain whole ap...
3,4,Dennis Valentine MD,94,Male,Migraine,Alcoholic,Antihistamine,High cholesterol; Positive COVID-19 test; Elev...,Night finally heart coach so. Again marriage i...
4,5,Anne Gonzalez,79,Male,Arthritis,Allergic to penicillin; Smoker; Alcoholic,Metformin,Normal ECG; Normal ECG; Normal ECG,Seem these have above. Hear mention final behi...


In [63]:
df = df.drop(columns=["Doctor's Notes"])
df.head()

Unnamed: 0,Patient ID,Name,Age,Gender,Diagnosis,Medical History,Prescribed Medication,Lab Test Results
0,1,Jaime Lynch,47,Male,Hypertension,Chronic back pain,Atorvastatin,Elevated blood sugar
1,2,Luke Martin,98,Male,Allergy,Allergic to penicillin; Obesity; Family histor...,Ibuprofen,Elevated blood sugar
2,3,Dr. Laura Moody DDS,55,Other,Heart Disease,Alcoholic; Obesity,Ibuprofen,High cholesterol; Normal ECG; Positive COVID-1...
3,4,Dennis Valentine MD,94,Male,Migraine,Alcoholic,Antihistamine,High cholesterol; Positive COVID-19 test; Elev...
4,5,Anne Gonzalez,79,Male,Arthritis,Allergic to penicillin; Smoker; Alcoholic,Metformin,Normal ECG; Normal ECG; Normal ECG


In [64]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Patient ID             5000 non-null   int64 
 1   Name                   5000 non-null   object
 2   Age                    5000 non-null   int64 
 3   Gender                 5000 non-null   object
 4   Diagnosis              5000 non-null   object
 5   Medical History        5000 non-null   object
 6   Prescribed Medication  4386 non-null   object
 7   Lab Test Results       5000 non-null   object
dtypes: int64(2), object(6)
memory usage: 312.6+ KB


### Dataset Summary:

This dataset contains **5000 records** (rows) with the following 8 columns:

1. **Patient ID**: Unique integer identifier for each patient (ranging from 1 to 5000).
2. **Name**: Pseudonymized name of the patient.
3. **Age**: The patient's age (ranging from 1 to 99 years).
4. **Gender**: The gender of the patient (e.g., "Male", "Female").
5. **Diagnosis**: The diagnosis provided to the patient, stored as text.
6. **Medical History**: Medical history of the patient, stored as text.
7. **Prescribed Medication**: Medications prescribed to the patient (not all records have this field populated).
8. **Lab Test Results**: Results of lab tests for the patient, stored as text.

#### Descriptive Statistics:
- **Patient ID**: Range from 1 to 5000.
- **Age**: Mean age is approximately 50.3 years, with values ranging from 1 to 99. The 25th percentile is 26, the 50th percentile (median) is 50, and the 75th percentile is 75.
  
#### Missing Data:
- **Prescribed Medication** has missing values in 114 records (4386 non-null entries out of 5000).

In [65]:
df.describe()

Unnamed: 0,Patient ID,Age
count,5000.0,5000.0
mean,2500.5,50.3104
std,1443.520003,28.513363
min,1.0,1.0
25%,1250.75,26.0
50%,2500.5,50.0
75%,3750.25,75.0
max,5000.0,99.0


In [66]:
df.isnull().sum()

Unnamed: 0,0
Patient ID,0
Name,0
Age,0
Gender,0
Diagnosis,0
Medical History,0
Prescribed Medication,614
Lab Test Results,0


In [67]:
df['Prescribed Medication']

Unnamed: 0,Prescribed Medication
0,Atorvastatin
1,Ibuprofen
2,Ibuprofen
3,Antihistamine
4,Metformin
...,...
4995,Metformin
4996,Paracetamol
4997,
4998,Atorvastatin


**Handling missing values**

In [68]:
df['Prescribed Medication'] = df['Prescribed Medication'].fillna('Not Available')

In [69]:
df.isnull().sum()

Unnamed: 0,0
Patient ID,0
Name,0
Age,0
Gender,0
Diagnosis,0
Medical History,0
Prescribed Medication,0
Lab Test Results,0


In [70]:
df['Gender'] = df['Gender'].str.lower()
df.head()

Unnamed: 0,Patient ID,Name,Age,Gender,Diagnosis,Medical History,Prescribed Medication,Lab Test Results
0,1,Jaime Lynch,47,male,Hypertension,Chronic back pain,Atorvastatin,Elevated blood sugar
1,2,Luke Martin,98,male,Allergy,Allergic to penicillin; Obesity; Family histor...,Ibuprofen,Elevated blood sugar
2,3,Dr. Laura Moody DDS,55,other,Heart Disease,Alcoholic; Obesity,Ibuprofen,High cholesterol; Normal ECG; Positive COVID-1...
3,4,Dennis Valentine MD,94,male,Migraine,Alcoholic,Antihistamine,High cholesterol; Positive COVID-19 test; Elev...
4,5,Anne Gonzalez,79,male,Arthritis,Allergic to penicillin; Smoker; Alcoholic,Metformin,Normal ECG; Normal ECG; Normal ECG


**Removing special characters**

In [71]:
df['Medical History'] = df['Medical History'].str.lower().str.replace(r'[^a-zA-Z\s]', '', regex=True)
df['Diagnosis'] = df['Diagnosis'].str.lower().str.replace(r'[^a-zA-Z\s]', '', regex=True)

In [72]:
df.head()

Unnamed: 0,Patient ID,Name,Age,Gender,Diagnosis,Medical History,Prescribed Medication,Lab Test Results
0,1,Jaime Lynch,47,male,hypertension,chronic back pain,Atorvastatin,Elevated blood sugar
1,2,Luke Martin,98,male,allergy,allergic to penicillin obesity family history ...,Ibuprofen,Elevated blood sugar
2,3,Dr. Laura Moody DDS,55,other,heart disease,alcoholic obesity,Ibuprofen,High cholesterol; Normal ECG; Positive COVID-1...
3,4,Dennis Valentine MD,94,male,migraine,alcoholic,Antihistamine,High cholesterol; Positive COVID-19 test; Elev...
4,5,Anne Gonzalez,79,male,arthritis,allergic to penicillin smoker alcoholic,Metformin,Normal ECG; Normal ECG; Normal ECG


In [73]:
df1 = df.copy()
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Patient ID             5000 non-null   int64 
 1   Name                   5000 non-null   object
 2   Age                    5000 non-null   int64 
 3   Gender                 5000 non-null   object
 4   Diagnosis              5000 non-null   object
 5   Medical History        5000 non-null   object
 6   Prescribed Medication  5000 non-null   object
 7   Lab Test Results       5000 non-null   object
dtypes: int64(2), object(6)
memory usage: 312.6+ KB


**Extract relevant information for a given patient ID**

In [74]:
def extract_patient_data(patient_id):
    try:
        patient_id_int = int(patient_id)
    except ValueError:
        return "Invalid Patient ID input.", None

    if patient_id_int < 1 or patient_id_int > 5000:
        return "Patient ID out of range. Please enter a number between 1 and 5000.", None

    patient_data = df1[df1['Patient ID'] == patient_id_int]
    if patient_data.empty:
        return "Patient ID not found.", None

    # Extract all columns from the first matched row
    patient_data = patient_data.iloc[0].to_dict()
    return None, patient_data

In [75]:
def preprocess_text(text):
    # Perform text preprocessing here (e.g., remove special characters, handle punctuation)
    return text.strip()

**Summarization using BERT (You can use a pre-trained model like T5 or BERTSUM for summarization)**

In [78]:
def summarize_text(text):
    """
    Summarizes the input text using the BART model.
    """
    # Preprocess the text before summarizing
    text = preprocess_text(text)

    summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
    input_length = len(text.split())

    # Handle very short text inputs
    if input_length < 10:
        return text  # Return the original text if it's too short to summarize

    # Dynamically set `max_new_tokens` for flexibility
    max_new_tokens = min(50, input_length // 2)  # Generate up to half the length of the input
    summary = summarizer(text, max_new_tokens=max_new_tokens, do_sample=False)

    return summary[0]['summary_text']

**Generate summary for all textual fields**

In [79]:
def generate_patient_summary(patient_data):
    """
    Generates meaningful summaries in sentence form for specific fields.
    """
    fields_to_summarize = ['Diagnosis', 'Medical History', 'Prescribed Medication', 'Lab Test Results']
    summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

    summarized_data = {}
    for field in fields_to_summarize:
        text = patient_data[field]
        input_length = len(text.split())

        # Avoid summarization for very short texts
        if input_length < 10:
            summarized_data[field] = text  # Use original text if it's too short
        else:
            # Summarize with meaningful length
            max_new_tokens = min(50, input_length // 2)
            summary = summarizer(text, max_new_tokens=max_new_tokens, do_sample=False)
            summarized_data[field] = summary[0]['summary_text']

    # Construct meaningful sentence outputs
    meaningful_summary = {
        "Diagnosis": f"The patient has been diagnosed with: {summarized_data['Diagnosis']}.",
        "Medical History": f"Medical history includes: {summarized_data['Medical History']}.",
        "Prescribed Medication": f"Prescribed medications are: {summarized_data['Prescribed Medication']}.",
        "Lab Test Results": f"Recent lab test results show: {summarized_data['Lab Test Results']}."
    }

    return meaningful_summary

**Function to display patient info and summary**

In [80]:
def display_patient_info(patient_data, summarized_data):
    print("\n--- Complete Patient Information ---\n")
    # Convert patient data to DataFrame
    patient_info_df = pd.DataFrame([patient_data])
    patient_info_df.reset_index(drop=True, inplace=True)  # Reset index to hide it
    display(patient_info_df.style.set_table_styles(
        [{'selector': 'th', 'props': [('font-size', '14px'), ('text-align', 'center')]},
         {'selector': 'td', 'props': [('font-size', '12px')]}]
    ))

    print("\n--- Summarized Information ---\n")
    # Create a DataFrame for summarized sentences
    summary_df = pd.DataFrame(list(summarized_data.items()), columns=["Field", "Summary"])
    summary_df.reset_index(drop=True, inplace=True)  # Reset index to hide it
    display(summary_df.style.set_table_styles(
        [{'selector': 'th', 'props': [('font-size', '14px'), ('text-align', 'center')]},
         {'selector': 'td', 'props': [('font-size', '12px'), ('text-align', 'left'), ('word-wrap', 'break-word')]}]
    ))

**Main function for user interaction**

In [81]:
def main():
    while True:
        patient_id = input("Please enter the Patient ID : ")
        error, patient_data = extract_patient_data(patient_id)

        if error:
            print(error)  # Print the error message if there's an issue
        else:
            summarized_data = generate_patient_summary(patient_data)
            display_patient_info(patient_data, summarized_data)

        # Ask user if they want to search for another patient or end the process
        continue_search = input("Do you want to search for another patient? (yes to continue / no to end): ").strip().lower()

        if continue_search != 'yes':
            print("Exiting the program. Goodbye!")
            break

In [82]:
if __name__ == "__main__":
    main()

Please enter the Patient ID : 675

--- Complete Patient Information ---



Unnamed: 0,Patient ID,Name,Age,Gender,Diagnosis,Medical History,Prescribed Medication,Lab Test Results
0,675,Brianna Carter,93,female,flu,alcoholic chronic back pain,Atorvastatin,High cholesterol; Low hemoglobin; Positive COVID-19 test



--- Summarized Information ---



Unnamed: 0,Field,Summary
0,Diagnosis,The patient has been diagnosed with: flu.
1,Medical History,Medical history includes: alcoholic chronic back pain.
2,Prescribed Medication,Prescribed medications are: Atorvastatin.
3,Lab Test Results,Recent lab test results show: High cholesterol; Low hemoglobin; Positive COVID-19 test.


Do you want to search for another patient? (yes to continue / no to end): yes
Please enter the Patient ID : 8930
Patient ID out of range. Please enter a number between 1 and 5000.
Do you want to search for another patient? (yes to continue / no to end): yes
Please enter the Patient ID : 6

--- Complete Patient Information ---



Unnamed: 0,Patient ID,Name,Age,Gender,Diagnosis,Medical History,Prescribed Medication,Lab Test Results
0,6,Dana Pearson,17,male,flu,smoker no previous conditions alcoholic,Insulin,High cholesterol; Normal blood pressure



--- Summarized Information ---



Unnamed: 0,Field,Summary
0,Diagnosis,The patient has been diagnosed with: flu.
1,Medical History,Medical history includes: smoker no previous conditions alcoholic.
2,Prescribed Medication,Prescribed medications are: Insulin.
3,Lab Test Results,Recent lab test results show: High cholesterol; Normal blood pressure.


Do you want to search for another patient? (yes to continue / no to end): no
Exiting the program. Goodbye!


### Why BART Over BERT:

I used **BART** instead of **BERT** because **BART** is designed for text generation tasks like **summarization**, while **BERT** is focused on text understanding and does not generate text. BART combines BERT's encoding with autoregressive decoding (like GPT), making it ideal for generating concise summaries.


### Task Summary:

The script performs the following:
1. **Input**: User provides a Patient ID.
2. **Search**: Looks up the dataset for the given ID.
3. **Data Extraction**: Retrieves the patient's medical details.
4. **Summarization**: Uses **BART** to generate summaries of key fields (Diagnosis, Medical History, Medication, Lab Results).
5. **Output**: Displays both full patient info and summarized results.

This method ensures efficient and concise summarization of medical reports based on the Patient ID using BART.