# ML-Based CDSS Summarization

This notebook generates medical summaries using a pre-trained ML model (BART-large-cnn) for comparison with RAG-based approaches.

## Approach
- **Model**: `facebook/bart-large-cnn` (specifically trained for summarization, 400M parameters)
- **Method**: Zero-shot summarization with medical guidance prefix (no training required)
- **Input**: Doctor-patient conversations with medical summarization instruction
- **Output**: Medical summaries (text format, comparable with RAG)


In [11]:
import pandas as pd
import numpy as np
from transformers import pipeline
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")


Libraries imported successfully!


## 1. Load Conversation Data


In [12]:
# Load the conversation data
df = pd.read_csv("/home/root495/Inexture/CDSS-RAG/data/raw/conversation_summary.csv")

# Extract first 15 rows (same as other notebooks)
df = df.head(15)

print(f"Loaded {len(df)} conversations")
print(f"Columns: {df.columns.tolist()}")
print(f"\nFirst conversation preview:")
print(df['conversation'].iloc[0][:200] + "...")


Loaded 15 conversations
Columns: ['conversation', 'summary']

First conversation preview:
Doctor: Hello? Hi. Um, should we start? Yeah, okay. Hello how um. Good morning sir, how can I help you this morning? Patient: Hello, how are you? Patient: Oh hey, um, I've just had some diarrhea for t...


## 2. Initialize ML Model (BART-base)


In [13]:
# Initialize BART-large-cnn model for summarization
# This model is specifically trained for summarization tasks
# Using large-cnn instead of base for better summarization quality
print("Loading BART-large-cnn model...")
print("Note: First time loading will download the model (~1.6GB)")

summarizer = pipeline(
    "summarization",
    model="facebook/bart-large-cnn",
    device=-1  # Use CPU (-1) or GPU (0) if available
)

print("Model loaded successfully!")
print(f"Model: facebook/bart-large-cnn (trained specifically for summarization)")


Loading BART-large-cnn model...
Note: First time loading will download the model (~1.6GB)


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Model loaded successfully!
Model: facebook/bart-large-cnn (trained specifically for summarization)


## 3. Generate Summaries


In [14]:
# Test the model on first conversation
test_conversation = df['conversation'].iloc[0]
print("Testing model on first conversation...")
print(f"Conversation length: {len(test_conversation)} characters")

# BART-large-cnn has a max positional embedding of 1024, so truncate input if too long
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
MAX_INPUT_TOKENS = 1024

# Add a prefix to guide the model to extract medical information
# This helps the model understand it should create a medical summary, not just generic text
MEDICAL_PREFIX = "Summarize this doctor-patient conversation into a concise medical note focusing on symptoms, history, examination findings, diagnosis, and treatment plan: "

# Tokenize and truncate if necessary (accounting for prefix)
prefix_tokens = tokenizer.encode(MEDICAL_PREFIX, add_special_tokens=False)
available_tokens = MAX_INPUT_TOKENS - len(prefix_tokens) - 10  # Leave some buffer
tokens = tokenizer.encode(test_conversation, truncation=True, max_length=available_tokens, add_special_tokens=False)
truncated_text = tokenizer.decode(tokens, skip_special_tokens=True)

# Combine prefix with conversation
input_text = MEDICAL_PREFIX + truncated_text

# Generate summary with parameters that encourage actual summarization
try:
    test_summary = summarizer(
        input_text,
        max_length=150,    # Target length for medical summary
        min_length=50,     # Minimum length to ensure meaningful summary
        do_sample=True,    # Enable sampling for better diversity
        temperature=0.7,   # Control randomness
        num_beams=4,       # Use beam search for better quality
        early_stopping=True
    )
    print(f"\nGenerated summary:")
    print(test_summary[0]['summary_text'])
    print(f"\nSummary length: {len(test_summary[0]['summary_text'])} characters")
except Exception as e:
    print(f"Error generating summary: {e}")
    import traceback
    traceback.print_exc()


Testing model on first conversation...
Conversation length: 8396 characters

Generated summary:
Doctor: Good morning sir, how can I help you this morning? Patient: Oh hey, um, I've just had some diarrhea for the last three days. Doctor: What do you mean by diarrhea? Do you mean you're going to the toilet more often? Or are your stools more loose. Patient: Yeah, so it's like loose and watery stool, going to toilet quite often, uh and like some pain in my stomach.

Summary length: 370 characters


In [15]:
# Generate summaries for all 15 conversations
ml_summaries = []

print("Generating summaries for all conversations...")
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
MAX_INPUT_TOKENS = 1024

# Add a prefix to guide the model to extract medical information
MEDICAL_PREFIX = "Summarize this doctor-patient conversation into a concise medical note focusing on symptoms, history, examination findings, diagnosis, and treatment plan: "

for idx, conversation in enumerate(df['conversation']):
    try:
        # Tokenize and truncate if necessary (accounting for prefix)
        prefix_tokens = tokenizer.encode(MEDICAL_PREFIX, add_special_tokens=False)
        available_tokens = MAX_INPUT_TOKENS - len(prefix_tokens) - 10  # Leave some buffer
        tokens = tokenizer.encode(conversation, truncation=True, max_length=available_tokens, add_special_tokens=False)
        truncated_text = tokenizer.decode(tokens, skip_special_tokens=True)
        
        # Combine prefix with conversation
        input_text = MEDICAL_PREFIX + truncated_text
        
        # Generate summary using ML model with parameters that encourage actual summarization
        result = summarizer(
            input_text,
            max_length=200,    # Target length for medical summary
            min_length=50,     # Minimum length to ensure meaningful summary
            do_sample=True,   # Enable sampling for better diversity
            temperature=0.7,  # Control randomness
            num_beams=4,      # Use beam search for better quality
            early_stopping=True
        )
        ml_summary = result[0]['summary_text']
        ml_summaries.append(ml_summary)
        
        if (idx + 1) % 5 == 0:
            print(f"  Processed {idx + 1}/{len(df)} conversations")
    except Exception as e:
        print(f"  Error processing conversation {idx + 1}: {e}")
        import traceback
        traceback.print_exc()
        ml_summaries.append("")  # Add empty string on error

print(f"\nGenerated {len(ml_summaries)} summaries successfully!")


Generating summaries for all conversations...
  Processed 5/15 conversations
  Processed 10/15 conversations
  Processed 15/15 conversations

Generated 15 summaries successfully!


In [16]:
# Add ml_summary column to dataframe
df['ml_summary'] = ml_summaries

# Display sample results
print("Sample results:")
print(f"\nConversation 1 - Gold Summary:")
print(df['summary'].iloc[0][:200] + "...")
print(f"\nConversation 1 - ML Summary:")
print(df['ml_summary'].iloc[0][:200] + "...")


Sample results:

Conversation 1 - Gold Summary:
3/7 hx of diarrhea, mainly watery. No blood in stool. Opening bowels x6/day. Associated LLQ pain - crampy, intermittent, nil radiation. Also vomiting - mainly bilous. No blood in vomit. Fever on first...

Conversation 1 - ML Summary:
Doctor: Hello, how are you? Patient: Oh hey, um, I've just had some diarrhea for the last three days. Doctor: And when you say diarrhea, what'd you mean by diarrhea? Do you mean you're going to the to...


## 4. Save Results


In [17]:
# Save to processed directory
output_path = Path("/home/root495/Inexture/CDSS-RAG/data/processed/conversation_summary_using_ml.csv")

# Ensure directory exists
output_path.parent.mkdir(parents=True, exist_ok=True)

# Save dataframe
df.to_csv(output_path, index=False)

print(f"Results saved to: {output_path}")
print(f"Total conversations processed: {len(df)}")
print(f"Columns saved: {df.columns.tolist()}")
print("\nFile saved successfully!")


Results saved to: /home/root495/Inexture/CDSS-RAG/data/processed/conversation_summary_using_ml.csv
Total conversations processed: 15
Columns saved: ['conversation', 'summary', 'ml_summary']

File saved successfully!


In [18]:
df

Unnamed: 0,conversation,summary,ml_summary
0,"Doctor: Hello? Hi. Um, should we start? Yeah, ...","3/7 hx of diarrhea, mainly watery. No blood in...","Doctor: Hello, how are you? Patient: Oh hey, u..."
1,Doctor: Hello? Patient: Hello. Can you hear me...,"4/7 hx of dry itchy skin, mainly on chest and ...","Patient: I have like a sore, and a red skin. I..."
2,Doctor: Hello? Patient: Hello. Doctor: Hello t...,"Headache on left side. Started few hours ago, ...",Doctor: Hello there. How can I help you this a...
3,"Doctor: Alex. Ohh. Hello? Hi, can you hear me?...","4/7 hx of generally unwell, mainly sore throat...",Alex has been feeling under the weather for th...
4,Doctor: Hello? Patient: Doctor: . Good morning...,2/7 ago developed lower abdo pain/suprapubic p...,Tim has been experiencing pain in the lower pa...
5,Doctor: Doctor: Hello? Patient: Hello there. D...,"2/5 hx of SOB, worsening over the past 2/7. Fe...","Doctor: Hello, can you hear me OK? Patient: Ye..."
6,Doctor: Hello? Patient: Hello? Doctor: Hello? ...,5/7 hx of generally unwell with cough and cold...,Doctor: I'm sorry to. Patient: And I haven't b...
7,"Patient: OK. Ohh, OK. Doctor: Hello? Patient: ...","3/7 hx of dry itchy skin, mainly on the hands ...","Itha, 26, has been suffering from dry and itch..."
8,Patient: Hello? Doctor: Hello? Doctor: Hello? ...,3/7 hx of dysuria and suprapubic pain. Nil fre...,"Jessica Smith, 19, says she has a pain in her ..."
9,Doctor: hello can you hear me ok? Patient: Hel...,"1/52 hx of dysuria, frequency and suprapubic p...","Amanda Jackson, 19, says she has been experien..."
