# Medical Transcription NLP Pipeline
## AI System for Medical Transcription, Summarization & Sentiment Analysis

This notebook demonstrates a complete NLP pipeline for:
1. **Named Entity Recognition (NER)** - Extract symptoms, treatments, diagnoses
2. **Medical Summarization** - Convert transcripts to structured reports
3. **Sentiment & Intent Analysis** - Detect patient emotions and communication intents
4. **SOAP Note Generation** - Create clinical documentation

---

## Setup & Imports

In [2]:
import json
import sys
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Add src to path
sys.path.insert(0, str(Path.cwd() / 'src'))

# Import modules
from src.medical_ner import MedicalNER, KeywordExtractor
from src.medical_summarizer import MedicalSummarizer
from src.sentiment_intent import SentimentIntentAnalyzer
from src.soap_generator import SOAPNoteGenerator

print("‚úì All modules imported successfully!")

‚úì All modules imported successfully!


## Load Sample Transcript

We'll use the physician-patient conversation about a car accident and whiplash injury.

In [3]:
# Load sample transcript
with open('data/sample_transcript.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

transcript_full = data['transcript_full']
transcript_sample = data['transcript_sample']
patient_dialogue = data['patient_dialogue_sample']

print("Full Transcript (first 500 characters):")
print(transcript_full[:500] + "...\n")
print(f"Total length: {len(transcript_full)} characters")

Full Transcript (first 500 characters):
Physician: Good morning, Ms. Jones. How are you feeling today?

Patient: Good morning, doctor. I'm doing better, but I still have some discomfort now and then.

Physician: I understand you were in a car accident last September. Can you walk me through what happened?

Patient: Yes, it was on September 1st, around 12:30 in the afternoon. I was driving from Cheadle Hulme to Manchester when I had to stop in traffic. Out of nowhere, another car hit me from behind, which pushed my car into the one in ...

Total length: 3137 characters


---
# 1. Medical NLP Summarization

Extract medical details from the transcript using NER and keyword extraction.

## 1.1 Named Entity Recognition (NER)

Extract: **Symptoms, Treatment, Diagnosis, Prognosis**

In [4]:
# Initialize NER
ner = MedicalNER()

# Extract entities from full transcript
entities = ner.extract_medical_entities(transcript_full)

print("="*70)
print("EXTRACTED MEDICAL ENTITIES")
print("="*70)
print(json.dumps(entities, indent=2))

EXTRACTED MEDICAL ENTITIES
{
  "Patient_Name": "Jones",
  "Date_of_Incident": "September 1st",
  "Symptoms": [
    "Back pain",
    "Head impact",
    "Hit My Head",
    "Neck And Back Almost Right Away",
    "Neck And Back Pain",
    "Neck pain",
    "Occasional Backaches",
    "Occasional backache",
    "Trouble Sleeping",
    "Trouble sleeping",
    "Whiplash"
  ],
  "Diagnosis": "Whiplash injury",
  "Treatment": [
    "10 physiotherapy sessions",
    "Painkillers",
    "Physical Examination",
    "X-rays"
  ],
  "Current_Status": "Occasional backaches",
  "Prognosis": "Full recovery within six months of the accident"
}


### Test with Sample Transcript

In [5]:
# Test with shorter sample
print("Sample Input:")
print(transcript_sample)
print("\n" + "="*70)

sample_entities = ner.extract_medical_entities(transcript_sample)
print("\nExtracted Entities:")
print(json.dumps(sample_entities, indent=2))

Sample Input:
Doctor: How are you feeling today?
Patient: I had a car accident. My neck and back hurt a lot for four weeks.
Doctor: Did you receive treatment?
Patient: Yes, I had ten physiotherapy sessions, and now I only have occasional back pain.


Extracted Entities:
{
  "Patient_Name": "Janet Jones",
  "Date_of_Incident": "Unknown",
  "Symptoms": [
    "Back pain",
    "Neck pain"
  ],
  "Diagnosis": "Whiplash injury",
  "Treatment": [
    "Physiotherapy",
    "Physiotherapy Sessions"
  ],
  "Current_Status": "Only have occasional back pain",
  "Prognosis": "Good prognosis"
}


## 1.2 Keyword Extraction

Identify important medical phrases.

In [6]:
# Extract keywords
kw_extractor = KeywordExtractor()
keywords = kw_extractor.extract_keywords(transcript_full, top_n=10)

print("Top Medical Keywords/Phrases:")
print("="*70)
for i, keyword in enumerate(keywords, 1):
    print(f"{i}. {keyword}")

Top Medical Keywords/Phrases:
1. full recovery
2. physical examination
3. car accident
4. emergency
5. steering wheel
6. range of movement
7. whiplash injury
8. neck and back pain
9. painkillers


## 1.3 Text Summarization

Convert transcript into structured medical report.

In [7]:
# Initialize summarizer
summarizer = MedicalSummarizer()

# Create structured summary
structured_summary = summarizer.create_structured_summary(transcript_full, entities)

print(structured_summary)

Device set to use cpu


MEDICAL CONSULTATION SUMMARY
Patient: Jones
Date of Incident: September 1st

CHIEF COMPLAINT:
  Back pain, Head impact, Hit My Head

DIAGNOSIS:
  Whiplash injury

TREATMENT PROVIDED:
  - 10 physiotherapy sessions
  - Painkillers
  - Physical Examination
  - X-rays

CURRENT STATUS:
  Occasional backaches

PROGNOSIS:
  Full recovery within six months of the accident



### JSON Format Summary

In [8]:
# Generate JSON summary
json_summary = summarizer.generate_json_summary(entities)

print("JSON Summary:")
print(json.dumps(json_summary, indent=2))

JSON Summary:
{
  "Patient_Name": "Jones",
  "Date_of_Incident": "September 1st",
  "Symptoms": [
    "Back pain",
    "Head impact",
    "Hit My Head",
    "Neck And Back Almost Right Away",
    "Neck And Back Pain",
    "Neck pain",
    "Occasional Backaches",
    "Occasional backache",
    "Trouble Sleeping",
    "Trouble sleeping",
    "Whiplash"
  ],
  "Diagnosis": "Whiplash injury",
  "Treatment": [
    "10 physiotherapy sessions",
    "Painkillers",
    "Physical Examination",
    "X-rays"
  ],
  "Current_Status": "Occasional backaches",
  "Prognosis": "Full recovery within six months of the accident"
}


---
# 2. Sentiment & Intent Analysis

Detect patient emotions and communication intents.

## 2.1 Single Utterance Analysis

In [9]:
# Initialize sentiment analyzer
sentiment_analyzer = SentimentIntentAnalyzer()

# Analyze sample patient dialogue
print(f"Patient Dialogue: \"{patient_dialogue}\"\n")

analysis = sentiment_analyzer.analyze(patient_dialogue)

print("Analysis Results:")
print(json.dumps(analysis, indent=2))

Device set to use cpu


Patient Dialogue: "I'm a bit worried about my back pain, but I hope it gets better soon."

Analysis Results:
{
  "Sentiment": "Reassured",
  "Intent": "Seeking reassurance"
}
Analysis Results:
{
  "Sentiment": "Reassured",
  "Intent": "Seeking reassurance"
}


## 2.2 Multiple Test Cases

In [10]:
# Test multiple utterances
test_cases = [
    "I'm a bit worried about my back pain, but I hope it gets better soon.",
    "I had a car accident. My neck and back hurt a lot for four weeks.",
    "That's a relief! Thank you, doctor.",
    "I'm doing better, but I still have some discomfort now and then.",
    "Yes, I had ten physiotherapy sessions, and now I only have occasional back pain."
]

print("Sentiment & Intent Analysis for Multiple Utterances:")
print("="*70)

for i, text in enumerate(test_cases, 1):
    result = sentiment_analyzer.analyze(text)
    print(f"\n{i}. Text: \"{text}\"")
    print(f"   Sentiment: {result['Sentiment']}")
    print(f"   Intent: {result['Intent']}")

Sentiment & Intent Analysis for Multiple Utterances:

1. Text: "I'm a bit worried about my back pain, but I hope it gets better soon."
   Sentiment: Reassured
   Intent: Seeking reassurance

2. Text: "I had a car accident. My neck and back hurt a lot for four weeks."
   Sentiment: Anxious
   Intent: Reporting symptoms

3. Text: "That's a relief! Thank you, doctor."
   Sentiment: Reassured
   Intent: Expressing gratitude

4. Text: "I'm doing better, but I still have some discomfort now and then."
   Sentiment: Anxious
   Intent: Reporting symptoms

5. Text: "Yes, I had ten physiotherapy sessions, and now I only have occasional back pain."
   Sentiment: Anxious
   Intent: Reporting symptoms

5. Text: "Yes, I had ten physiotherapy sessions, and now I only have occasional back pain."
   Sentiment: Anxious
   Intent: Reporting symptoms


## 2.3 Full Conversation Analysis

Analyze all patient utterances in the transcript.

In [11]:
# Analyze entire conversation
conversation_analysis = sentiment_analyzer.analyze_conversation(transcript_full)

print("Patient Utterance Analysis:")
print("="*70)

for i, analysis in enumerate(conversation_analysis, 1):
    print(f"\nUtterance {i}:")
    print(f"Text: {analysis['Text']}")
    print(f"Sentiment: {analysis['Sentiment']}")
    print(f"Intent: {analysis['Intent']}")
    print("-" * 70)

Patient Utterance Analysis:

Utterance 1:
Text: Good morning, doctor.
Sentiment: Reassured
Intent: General communication
----------------------------------------------------------------------

Utterance 2:
Text: Yes, it was on September 1st, around 12:30 in the afternoon.
Sentiment: Anxious
Intent: General communication
----------------------------------------------------------------------

Utterance 3:
Text: Yes, I always do.
Sentiment: Reassured
Intent: General communication
----------------------------------------------------------------------

Utterance 4:
Text: At first, I was just shocked.
Sentiment: Neutral
Intent: General communication
----------------------------------------------------------------------

Utterance 5:
Text: Yes, I went to Moss Bank Accident and Emergency.
Sentiment: Anxious
Intent: General communication
----------------------------------------------------------------------

Utterance 6:
Text: The first four weeks were rough.
Sentiment: Anxious
Intent: General 

---
# 3. SOAP Note Generation (Bonus)

Convert transcript into structured SOAP (Subjective, Objective, Assessment, Plan) note.

In [12]:
# Initialize SOAP generator
soap_generator = SOAPNoteGenerator()

# Generate SOAP note
soap_note = soap_generator.generate_soap_note(transcript_full, entities)

print("SOAP Note (JSON Format):")
print(json.dumps(soap_note, indent=2))

SOAP Note (JSON Format):
{
  "Subjective": {
    "Chief_Complaint": "Back pain, Head impact, Hit My Head",
    "History_of_Present_Illness": "Car accident last september. can you walk me through what happened?\n\npatient: yes, it was on september 1st, around 12:30 in the afternoon. i was driving from cheadle hulme to manchester when i had to stop in traffic First four weeks were rough. my neck and back pain were really bad\u2014i had trouble sleeping and had to take painkillers regularly Currently experiencing occasional backaches."
  },
  "Objective": {
    "Physical_Exam": "Full range of motion in cervical and lumbar spine, no tenderness.",
    "Observations": "No tenderness."
  },
  "Assessment": {
    "Diagnosis": "Whiplash injury",
    "Severity": "Mild, improving"
  },
  "Plan": {
    "Treatment": "Continue physiotherapy as needed, use analgesics for pain relief.",
    "Follow_Up": "Patient to return if pain worsens or persists beyond six months."
  }
}


## SOAP Note - Formatted Text

In [13]:
# Format as readable text
soap_text = soap_generator.format_soap_note_text(soap_note, entities['Patient_Name'])

print(soap_text)

SOAP NOTE
Patient: Jones
Date: December 18, 2025

SUBJECTIVE:
  Chief Complaint: Back pain, Head impact, Hit My Head
  HPI: Car accident last september. can you walk me through what happened?

patient: yes, it was on september 1st, around 12:30 in the afternoon. i was driving from cheadle hulme to manchester when i had to stop in traffic First four weeks were rough. my neck and back pain were really bad‚Äîi had trouble sleeping and had to take painkillers regularly Currently experiencing occasional backaches.

OBJECTIVE:
  Physical Exam: Full range of motion in cervical and lumbar spine, no tenderness.
  Observations: No tenderness.

ASSESSMENT:
  Diagnosis: Whiplash injury
  Severity: Mild, improving

PLAN:
  Treatment: Continue physiotherapy as needed, use analgesics for pain relief.
  Follow-Up: Patient to return if pain worsens or persists beyond six months.



---
# Complete Pipeline Demo

Process a new transcript through all stages.

In [14]:
# Complete pipeline function
def process_medical_transcript(transcript):
    """
    Complete NLP pipeline for medical transcript analysis
    """
    results = {}
    
    print("Processing Medical Transcript...\n")
    
    # Step 1: NER
    print("1Ô∏è‚É£ Extracting Medical Entities...")
    ner = MedicalNER()
    entities = ner.extract_medical_entities(transcript)
    results['entities'] = entities
    print(f"   ‚úì Found {len(entities.get('Symptoms', []))} symptoms")
    
    # Step 2: Keywords
    print("2Ô∏è‚É£ Extracting Keywords...")
    kw_extractor = KeywordExtractor()
    keywords = kw_extractor.extract_keywords(transcript)
    results['keywords'] = keywords
    print(f"   ‚úì Extracted {len(keywords)} key phrases")
    
    # Step 3: Summarization
    print("3Ô∏è‚É£ Generating Summary...")
    summarizer = MedicalSummarizer()
    summary = summarizer.generate_json_summary(entities)
    results['summary'] = summary
    print("   ‚úì Summary generated")
    
    # Step 4: Sentiment & Intent
    print("4Ô∏è‚É£ Analyzing Sentiment & Intent...")
    analyzer = SentimentIntentAnalyzer()
    sentiment_results = analyzer.analyze_conversation(transcript)
    results['sentiment_intent'] = sentiment_results
    print(f"   ‚úì Analyzed {len(sentiment_results)} utterances")
    
    # Step 5: SOAP Note
    print("5Ô∏è‚É£ Generating SOAP Note...")
    soap_gen = SOAPNoteGenerator()
    soap = soap_gen.generate_soap_note(transcript, entities)
    results['soap_note'] = soap
    print("   ‚úì SOAP note created")
    
    print("\n‚úÖ Processing Complete!\n")
    
    return results

# Run pipeline
pipeline_results = process_medical_transcript(transcript_full)

Processing Medical Transcript...

1Ô∏è‚É£ Extracting Medical Entities...
   ‚úì Found 11 symptoms
2Ô∏è‚É£ Extracting Keywords...
   ‚úì Extracted 9 key phrases
3Ô∏è‚É£ Generating Summary...
   ‚úì Found 11 symptoms
2Ô∏è‚É£ Extracting Keywords...
   ‚úì Extracted 9 key phrases
3Ô∏è‚É£ Generating Summary...


Device set to use cpu


   ‚úì Summary generated
4Ô∏è‚É£ Analyzing Sentiment & Intent...


Device set to use cpu


   ‚úì Analyzed 12 utterances
5Ô∏è‚É£ Generating SOAP Note...
   ‚úì SOAP note created

‚úÖ Processing Complete!



## Display All Results

In [15]:
print("="*70)
print("COMPLETE ANALYSIS RESULTS")
print("="*70)

print("\nüìã EXTRACTED ENTITIES:")
print(json.dumps(pipeline_results['entities'], indent=2))

print("\nüîë KEY MEDICAL PHRASES:")
for i, kw in enumerate(pipeline_results['keywords'], 1):
    print(f"   {i}. {kw}")

print("\nüí≠ SENTIMENT & INTENT:")
for i, item in enumerate(pipeline_results['sentiment_intent'][:3], 1):
    print(f"   {i}. {item['Sentiment']} | {item['Intent']}")

print("\nüìÑ SOAP NOTE:")
print(json.dumps(pipeline_results['soap_note'], indent=2))

COMPLETE ANALYSIS RESULTS

üìã EXTRACTED ENTITIES:
{
  "Patient_Name": "Jones",
  "Date_of_Incident": "September 1st",
  "Symptoms": [
    "Back pain",
    "Head impact",
    "Hit My Head",
    "Neck And Back Almost Right Away",
    "Neck And Back Pain",
    "Neck pain",
    "Occasional Backaches",
    "Occasional backache",
    "Trouble Sleeping",
    "Trouble sleeping",
    "Whiplash"
  ],
  "Diagnosis": "Whiplash injury",
  "Treatment": [
    "10 physiotherapy sessions",
    "Painkillers",
    "Physical Examination",
    "X-rays"
  ],
  "Current_Status": "Occasional backaches",
  "Prognosis": "Full recovery within six months of the accident"
}

üîë KEY MEDICAL PHRASES:
   1. full recovery
   2. physical examination
   3. car accident
   4. emergency
   5. steering wheel
   6. range of movement
   7. whiplash injury
   8. neck and back pain
   9. painkillers

üí≠ SENTIMENT & INTENT:
   1. Reassured | General communication
   2. Anxious | General communication
   3. Reassured | Gen

---
# Questions & Answers

## Q1: How to handle ambiguous or missing medical data?

**Answer:**
- **Rule-based fallbacks**: Use pattern matching and medical knowledge bases
- **Context inference**: Extract information from surrounding sentences
- **Default values**: Provide "Unknown" or "Not specified" for missing data
- **Confidence scores**: Assign confidence levels to extractions
- **Human-in-the-loop**: Flag uncertain extractions for review

In [16]:
# Example: Handling missing data
ambiguous_transcript = "Patient feels unwell. Some pain mentioned."

ambiguous_entities = ner.extract_medical_entities(ambiguous_transcript)
print("Handling Ambiguous Data:")
print(json.dumps(ambiguous_entities, indent=2))

Handling Ambiguous Data:
{
  "Patient_Name": "feels unwell",
  "Date_of_Incident": "Unknown",
  "Symptoms": [],
  "Diagnosis": "Whiplash injury",
  "Treatment": [],
  "Current_Status": "Recovering well",
  "Prognosis": "Good prognosis"
}


## Q2: What pre-trained NLP models for medical summarization?

**Recommended Models:**

1. **BioBERT** - Pre-trained on biomedical literature
2. **Clinical BERT** - Fine-tuned on clinical notes
3. **SciBERT** - Scientific domain specialization
4. **BART-large-cnn** - General summarization (used here)
5. **BioGPT** - Medical text generation
6. **PubMedBERT** - PubMed abstracts training

**Current Implementation:**

In [17]:
print("Current Models:")
print("- NER: spaCy + Rule-based extraction")
print("- Summarization: facebook/bart-large-cnn")
print("- Sentiment: distilbert-base-uncased-finetuned-sst-2-english")
print("\nFor production, consider:")
print("- BioBERT for medical NER")
print("- ClinicalBERT for clinical notes")
print("- Fine-tuning on medical datasets (MIMIC-III, i2b2)")

Current Models:
- NER: spaCy + Rule-based extraction
- Summarization: facebook/bart-large-cnn
- Sentiment: distilbert-base-uncased-finetuned-sst-2-english

For production, consider:
- BioBERT for medical NER
- ClinicalBERT for clinical notes
- Fine-tuning on medical datasets (MIMIC-III, i2b2)


## Q3: Fine-tuning BERT for medical sentiment?

**Approach:**
1. Start with BioBERT or Clinical BERT
2. Create labeled dataset with medical sentiments
3. Fine-tune on healthcare-specific sentiment data
4. Use datasets like Medical Sentiment Analysis corpus
5. Validate on clinical conversations

## Q4: Training dataset for healthcare sentiment?

**Recommended Datasets:**
- **MIMIC-III**: Clinical notes
- **i2b2 Challenges**: Medical NLP tasks
- **Medical Dialog Dataset**: Doctor-patient conversations
- **EmotionLines**: Sentiment in conversations
- **Custom annotation**: Label patient utterances for medical context

---
# Save Results

In [18]:
# Save all results to JSON
output_path = Path('output')
output_path.mkdir(exist_ok=True)

# Prepare output data
output_data = {
    'entities': pipeline_results['entities'],
    'keywords': pipeline_results['keywords'],
    'summary': pipeline_results['summary'],
    'sentiment_intent': pipeline_results['sentiment_intent'],
    'soap_note': pipeline_results['soap_note']
}

# Save to file
with open(output_path / 'analysis_results.json', 'w', encoding='utf-8') as f:
    json.dump(output_data, f, indent=2, ensure_ascii=False)

print("‚úì Results saved to: output/analysis_results.json")

‚úì Results saved to: output/analysis_results.json


---
# Conclusion

This notebook demonstrated:

‚úÖ **Medical NER** - Extracting symptoms, treatments, diagnoses  
‚úÖ **Keyword Extraction** - Identifying important medical phrases  
‚úÖ **Text Summarization** - Creating structured medical reports  
‚úÖ **Sentiment Analysis** - Detecting patient emotions (Anxious/Neutral/Reassured)  
‚úÖ **Intent Detection** - Understanding patient communication goals  
‚úÖ **SOAP Note Generation** - Automated clinical documentation  

**Next Steps:**
- Fine-tune models on medical datasets
- Implement confidence scoring
- Add multilingual support
- Deploy as REST API
- Integrate with EHR systems