# Medical NLP Pipeline - Data Exploration

This notebook demonstrates the complete medical NLP pipeline.

## 1. Setup

In [1]:
import sys
import os
import json
import numpy as np
from dotenv import load_dotenv

# Add parent directory to path
sys.path.append('..')

# Load environment variables from project root
load_dotenv(dotenv_path=os.path.join('..', '.env'))
load_dotenv()  # Also try current directory

# Import pipeline
from src.pipeline import MedicalNLPPipeline


def convert_to_json_serializable(obj):
    """
    Convert NumPy types and other non-serializable types to native Python types.
    
    Args:
        obj: Object to convert (can be dict, list, NumPy type, etc.)
        
    Returns:
        JSON-serializable version of the object
    """
    if isinstance(obj, (np.integer, np.floating)):
        return float(obj) if isinstance(obj, np.floating) else int(obj)
    elif isinstance(obj, np.ndarray):
        return obj.tolist()
    elif isinstance(obj, dict):
        return {key: convert_to_json_serializable(value) for key, value in obj.items()}
    elif isinstance(obj, (list, tuple)):
        return [convert_to_json_serializable(item) for item in obj]
    return obj

## 2. Sample Transcript

In [2]:
sample_transcript = """
Doctor: Good morning, Mr. Carter. How are you feeling today?

Patient: Good morning, doctor. I'm feeling better than before, but I still get some discomfort every now and then.

Doctor: I understand you slipped and fell at work last October. Can you walk me through what happened?

Patient: Yes, it was on October 14th, around 9:00 in the morning. I was working in the warehouse in Stockport, and the floor had a wet patch I didn’t notice. My foot slipped, and I fell sideways onto my right shoulder.

Doctor: That sounds like a painful fall. Were you able to break the fall with your hands?

Patient: Not really. It happened too fast.

Doctor: What did you feel immediately after the incident?

Patient: At first, I was just startled. Then I felt a sharp pain in my right shoulder and upper arm. I couldn’t lift my arm properly right after the fall.

Doctor: Did you seek medical attention at that time?

Patient: Yes, I went to Stepping Hill Hospital’s Urgent Care. They examined me. They did an X-ray to rule out fractures, which came back clear. They gave me a sling and some painkillers.

Doctor: How did things progress after that?

Patient: The first few weeks were difficult. I had limited movement, and the pain made simple tasks hard. I started physiotherapy about two weeks after the injury and had eight sessions, which really helped with mobility and strength.

Doctor: That makes sense. Are you still experiencing pain now?

Patient: It’s not constant, but I do get mild aches if I lift something heavy or sleep on my right side. Definitely not as bad as before.

Doctor: That’s good to hear. Have you noticed any other effects, like anxiety about returning to work or difficulty focusing?

Patient: No, nothing like that. I feel normal otherwise, and I’m not nervous about being in the warehouse again.

Doctor: And how has this impacted your daily life? Work, hobbies, anything like that?

Patient: I took about ten days off work, but after that, I was able to go back gradually. I avoided lifting for a bit, but overall it hasn’t stopped me from doing my usual activities.

Doctor: That’s encouraging. Let’s go ahead and do a physical examination to check your mobility and any lingering pain.

Doctor: Everything looks good. Your shoulder has a full range of movement, and there’s no tenderness or signs of ongoing injury. Your strength and joint stability appear normal.

Patient: That’s a relief!

Doctor: Yes, your recovery so far has been very positive. Given your progress, I’d expect you to make a full recovery within twenty-three months of the injury. There are no signs of long-term damage or instability.

Patient: That’s great to hear. So, I don’t need to worry about this affecting me in the future?

Doctor: That’s right. I don’t anticipate any long-term impact on your work or daily life. If anything changes or you start experiencing increased pain, you can always come back for a follow-up. But at this point, you’re on track for a complete recovery.

Patient: Thank you, doctor. I appreciate it.

Doctor: You’re very welcome, Mr. Carter. Take care, and don’t hesitate to reach out if you need anything.





"""

print("Sample Transcript:")
print(sample_transcript)

Sample Transcript:

Doctor: Good morning, Mr. Carter. How are you feeling today?

Patient: Good morning, doctor. I'm feeling better than before, but I still get some discomfort every now and then.

Doctor: I understand you slipped and fell at work last October. Can you walk me through what happened?

Patient: Yes, it was on October 14th, around 9:00 in the morning. I was working in the warehouse in Stockport, and the floor had a wet patch I didn’t notice. My foot slipped, and I fell sideways onto my right shoulder.

Doctor: That sounds like a painful fall. Were you able to break the fall with your hands?

Patient: Not really. It happened too fast.

Doctor: What did you feel immediately after the incident?

Patient: At first, I was just startled. Then I felt a sharp pain in my right shoulder and upper arm. I couldn’t lift my arm properly right after the fall.

Doctor: Did you seek medical attention at that time?

Patient: Yes, I went to Stepping Hill Hospital’s Urgent Care. They examine

## 3. Initialize Pipeline

In [3]:
# Initialize with LLM (if API key available)
pipeline = MedicalNLPPipeline(config_path='../config/config.yaml', use_llm=True)

# Or without LLM (local models only)
# pipeline = MedicalNLPPipeline(config_path='../config/config.yaml', use_llm=False)

INFO:src.pipeline:Initializing Medical NLP Pipeline
INFO:src.pipeline:✓ Configuration loaded from ../config/config.yaml
INFO:src.pipeline:Loading models and components...
INFO:src.medical_nlp.entity_extractor:Loading NER model: d4data/biomedical-ner-all
INFO:src.medical_nlp.entity_extractor:Using device: cpu
INFO:src.medical_nlp.medical_summarizer:Groq client initialized
INFO:src.sentiment.sentiment_analyzer:Loading sentiment model: distilbert-base-uncased-finetuned-sst-2-english
INFO:src.sentiment.sentiment_analyzer:Using device: cpu
INFO:src.sentiment.intent_detector:Loading intent model: facebook/bart-large-mnli
INFO:src.sentiment.intent_detector:Using device: cpu
INFO:src.soap.soap_generator:Groq client initialized for SOAP generation
INFO:src.pipeline:✓ Pipeline initialization complete


## 4. Process Transcript

In [4]:
results = pipeline.process(sample_transcript)

INFO:src.pipeline:
INFO:src.pipeline:Starting Medical Transcript Analysis

INFO:src.pipeline:[1/6] Text Preprocessing...
INFO:src.pipeline:  ✓ Identified 14 doctor utterances
INFO:src.pipeline:  ✓ Identified 12 patient utterances
INFO:src.pipeline:
[2/6] Medical Entity Extraction...
INFO:src.medical_nlp.entity_extractor:Extracting medical entities...
INFO:src.pipeline:  ✓ Extracted 3 symptoms
INFO:src.pipeline:  ✓ Extracted 7 treatments
INFO:src.pipeline:  ✓ Extracted 2 body parts
INFO:src.pipeline:  ✓ Extracted 3 symptoms
INFO:src.pipeline:  ✓ Extracted 7 treatments
INFO:src.pipeline:  ✓ Extracted 2 body parts
INFO:src.pipeline:
[3/6] Temporal Information Extraction...
INFO:src.pipeline:  ✓ Extracted 0 durations
INFO:src.pipeline:  ✓ Extracted 0 quantities
INFO:src.pipeline:
[4/6] Medical Summary Generation...
INFO:src.medical_nlp.medical_summarizer:Using LLM for medical information extraction...
INFO:httpx:HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 2

## 5. View Results

In [5]:
print("\n" + "="*80)
print("MEDICAL SUMMARY")
print("="*80)

# Convert NumPy types to native Python types for JSON serialization
medical_summary_serializable = convert_to_json_serializable(results['medical_summary'])
print(json.dumps(medical_summary_serializable, indent=2))


MEDICAL SUMMARY
{
  "Patient_Name": {
    "value": "Mr. Carter",
    "confidence": 1.0,
    "source": "stated"
  },
  "Symptoms": [
    {
      "value": "Right shoulder pain",
      "confidence": 1.0,
      "source": "stated"
    },
    {
      "value": "Discomfort",
      "confidence": 0.8,
      "source": "stated"
    },
    {
      "value": "Mild aches",
      "confidence": 0.8,
      "source": "stated"
    }
  ],
  "Diagnosis": {
    "value": "Right shoulder injury, likely a soft tissue injury or strain",
    "confidence": 0.7,
    "source": "inferred",
    "reasoning": "The patient had a fall, experienced pain and limited movement, and was given a sling and painkillers. The X-ray ruled out fractures, suggesting a soft tissue injury."
  },
  "Treatment": [
    {
      "value": "Sling",
      "confidence": 1.0,
      "source": "stated"
    },
    {
      "value": "Painkillers",
      "confidence": 1.0,
      "source": "stated"
    },
    {
      "value": "8 physiotherapy sessions",

In [6]:
print("\n" + "="*80)
print("SENTIMENT & INTENT ANALYSIS")
print("="*80)

# Convert NumPy types to native Python types for JSON serialization
sentiment_serializable = convert_to_json_serializable(results['sentiment_analysis'])
print(json.dumps(sentiment_serializable, indent=2))


SENTIMENT & INTENT ANALYSIS
{
  "overall_sentiment": "Reassured",
  "sentiment_score": 0.952,
  "confidence": 0.583,
  "intent": "Reporting Symptoms",
  "intent_confidence": 0.992,
  "all_intents": [
    {
      "intent": "reporting symptoms",
      "confidence": 0.992
    },
    {
      "intent": "providing medical history",
      "confidence": 0.988
    },
    {
      "intent": "seeking reassurance",
      "confidence": 0.987
    },
    {
      "intent": "expressing concern",
      "confidence": 0.983
    },
    {
      "intent": "asking questions",
      "confidence": 0.952
    },
    {
      "intent": "confirming understanding",
      "confidence": 0.937
    }
  ]
}


In [7]:
print("\n" + "="*80)
print("SOAP NOTE")
print("="*80)

# Convert NumPy types to native Python types for JSON serialization
soap_note_serializable = convert_to_json_serializable(results['soap_note'])
print(json.dumps(soap_note_serializable, indent=2))


SOAP NOTE
{
  "Subjective": {
    "Chief_Complaint": "Right shoulder pain and discomfort after a fall",
    "History_of_Present_Illness": "The patient reported a fall at work on October 14th, landing on his right shoulder, with initial sharp pain and limited arm movement. He was treated with a sling and painkillers, and underwent 8 physiotherapy sessions, which improved his mobility and strength. He currently experiences mild aches with heavy lifting or sleeping on his right side.",
    "Review_of_Systems": "No anxiety, difficulty focusing, or other systemic symptoms reported"
  },
  "Objective": {
    "Physical_Exam": "Full range of motion in the right shoulder, no tenderness, normal strength and joint stability",
    "Vital_Signs": "Not mentioned",
    "Observations": "No signs of ongoing injury or long-term damage"
  },
  "Assessment": {
    "Diagnosis": "Right shoulder injury, likely a soft tissue injury or strain",
    "Severity": "Mild, with significant improvement since the ini

## 6. Process Full Transcript from File

In [8]:
# Process the full transcript
output_path = pipeline.process_file(
    '../data/input/sample_transcript.txt',
    '../data/output'
)

print(f"Results saved to: {output_path}")

INFO:src.pipeline:Processing file: ../data/input/sample_transcript.txt
INFO:src.pipeline:
INFO:src.pipeline:Starting Medical Transcript Analysis

INFO:src.pipeline:[1/6] Text Preprocessing...
INFO:src.pipeline:  ✓ Identified 14 doctor utterances
INFO:src.pipeline:  ✓ Identified 12 patient utterances
INFO:src.pipeline:
[2/6] Medical Entity Extraction...
INFO:src.medical_nlp.entity_extractor:Extracting medical entities...
INFO:src.pipeline:  ✓ Extracted 3 symptoms
INFO:src.pipeline:  ✓ Extracted 7 treatments
INFO:src.pipeline:  ✓ Extracted 2 body parts
INFO:src.pipeline:  ✓ Extracted 3 symptoms
INFO:src.pipeline:  ✓ Extracted 7 treatments
INFO:src.pipeline:  ✓ Extracted 2 body parts
INFO:src.pipeline:
[3/6] Temporal Information Extraction...
INFO:src.pipeline:  ✓ Extracted 0 durations
INFO:src.pipeline:  ✓ Extracted 0 quantities
INFO:src.pipeline:
[4/6] Medical Summary Generation...
INFO:src.medical_nlp.medical_summarizer:Using LLM for medical information extraction...
INFO:httpx:HTTP Re

Results saved to: ../data/output\analysis_20251214_224849.json


In [9]:
# Load and view full results
with open(output_path, 'r') as f:
    full_results = json.load(f)

print(json.dumps(full_results, indent=2))

{
  "metadata": {
    "timestamp": "2025-12-14T22:48:49.583401",
    "pipeline_version": "1.0.0",
    "llm_used": true,
    "overall_confidence": 0.91
  },
  "medical_summary": {
    "Patient_Name": {
      "value": "Mr. Carter",
      "confidence": 1.0,
      "source": "stated"
    },
    "Symptoms": [
      {
        "value": "Right shoulder pain",
        "confidence": 1.0,
        "source": "stated"
      },
      {
        "value": "Discomfort",
        "confidence": 0.8,
        "source": "stated"
      },
      {
        "value": "Mild aches",
        "confidence": 0.8,
        "source": "stated"
      }
    ],
    "Diagnosis": {
      "value": "Right shoulder injury, likely a soft tissue injury or strain",
      "confidence": 0.7,
      "source": "inferred",
      "reasoning": "The patient had a fall, experienced pain and limited movement, and was given a sling and painkillers. The X-ray ruled out fractures, suggesting a soft tissue injury."
    },
    "Treatment": [
      {
  

## 7. Explore Individual Components

In [10]:
# Text preprocessing
from src.preprocessing.text_processor import TextProcessor

processor = TextProcessor(pipeline.config)
processed = processor.process(sample_transcript)

print("Doctor utterances:", len(processed['doctor_utterances']))
print("Patient utterances:", len(processed['patient_utterances']))

Doctor utterances: 14
Patient utterances: 12


In [11]:
# Entity extraction
from src.medical_nlp.entity_extractor import EntityExtractor

extractor = EntityExtractor(pipeline.config)
entities = extractor.extract_entities(sample_transcript)

print("Symptoms:", entities['symptoms'])
print("Treatments:", entities['treatments'])
print("Body parts:", entities['body_parts'])

INFO:src.medical_nlp.entity_extractor:Loading NER model: d4data/biomedical-ner-all
INFO:src.medical_nlp.entity_extractor:Using device: cpu
INFO:src.medical_nlp.entity_extractor:Extracting medical entities...


Symptoms: [{'value': 'Discomfort', 'confidence': 1.0, 'source': 'extracted'}, {'value': 'Pain', 'confidence': 1.0, 'source': 'extracted'}, {'value': 'Ache', 'confidence': 1.0, 'source': 'extracted'}]
Treatments: [{'value': 'Wet Patch', 'confidence': 0.9829999804496765, 'label': 'Therapeutic_procedure'}, {'value': 'Lift', 'confidence': 0.4000000059604645, 'label': 'Therapeutic_procedure'}, {'value': 'Sling', 'confidence': 0.7860000133514404, 'label': 'Therapeutic_procedure'}, {'value': 'Ph', 'confidence': 1.0, 'label': 'Therapeutic_procedure'}, {'value': 'Ys', 'confidence': 0.875, 'label': 'Therapeutic_procedure'}, {'value': 'Iotherapy', 'confidence': 0.9599999785423279, 'label': 'Therapeutic_procedure'}, {'value': 'X', 'confidence': 0.9739999771118164, 'label': 'Diagnostic_procedure'}, {'value': 'Ray', 'confidence': 0.9909999966621399, 'label': 'Diagnostic_procedure'}, {'value': 'Examination', 'confidence': 0.6639999747276306, 'label': 'Diagnostic_procedure'}]
Body parts: [{'value': 'R