# Clinical NLP Assignment
## Medical Code Extraction from Clinical Reports

This notebook extracts structured data from clinical reports, including:
- Patient Information
- ICD-10 Diagnosis Codes
- CPT Procedure Codes

**Approach:**
- Build vector database using sentence transformers for ICD and CPT codes
- Use retrieval-based prediction to match clinical text with medical codes
- Extract structured data and output to JSON format


## 1. Setup and Imports


In [None]:
import sys
import os
import json
import pandas as pd
import numpy as np
from pathlib import Path

# Add current directory to path (for notebook execution)
current_dir = os.getcwd()
if 'clinical_nlp_assignment' in current_dir:
    # If already in the assignment directory, add it to path
    sys.path.insert(0, current_dir)
else:
    # If in parent directory, add the assignment directory
    assignment_dir = os.path.join(current_dir, 'clinical_nlp_assignment')
    if os.path.exists(assignment_dir):
        sys.path.insert(0, assignment_dir)
    else:
        sys.path.insert(0, current_dir)

from vector_db import CodeVectorDB
from extractor import ClinicalReportExtractor
import config
import utils

print("All imports successful!")


## 2. Build Vector Database for ICD and CPT Codes

This step creates embeddings for all ICD and CPT codes from the reference Excel files. This only needs to be done once or when the code reference files are updated.


In [None]:
# Initialize vector database
vector_db = CodeVectorDB()

# Build ICD code database
print("Building ICD vector database...")
icd_count = vector_db.build_icd_database()
print(f"Successfully processed {icd_count} ICD codes\n")

# Build CPT code database
print("Building CPT vector database...")
cpt_count = vector_db.build_cpt_database()
print(f"Successfully processed {cpt_count} CPT codes\n")

print("Vector databases built successfully!")


## 3. Load Clinical Reports

Extract text from the PDF file containing clinical reports. In a real scenario, you would process each report separately.


In [None]:
# Initialize extractor
extractor = ClinicalReportExtractor()

# Extract text from PDF
print(f"Reading PDF from: {config.INPUT_DATA_PDF}")
clinical_text = extractor.extract_from_pdf(config.INPUT_DATA_PDF)

print(f"Extracted {len(clinical_text)} characters from PDF")
print(f"\nFirst 500 characters:\n{clinical_text[:500]}")


## 4. Split Reports into Individual Reports

If the PDF contains multiple reports, we need to split them. This is a simple split based on common patterns.


In [None]:
def split_reports(text):
    """Split text into individual clinical reports"""
    import re
    # Split on multiple blank lines
    reports = re.split(r'\n\n\n+', text)
    
    # Filter out very short segments (likely not reports)
    reports = [r.strip() for r in reports if len(r.strip()) > 200]
    
    return reports

# Split reports
reports = split_reports(clinical_text)
print(f"Found {len(reports)} potential clinical reports")

# If only one report found, use the entire text
if len(reports) == 0:
    reports = [clinical_text]
    print("Using entire document as single report")

for i, report in enumerate(reports[:4], 1):  # Process up to 4 reports
    print(f"\nReport {i} length: {len(report)} characters")
    print(f"First 200 chars: {report[:200]}...")


## 5. Extract Structured Data from Each Report

For each clinical report, extract:
- Patient information
- ICD-10 diagnosis codes
- CPT procedure codes
- Clinical sections


In [None]:
# Extract data from all reports
extracted_data = []

for i, report_text in enumerate(reports[:4], 1):  # Process up to 4 reports
    print(f"\nProcessing Report {i}...")
    
    try:
        result = extractor.extract_from_report(report_text, report_id=f"report_{i}")
        extracted_data.append(result)
        
        print(f"  Patient ID: {result['patient_info'].get('patient_id', 'N/A')}")
        print(f"  ICD Codes found: {len(result['icd_codes'])}")
        print(f"  CPT Codes found: {len(result['cpt_codes'])}")
        
        if result['icd_codes']:
            print(f"  Top ICD: {result['icd_codes'][0]['code']} - {result['icd_codes'][0]['description'][:50]}...")
        if result['cpt_codes']:
            print(f"  Top CPT: {result['cpt_codes'][0]['code']} - {result['cpt_codes'][0]['description'][:50]}...")
            
    except Exception as e:
        print(f"  Error processing report {i}: {e}")
        import traceback
        traceback.print_exc()

print(f"\n\nSuccessfully processed {len(extracted_data)} reports")


## 6. Display Extracted Data


In [None]:
# Display formatted results
for i, data in enumerate(extracted_data, 1):
    print(f"\n{'='*80}")
    print(f"REPORT {i}")
    print(f"{'='*80}")
    
    print(f"\nPatient Information:")
    for key, value in data['patient_info'].items():
        if value:
            print(f"  {key}: {value}")
    
    print(f"\nICD-10 Diagnosis Codes ({len(data['icd_codes'])}):")
    for icd in data['icd_codes']:
        print(f"  {icd['code']}: {icd['description'][:80]}... (confidence: {icd['confidence']:.3f})")
    
    print(f"\nCPT Procedure Codes ({len(data['cpt_codes'])}):")
    for cpt in data['cpt_codes']:
        print(f"  {cpt['code']}: {cpt['description'][:80]}... (confidence: {cpt['confidence']:.3f})")


## 7. Save Results to JSON


In [None]:
# Save extracted data to JSON file
output_file = config.JSON_OUTPUT_FILE
utils.save_json(extracted_data, output_file)

print(f"Extracted data saved to: {output_file}")
print(f"\nJSON file size: {os.path.getsize(output_file)} bytes")
print(f"Number of reports processed: {len(extracted_data)}")

# Display first report as JSON preview
if extracted_data:
    print("\nPreview of first report (formatted JSON):")
    print(json.dumps(extracted_data[0], indent=2, ensure_ascii=False)[:1000] + "...")


## 8. Validation and Quality Checks


In [None]:
# Validate extracted data
print("Validation Results:\n")

for i, data in enumerate(extracted_data, 1):
    print(f"Report {i}:")
    print(f"  ✓ ICD codes: {len(data['icd_codes'])}")
    print(f"  ✓ CPT codes: {len(data['cpt_codes'])}")
    
    # Check for high-confidence matches
    high_conf_icd = [icd for icd in data['icd_codes'] if icd['confidence'] > 0.7]
    high_conf_cpt = [cpt for cpt in data['cpt_codes'] if cpt['confidence'] > 0.7]
    
    print(f"  ✓ High confidence ICD (>0.7): {len(high_conf_icd)}")
    print(f"  ✓ High confidence CPT (>0.7): {len(high_conf_cpt)}")
    print()


## Summary

This notebook demonstrates:

1. **Vector Database Construction**: Created embeddings for ICD and CPT codes using sentence transformers
2. **Clinical Text Extraction**: Extracted text from PDF clinical reports
3. **Information Extraction**: Used NLP techniques to extract:
   - Patient demographics
   - ICD-10 diagnosis codes
   - CPT procedure codes
   - Clinical report sections
4. **Retrieval-Based Prediction**: Used semantic search to match clinical text with appropriate medical codes
5. **Structured Output**: Generated JSON file with all extracted data

### Technical Approach:
- **Embeddings**: Sentence Transformers (all-MiniLM-L6-v2) for semantic similarity
- **Code Matching**: Cosine similarity between clinical text embeddings and code embeddings
- **Pattern Matching**: Regex patterns to identify explicit codes and clinical entities
- **Confidence Scoring**: Similarity scores to rank and filter results

### Output:
- JSON file with structured data ready for downstream processing
- Confidence scores for each extracted code
- Source attribution (NLP retrieval vs explicit code)
