# Automated Medical Transcript Processing with ICD-10 Code Extraction

Medical transcripts contain detailed information about patient encounters, but extracting structured data from these narratives is time-consuming. This project uses the OpenAI API to automatically extract key information from medical transcripts and map treatments to ICD-10 codes for billing and documentation purposes.

## Dataset Overview
The dataset contains anonymized medical transcriptions across various specialties.

**transcriptions.csv**
- `medical_specialty`: The medical specialty associated with each transcription
- `transcription`: Detailed medical transcription texts with case insights
## transcriptions.csv
| Column     | Description              |
|------------|--------------------------|
| `"Allergy / Immunology"` | SUBJECTIVE:,  This 23-year-old white female presents with complaint of allergies.  She used to have allergies when she lived in Seattle but she thinks they are worse here.  In the past, she has tried Claritin, and Zyrtec.  Both worked for short time but then seemed to lose effectiveness.  She has used Allegra also.  She used that last summer and she began using it again two weeks ago.  It does not appear to be working very well.  She has used over-the-counter sprays but no prescription nasal sprays.  She does have asthma but doest not require daily medication for this and does not think it is flaring up.,MEDICATIONS: , Her only medication currently is Ortho Tri-Cyclen and the Allegra.,ALLERGIES: , She has no known medicine allergies.,OBJECTIVE:,Vitals:  Weight was 130 pounds and blood pressure 124/78.,HEENT:  Her throat was mildly erythematous without exudate.  Nasal mucosa was erythematous and swollen.  Only clear drainage was seen.  TMs were clear.,Neck:  Supple without adenopathy.,Lungs:  Clear.,ASSESSMENT:,  Allergic rhinitis.,PLAN:,1.  She will try Zyrtec instead of Allegra again.  Another option will be to use loratadine.  She does not think she has prescription coverage so that might be cheaper.,2.  Samples of Nasonex two sprays in each nostril given for three weeks.  A prescription was written as well.  |


## Import Required Libraries

We import the essential libraries for this project. Pandas handles data manipulation, OpenAI provides API access for natural language processing, and JSON manages structured data formatting.

In [None]:
# Import the necessary libraries
import pandas as pd
from openai import OpenAI
import json

## Initialize OpenAI Client

We create an OpenAI client instance that will handle all API requests throughout the project. This client uses your API key for authentication.

In [None]:
# Initialize the OpenAI client
client = OpenAI()

## Load and Explore the Dataset

We load the medical transcriptions dataset and display the first few rows to understand the data structure and content. This helps us verify the data quality and identify the information we need to extract.

In [None]:
# Load the data
df = pd.read_csv("data/transcriptions.csv")
df.head()

## Dataset Statistics and Quality Check

Before processing, we examine the dataset dimensions, check for missing values, and understand the distribution of medical specialties. This ensures data quality and helps identify potential issues.

In [None]:
# Display dataset information
print(f"Dataset shape: {df.shape}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nMedical specialties distribution:\n{df['medical_specialty'].value_counts()}")

## Define Information Extraction Function

This function uses OpenAI's function calling feature to extract structured data from medical transcripts. It specifically pulls out patient age and recommended treatments or procedures. The function is designed to always return both fields, marking them as 'Unknown' if the information is not present in the transcript.

In [None]:
# Define function to extract age and recommended treatment/procedure
def extract_info_with_openai(transcription):
    """Extracts age and recommended treatment from a transcription using OpenAI."""
    messages = [
        {
            "role": "system",
            "content": "You are a healthcare professional extracting patient data. Always return both the age and recommended treatment. If the information is missing, still create the field and specify 'Unknown'.",
            "role": "user",
            "content": f"Please extract and return both the patient's age and recommended treatment from the following transcription. Transcription: {transcription}."
        }
    ]
    function_definition = [
        {
            'type': 'function',
            'function': {
                'name': 'extract_medical_data',
                'description': 'Get the age and recommended treatment from the input text. Always return both age and recommended treatment.',
                'parameters': {
                    'type': 'object',
                    'properties': {
                        'Age': {
                            'type': 'integer',
                            'description': 'Age of the patient'
                        },
                        'Recommended Treatment/Procedure': {
                            'type': 'string',
                            'description': 'Recommended treatment or procedure for the patient'
                        }
                    }
                }
            }
        }
    ]
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        tools=function_definition
    )
    return json.loads(response.choices[0].message.tool_calls[0].function.arguments)

## Define ICD-10 Code Retrieval Function

This function takes a treatment or procedure name and queries the OpenAI API to return the corresponding ICD-10 codes. ICD-10 codes are standardized medical codes used worldwide for diagnosis and billing. The temperature is set to 0.3 to ensure consistent and accurate code retrieval.

In [None]:
# Define function to get ICD-10 codes for treatments
def get_icd_codes(treatment):
    if treatment != 'Unknown':
        """Retrieves ICD codes for a given treatment using OpenAI."""
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{
                "role": "user",
                "content": f"Provide the ICD codes for the following treatment or procedure: {treatment}. Return the answer as a list of codes. Please only include the codes and no other information."
            }],
            temperature=0.3
        )
        output = response.choices[0].message.content
    else:
        output = 'Unknown'
    return output

## Initialize Data Storage

We create an empty list that will store the processed medical data. Each transcript will be processed and added to this list as a structured dictionary containing age, treatment, specialty, and ICD codes.

In [None]:
# Start an empty list to store processed data
processed_data = []

## Process All Transcripts

This is the main processing loop. For each transcript in the dataset, we extract patient age and treatment using OpenAI, retrieve the corresponding ICD-10 codes, and compile everything into a structured format. The loop handles each transcript sequentially and builds a complete dataset with all extracted information.

In [None]:
# Process each row in the DataFrame
for index, row in df.iterrows():
    medical_specialty = row['medical_specialty']
    extracted_data = extract_info_with_openai(row['transcription'])
    icd_code = get_icd_codes(extracted_data["Recommended Treatment/Procedure"]) if 'Recommended Treatment/Procedure' in extracted_data.keys() else 'Unknown'
    extracted_data["Medical Specialty"] = medical_specialty
    extracted_data["ICD Code"] = icd_code

    # Append the extracted information as a new row in the list
    processed_data.append(extracted_data)

# Convert the list to a DataFrame
df_structured = pd.DataFrame(processed_data)

## Display Processed Results

We view the structured data to verify that information was correctly extracted. This table shows patient ages, treatments, medical specialties, and ICD-10 codes in an organized format ready for further analysis or export.

In [None]:
# Display the structured results
df_structured.head(10)

## Data Quality Analysis

We analyze how well the extraction process performed by checking for missing or unknown values. This helps us understand the completeness of our automated extraction and identify areas where the AI struggled to find information in the transcripts.

In [None]:
# Check extraction quality
print("Extraction Quality Summary")
print("="*50)
print(f"Total transcripts processed: {len(df_structured)}")
print(f"\nMissing age information: {df_structured['Age'].isna().sum()}")
print(f"Unknown treatments: {(df_structured['Recommended Treatment/Procedure'] == 'Unknown').sum()}")
print(f"Unknown ICD codes: {(df_structured['ICD Code'] == 'Unknown').sum()}")
print(f"\nExtraction success rate: {((len(df_structured) - (df_structured['ICD Code'] == 'Unknown').sum()) / len(df_structured) * 100):.2f}%")

## Patient Age Distribution Analysis

Understanding the age distribution of patients in the dataset provides insights into the demographic characteristics. We calculate basic statistics and visualize the age distribution to identify patterns and trends.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Filter out unknown ages
valid_ages = df_structured[df_structured['Age'].notna()]['Age']

print("Age Statistics:")
print(f"Mean age: {valid_ages.mean():.1f} years")
print(f"Median age: {valid_ages.median():.1f} years")
print(f"Age range: {valid_ages.min():.0f} - {valid_ages.max():.0f} years")
print(f"Standard deviation: {valid_ages.std():.1f} years")

# Create age distribution plot
plt.figure(figsize=(10, 6))
plt.hist(valid_ages, bins=20, edgecolor='black', alpha=0.7)
plt.xlabel('Patient Age (years)')
plt.ylabel('Frequency')
plt.title('Distribution of Patient Ages in Medical Transcripts')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

## Treatment Analysis by Medical Specialty

We examine how treatments vary across different medical specialties. This analysis helps identify the most common procedures in each specialty and provides insights into the types of care being delivered across the healthcare network.

In [None]:
# Analyze treatments by specialty
specialty_treatment = df_structured.groupby('Medical Specialty')['Recommended Treatment/Procedure'].apply(lambda x: x.mode()[0] if len(x.mode()) > 0 else 'None')

print("Most Common Treatment by Specialty:")
print("="*70)
for specialty, treatment in specialty_treatment.items():
    print(f"{specialty}: {treatment}")

## Most Frequent ICD-10 Codes

We identify the most commonly assigned ICD-10 codes in the dataset. This information is valuable for understanding the types of procedures being performed and can help with resource planning and billing optimization.

In [None]:
# Count ICD code frequency (excluding Unknown)
icd_counts = df_structured[df_structured['ICD Code'] != 'Unknown']['ICD Code'].value_counts().head(10)

print("Top 10 Most Frequent ICD-10 Codes:")
print("="*50)
for code, count in icd_counts.items():
    print(f"{code}: {count} occurrences")

# Visualize top ICD codes
plt.figure(figsize=(12, 6))
icd_counts.plot(kind='barh')
plt.xlabel('Frequency')
plt.ylabel('ICD-10 Code')
plt.title('Top 10 Most Frequent ICD-10 Codes')
plt.tight_layout()
plt.show()

## Export Processed Data

Finally, we save the structured data to a CSV file. This file can be used for insurance claims processing, medical billing, further analysis, or integration with electronic health record systems.

In [None]:
# Export to CSV
output_filename = 'processed_medical_transcripts.csv'
df_structured.to_csv(output_filename, index=False)
print(f"Processed data exported to: {output_filename}")
print(f"Total records: {len(df_structured)}")

## API Cost Estimation

We estimate the API costs for processing the entire dataset. This helps in budgeting and understanding the operational costs of running this automated extraction system at scale.

In [None]:
# Estimate API costs (approximate)
# GPT-4o-mini pricing: $0.150 per 1M input tokens, $0.600 per 1M output tokens
# Rough estimates for demonstration purposes

num_transcripts = len(df)
avg_transcript_length = df['transcription'].str.len().mean()
estimated_input_tokens = num_transcripts * (avg_transcript_length / 4) * 2  # 2 API calls per transcript
estimated_output_tokens = num_transcripts * 100 * 2  # Rough estimate

input_cost = (estimated_input_tokens / 1_000_000) * 0.150
output_cost = (estimated_output_tokens / 1_000_000) * 0.600
total_cost = input_cost + output_cost

print("Estimated API Costs:")
print("="*50)
print(f"Input tokens: {estimated_input_tokens:,.0f}")
print(f"Output tokens: {estimated_output_tokens:,.0f}")
print(f"Estimated total cost: ${total_cost:.4f}")
print(f"Cost per transcript: ${total_cost/num_transcripts:.6f}")