![electronic_medical_records](electronic_medical_records.png)

Medical professionals often summarize patient encounters in transcripts written in natural language, which include details about symptoms, diagnosis, and treatments. These transcripts can be used for other medical documentation, such as for insurance purposes, but as they are densely packed with medical information, extracting the key data accurately can be challenging.  

You and your team at Lakeside Healthcare Network have decided to leverage the OpenAI API to automatically extract medical information from these transcripts and automate the matching with the appropriate ICD-10 codes. ICD-10 codes are a standardized system used worldwide for diagnosing and billing purposes, such as insurance claims processing.

## The Data
The dataset contains anonymized medical transcriptions categorized by specialty.

## transcriptions.csv
| Column     | Description              |
|------------|--------------------------|
| `"medical_specialty"` | The medical specialty associated with each transcription.  |
| `"transcription"` | Detailed medical transcription texts, with insights into the medical case. |


## Before you start

In order to complete the project you will need to create a developer account with OpenAI and store your API key as a secure environment variable. Instructions for these steps are outlined below.

### Create a developer account with OpenAI

1. Go to the [API signup page](https://platform.openai.com/signup). 

2. Create your account (you'll need to provide your email address and your phone number).

3. Go to the [API keys page](https://platform.openai.com/account/api-keys). 

4. Create a new secret key.

<img src="images/openai-new-secret-key.png" width="200">

5. **Take a copy of it**. (If you lose it, delete the key and create a new one.)

### Add a payment method

OpenAI sometimes provides free credits for the API, but this can vary depending on geography. You may need to add debit/credit card details. 

**This project should cost less than 10 US cents with GPT-3.5-Turbo (but if you rerun tasks, you will be charged every time).**

1. Go to the [Payment Methods page](https://platform.openai.com/account/billing/payment-methods).

2. Click Add payment method.

<img src="images/openai-add-payment-method.png" width="200">

3. Fill in your card details.

### Add an environmental variable with your OpenAI key

1. In the workbook, click on "Environment," in the left sidebar.

2. Click on the plus button next to "Environment variables" to add environment variables.

3. In the "Name" field, type "OPENAI_API_KEY". In the "Value" field, paste in your secret key.

<img src="images/datalab-env-var-details.png" width="500">

4. Click "Create", then you'll see the following pop-up window. Click "Connect," then wait 5-10 seconds for the kernel to restart, or restart it manually in the Run menu.

<img src="images/connect-integ.png" width="500">

### OR: USE GOOGLE GEMINI API

[LINK](https://aistudio.google.com/apikey)


You have been provided with an anonymized dataset of medical transcriptions organized by specialty, transcriptions.csv.
- choices: grok, openai, gemini, togetherai
- Use the OpenAI API to extract "age", "medical_specialty", and a new data field to store the recommended treatment extracted from each transcription.
- Match each recommended treatment with the corresponding International Classification of Diseases (ICD) code, and save your answers in a pandas DataFrame named df_structured.

In [1]:
# Import the necessary libraries
import os
import pandas as pd
from google import genai
from google.genai import types
from typing_extensions import TypedDict
import json

In [2]:
# Load api key
gemini_api_key = os.environ.get('GEMINI_API_KEY')

# Create client
client = genai.Client(api_key=gemini_api_key)

In [3]:
class Transcription(TypedDict):
    age: int
    medical_speciality: str
    treatment: str
    icd: str


model_config = types.GenerateContentConfig(
    temperature=0.1,
    top_p=1,
    max_output_tokens=250,
    response_mime_type="application/json",
    response_schema=Transcription)

In [7]:
def generate_transcription(transcript: str) -> dict:
    """
    Use an llm to generate transcription based on certain criterias
    Args:
        transcript: medical text
    Return:
        str: containing transcription
    """
    prompt = """
                You are a professional transcription Agent with 10 years of experience.
                
                Task:
                    Extract the age from the text,
                    extract the recommended treatment, and medicalical speciality
                    Generate the corresponding international classification of disease (icd) code
                    
                Transcript: 
    """
    prompt = prompt.join(transcript)
    response = client.models.generate_content(
            model="gemini-2.0-flash",
            config=model_config,
            contents=prompt
    )
    
    return response.text

In [5]:
# Load the data
df = pd.read_csv("data/transcriptions.csv")
df

Unnamed: 0,medical_specialty,transcription
0,Allergy / Immunology,"SUBJECTIVE:, This 23-year-old white female pr..."
1,Orthopedic,"CHIEF COMPLAINT:, Achilles ruptured tendon.,H..."
2,Bariatrics,"PREOPERATIVE DIAGNOSIS: , Morbid obesity.,POST..."
3,Cardiovascular / Pulmonary,"PREOPERATIVE DIAGNOSES,Airway obstruction seco..."
4,Urology,"CHIEF COMPLAINT:, Urinary retention.,HISTORY ..."


In [None]:
# Dict collector
collector = []

# Loop through dataframe
for id, transcript in df.iterrows():
    # pass transcript to llm
    mod_trans = generate_transcription(transcript.values)
    collector.append(mod_trans)

['Allergy / Immunology'
 'SUBJECTIVE:,  This 23-year-old white female presents with complaint of allergies.  She used to have allergies when she lived in Seattle but she thinks they are worse here.  In the past, she has tried Claritin, and Zyrtec.  Both worked for short time but then seemed to lose effectiveness.  She has used Allegra also.  She used that last summer and she began using it again two weeks ago.  It does not appear to be working very well.  She has used over-the-counter sprays but no prescription nasal sprays.  She does have asthma but doest not require daily medication for this and does not think it is flaring up.,MEDICATIONS: , Her only medication currently is Ortho Tri-Cyclen and the Allegra.,ALLERGIES: , She has no known medicine allergies.,OBJECTIVE:,Vitals:  Weight was 130 pounds and blood pressure 124/78.,HEENT:  Her throat was mildly erythematous without exudate.  Nasal mucosa was erythematous and swollen.  Only clear drainage was seen.  TMs were clear.,Neck:  Su

In [None]:
import json

extracted_data = []

for item_str in collector:
    # Parse the JSON string into a Python dictionary
    item_dict = json.loads(item_str)

    # Extract the desired fields
    # You can choose which fields you want to extract
    extracted_fields = {
        "age": item_dict.get("age"),
        "speciality": item_dict.get("medical_speciality"),
        "treatment": item_dict.get("treatment"),
        "icd_code": item_dict.get("icd")
    }
    extracted_data.append(extracted_fields)

[{'age': 23,
  'speciality': 'Allergy / Immunology',
  'treatment': 'Zyrtec, loratadine, Nasonex',
  'icd_code': 'J30.9'},
 {'age': 41,
  'speciality': 'Orthopedic',
  'treatment': 'Operative fixation',
  'icd_code': 'S86.01'},
 {'age': 30,
  'speciality': 'Bariatrics',
  'treatment': 'Laparoscopic antecolic antegastric Roux-en-Y gastric bypass with EEA anastomosis',
  'icd_code': 'E66.9'},
 {'age': 50,
  'speciality': 'Laryngology and Thoracic Surgery',
  'treatment': 'Neck exploration, tracheostomy, urgent flexible bronchoscopy via tracheostomy site, removal of foreign body, tracheal metallic stent material, dilation distal trachea, placement of #8 Shiley single cannula tracheostomy tube',
  'icd_code': 'J95.5'},
 {'age': 66,
  'speciality': 'Urology',
  'treatment': 'Flomax and Proscar, self-catheterization',
  'icd_code': 'N40'}]

In [22]:
# Convert to pandas
llm_df = pd.DataFrame(extracted_data)

# Concatenate with original df
final_df = pd.concat([df, llm_df], axis=1)

# Save dataset
final_df.to_csv('data/final_transcription.csv', index=False)

In [24]:
# View Dataset
pd.read_csv('data/final_transcription.csv')

Unnamed: 0,medical_specialty,transcription,age,speciality,treatment,icd_code
0,Allergy / Immunology,"SUBJECTIVE:, This 23-year-old white female pr...",23,Allergy / Immunology,"Zyrtec, loratadine, Nasonex",J30.9
1,Orthopedic,"CHIEF COMPLAINT:, Achilles ruptured tendon.,H...",41,Orthopedic,Operative fixation,S86.01
2,Bariatrics,"PREOPERATIVE DIAGNOSIS: , Morbid obesity.,POST...",30,Bariatrics,Laparoscopic antecolic antegastric Roux-en-Y g...,E66.9
3,Cardiovascular / Pulmonary,"PREOPERATIVE DIAGNOSES,Airway obstruction seco...",50,Laryngology and Thoracic Surgery,"Neck exploration, tracheostomy, urgent flexibl...",J95.5
4,Urology,"CHIEF COMPLAINT:, Urinary retention.,HISTORY ...",66,Urology,"Flomax and Proscar, self-catheterization",N40
