# Medical Use-case:
## Extract the ICD codes from the clinical text.

ICD codes, or International Classification of Diseases codes, are a standardized system of alphanumeric codes used to represent medical diagnoses and procedures. They play an essential role in medical billing and claims processing. Here’s an overview of how ICD codes are applied in medical claims:

1. Claim Submission
When healthcare providers perform services or procedures, they create medical claims that detail the services, diagnoses, and other relevant information. ICD codes are included in these claims to precisely describe the patient’s condition.

2. Insurance Processing
Health insurance companies rely on ICD codes to process claims efficiently and assess the medical necessity and coverage of services provided. The codes help insurers understand the patient’s condition and the healthcare services received.

3. Claim Adjudication
Insurance companies use the ICD codes on claims to compare against their policies and determine reimbursement. Claims may be accepted, denied, or adjusted based on the information provided by these codes.

4. Billing and Reimbursement
Healthcare providers use ICD codes to bill insurance companies for the services they deliver. Reimbursement amounts often correlate with the ICD codes, the type of procedure, and additional factors.

5. Medical Necessity
ICD codes are crucial in establishing the necessity of a procedure or service. Insurance companies review these codes to determine if the treatment aligns with the patient’s diagnosis and adheres to accepted medical guidelines.

6. Fraud Detection
Insurance companies utilize ICD codes to identify possible fraud or abuse. Discrepancies, such as services that don’t align with diagnosis codes or seem excessive, may prompt further review or investigation.

7. Data Analysis and Research
ICD codes also support healthcare data analysis at a broader level. Researchers, public health officials, and policymakers use ICD data to track health trends, conduct studies, and inform healthcare decisions and policy.

8. Diagnosis Coding
ICD codes ensure an accurate representation of a patient’s medical condition. Each diagnosis, symptom, or complaint is assigned a specific ICD code, which facilitates clear communication between providers and insurers about the reasons for a patient’s treatment.

Accurate ICD coding is essential for proper reimbursement and regulatory compliance. Medical coders assign these codes based on patient records and documentation, which standardizes communication between providers and insurers, streamlining claims processing and ensuring accurate reimbursement.

### Download ICD codes from https://www.cms.gov/medicare/coding-billing/icd-10-codes#CodeFiles

In [1]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer, util
import stanza

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Load ICD-10-CM data
def load_icd10cm_data(filepath='icd10cm_codes_2025.txt'):
    data = pd.read_csv(filepath, sep="\t", header=None)
    data['IcdCodes'] = data[0].apply(lambda x: x.split()[0])
    data['IcdDescription'] = data[0].apply(lambda x: " ".join(x.split()[1:]))
    data = data.drop([0], axis=1)
    return data

In [3]:
# Initialize model and embeddings
def load_model_and_embeddings(data, model_name='all-MiniLM-L6-v2'):
    model = SentenceTransformer(model_name)
    embeddings = model.encode(data['IcdDescription'], convert_to_tensor=True)
    return model, embeddings  

In [4]:
# Initialize Stanza pipeline
def initialize_stanza_pipeline():
    stanza.download('en', package='mimic', processors={'ner': 'i2b2'})
    return stanza.Pipeline('en', package='mimic', processors={'ner': 'i2b2'})

In [5]:
# Extract NER entities
def extract_problem_entities(nlp, text):
    doc = nlp(text)
    ner_list = [ent.text for ent in doc.entities if ent.type == 'PROBLEM']
    print("ner_list: ", ner_list)
    return ner_list

In [6]:
def predict_icd_codes(ner_list, model, embeddings, data, top_n=2, threshold=0.5):
    # Encode NER list and move to CPU
    embedding_ner = model.encode(ner_list, convert_to_tensor=True)
    
    # Calculate cosine similarity
    res = util.cos_sim(embedding_ner, embeddings)  

    # Move to CPU and convert to numpy
    res = res.cpu().numpy()

    # Display results
    for element, ner_item in zip(res, ner_list):
        # Get top `n` indices and corresponding scores
        top_indices = element.argsort()[::-1][:top_n]
        top_scores = np.sort(element)[::-1][:top_n]

        print(f"\nNER Entity: {ner_item}")
        print("Top ICD Code Matches:")
        
        for idx, score in zip(top_indices, top_scores):
            if score >= threshold:
                icd_code = data.iloc[idx]['IcdCodes']
                icd_description = data.iloc[idx]['IcdDescription']
                print(f"- ICD Code: {icd_code}, Description: {icd_description}, Similarity Score: {score * 100:.2f}%")
            else:
                print("- No matches with a high enough similarity score.")


In [None]:
# Main function to run the process
def main():
    # Load ICD-10-CM data
    data = load_icd10cm_data()

    data.head()

    # Load model and embeddings
    model, embeddings = load_model_and_embeddings(data)

    # Initialize Stanza pipeline
    nlp = initialize_stanza_pipeline()

    # Input text for diagnosis extraction
    # text = """He is suffering from chronic kidney.
    # he is also suffering from diabetes
    # he is also suffering from HIV"""

    text = """The patient presents with persistent lower back pain, reporting a dull ache that occasionally radiates down the left leg. Pain is aggravated by prolonged sitting and alleviated somewhat by standing and gentle stretching. No history of recent trauma or injury. Patient reports difficulty sleeping due to discomfort, but no significant weakness, numbness, or loss of bowel/bladder control. Physical examination reveals tenderness along the lumbar spine with limited range of motion in flexion. Straight leg raise test is positive on the left side, suggesting possible nerve root irritation. Plan includes imaging to assess for disc pathology, NSAIDs for pain management, and referral to physical therapy for core strengthening and flexibility exercises. Follow-up in two weeks to evaluate progress."""

    # Extract entities and predict ICD codes
    ner_list = extract_problem_entities(nlp, text)
    predict_icd_codes(ner_list, model, embeddings, data, top_n=2, threshold=0.5)

In [8]:
# Run the main function
if __name__ == "__main__":
    main()

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.9.0.json: 392kB [00:00, 21.1MB/s]                    
2024-11-12 08:49:14 INFO: Downloaded file to C:\Users\cheta\stanza_resources\resources.json
2024-11-12 08:49:14 INFO: Downloading these customized packages for language: en (English)...
| Processor       | Package        |
------------------------------------
| tokenize        | mimic          |
| pos             | mimic_charlm   |
| lemma           | mimic_nocharlm |
| depparse        | mimic_charlm   |
| ner             | i2b2           |
| forward_charlm  | mimic          |
| backward_charlm | mimic          |
| pretrain        | mimic          |

2024-11-12 08:49:14 INFO: File exists: C:\Users\cheta\stanza_resources\en\tokenize\mimic.pt
2024-11-12 08:49:14 INFO: File exists: C:\Users\cheta\stanza_resources\en\pos\mimic_charlm.pt
2024-11-12 08:49:14 INFO: File exists: C:\Users\cheta\stanza_resources\en\lemma\mimic_nocharlm.pt
2024-11-12 08

ner_list:  ['chronic kidney', 'diabetes', 'HIV']

NER Entity: chronic kidney
Top ICD Code Matches:
- ICD Code: N189, Description: Chronic kidney disease, unspecified, Similarity Score: 88.52%
- ICD Code: N181, Description: Chronic kidney disease, stage 1, Similarity Score: 78.16%

NER Entity: diabetes
Top ICD Code Matches:
- ICD Code: E089, Description: Diabetes mellitus due to underlying condition without complications, Similarity Score: 73.68%
- ICD Code: R7303, Description: Prediabetes, Similarity Score: 73.65%

NER Entity: HIV
Top ICD Code Matches:
- ICD Code: B20, Description: Human immunodeficiency virus [HIV] disease, Similarity Score: 74.56%
- ICD Code: Z21, Description: Asymptomatic human immunodeficiency virus [HIV] infection status, Similarity Score: 65.49%
