# Tutorial: Standardize Vocabularies with Cocoon

This guide will help you learn how to use Cocoon to easily standardize procedure concepts. We'll use sythetic procedure data from Synthea for this.

## Installation of Cocoon and Configuration of OpenAI API Key

To install Cocoon, please follow the instructions below. Additionally, for obtaining the OpenAI API key, refer to the Cocoon's GitHub page at https://github.com/Cocoon-Data-Transformation/cocoon/tree/main#openai-api-key.


In [1]:
! pip install cocoon_data==0.1.16

Collecting cocoon_data
  Downloading cocoon_data-0.1.4-py3-none-any.whl (94 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.3/94.3 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting faiss-cpu (from cocoon_data)
  Downloading faiss_cpu-1.7.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.6/17.6 MB[0m [31m75.4 MB/s[0m eta [36m0:00:00[0m
Collecting openai==0.27.8 (from cocoon_data)
  Downloading openai-0.27.8-py3-none-any.whl (73 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.6/73.6 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
Collecting jedi>=0.16 (from ipython>=4.0.0->ipywidgets->cocoon_data)
  Downloading jedi-0.19.1-py2.py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m53.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: faiss-cpu, jedi, openai, cocoon_data
[31

In [2]:
from cocoon_data import *

In [None]:
# set up your api key
openai.api_key = ""

# test if it works
response = openai.ChatCompletion.create(
    model="gpt-4-1106-preview",
    messages=[
      {"role": "user", "content": "hello!"},
  ]
)

print(response['choices'][0]['message']["content"])

Hello! How can I assist you today?


## Downloading Files and Starting Standardization

Begin by downloading the required files.

Note the the embedded standardized vocabulary for procedures is large (>1GB) and would take a while.


In [5]:
import requests

# Dictionary of file names and their URLs
files = {
    "procedures.csv": "https://raw.githubusercontent.com/Cocoon-Data-Transformation/cocoon/main/files/procedures.csv",
    "embedded_procedures.csv": "https://raw.githubusercontent.com/Cocoon-Data-Transformation/cocoon/main/files/embedded_procedures.csv",
    "procedure_synthea_matched.csv": "https://raw.githubusercontent.com/Cocoon-Data-Transformation/cocoon/main/files/procedure_synthea_matched.csv"
}


# Loop through the files dictionary
for file_name, url in files.items():
    # Send a GET request to the URL
    response = requests.get(url)

    # Check if the request was successful
    if response.status_code == 200:
        # Open file in binary write mode
        with open(file_name, "wb") as file:
            file.write(response.content)
        print(f"{file_name} downloaded successfully.")
    else:
        print(f"Failed to download {file_name}.")


procedures.csv downloaded successfully.
embedded_procedures.csv downloaded successfully.
procedure_synthea_matched.csv downloaded successfully.


In [7]:
import requests

# Dictionary of file names and their URLs
files = {
    "Procedure.zip": "https://www.dropbox.com/scl/fi/429whiqpja71bv24y4vl4/Procedure.zip?rlkey=0i06yz4a4mr5r9xp9dieczqxk&dl=1",
    "Procedure_index.zip": "https://www.dropbox.com/scl/fi/dojmjkglajqu45lpt296i/Procedure_index.zip?rlkey=gfszj659m4s9rpnaj5kksz4k6&dl=1"
}


# Loop through the files dictionary
for file_name, url in files.items():
    # Send a GET request to the URL
    response = requests.get(url)

    # Check if the request was successful
    if response.status_code == 200:
        # Open file in binary write mode
        with open(file_name, "wb") as file:
            file.write(response.content)
        print(f"{file_name} downloaded successfully.")
    else:
        print(f"Failed to download {file_name}.")


Procedure.zip downloaded successfully.
Procedure_index.zip downloaded successfully.


In [8]:
!unzip Procedure.zip
!unzip Procedure_index.zip

Archive:  Procedure.zip
  inflating: Procedure.csv           
Archive:  Procedure_index.zip
  inflating: Procedure.index         


## Build Embedding for concepts

Creating text embeddings helps in finding concepts that are similar in meaning.

However,  these embeddings don't explain how concepts are related and can be overwhelming.

In [9]:
# # Embed the vocabularies from OHDSI athena
# df = pd.read_csv(YOUR_PATH_TO_THE_VOCABULARY)
# df = df[df["standard_concept"] == "S"]
# Procedure_df = df[df["domain_id"] == "Procedure"]
# # add a label column for embedding
# Procedure_df["label"] = Procedure_df["concept_name"]
# embed_labels(Procedure_df, '/Procedure.csv')

# load the embedded standardized vocabularies
# this step will take 30 seconds
reference_df = pd.read_csv('./Procedure.csv')

In [10]:
# load the embedding into an index, which will help us find the semantically similar concepts
# this step will take 10 minutes
# index = load_embedding(reference_df, label_embedding='embedding')

# or you can load the exsiting index from disk
index = faiss.read_index('./Procedure.index')

In [11]:
# load your procedures data from synthea
procedure_df = pd.read_csv('./procedures.csv')

In [12]:
# create a label column that will be used to store the procedure description
procedure_df["label"] = procedure_df["DESCRIPTION"]
# build embedding for your file
embed_labels(procedure_df, './embedded_procedures.csv')

All labels already embedded.


Unnamed: 0,label,index_ids,CODE,DESCRIPTION,embedding
0,Admission to orthopedic department,[4],305428000,Admission to orthopedic department,"[0.027575980871915817, 0.008947665803134441, -..."
1,Clavicle X-ray,[3],168594001,Clavicle X-ray,"[0.003449691692367196, 0.00801424216479063, 0...."
2,Injection of tetanus antitoxin,[7],384700001,Injection of tetanus antitoxin,"[-0.03375847265124321, 0.015600106678903103, -..."
3,Medication Reconciliation (procedure),"[0, 1, 2, 8, 9]",430193006,Medication Reconciliation (procedure),"[0.02277933992445469, 0.02867008186876774, 0.0..."
4,Suture open wound,"[5, 6]",288086009,Suture open wound,"[0.0003375147352926433, 0.0011483834823593497,..."


In [13]:
# use the embedding to search for the semantically similar concepts in the standard vocabulary
procedure_synthea_embedded_label = pd.read_csv('./embedded_procedures.csv')
D, I = df_search(procedure_synthea_embedded_label, index)

In [14]:
# below are the matched results using embedding
display_matches(reference_df=reference_df, input_df=procedure_synthea_embedded_label, I=I)

HBox(children=(Button(description='Previous Page', style=ButtonStyle()), Label(value='Page 1 of 5'), Button(de…

HTML(value='<b>Input Label</b>: <table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: ri…

## Build  Relations Between Concepts

Embeddings can be overwhelming to interpret.

We build clear relationships between source and target concepts to understand how these concepts are related.

In [16]:
# relation-based entity matching, that further help you judge if these concepts are truly related, in which way
# this would take a while
entity_relation_match_cluster(input_df = procedure_synthea_embedded_label,
                        attributes = ["DESCRIPTION"],
                        I = I,
                        refernece_df = reference_df,
                        label = "label")

procedure_synthea_embedded_label.to_csv('./procedure_synthea_matched.csv', index=False)

In [17]:
procedure_synthea_matched = pd.read_csv('./procedure_synthea_matched.csv')

# write the report to a html file
clusters = compute_cluster(procedure_synthea_embedded_label)
final_html = generate_report_for_cluster(procedure_synthea_embedded_label, clusters)

with open("./standardization_report.html", "w") as file:
    file.write(final_html)

HBox(children=(Button(description='Previous Page', style=ButtonStyle()), Label(value='Page 1 of 5'), Button(de…

HTML(value='<h1>Input Data</h1>\n<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: r…