# MedCAT Tutorial

In this tutorial will demonstrate how to use MedCAT in order to extract medical information in Clinical Trials and Consumer Health Search Tasks.

Source:https://github.com/CogStack/MedCATtutorials

## 1. MedCAT Installation

Step 1: Clone MedCAT Repo

<code>git clone https://github.com/CogStack/MedCAT.git </code>

Step 2: Create a MedCAT Environment

<code>python3 -m venv medcat</code>

Step 3: Activate the Virtual Environment -  On Linux/Mac

<code>source medcat/bin/activate</code>

Step 4: Install Jupyter and ipykernel

<code>pip install jupyter ipykernel</code>

Step 5: Add the Virtual Environment as a Kernel

<code>python -m ipykernel install --user --name=medcat</code>

### 1.1 Install the Requirements

In [None]:
import os

# Replace 'medcat_directory_path' with the actual path to the directory containing 'requirements.txt'
medcat_directory_path = 'change_with_github_clone_path'

# Change the current working directory to the MedCAT directory
os.chdir(medcat_directory_path)


# Install the packages listed in 'requirements.txt'
!pip install -r requirements.txt

### 1.2 Import the Medcat Library

In [None]:
# Import the required libraries
import pandas as pd
import numpy as np
import pickle

from medcat.cat import CAT

import xml.etree.ElementTree as ET

import json

from spacy import displacy

import pandas as pd
import xml.etree.ElementTree as ET


### 1.3 Import the Modelpack

In this tutorial, we used the SNOMED International modelpack (Full SNOMED modelpack trained on MIMIC-III). To download this model, you need to sign in to your NIH profile/UMLS License using the provided link: <link>https://uts.nlm.nih.gov/uts/login?service=https:%2F%2Fmedcat.rosalind.kcl.ac.uk%2Fauth-callback</link>

or use the other model package without needing a license

Download from: <link>https://medcat.rosalind.kcl.ac.uk/media/medmen_wstatus_2021_oct.zip</link>

In [7]:
# Download the models and required data
!wget https://medcat.rosalind.kcl.ac.uk/media/medmen_wstatus_2021_oct.zip

'wget' is not recognized as an internal or external command,
operable program or batch file.


In [10]:
model_pack_path = "change_to_modelpack_zip_path"

### 1.4 Loading the MedCAT modelpack

In [None]:
# Load model pack and Create CAT - the main class from medcat used for concept annotation
cat = CAT.load_model_pack(model_pack_path)

# 2. Using MedCAT to extract information in clinical trials and consumer health searches

## 2.1 Clinical Trial Task

### Load the data

In [15]:
# Replace 'DATA_DIR' with the path to the directory containing your data files
DATA_DIR = "change_to_data_directory"

In [28]:
# Clinical Trials data
clinical_trials_data_path = os.path.join(DATA_DIR, 'ct-topics2022.xml')

In [29]:
# Read and Parse XML Data
tree = ET.parse(clinical_trials_data_path)
root = tree.getroot()

#  Extract Data and Convert to DataFrame
data = []
for topic in root.findall('topic'):
    number = topic.get('number') 
    description = topic.text.strip() 

    # Append the extracted data as a dictionary to the 'data' list
    data.append({'Number': number, 'Description': description})

# Create a DataFrame from the extracted data
clinical_df = pd.DataFrame(data) 

In [30]:
# We'll show the first 5 rows of the DataFrame to get a quick overview of the data.

# Display top 5 rows
top_5_df = clinical_df.head(5)
print(top_5_df)
 

  Number                                        Description
0      1  A 19-year-old male came to clinic with some se...
1      2  A 32-year-old woman comes to the hospital with...
2      3  A 51-year-old man comes to the office complain...
3      4  A 66-year-old woman comes to the office due to...
4      5  A 23-year-old man comes to the emergency depar...


### Annotate Description with MedCAT

In [31]:
# We will now proceed to annotate the medical entities present in the description

# Extract the description from the 1st row
text = clinical_df.iloc[0]['Description']

# Display the description from the 1st row
print("Description from the 1st row:")
print()
print(text)


# Annotate the description with MedCAT
annotated_text = cat(text)

# Display the annotated medical entities
print()
print("Annotated Medical Entities:")
displacy.render(annotated_text, style='ent')

Description from the 1st row:

A 19-year-old male came to clinic with some sexual concern.  He recently engaged in a relationship and is worried about the satisfaction of his girlfriend. He has a "baby face" according to his girlfriend's statement and he is not as muscular as his classmates.  On physical examination, there is some pubic hair and poorly developed secondary sexual characteristics. He is unable to detect coffee smell during the examination, but the visual acuity is normal. Ultrasound reveals the testes volume of 1-2 ml. The hormonal evaluation showed serum testosterone level of 65 ng/dL with low levels of GnRH.

Annotated Medical Entities:


### We can extract the annotated entities in a structured JSON format for further analysis or processing

In [32]:
# Extract annotated entities in JSON format
annotated_entities_json = cat.get_entities(text)

# Prettify the annotations for better readability
prettified_json = json.dumps(annotated_entities_json, indent=4)

# Display the annotated entities in a readable JSON format
print()
print("Annotated Medical Entities (JSON format):")
print(prettified_json)



Annotated Medical Entities (JSON format):
{
    "entities": {
        "2": {
            "pretty_name": "year",
            "cui": "C0439234",
            "type_ids": [
                "T079"
            ],
            "types": [
                "Temporal Concept"
            ],
            "source_value": "year-old",
            "detected_name": "year~old",
            "acc": 0.6044525851765753,
            "context_similarity": 0.6044525851765753,
            "start": 5,
            "end": 13,
            "icd10": [],
            "ontologies": [],
            "snomed": [],
            "id": 2,
            "meta_anns": {
                "Status": {
                    "value": "Affirmed",
                    "confidence": 0.9990621209144592,
                    "name": "Status"
                }
            }
        },
        "4": {
            "pretty_name": "Male population group",
            "cui": "C0025266",
            "type_ids": [
                "T098"
            ],
    

### Filter Medical Entities with Concept Type 'Clinical Attribute'

In [33]:
# In this step, we will filter the annotated medical entities and extract only those
# that have the concept type 'Clinical Attribute', representing diagnostic findings.

# Initialize a dictionary to store the filtered entities
filtered_entities_dict = dict()

# Iterate through each medical entity annotation
medical_entities_dict = annotated_entities_json["entities"]
for key, entity in medical_entities_dict.items():
    # Check if the entity's concept type is 'finding'
    if entity['types'][0] == 'Clinical Attribute':
        # Add the entity to the filtered dictionary
        filtered_entities_dict[key] = entity

# Prettify the filtered annotations for better readability
prettified_filtered_entities_json = json.dumps(filtered_entities_dict, indent=4)
print(prettified_filtered_entities_json)

{}


## 2.2 Consumer Health Search Task

In [34]:
# File paths for patient and Reddit topics data
chs_topics_path = os.path.join(DATA_DIR, 'topics.txt')

In [35]:
# Step 1: Read and Parse XML Data
tree = ET.parse(chs_topics_path)
root = tree.getroot()

# Extract Data and Convert to DataFrame
chs_patients_data = []

# Iterate over each 'topic' element in the XML data
for topic in root.findall('topic'):
    topic_id = topic.find('id').text.strip()
    query = topic.find('query').text.strip()
    
    # Append the extracted data as a dictionary to the chs_patients_data list
    chs_patients_data.append({'Number': topic_id, 'Description': query})
    
# Create a DataFrame from the extracted data in chs_patients_data list
chs_df = pd.DataFrame(chs_patients_data)

print(chs_df)
# Display top 5 rows
top_5_chs_df = chs_df.head(5)
print(top_5_df)

   Number                                        Description
0       1  What are the most common chronic diseases? Wha...
1       8         best apps daily activity exercise diabetes
2      22             my risk for developing type 2 diabetes
3      35  Is a ketogenic / keto diet suitable for people...
4      45                             Can diabetes be cured?
5      51  What is holistic medicine and what does it inc...
6      52  What are the most common mental health problem...
7      53                             what is psychotherapy?
8      54  What does multiple sclerosis diagnosis include...
9      55                  How to manage multiple sclerosis?
10     57                   multiple sclerosis stages phases
11     58                 Risk to develop multiple sclerosis
12     59    long-term effects likelihood multiple sclerosis
13     62  disclosing multiple sclerosis at work, how wil...
14     63          Will multiple sclerosis affect my career?
15     68               

In [37]:
# Step 2: Annotate Description with MedCAT

# We will now proceed to annotate the medical entities present in the description

# Extract the description from the 1st row
chs_text = chs_df.iloc[0]['Description']

# Display the description from the 1st row
print("Description from the 1st row:")
print()
print(chs_text)


# Annotate the description with MedCAT
chs_annotated_text = cat(chs_text)

# Display the annotated medical entities
print()
print("Annotated Medical Entities:")
displacy.render(chs_annotated_text, style='ent')

Description from the 1st row:

What are the most common chronic diseases? What effects do chronic diseases have for the society and the individual?

Annotated Medical Entities:


In [38]:
# Extract medical entities from 'text' using MedCAT and display the results in prettified JSON format.
chs_annotated_text_json = cat.get_entities(chs_text)
chs_prettified_json = json.dumps(chs_annotated_text_json, indent=4)
print(chs_prettified_json)


{
    "entities": {
        "0": {
            "pretty_name": "Common (qualifier value)",
            "cui": "C0205214",
            "type_ids": [
                "T081"
            ],
            "types": [
                "Quantitative Concept"
            ],
            "source_value": "common",
            "detected_name": "common",
            "acc": 0.20545164847206543,
            "context_similarity": 0.20545164847206543,
            "start": 18,
            "end": 24,
            "icd10": [],
            "ontologies": [],
            "snomed": [],
            "id": 0,
            "meta_anns": {
                "Status": {
                    "value": "Other",
                    "confidence": 0.9999736547470093,
                    "name": "Status"
                }
            }
        },
        "2": {
            "pretty_name": "Chronic disease",
            "cui": "C0008679",
            "type_ids": [
                "T047"
            ],
            "types": [
          

### Print CUI and Corresponding Entity Names from Annotated Text.

In [39]:
# Loop through the annotations in 'chs_annotated_text' and print each CUI and its entity name.

for annotation in list(chs_annotated_text_json['entities'].values()):
    print(annotation['cui'], annotation['pretty_name'])
    print()

C0205214 Common (qualifier value)

C0008679 Chronic disease

C1280500 Effect

C0008679 Chronic disease

C0027361 Persons

C0030705 Patients



### Convert CUI to its preferred name

We can convert the CUI to its preferred name by utilizing <code>cdb.cui2preferred_name</code> function in <code>cat</code>.

In [41]:
# Convert a given CUI back to its preferred name.
cui = "C0205214"
preferred_name = cat.cdb.cui2preferred_name[cui]

# Print the preferred name associated with the CUI.
print(preferred_name)

Common (qualifier value)


### Retrieving all names and semantic types for a Given CUI.

We can retrieve all concept names from KB associated with the CUI by utilizing <code>cdb.cdb.cui2names</code> and  <code>cdb.cui2type_ids</code>function in <code>cat</code>.

In [42]:
# Retrieve all names associated with the CUI from the concept database.
all_names = cat.cdb.cui2names[cui]

# Print the list of names associated with the given CUI.
print(all_names)

{'commonplace', 'conventional', 'prevalent', 'common~values', 'commonly', 'commonest', 'quite~common', 'common', 'common~value', 'less~common'}


In [43]:
# Retrieve the type_ids associated with the CUI from the concept database.
type_ids = cat.cdb.cui2type_ids[cui]

# Print the list of type_ids associated with the given CUI.
print(type_ids)

{'T081'}
