In [1]:
from IPython.display import display, Markdown

### SNOMED methods example

In [2]:
display(Markdown("""
## Begin

Ensure the methods are on path
"""))

import os, sys
sys.path.insert(0,'/home/aliencat/samora/gloabl_files')
sys.path.insert(0,'/data/AS/Samora/gloabl_files')
sys.path.insert(0,'/home/jovyan/work/gloabl_files')
sys.path.insert(0, '/home/cogstack/samora/_data/gloabl_files')


## Begin

Ensure the methods are on path


In [3]:
from snomed_methods import snomed_methods_v1

display(Markdown("""

Import module

"""))



Import module



Ensure the rf2 snomed files are in the folder specified in snomed_methods_v1.py

Ensure medcat path is set if using medcat and your dev environment is set (defaults to dh-cap02)




In [4]:
snomed_relations_obj = snomed_methods_v1.snomed_relations(medcat=True)


display(Markdown("""

Initialise the snomed methods object

"""))

  from tqdm.autonotebook import tqdm, trange




Initialise the snomed methods object



In [5]:
display(Markdown("""

Define your starting point SNOMED cui code.
"""))

outcome_variable_cui_for_filter = '399187006'  # HFE

print(outcome_variable_cui_for_filter)




Define your starting point SNOMED cui code.


399187006


In [6]:
filter_root_cui = outcome_variable_cui_for_filter
print(filter_root_cui)

399187006


#### Spreading from starting SNOMED code we have found more related codes in the SNOMED tree:

In [7]:
retrieved_codes_snomed_tree, retrieved_names_snomed_tree = snomed_relations_obj.recursive_code_expansion(filter_root_cui, n_recursion = 10, debug=False)

display(Markdown("""

n_recursion is the number of cycles of searching for a codes parents and children, then appending them to a set and searching for each of these codes parent/children.
Higher recursion, more exploration, more codes, higher odds of unrelated concepts being returned.

"""))

Retrieving 399187006 with recursion 10


100%|██████████| 10/10 [00:00<00:00, 11.69it/s]




n_recursion is the number of cycles of searching for a codes parents and children, then appending them to a set and searching for each of these codes parent/children.
Higher recursion, more exploration, more codes, higher odds of unrelated concepts being returned.



In [8]:
retrieved_codes_snomed_tree[0:5], len(retrieved_codes_snomed_tree), len(retrieved_names_snomed_tree)

([66576001, 6160004, 401119001, 143101000119101, 399187006], 21, 20)

In [9]:
retrieved_names_snomed_tree[0:3]

display(Markdown("""

Lets examine some of the identified codes names. 
"""))



Lets examine some of the identified codes names. 


In [10]:
retrieved_codes_snomed_tree[0:3]

[66576001, 6160004, 401119001]

#### Lets try an additional method to find related codes.

Here we will attempt to get related codes from medcats' concept databases context similarity. In other words, what concepts occurred in a similar context in the training data for our CDB. **This method may not work if the concept did not receive training in the inital base model. This is because the concept does not have a context vector(s).

In [11]:
retrieved_codes_medcat_cdb, retrieved_names_medcat_cdb  = snomed_relations_obj.get_medcat_cdb_most_similar(filter_root_cui, context_type = 'xxxlong', type_id_filter=[], topn=50)

In [12]:
retrieved_names_medcat_cdb[0:5]

['Hemochromatosis (disorder)',
 'Hereditary hemochromatosis (disorder)',
 'Juvenile hemochromatosis (disorder)',
 'Hereditary spherocytosis (disorder)',
 'Spherocytosis (finding)']

## An additional method 



In this method we will calculate an embedding for snomed terms with their name using a large language model (Gatortron OG) trained on clinical text. We will then calculate an embedding for our term of choice. /n
With these embedding vectors we can measure their cosine similarty and return a list of similar embeddings. 

In [13]:
import pickle



# Load the dictionary back from the file
with open('/home/cogstack/samora/_data/gloabl_files/gatortron/precomputed_sname_gatortron_base_embedding_dict.pkl', 'rb') as file:
    loaded_dict = pickle.load(file)

# Print the loaded dictionary
print(len(loaded_dict.keys()))


7311327


In [14]:
list(loaded_dict.keys())[0:3]

['neoplasm~of~anterior~surface~of~epiglottis~diagnosis',
 'neoplasm',
 'neoplasm~of']

In [15]:
loaded_dict.get('hemochromatosis')

array([[ 0.13535264,  0.05197329, -0.02210324, ...,  0.01287067,
        -0.5452818 , -0.14283289]], dtype=float32)

In [23]:
import random

display(Markdown("""

This takes a long time, randomly sample keys as an example. Approx 1h for full list. 

"""))


# Get a list of all keys in the dictionary
all_keys = list(loaded_dict.keys())

# Select 1000 random keys
selected_keys = random.sample(all_keys, 100000)

# Create a new dictionary with only the selected keys
filtered_dict = {key: loaded_dict[key] for key in selected_keys}

# Now, filtered_dict contains only 1000 randomly selected key-value pairs from loaded_dict
#print(filtered_dict)




This takes a long time, randomly sample keys as an example. Approx 1h for full list. 



In [24]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm

def find_most_similar(target_vector, term_vectors, n=5):
    """
    Find the n most similar vectors to the target_vector from the given term_vectors.

    Parameters:
    - target_vector: The vector for which similarity is to be calculated.
    - term_vectors: A dictionary of term vectors.
    - n: The number of most similar vectors to retrieve (default is 5).

    Returns:
    - A list of tuples, each containing (term, similarity_score), sorted by similarity_score in descending order.
    """
    similarities = {}
    
    # Reshape target_vector to 2D array
    target_vector = target_vector.reshape(1, -1)
    
    for term, vector in tqdm(term_vectors.items()):
        # Reshape vector to 2D array
        vector = vector.reshape(1, -1)
        
        # Calculate cosine similarity
        similarity_score = cosine_similarity(target_vector, vector)[0, 0]
        similarities[term] = similarity_score
    
    # Sort terms by similarity in descending order
    sorted_terms = sorted(similarities.items(), key=lambda x: x[1], reverse=True)
    
    # Return the top n most similar vectors with their terms
    top_n_similarities = sorted_terms[:n]
    
    return top_n_similarities

# Example usage:
# Assuming loaded_dict is a dictionary of term vectors
# loaded_dict = {'term1': np.array([[0.1, 0.2, 0.3]]), 'term2': np.array([[0.4, 0.5, 0.6]])}

target_vector = loaded_dict.get('hemochromatosis')
result = find_most_similar(target_vector, filtered_dict, n=50)

# Print the result
for term, similarity_score in result:
    print(f'Term: {term}, Similarity Score: {similarity_score}')


100%|██████████| 100000/100000 [00:41<00:00, 2389.30it/s]


Term: erythrocytosis, Similarity Score: 0.8896450400352478
Term: latent~haemochromatosis, Similarity Score: 0.8856067657470703
Term: acquired~intolerance, Similarity Score: 0.8817482590675354
Term: hhaemochromatosis, Similarity Score: 0.8780232667922974
Term: polycythemia~due, Similarity Score: 0.8698443174362183
Term: hematobia, Similarity Score: 0.867199182510376
Term: failure, Similarity Score: 0.8663843870162964
Term: septic, Similarity Score: 0.8660054206848145
Term: hemoglobins, Similarity Score: 0.8609825372695923
Term: pulmonators, Similarity Score: 0.8604337573051453
Term: nicobar, Similarity Score: 0.8598579168319702
Term: hyperammonemia, Similarity Score: 0.8569173812866211
Term: secondary~cholangitis, Similarity Score: 0.8568036556243896
Term: anemias, Similarity Score: 0.8564382791519165
Term: hyperalphalipoproteinemia, Similarity Score: 0.8559831380844116
Term: anthracycline, Similarity Score: 0.855939507484436
Term: hypothalamic~hypothyroidism, Similarity Score: 0.855540