# SNOMED

SNOMED CT is a standarised clinical terminology consisting of >350,000 unique concepts. It is owned, maintained and distributed by SNOMED International.

Please visit and explore https://www.snomed.org/ to find out further information about the various SNOMED CT products and services which they offer.

--------

All raw files from SNOMED can be found [here](data/snomed)

# Part 1: Preprocessing SNOMED CT for MedCAT

Once you have downloaded a SNOMED release of interest. Store the zipped folder containing your respective SNOMED release in the current colab working directory.

The folder name should look like: `SnomedCT_InternationalRF2_PRODUCTION_20210131T120000Z.zip
`


### Import required packages

In [1]:
import zipfile
from medcat.utils.preprocess_snomed import Snomed

### Load the data
Please see the section: [Access to SNOMED CT release files](#access_to_snomed_ct) for how to retrieve the zipped SNOMED CT release.

In [24]:
# Assign a path to the zipped SNOMED CT release download. (skip this step if the folder is not zipped)
snomed_path = "uk_sct2cl_35.2.0_20221123000001Z.zip"  # Enter your zipped Snomed folder here
snomed_folder = snomed_path[:-4]  # The unzipped SNOMED CT folder path

In [23]:
with zipfile.ZipFile(snomed_path, 'r') as zip_ref:
    zip_ref.extractall(snomed_folder)

### Preprocess the release for MedCAT

In [25]:
# Initialise
snomed = Snomed(snomed_folder)

In [26]:
### Skip this step if your version of snomed is not the UK extension released >2021.
### Note: this step will only work with MedCAT v1.2.7+

snomed.uk_ext = True

#### Create a SNOMED DataFrame

We first preprocess SNOMED to fit the following format:


|cui|name|ontologies|name_status|description_type_ids|type_ids|
|--|--|--|:--:|:--:|--|
|101009|Quilonia ethiopica (organism)|SNOMED|P|organism|81102976|
.
.
.

`cui` - The concept unique identifier, this is simply the `SCTID`.

`name` - This include the name of the concept. The status of the name is given in `name_status`

`ontologies` - Always SNOMED. Alternatively you can change it to your specific edition.

`name_status` - The Fully specified name or FSN is denoted with a `P` - Primary Name. Each concept must be assigned only one Primary Name. These should be unique across all SCTID/cui to avoid confusion. A synonym or other description type is represented as a `A` - Alternative Name. This can be enriched with all possible names and abbreviations for a concept of interest.

`description_type_ids` - These are processed to be the Semantic Tags of the concept.

`type_ids` - This is simply a 10 digit Hash of the Semantic Tags




In [27]:
# Create SNOMED DataFrame 
df = snomed.to_concept_df()

In [28]:
df.head()

Unnamed: 0,cui,name,name_status,ontologies,description_type_ids,type_ids
0,181000000101,Patient National Health Service number unknown...,P,SNOMED-CT,finding,67667581
1,191000000104,Patient NHS no ? correct (finding),P,SNOMED-CT,finding,67667581
2,201000000102,FMed 136A-ask for service PH (finding),P,SNOMED-CT,finding,67667581
3,231000000108,Patient deregistered - medical record envelope...,P,SNOMED-CT,finding,67667581
4,241000000104,Medical record envelope kept for medicolegal r...,P,SNOMED-CT,finding,67667581


In [29]:
# inspect
df[df['cui'] == '101009']

Unnamed: 0,cui,name,name_status,ontologies,description_type_ids,type_ids
62430,101009,Quilonia ethiopica (organism),P,SNOMED-CT,organism,81102976
419247,101009,Quilonia ethiopica,A,SNOMED-CT,organism,81102976


In [30]:
# Optional - Create a SCTID to FSN dictionary
primary_names_only = df[df["name_status"] == 'P']
sctid2name = dict(zip(primary_names_only['cui'], primary_names_only['name']))
del primary_names_only

In [31]:
# Test with example SCTID
sctid2name['101009']

'Quilonia ethiopica (organism)'

#### SNOMED Relationships

In [32]:
all_snomed_relationships = snomed.list_all_relationships()

In [33]:
# List of the SCTID of all snomed relationships
all_snomed_relationships

['116680003',
 '260686004',
 '363704007',
 '363699004',
 '405814001',
 '363714003',
 '363698007',
 '116676008',
 '363713009',
 '246501002',
 '418775008',
 '408731000',
 '363589002',
 '408730004',
 '408732007',
 '246090004',
 '408729009',
 '363701004',
 '272741003',
 '405813007',
 '425391005',
 '424361007',
 '363700003',
 '260507000',
 '405815000',
 '424226004',
 '424244007',
 '363709002',
 '363703001',
 '246454002',
 '246513007',
 '370135005',
 '363702006',
 '116686009',
 '42752001',
 '263502005',
 '47429007',
 '118169006',
 '370133003',
 '118168003',
 '118171006',
 '410675002',
 '246075003',
 '424876005',
 '405816004',
 '246093002',
 '255234002',
 '371881003',
 '419066007',
 '704319004',
 '370131001',
 '719722006',
 '370130000',
 '704327008',
 '370132008',
 '370134009',
 '260870009',
 '704321009',
 '118170007',
 '704324001',
 '719715003',
 '718497002',
 '363710007',
 '704325000',
 '116680003',
 '363698007',
 '363703001',
 '260686004',
 '363714003',
 '246075003',
 '272741003',
 '246454

In [34]:
# Using the SCTID to name to inspect what the FSN (fully specified names) are:
for sctid in all_snomed_relationships:
    print(sctid2name[sctid])

Is a (attribute)
Method (attribute)
Procedure site (attribute)
Direct device (attribute)
Procedure site - Indirect (attribute)
Interprets (attribute)
Finding site (attribute)
Associated morphology (attribute)
Has interpretation (attribute)
Technique (attribute)
Finding method (attribute)
Temporal context (attribute)
Associated procedure (attribute)
Procedure context (attribute)
Subject relationship context (attribute)
Associated finding (attribute)
Finding context (attribute)
Direct substance (attribute)
Laterality (attribute)
Procedure site - Direct (attribute)
Using access device (attribute)
Using substance (attribute)
Direct morphology (attribute)
Access (attribute)
Procedure device (attribute)
Using device (attribute)
Using energy (attribute)
Indirect morphology (attribute)
Has intent (attribute)
Occurrence (attribute)
Revision status (attribute)
Pathological process (attribute)
Has focus (attribute)
Has specimen (attribute)
Due to (attribute)
Clinical course (attribute)
Associated

In [35]:
# save a specific relationship to json
# In the example we save the "IS a (attribute)" hierarchical relationship.
snomed.relationship2json("116680003", "ISA_relationship.json")

#### Mappings to inbuilt external terminologies 

Create a dictionary map to add to the medcat concept database additional information

##### ICD-10
For SNOMED to ICD-10 mapping read more on:
Map Blocks, Map Groups and Map Priorities, for correct official mapping methodology.

In [36]:
# ICD-10
icd_df = snomed.map_snomed2icd10()

In [37]:
icd_df.head()

Unnamed: 0,id,effectiveTime,active,moduleId,refsetId,referencedComponentId,mapGroup,mapPriority,mapRule,mapAdvice,mapTarget,correlationId,mapCategoryId
0,49b07dca-0559-59d9-8c85-3198c941a813,20130731,1,449080006,447562003,10000006,1,1,True,ALWAYS R07.4,R07.4,447561005,447637006
1,2c141272-bbc0-5c34-aa48-72f2dc370ead,20150131,1,449080006,447562003,10001005,1,1,True,ALWAYS A41.9,A41.9,447561005,447637006
2,47cf847c-c85c-5979-8129-98475d4ab274,20190731,1,449080006,447562003,10007009,1,1,True,ALWAYS Q87.1 | POSSIBLE REQUIREMENT FOR ADDITI...,Q87.1,447561005,447637006
3,4cd08e27-0417-5c5d-8873-6a72d130e918,20140731,1,449080006,447562003,1001000119102,1,1,True,ALWAYS I26.9 | MAPPED FOLLOWING WHO GUIDANCE,I26.9,447561005,447637006
4,ca792560-1589-5f24-9126-b040e8ea2b2d,20180731,1,449080006,447562003,10017004,1,1,True,ALWAYS K03.0,K03.0,447561005,447637006


In [38]:
sctid2icd10 = {k: g["mapTarget"].tolist() for k,g in icd_df[['referencedComponentId',
                                                              'mapTarget']].groupby("referencedComponentId")}

In [39]:
# To view the SNOMED to ICD-10 Map structure.
# The structure should be '100006': [ {'code': 'R07.4', name: 'diabetes type2', 'priority': 1}, etc]
sctid2icd10


{'10000006': ['R07.4'],
 '10001005': ['A41.9'],
 '10007009': ['Q87.1'],
 '1001000119102': ['I26.9'],
 '10017004': ['K03.0'],
 '100191000119105': ['N42.8'],
 '100211000119106': ['R25.2'],
 '1002229008': ['L75.2'],
 '1002253002': [''],
 '100231000119101': ['I31.8'],
 '1003002': ['Z60.9'],
 '10033001': ['Q79.6'],
 '1003322006': ['Q93.5'],
 '1003324007': ['Q74.2'],
 '1003330007': ['Q70.3'],
 '1003337005': ['Q72.8'],
 '1003339008': ['Q71.8'],
 '1003358004': ['Q93.5'],
 '1003364006': ['Q93.5'],
 '1003367004': ['E72.1'],
 '1003368009': ['E72.1'],
 '1003369001': ['Q11.2', 'Q87.8'],
 '1003370000': ['Q11.2', 'Q87.8'],
 '1003371001': ['Q73.8'],
 '1003372008': ['Q11.2', 'Q18.8'],
 '1003373003': ['Q02', 'Q04.8'],
 '1003374009': ['Q04.3'],
 '1003375005': ['E75.2'],
 '1003376006': ['Q92.8'],
 '1003377002': ['Q92.8'],
 '1003378007': ['Q07.8', 'E34.9'],
 '1003379004': ['Q78.0'],
 '1003380001': ['Q93.5'],
 '1003381002': ['L67.8', 'D70'],
 '1003384005': ['Q76.4'],
 '1003385006': ['F89', 'P04.3'],
 '10033

##### OPCS
Office of Population Censuses and Surveys


__Note:__ only the SNOMED UK extension edition contains this information
Skip if your version is not a UK extension

In [None]:
opcs_df = snomed.map_snomed2opcs4()

In [None]:
opcs_df.head()

### Save for MedCAT

In [None]:
# Save to CSV for medcat CDB creation
df.to_csv("preprocessed_snomed.csv", index=False)

--------

# Part 2: Create a MedCAT CDB using SNOMED CT release files


These steps are also in the [create_cdb.py](medcat/1_create_model/create_cdb/create_cdb.py)

In [None]:
# Import required packages
from medcat.cdb import CDB
from medcat.config import Config
from medcat.cdb_maker import CDBMaker

#### Create concept database (cdb)

In [None]:
# First initialise the default configuration
config = Config()
config.general['spacy_model'] = 'en_core_web_md'
maker = CDBMaker(config)

In [None]:
# Create an array containing CSV files that will be used to build our CDB
csv_path = ['preprocessed_snomed.csv']

# Create your CDB
cdb = maker.prepare_csvs(csv_path, full_build=True)

### Inspect your cdb

In [None]:
print(cdb.name2cuis['epilepsy'])

In [None]:
print(cdb.cui2preferred_name['84757009'])

In [None]:
print(cdb.cui2names['84757009'])

#### Enrich with extra information and mapping

Mapping was created in [Mappings to inbuilt external terminologies](https://colab.research.google.com/drive/1yesqjMQwQH20Kl9w7siRGVaSWU0uI84W#scrollTo=Mappings_to_inbuilt_external_terminologies).
Here we use [ICD-10](https://colab.research.google.com/drive/1yesqjMQwQH20Kl9w7siRGVaSWU0uI84W#scrollTo=ICD_10) as an example.

In [None]:
cdb.addl_info['cui2icd10'] = sctid2icd10

### Save your new SNOMED cdb

__tip:__ good practise to include the snomed release edition file name

In [None]:
cdb.save("SNOMED_cdb.dat")