# Working with UMLS

This scripts will walk you through how to:
1) Download a specific version of UMLS

2) Process the MRCONSO.RFF.ZIP files to a pandas df whcih you can then manipulate

__Note:__ Keep in mind that the UMLS file sets are very large!

## Part 1: Downloading UMLS

In [None]:
import os
from umls_downloader import download_umls

In [None]:
# Get this from https://uts.nlm.nih.gov/uts/edit-profile
api_key = ''
version = '2022AA' # Change this to the UMLS version that you require

In [None]:
path = download_umls(version=version, api_key=api_key)
print(path) # This is where the UMLS files are now saved

## Part 2: Working with UMLS

The part of UMLS that we require is stored in the MRCONSO.RFF files. The file layout is as follows:

__Concept Names and Sources (File = MRCONSO.RRF)__

|Col.|Description|
|---|---|
|CUI|	Unique identifier for concept|
|LAT|	Language of term|
|TS|	Term status|
|LUI|	Unique identifier for term|
|STT|	String type|
|SUI|	Unique identifier for string|
|ISPREF|	Atom status - preferred (Y) or not (N) for this string within this concept|
|AUI|	Unique identifier for atom - variable length field, 8 or 9 characters|
|SAUI|	Source asserted atom identifier [optional]|
|SCUI|	Source asserted concept identifier [optional]|
|SDUI|	Source asserted descriptor identifier [optional]|
|SAB|	Abbreviated source name (SAB). Maximum field length is 20 alphanumeric characters. Two source abbreviations are assigned: Root Source Abbreviation (RSAB) — short form, no version information, for example, AI/RHEUM, 1993, has an RSAB of "AIR" Versioned Source Abbreviation (VSAB) — includes version information, for example, AI/RHEUM, 1993, has an VSAB of "AIR93" Official source names, RSABs, and VSABs are included on the UMLS Source Vocabulary Documentation page.
|TTY|	Abbreviation for term type in source vocabulary, for example PN (Metathesaurus Preferred Name) or CD (Clinical Drug). Possible values are listed on the Abbreviations Used in Data Elements page.|
CODE|	Most useful source asserted identifier (if the source vocabulary has more than one identifier), or a Metathesaurus-generated source entry identifier (if the source vocabulary has none)|
|STR|	String|
|SRL|	Source restriction level|
|SUPPRESS|	Suppressible flag. Values = O, E, Y, or N O: All obsolete content, whether they are obsolesced by the source or by NLM. These will include all atoms having obsolete TTYs, and other atoms becoming obsolete that have not acquired an obsolete TTY (e.g. RxNorm SCDs no longer associated with current drugs, LNC atoms derived from obsolete LNC concepts). E: Non-obsolete content marked suppressible by an editor. These do not have a suppressible SAB/TTY combination. Y: Non-obsolete content deemed suppressible during inversion. These can be determined by a specific SAB/TTY combination explicitly listed in MRRANK. N: None of the above. Default suppressibility as determined by NLM (i.e., no changes at the Suppressibility tab in MetamorphoSys) should be used by most users, but may not be suitable in some specialized applications. See the MetamorphoSys Help page for information on how to change the SAB/TTY suppressibility to suit your requirements. NLM strongly recommends that users not alter editor-assigned suppressibility, and MetamorphoSys cannot be used for this purpose.|
|CVF|	Content View Flag. Bit field used to flag rows included in Content View. This field is a varchar field to maximize the number of bits available for use.|

In [None]:
import zipfile
import pandas as pd

In [None]:
umls_rows = []
with zipfile.ZipFile(path) as zip_file:
    with zip_file.open("MRCONSO.RRF", mode="r") as file:
        for line in file:
            umls_rows.append(line.decode('UTF-8').split('|')[:-1])

In [None]:
columns = [
    "CUI",
    "LAT",
    "TS",
    "LUI",
    "STT",
    "SUI",
    "ISPREF",
    "AUI",
    "SAUI",
    "SCUI",
    "SDUI",
    "SAB",
    "TTY",
    "CODE",
    "STR",
    "SRL",
    "SUPPRESS",
    "CVF",   
]

In [None]:
umls_df = pd.DataFrame(columns=columns, data=umls_rows)

In [None]:
umls_df.head()

Free free to now manipulate the dataframe as you would like!