## Analyze Morbidmap content - v1

The goal of this notebook is to analyze the content of the files from OMIM called morbidmap and mimTitles in order to create a gold standard list of diseases that should be represented in Mondo with 'has material basis in germline mutation in' some GENE. The diseases in this list can be used for comparison of results through the various transformations that occur of the omim content to confirm the final representation is correct.

To download these files, request an API key from OMIM (https://omim.org/contact#) and then create the files using `python -m omim2obo` based on the instructions in the [README](https://github.com/monarch-initiative/omim) in the omim repo.


For this analysis, the working assumption is that the gene associations to add into Mondo are:
- those that have a Phenotype Mapping key value of 3 and there is only one Phenotype to Gene Relationship for the given OMIM Phenotype MIM

**OR**

- there is a digenic association


See https://omim.org/help/faq#1_6 for more details on what the Phenotype mapping key values mean and additional formatting, [], {}, ?, found in phenotype labels. See https://omim.org/help/faq#1_3 for information on what the Prefix values in the file mimTitles means.


** **TODO** **: The working assumption needs to be confirmed 

The results of this analysis under the working assumptions above is at [OMIM Disease-Gene Issues](https://docs.google.com/document/d/1cLfBgPIZWiN5LX-E-xwSyBeFdT-vw0JuSfSM7HL3_hc/edit?usp=sharing)

### Imports

In [1]:
# Imports
import pandas as pd
import re

# Set the display option to show full column width
pd.set_option('display.max_colwidth', None)

### Read in data file

In [2]:
# Read in file. This version of morbidmap.tsv was downloaded on 29-Oct-2024
# NOTE: You will need to follow the instructions in the README to get the morbidmap file. 
# IMPORTANT !!The morbidmap file is not a file that should be posted publicly in this repo!!

df = pd.read_csv('../../data/morbidmap.tsv', sep='\t')
df.head()

Unnamed: 0,Phenotype,Gene/Locus And Other Related Symbols,MIM Number,Cyto Location
0,"17,20-lyase deficiency, isolated, 202110 (3)","CYP17A1, CYP17, P450C17",609300,10q24.32
1,"17-alpha-hydroxylase/17,20-lyase deficiency, 202110 (3)","CYP17A1, CYP17, P450C17",609300,10q24.32
2,"2,4-dienoyl-CoA reductase deficiency, 616034 (3)","NADK2, C5orf33, DECRD",615787,5p13.2
3,"2-methylbutyrylglycinuria, 610006 (3)","ACADSB, SBCAD",600301,10q26.13
4,"3-M syndrome 1, 273750 (3)","CUL7, 3M1",609577,6p21.1


### Process file to parse out phenotype mim number from Phenotype column

In [3]:
# Parse out phenotype mim number from Phenotype column

# Define the regex pattern
pattern = r'(.*), (\d{6})\s*(?:\((\d+)\))?' # Regex based on existing pattern in code, https://github.com/monarch-initiative/omim/blob/main/omim2obo/parsers/omim_txt_parser.py#L328

# Use .str.extract() to apply the pattern and store matches in new columns
df[['p_label', 'p_mim', 'p_mapping_key']] = df['Phenotype'].str.extract(pattern)

df.head()

Unnamed: 0,Phenotype,Gene/Locus And Other Related Symbols,MIM Number,Cyto Location,p_label,p_mim,p_mapping_key
0,"17,20-lyase deficiency, isolated, 202110 (3)","CYP17A1, CYP17, P450C17",609300,10q24.32,"17,20-lyase deficiency, isolated",202110,3
1,"17-alpha-hydroxylase/17,20-lyase deficiency, 202110 (3)","CYP17A1, CYP17, P450C17",609300,10q24.32,"17-alpha-hydroxylase/17,20-lyase deficiency",202110,3
2,"2,4-dienoyl-CoA reductase deficiency, 616034 (3)","NADK2, C5orf33, DECRD",615787,5p13.2,"2,4-dienoyl-CoA reductase deficiency",616034,3
3,"2-methylbutyrylglycinuria, 610006 (3)","ACADSB, SBCAD",600301,10q26.13,2-methylbutyrylglycinuria,610006,3
4,"3-M syndrome 1, 273750 (3)","CUL7, 3M1",609577,6p21.1,3-M syndrome 1,273750,3


In [4]:
# Convert type of p_mapping_key to a string

df['p_mapping_key'] = df['p_mapping_key'].astype(str)

# Check that each value is now a string
print(df['p_mapping_key'].apply(type).unique())

[<class 'str'>]


### Get all rows where the p_mim value occurs only 1 time in the dataframe and has p_mapping_key='3' or rows where the p_label contains the word 'digenic'


In [5]:
# Step 1: Filter for rows where p_mim occurs only once and p_mapping_key is 3
unique_p_mim = df['p_mim'].value_counts()[df['p_mim'].value_counts() == 1].index
# print(len(unique_pmim))

filtered_unique_df = df[(df['p_mim'].isin(unique_p_mim)) & (df['p_mapping_key'] == '3')]
# print(len(filtered_unique_df['p_mim']))
# print(filtered_unique_df.nunique())

# Step 2: Filter for rows where p_label contains the word 'digenic'
digenic_p_mim = df[df['p_label'].str.contains('digenic', case=False, na=False)]['p_mim'].unique()
# print(len(digenic_p_mim))

# Combine the unique and digenic p_mim values
p_mim_to_keep = set(unique_p_mim).union(digenic_p_mim)

# Step 3: Filter the original dataframe to keep all rows for those p_mim values (p_mim_to_keep)
unique_and_pkey3_or_digenic_filtered_df = df[df['p_mim'].isin(p_mim_to_keep)]

unique_and_pkey3_or_digenic_filtered_df.head()

Unnamed: 0,Phenotype,Gene/Locus And Other Related Symbols,MIM Number,Cyto Location,p_label,p_mim,p_mapping_key
2,"2,4-dienoyl-CoA reductase deficiency, 616034 (3)","NADK2, C5orf33, DECRD",615787,5p13.2,"2,4-dienoyl-CoA reductase deficiency",616034,3
3,"2-methylbutyrylglycinuria, 610006 (3)","ACADSB, SBCAD",600301,10q26.13,2-methylbutyrylglycinuria,610006,3
4,"3-M syndrome 1, 273750 (3)","CUL7, 3M1",609577,6p21.1,3-M syndrome 1,273750,3
5,"3-M syndrome 2, 612921 (3)","OBSL1, KIAA0657, 3M2",610991,2q35,3-M syndrome 2,612921,3
6,"3-M syndrome 3, 614205 (3)","CCDC8, 3M3",614145,19q13.32,3-M syndrome 3,614205,3


In [6]:
unique_and_pkey3_or_digenic_filtered_df.nunique()
# NOTE: Values in unique_and_pkey3_or_digenic_filtered_df include all p_mapping_key values. This can be filtered out later.

Phenotype                               6386
Gene/Locus And Other Related Symbols    4656
MIM Number                              4656
Cyto Location                            839
p_label                                 6383
p_mim                                   6353
p_mapping_key                              4
dtype: int64

In [7]:
# Spot check data for rows that should and should not be included in unique_or_digenic_filtered_df
# 100100 - expect in df, 613659 - not expected in df, 601067 - expect in df

p_mim_list = ['100100', '613659', '601067']

# Filter the DataFrame to get rows where p_mim is in p_mim_list
rows_with_p_mim = unique_and_pkey3_or_digenic_filtered_df[unique_and_pkey3_or_digenic_filtered_df['p_mim'].isin(p_mim_list)]

rows_with_p_mim.head()

Unnamed: 0,Phenotype,Gene/Locus And Other Related Symbols,MIM Number,Cyto Location,p_label,p_mim,p_mapping_key
6256,"Prune belly syndrome, 100100 (3)","CHRM3, PBS, EGBRS",118494,1q43,Prune belly syndrome,100100,3
7305,"Usher syndrome, type 1D, 601067 (3)","CDH23, USH1D, DFNB12, PITA5",605516,10q22.1,"Usher syndrome, type 1D",601067,3
7306,"Usher syndrome, type 1D/F digenic, 601067 (3)","CDH23, USH1D, DFNB12, PITA5",605516,10q22.1,"Usher syndrome, type 1D/F digenic",601067,3
7307,"Usher syndrome, type 1D/F digenic, 601067 (3)","PCDH15, DFNB23, USH1F",605514,10q21.1,"Usher syndrome, type 1D/F digenic",601067,3


### Filter out the rows where the disease is digenic (p_label contains 'digenic' for all unique p_mim values)

In [8]:
# Filter out all rows in unique_and_pkey3_or_digenic_filtered_df where for a unique p_mim all of it's p_labels contain the word 'digenic'.
# Therefore, this should filter out p_mim 601067 since only 2 of 3 p_label values contain the word 'digenic'.

# Step 1: Identify p_mim values where all associated p_label values contain 'digenic'
all_digenic_p_mim = unique_and_pkey3_or_digenic_filtered_df.groupby('p_mim').filter(lambda x: x['p_label'].str.contains('digenic', case=False).all())['p_mim'].unique()

# Step 2: Filter the DataFrame to include only rows with these p_mim values
filtered_digenic_df = unique_and_pkey3_or_digenic_filtered_df[unique_and_pkey3_or_digenic_filtered_df['p_mim'].isin(all_digenic_p_mim)]

# Add a new column 'p_mim_count' with the count of each p_mim occurrence
filtered_digenic_df['p_mim_count'] = filtered_digenic_df.groupby('p_mim')['p_mim'].transform('count')

# Filter by p_mim_count
filtered_digenic_df = filtered_digenic_df.sort_values(by='p_mim_count')


filtered_digenic_df.head(len(filtered_digenic_df))

# NOTE: Amongst these results, each unique p_mim should only occur in 2 rows based on understanding of digenic. 
# Otherwise ask OMIM about the other rows where there is one or more than 2 rows for a p_label that contains the word 'digenic'.


# !! QUESTION FOR OMIM: Ask OMIM about p_mim with count of 1 and p_label contains 'digenic'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_digenic_df['p_mim_count'] = filtered_digenic_df.groupby('p_mim')['p_mim'].transform('count')


Unnamed: 0,Phenotype,Gene/Locus And Other Related Symbols,MIM Number,Cyto Location,p_label,p_mim,p_mapping_key,p_mim_count
274,"?Facioscapulohumeral muscular dystrophy 3, digenic, 619477 (3)","LRIF1, RIF1, FSHD3",615354,1p13.3,"?Facioscapulohumeral muscular dystrophy 3, digenic",619477,3,1
571,"?Proteasome-associated autoinflammatory syndrome 3 and digenic forms, 617591 (3)","PSMB4, PRAAS3",602177,1q21.3,?Proteasome-associated autoinflammatory syndrome 3 and digenic forms,617591,3,1
707,"AMED syndrome, digenic, 619151 (3)","ADH5, FDH, AMEDS, BMFS7",103710,4q23,"AMED syndrome, digenic",619151,3,1
1115,"Atrial standstill, digenic (GJA5/SCN5A), 108770 (3)","GJA5, CX40, ATFB11",121013,1q21.2,"Atrial standstill, digenic (GJA5/SCN5A)",108770,3,1
2772,"Dyskeratosis congenita, digenic, 620040 (3)","TYMS, TS, TMS, DKCD",188350,18p11.32,"Dyskeratosis congenita, digenic",620040,3,1
3068,"Facioscapulohumeral muscular dystrophy 2, digenic, 158901 (3)","SMCHD1, KIAA0650, BAMS",614982,18p11.32,"Facioscapulohumeral muscular dystrophy 2, digenic",158901,3,1
3069,"Facioscapulohumeral muscular dystrophy 4, digenic, 619478 (3)","DNMT3B, ICF1, FSHD4",602900,20q11.21,"Facioscapulohumeral muscular dystrophy 4, digenic",619478,3,1
6246,"Proteasome-associated autoinflammatory syndrome 1 and digenic forms, 256040 (3)","PSMB8, LMP7, RING10, JMP, NKJO, ALDD, PRAAS1",177046,6p21.32,Proteasome-associated autoinflammatory syndrome 1 and digenic forms,256040,3,1
602,"?Roifman-Chitayat syndrome, digenic, 613328 (3)","KNSTRN, C15orf23, SKAP, ROCHIS",614718,15q15.1,"?Roifman-Chitayat syndrome, digenic",613328,3,2
603,"?Roifman-Chitayat syndrome, digenic, 613328 (3)","PIK3CD, APDS, IMD14A, IMD14B, ROCHIS",602839,1p36.22,"?Roifman-Chitayat syndrome, digenic",613328,3,2


### Create a dataframe of unique p_mim values where digenic entries (all labels for a phenotype mim contain digenic) are not included


In [9]:
# Now, let's filter out values in filtered_digenic_df from unique_and_pkey3_or_digenic_filtered_df so that the dataframe should 
# have only those unique p_mim rows where the p_label does not include 'digenic' and contains unique p_mim values.
# NOTE: We know (29-Oct-2024) that this will have some duplicate p_mim values, e.g Usher syndrome (605516) since not all of the 
# p_labels for Usher syndrome (605516) contain the word digenic.

# Perform a left merge with an indicator to identify rows that are only in unique_and_pkey3_or_digenic_filtered_df
merged_df = unique_and_pkey3_or_digenic_filtered_df.merge(filtered_digenic_df, on=['Phenotype', 'Gene/Locus And Other Related Symbols', 'MIM Number', 'Cyto Location', 'p_label', 'p_mim', 'p_mapping_key'], 
                      how='left', indicator=True)

# Filter out rows that appear in both DataFrames
unique_pmim_df = merged_df[merged_df['_merge'] == 'left_only'].drop(columns=['_merge'])

# Get a count of how often the p_mim occurs in the dataframe
unique_pmim_df['p_mim_count']= unique_pmim_df.groupby('p_mim')['p_mim'].transform('count')


unique_pmim_df.head(len(unique_pmim_df))

Unnamed: 0,Phenotype,Gene/Locus And Other Related Symbols,MIM Number,Cyto Location,p_label,p_mim,p_mapping_key,p_mim_count
0,"2,4-dienoyl-CoA reductase deficiency, 616034 (3)","NADK2, C5orf33, DECRD",615787,5p13.2,"2,4-dienoyl-CoA reductase deficiency",616034,3,1
1,"2-methylbutyrylglycinuria, 610006 (3)","ACADSB, SBCAD",600301,10q26.13,2-methylbutyrylglycinuria,610006,3,1
2,"3-M syndrome 1, 273750 (3)","CUL7, 3M1",609577,6p21.1,3-M syndrome 1,273750,3,1
3,"3-M syndrome 2, 612921 (3)","OBSL1, KIAA0657, 3M2",610991,2q35,3-M syndrome 2,612921,3,1
4,"3-M syndrome 3, 614205 (3)","CCDC8, 3M3",614145,19q13.32,3-M syndrome 3,614205,3,1
...,...,...,...,...,...,...,...,...
6396,"{Warfarin sensitivity}, 301052 (3)","F9, HEMB, THPH8",300746,Xq27.1,{Warfarin sensitivity},301052,3,1
6397,"{West nile virus, susceptibility to}, 610379 (3)","CCR5, CMKBR5, CCCKR5, IDDM22",601373,3p21.31,"{West nile virus, susceptibility to}",610379,3,1
6398,"{Wilms tumor 6, susceptibility to}, 616806 (3)","REST, NRSF, WT6, GINGF5, HGF5, DFNA27",600571,4q12,"{Wilms tumor 6, susceptibility to}",616806,3,1
6399,"{Wilms tumor susceptibility-5}, 601583 (3)","POU6F2, WTSL, WT5",609062,7p14.1,{Wilms tumor susceptibility-5},601583,3,1


In [10]:
unique_pmim_df.nunique()

Phenotype                               6375
Gene/Locus And Other Related Symbols    4647
MIM Number                              4647
Cyto Location                            839
p_label                                 6372
p_mim                                   6342
p_mapping_key                              4
p_mim_count                                4
dtype: int64

In [11]:
# Spot check data - Let's now use unique_pmim_df to find any p_mim values that occur >1

# Find rows where the p_mim value occurs more than once
rows_with_duplicate_pmim_df = unique_pmim_df[unique_pmim_df['p_mim'].duplicated(keep=False)]

# Sort by p_mim_count values
rows_with_duplicate_pmim_df = rows_with_duplicate_pmim_df.sort_values(by='p_mim_count')

# Sort by p_mim_count ascending and p_mim to group duplicates together
rows_with_duplicate_pmim_df = rows_with_duplicate_pmim_df.sort_values(by=['p_mim_count', 'p_mim']).reset_index(drop=True)

rows_with_duplicate_pmim_df.head(len(rows_with_duplicate_pmim_df))

# !! QUESTION - Ask OMIM about these entries where only one label contains 'digenic'

Unnamed: 0,Phenotype,Gene/Locus And Other Related Symbols,MIM Number,Cyto Location,p_label,p_mim,p_mapping_key,p_mim_count
0,"Methylmalonic aciduria and homocystinuria, cblC type, digenic, 277400 (3)","PRDX1, PRXI, PAGA, NKEFA",176763,1p34.1,"Methylmalonic aciduria and homocystinuria, cblC type, digenic",277400,3,2
1,"Methylmalonic aciduria and homocystinuria, cblC type, 277400 (3)",MMACHC,609831,1p34.1,"Methylmalonic aciduria and homocystinuria, cblC type",277400,3,2
2,"Insulin resistance, severe, digenic, 604367 (3)","PPARG, PPARG1, PPARG2, CIMT1, GLM1",601487,3p25.2,"Insulin resistance, severe, digenic",604367,3,2
3,"Lipodystrophy, familial partial, type 3, 604367 (3)","PPARG, PPARG1, PPARG2, CIMT1, GLM1",601487,3p25.2,"Lipodystrophy, familial partial, type 3",604367,3,2
4,"Microphthalmia, isolated, with coloboma 6, 613703 (3)","GDF3, KFS3, MCOPCB6, MCOP7",606522,12p13.31,"Microphthalmia, isolated, with coloboma 6",613703,3,2
5,"Microphthalmia with coloboma 6, digenic, 613703 (3)","GDF6, MCOP4, KFS1, MCOPCB6, LCA17, SYNS4",601147,8q22.1,"Microphthalmia with coloboma 6, digenic",613703,3,2
6,"[Bombay phenotype], 616754 (3)","FUT1, H, HH",211100,19q13.33,[Bombay phenotype],616754,3,2
7,"[Bombay phenotype, digenic], 616754 (3)","FUT2, SE, B12QTL1",182100,19q13.33,"[Bombay phenotype, digenic]",616754,3,2
8,"Cardiomyopathy, familial hypertrophic, 192600 (3)","CAV3, LQT9, MPDT, RMD2",601253,3p25.3,"Cardiomyopathy, familial hypertrophic",192600,3,3
9,"Cardiomyopathy, hypertrophic, 1, 192600 (3)","MYH7, CMH1, MPD1, CMD1S, CMYO7A, CMYO7B",160760,14q11.2,"Cardiomyopathy, hypertrophic, 1",192600,3,3


In [12]:
rows_with_duplicate_pmim_df.nunique()

Phenotype                               44
Gene/Locus And Other Related Symbols    50
MIM Number                              50
Cyto Location                           46
p_label                                 42
p_mim                                   11
p_mapping_key                            2
p_mim_count                              3
dtype: int64

In [13]:
# Get rows where the value for p_mim_count is 1
unique_pmim_df = unique_pmim_df[unique_pmim_df['p_mim_count'] == 1]
print(len(unique_pmim_df))

unique_pmim_df.head()

6331


Unnamed: 0,Phenotype,Gene/Locus And Other Related Symbols,MIM Number,Cyto Location,p_label,p_mim,p_mapping_key,p_mim_count
0,"2,4-dienoyl-CoA reductase deficiency, 616034 (3)","NADK2, C5orf33, DECRD",615787,5p13.2,"2,4-dienoyl-CoA reductase deficiency",616034,3,1
1,"2-methylbutyrylglycinuria, 610006 (3)","ACADSB, SBCAD",600301,10q26.13,2-methylbutyrylglycinuria,610006,3,1
2,"3-M syndrome 1, 273750 (3)","CUL7, 3M1",609577,6p21.1,3-M syndrome 1,273750,3,1
3,"3-M syndrome 2, 612921 (3)","OBSL1, KIAA0657, 3M2",610991,2q35,3-M syndrome 2,612921,3,1
4,"3-M syndrome 3, 614205 (3)","CCDC8, 3M3",614145,19q13.32,3-M syndrome 3,614205,3,1


In [14]:
unique_pmim_df.nunique()

Phenotype                               6331
Gene/Locus And Other Related Symbols    4627
MIM Number                              4627
Cyto Location                            837
p_label                                 6330
p_mim                                   6331
p_mapping_key                              4
p_mim_count                                1
dtype: int64

In [15]:
# Save to file
# unique_pmim_df.to_csv('unique_pmim_df.tsv', sep='\t', index=False)

# NOTE: there are other columns that the unique_pmim_df should probably be filtered on, eg. only those rows with mapping_key=3
# and only those rows where the p_label is not included in {}, [], or prefixed with '?'. See https://omim.org/help/faq#1_6

# !! unique_pmim_df --> Need to filter out p_mim values that are actually for genes and not phenotypes !!

### Filter out p_mim values from unique_pmim_df where the p_mim value is a Gene MIM identifer

In [16]:
# Read in mimTitles.tsv in order to filter out p_mim values from unique_pmim_df that are actually Gene MIM identifers
mimTitles_df = pd.read_csv('../../data/mimTitles.tsv', sep='\t')

mimTitles_df.head()

Unnamed: 0,Prefix,MIM Number,Preferred Title; symbol,Alternative Title(s); symbol(s),Included Title(s); symbols
0,,100050,"AARSKOG SYNDROME, AUTOSOMAL DOMINANT",,
1,Percent,100070,"AORTIC ANEURYSM, FAMILIAL ABDOMINAL, 1; AAA1","ANEURYSM, ABDOMINAL AORTIC; AAA;; ABDOMINAL AORTIC ANEURYSM",
2,Number Sign,100100,PRUNE BELLY SYNDROME; PBS,"ABDOMINAL MUSCLES, ABSENCE OF, WITH URINARY TRACT ABNORMALITY AND CRYPTORCHIDISM;; EAGLE-BARRETT SYNDROME; EGBRS",
3,,100200,ABDUCENS PALSY,,
4,Number Sign,100300,ADAMS-OLIVER SYNDROME 1; AOS1,"AOS;; ABSENCE DEFECT OF LIMBS, SCALP, AND SKULL;; CONGENITAL SCALP DEFECTS WITH DISTAL LIMB REDUCTION ANOMALIES;; APLASIA CUTIS CONGENITA WITH TERMINAL TRANSVERSE LIMB DEFECTS","APLASIA CUTIS CONGENITA, CONGENITAL HEART DEFECT, AND FRONTONASAL CYSTS, INCLUDED"


In [17]:
# Filter out all Gene MIM values from unique_pmim_df_copy. See https://omim.org/help/faq#1_3

# TODO: Change this to merge unique_pmim_df_copy with mimTitles_df so the mimTitles_df['Prefix'] value is in the final dataframe

# Make a copy of unique_pmim_df to work with further
unique_pmim_df_copy = unique_pmim_df.copy()
print('Length - unique_pmim_df_copy: ', len(unique_pmim_df_copy))
print('\nData types for unique_pmim_df_copy:\n', unique_pmim_df_copy.dtypes)

# Change datatype of p_mim to string in order to filter
unique_pmim_df_copy['p_mim'] = unique_pmim_df_copy['p_mim'].astype(str)

# Step 1: Get MIM Numbers in mimTitles_df where Prefix is 'Asterisk'
asterisk_mim_numbers = mimTitles_df[mimTitles_df['Prefix'] == 'Asterisk']['MIM Number'].tolist()
asterisk_mim_numbers = [str(mim) for mim in asterisk_mim_numbers]
print('\nLength asterisk_mim_numbers: ', len(asterisk_mim_numbers))

# Step 2: Filter unique_pmim_df to remove rows with matching p_mim values
unique_pmim_df_copy = unique_pmim_df_copy[~unique_pmim_df_copy['p_mim'].isin(asterisk_mim_numbers)]


unique_pmim_df_copy.head(10)

Length - unique_pmim_df_copy:  6331

Data types for unique_pmim_df_copy:
 Phenotype                               object
Gene/Locus And Other Related Symbols    object
MIM Number                               int64
Cyto Location                           object
p_label                                 object
p_mim                                   object
p_mapping_key                           object
p_mim_count                              int64
dtype: object

Length asterisk_mim_numbers:  17403


Unnamed: 0,Phenotype,Gene/Locus And Other Related Symbols,MIM Number,Cyto Location,p_label,p_mim,p_mapping_key,p_mim_count
0,"2,4-dienoyl-CoA reductase deficiency, 616034 (3)","NADK2, C5orf33, DECRD",615787,5p13.2,"2,4-dienoyl-CoA reductase deficiency",616034,3,1
1,"2-methylbutyrylglycinuria, 610006 (3)","ACADSB, SBCAD",600301,10q26.13,2-methylbutyrylglycinuria,610006,3,1
2,"3-M syndrome 1, 273750 (3)","CUL7, 3M1",609577,6p21.1,3-M syndrome 1,273750,3,1
3,"3-M syndrome 2, 612921 (3)","OBSL1, KIAA0657, 3M2",610991,2q35,3-M syndrome 2,612921,3,1
4,"3-M syndrome 3, 614205 (3)","CCDC8, 3M3",614145,19q13.32,3-M syndrome 3,614205,3,1
5,"3-Methylcrotonyl-CoA carboxylase 1 deficiency, 210200 (3)","MCCC1, MCCA",609010,3q27.1,3-Methylcrotonyl-CoA carboxylase 1 deficiency,210200,3,1
6,"3-Methylcrotonyl-CoA carboxylase 2 deficiency, 210210 (3)","MCCC2, MCCB",609014,5q13.2,3-Methylcrotonyl-CoA carboxylase 2 deficiency,210210,3,1
7,"3-hydroxyacyl-CoA dehydrogenase deficiency, 231530 (3)","HADHSC, SCHAD, HHF4",601609,4q25,3-hydroxyacyl-CoA dehydrogenase deficiency,231530,3,1
8,"3-hydroxyisobutryl-CoA hydrolase deficiency, 250620 (3)",HIBCH,610690,2q32.2,3-hydroxyisobutryl-CoA hydrolase deficiency,250620,3,1
9,"3-methylglutaconic aciduria with deafness, encephalopathy, and Leigh-like syndrome, 614739 (3)","SERAC1, MEGDEL",614725,6q25.3,"3-methylglutaconic aciduria with deafness, encephalopathy, and Leigh-like syndrome",614739,3,1


In [18]:
unique_pmim_df_copy.nunique()

Phenotype                               6328
Gene/Locus And Other Related Symbols    4626
MIM Number                              4626
Cyto Location                            837
p_label                                 6327
p_mim                                   6328
p_mapping_key                              4
p_mim_count                                1
dtype: int64

In [19]:
# Save to file
unique_pmim_df_copy.to_csv('unique_pmim_df-no-gene-entries.tsv', sep='\t', index=False)