# Exploring EBI gene2pheno data for parser

## Downloading data

Latest data (created on the fly) can be downloaded from:
* webpage https://www.ebi.ac.uk/gene2phenotype/download: currently manual clicks only. Can use the "all data" file!
* API urls https://www.ebi.ac.uk/gene2phenotype/api/: `panel/{name}/download/` endpoint. Has "all data" option! Works programmatically and manually

Static release files can be downloaded from the FTP site http://ftp.ebi.ac.uk/pub/databases/gene2phenotype/ (in the G2P_data_downloads subfolder). **These releases will include the "all data" file starting in Aug 2025.** It updates on a monthly basis, according to the [README](http://ftp.ebi.ac.uk/pub/databases/gene2phenotype/README). However, I notice that the FTP site is sometimes slow on my computer (Firefox browser). 


I'm using all data (either downloading the "all data" file or all the panel/subset files). 

## Load data 

In [1]:
## import packages for exploring here

## CX: allows multiple lines of code to print from one code block
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import pathlib
import pandas as pd
import glob
import numpy as np
from pprint import pprint


## for notebook viewing/debugging...
pd.options.display.max_columns = None

In [2]:
## pandas downloading file from URL: works, a little slow

df_url_all = pd.read_csv("https://www.ebi.ac.uk/gene2phenotype/api/panel/all/download/", 
                         dtype=str)

In [3]:
df_url_all.columns = df_url_all.columns.str.replace(" ", "_")

df_url_all.info(memory_usage="deep")
## the numbers are the same as df_merged and df_option_all

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3831 entries, 0 to 3830
Data columns (total 24 columns):
 #   Column                              Non-Null Count  Dtype 
---  ------                              --------------  ----- 
 0   g2p_id                              3831 non-null   object
 1   gene_symbol                         3831 non-null   object
 2   gene_mim                            3828 non-null   object
 3   hgnc_id                             3831 non-null   object
 4   previous_gene_symbols               3534 non-null   object
 5   disease_name                        3831 non-null   object
 6   disease_mim                         2956 non-null   object
 7   disease_MONDO                       3320 non-null   object
 8   allelic_requirement                 3831 non-null   object
 9   cross_cutting_modifier              477 non-null    object
 10  confidence                          3831 non-null   object
 11  variant_consequence                 3812 non-null   obje

## DEFUNCT: compare subset files vs "all" link

### merging subset files

In [2]:
# ## adjust based on where files are stored
# base_file_path = pathlib.Path.home().joinpath("Desktop", "EBIgene2pheno_files", 
#                                               "From_Website", "2025-07-30")

# ## uses pathlib's Path.glob, which produces a generator. 
# ## cast into list so parser code can check if paths were actually matched or not
# all_file_paths = list(base_file_path.glob("*.csv"))

These files come from the webpage (manual click downloads) or API.

In [38]:
# ## using generator expression (think list/dict comprehension) within pd.concat to load files 1 at a time
# ## ingesting all columns as str for now
# df_merged = pd.concat((pd.read_csv(f, dtype=str) for f in all_file_paths), ignore_index=True)

# ## make column names snake-case - usable with itertuples later
# df_merged.columns = df_merged.columns.str.replace(" ", "_")

In [39]:
# ## checking for duplicates - should be since rows in multiple panels

# n_duplicates_column_combo = df_merged[df_merged.duplicated(subset=["g2p_id"], keep=False)].shape

# n_duplicates_all_columns = df_merged[df_merged.duplicated(keep=False)].shape

# ## for testing
# # n_duplicates_all_columns = (1, 1)

# if n_duplicates_column_combo != n_duplicates_all_columns: 
#     raise AssertionError("The data format has changed, and the assumptions about duplicates/key columns may " \
#                           "no longer hold. Re-explore the data and adjust the parser.")
    
# print(n_duplicates_column_combo)

(2066, 23)


In [40]:
# df_merged.drop_duplicates(inplace=True, ignore_index=True)

# df_merged.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3718 entries, 0 to 3717
Data columns (total 23 columns):
 #   Column                              Non-Null Count  Dtype 
---  ------                              --------------  ----- 
 0   g2p_id                              3718 non-null   object
 1   gene_symbol                         3718 non-null   object
 2   gene_mim                            3715 non-null   object
 3   hgnc_id                             3718 non-null   object
 4   previous_gene_symbols               3433 non-null   object
 5   disease_name                        3718 non-null   object
 6   disease_mim                         2949 non-null   object
 7   disease_MONDO                       2196 non-null   object
 8   allelic_requirement                 3718 non-null   object
 9   cross_cutting_modifier              464 non-null    object
 10  confidence                          3718 non-null   object
 11  variant_consequence                 3700 non-null   obje

### "All data" manual download from webpage

In [41]:
# option_all_path = pathlib.Path.home().joinpath("Desktop", "EBIgene2pheno_files", 
#                                               "From_Website", "G2P_all_2025-07-30.csv")

# df_option_all = pd.read_csv(option_all_path, dtype=str)
# df_option_all.columns = df_option_all.columns.str.replace(" ", "_")

# df_option_all.info(memory_usage="deep")
# ## the numbers are the same as df_merged

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3718 entries, 0 to 3717
Data columns (total 23 columns):
 #   Column                              Non-Null Count  Dtype 
---  ------                              --------------  ----- 
 0   g2p_id                              3718 non-null   object
 1   gene_symbol                         3718 non-null   object
 2   gene_mim                            3715 non-null   object
 3   hgnc_id                             3718 non-null   object
 4   previous_gene_symbols               3433 non-null   object
 5   disease_name                        3718 non-null   object
 6   disease_mim                         2949 non-null   object
 7   disease_MONDO                       2196 non-null   object
 8   allelic_requirement                 3718 non-null   object
 9   cross_cutting_modifier              464 non-null    object
 10  confidence                          3718 non-null   object
 11  variant_consequence                 3700 non-null   obje

In [42]:
# ## does the "all" file have duplicates? - no! yay!

# df_option_all[df_option_all.duplicated(subset=["g2p_id"], keep=False)]

Unnamed: 0,g2p_id,gene_symbol,gene_mim,hgnc_id,previous_gene_symbols,disease_name,disease_mim,disease_MONDO,allelic_requirement,cross_cutting_modifier,confidence,variant_consequence,variant_types,molecular_mechanism,molecular_mechanism_support,molecular_mechanism_categorisation,molecular_mechanism_evidence,phenotypes,publications,panel,comments,date_of_last_review,review


In [46]:
# ## check if these dataframes are identical or not

# ## first sort them the same way
# df_merged.sort_values(by="g2p_id", inplace=True, ignore_index=True)
# df_option_all.sort_values(by="g2p_id", inplace=True, ignore_index=True)

# ## these are identical!!!
# df_merged.equals(df_option_all)

True

### "All data" download from API url

In [3]:
# ## pandas downloading file from URL: works, a little slow

# df_url_all = pd.read_csv("https://www.ebi.ac.uk/gene2phenotype/api/panel/all/download/", 
#                          dtype=str)

In [4]:
# df_url_all.columns = df_url_all.columns.str.replace(" ", "_")

# df_url_all.info(memory_usage="deep")
# ## the numbers are the same as df_merged and df_option_all

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3720 entries, 0 to 3719
Data columns (total 23 columns):
 #   Column                              Non-Null Count  Dtype 
---  ------                              --------------  ----- 
 0   g2p_id                              3720 non-null   object
 1   gene_symbol                         3720 non-null   object
 2   gene_mim                            3717 non-null   object
 3   hgnc_id                             3720 non-null   object
 4   previous_gene_symbols               3435 non-null   object
 5   disease_name                        3720 non-null   object
 6   disease_mim                         2954 non-null   object
 7   disease_MONDO                       2224 non-null   object
 8   allelic_requirement                 3720 non-null   object
 9   cross_cutting_modifier              468 non-null    object
 10  confidence                          3720 non-null   object
 11  variant_consequence                 3702 non-null   obje

In [5]:
# ## does the "all" file have duplicates? - no! yay!

# df_url_all[df_url_all.duplicated(subset=["g2p_id"], keep=False)]

Unnamed: 0,g2p_id,gene_symbol,gene_mim,hgnc_id,previous_gene_symbols,disease_name,disease_mim,disease_MONDO,allelic_requirement,cross_cutting_modifier,confidence,variant_consequence,variant_types,molecular_mechanism,molecular_mechanism_support,molecular_mechanism_categorisation,molecular_mechanism_evidence,phenotypes,publications,panel,comments,date_of_last_review,review


### Check if dataframes are identical to each other

In [None]:
# ## check if sliced dataframe is identical to others or not

# ## first sort them the same way
# df_url_all.sort_values(by="g2p_id", inplace=True, ignore_index=True)

# ## these are identical!!!
# ## meaning this also doesn't have duplicates
# df_option_all.equals(df_url_all)

<div class="alert alert-block alert-success">


Conclusions:
* "all data" file does have all data, w/o duplicates. So it is nice to use!
* Using API link to retrieve "all data" file is the easiest to set up common ingest pipeline to use

### original: 1 dataframe for each subset

In [4]:
# ## Construct df for each file

# cancer = pd.read_csv(base_file_path.joinpath('G2P_Cancer_2025-03-13.csv'), 
#                       dtype={"gene mim": str, "hgnc id": str, "disease mim": str})
# cardiac = pd.read_csv(base_file_path.joinpath('G2P_Cardiac_2025-03-13.csv'), 
#                       dtype={"gene mim": str, "hgnc id": str, "disease mim": str})
# developmental = pd.read_csv(base_file_path.joinpath('G2P_DD_2025-03-13.csv'), 
#                       dtype={"gene mim": str, "hgnc id": str, "disease mim": str})
# eye = pd.read_csv(base_file_path.joinpath('G2P_Eye_2025-03-13.csv'), 
#                       dtype={"gene mim": str, "hgnc id": str, "disease mim": str})
# hearing = pd.read_csv(base_file_path.joinpath('G2P_Hearing loss_2025-03-13.csv'), 
#                       dtype={"gene mim": str, "hgnc id": str, "disease mim": str})
# skeletal = pd.read_csv(base_file_path.joinpath('G2P_Skeletal_2025-03-13.csv'), 
#                       dtype={"gene mim": str, "hgnc id": str, "disease mim": str})
# skin = pd.read_csv(base_file_path.joinpath('G2P_Skin_2025-03-13.csv'), 
#                       dtype={"gene mim": str, "hgnc id": str, "disease mim": str})

# ## saving for possible future use
# # unique_filename_substring = [
# #     'Cancer',
# #     'Cardiac',
# #     'DD',
# #     'Eye',
# #     'Hearing loss',
# #     'Skeletal',
# #     'Skin'
# # ]

## Code chunks to explore column data

In [4]:
## useful function
def check_if_contains(df, column_name, patterns):
    for i in patterns:
        temp = df[df[column_name].str.contains(pat=i, na=False)]
        if temp.size > 0:
            print(f'"{i}"')
            print(temp.shape)

Reviewing individual dataframes, values within each column

In [5]:
df_url_all

Unnamed: 0,g2p_id,gene_symbol,gene_mim,hgnc_id,previous_gene_symbols,disease_name,disease_mim,disease_MONDO,allelic_requirement,cross_cutting_modifier,confidence,variant_consequence,variant_types,molecular_mechanism,molecular_mechanism_support,molecular_mechanism_categorisation,molecular_mechanism_evidence,phenotypes,publications,additional_mined_publications,panel,comments,date_of_last_review,review
0,G2P00001,HMX1,142992,5017,H6; NKX5-3,HMX1-related oculoauricular syndrome,612109,MONDO:0012802,biallelic_autosomal,,definitive,absent gene product,,loss of function,inferred,,,HP:0000007; HP:0000356; HP:0000482; HP:0000505...,18423520; 25574057; 29140751,21417677; 31469246; 35946463,DD; Eye,,2019-09-26 16:23:46+00:00,
1,G2P00002,SLX4,613278,23845,BTBD12; FANCP; KIAA1784; KIAA1987,SLX4-related Fanconi anemia,613951,MONDO:0013499,biallelic_autosomal,,definitive,absent gene product,,loss of function,inferred,,,HP:0000007; HP:0000028; HP:0000085; HP:0000125...,21240275; 21240277,21476996; 23033263; 26841305; 30047418; 347541...,DD,,2025-01-28 23:09:54+00:00,
2,G2P00003,ARG1,608313,663,,ARG1-related argininemia,207800,MONDO:0008814,biallelic_autosomal,,definitive,absent gene product,,loss of function,inferred,,,HP:0000007; HP:0000737; HP:0000752; HP:0001249...,10502833; 1463019; 1598908; 2365823; 7649538,15565656; 19052914; 19562505; 19936428; 213103...,DD,,2015-07-22 16:14:07+00:00,
3,G2P00004,ATR,601215,882,FRP1; MEC1; SCKL; SCKL1,ATR-related Seckel syndrome,210600,MONDO:0008869,biallelic_autosomal,,strong,absent gene product,,loss of function,inferred,,,HP:0000007; HP:0000028; HP:0000047; HP:0000175...,10889046; 12640452; 23111928; 30199583,10631133; 15279811; 21669506; 23144622; 272354...,DD; Skeletal,,2025-01-27 14:24:27+00:00,
4,G2P00005,FANCB,300515,3583,FAAP95; FAB; FLJ34064,FANCB-related Fanconi anemia,300514,MONDO:0010351,monoallelic_X_hemizygous,,definitive,absent gene product,,loss of function,inferred,,,HP:0000083; HP:0000100; HP:0000119; HP:0000707...,15502827; 16679491; 21910217; 22052692; 236135...,24416387; 31351673,DD; Skin,,2024-08-20 14:13:58+00:00,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3826,G2P01393,CABP2,607314,1385,DFNB93,CABP2-related deafness,614899,MONDO:0013963,biallelic_autosomal,,strong,absent gene product,,loss of function,inferred,,,HP:0000007,22981119,,Ear,,2025-04-28 09:36:23+00:00,
3827,G2P01854,PAX3,606597,8617,HUP2; PAX-3; WS1,PAX3-related craniofacial-deafness-hand syndrome,122880,MONDO:0007395,monoallelic_autosomal,,definitive,absent gene product,,loss of function,inferred,,,HP:0000006; HP:0000272; HP:0000316; HP:0000327...,6859126,,Ear,,2025-04-28 10:19:36+00:00,
3828,G2P01857,SIX1,601205,10887,DFNA23,SIX1-related deafness,605192,MONDO:0011519,monoallelic_autosomal,,definitive,altered gene product structure,inframe_deletion; inframe_insertion; missense_...,undetermined,inferred,,,HP:0000006; HP:0000405; HP:0000407,10777717,,Ear,,2025-04-28 10:21:02+00:00,
3829,G2P01852,SOX10,602229,11190,DOM; SOX-10; WS2E; WS4,SOX10-related Kallmann syndrome with deafness,,,monoallelic_autosomal,,definitive,absent gene product,,loss of function,inferred,,,,23643381,,Ear,,2025-04-28 10:21:28+00:00,


In [6]:
df_url_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3831 entries, 0 to 3830
Data columns (total 24 columns):
 #   Column                              Non-Null Count  Dtype 
---  ------                              --------------  ----- 
 0   g2p_id                              3831 non-null   object
 1   gene_symbol                         3831 non-null   object
 2   gene_mim                            3828 non-null   object
 3   hgnc_id                             3831 non-null   object
 4   previous_gene_symbols               3534 non-null   object
 5   disease_name                        3831 non-null   object
 6   disease_mim                         2956 non-null   object
 7   disease_MONDO                       3320 non-null   object
 8   allelic_requirement                 3831 non-null   object
 9   cross_cutting_modifier              477 non-null    object
 10  confidence                          3831 non-null   object
 11  variant_consequence                 3812 non-null   obje

In [12]:
## regularly used to review

# ## all values in MONDO column have correct prefix
# df_url_all[~ df_url_all["disease_MONDO"].str.contains("MONDO:", na=True)]

df_url_all["allelic_requirement"].value_counts()
# df_url_all[df_url_all["comments"].notna()]
# df_url_all["comments"].unique()[0:20]

df_url_all[df_url_all["review"].notna()]

allelic_requirement
biallelic_autosomal           2002
monoallelic_autosomal         1585
monoallelic_X_hemizygous       176
monoallelic_X_heterozygous      59
mitochondrial                    7
monoallelic_Y_hemizygous         1
monoallelic_X                    1
Name: count, dtype: int64

Unnamed: 0,g2p_id,gene_symbol,gene_mim,hgnc_id,previous_gene_symbols,disease_name,disease_mim,disease_MONDO,allelic_requirement,cross_cutting_modifier,confidence,variant_consequence,variant_types,molecular_mechanism,molecular_mechanism_support,molecular_mechanism_categorisation,molecular_mechanism_evidence,phenotypes,publications,additional_mined_publications,panel,comments,date_of_last_review,review


In [19]:
## check if disease OMIM ID column has non-numeric values (aka other IDs)
df_url_all[df_url_all["disease_mim"].str.contains(r'\D', na=False)]

## disease_name
## checked : (does it have IDs: no), ; (does it have lists - a little but not going to split)
# df_url_all[df_url_all["disease_name"].str.contains(";", na=False)]

## rows with no disease ID
# df_url_all[df_url_all["disease_mim"].isna() & 
#            df_url_all["disease_MONDO"].isna()]


## publications column doesn't have prefixes
check_if_contains(df_url_all, "publications", ["PMID", ":", "_"])


## digging into new column molecular_mechanism_support
# df_url_all["molecular_mechanism_support"].value_counts()
# df_url_all[df_url_all["molecular_mechanism_support"] == "evidence"]

## digging into changed column molecular_mechanism_categorisation
# df_url_all["molecular_mechanism_categorisation"].value_counts()
# df_url_all[df_url_all["molecular_mechanism_categorisation"].notna()]

## digging into molecular_mechanism_evidence
# df_url_all["molecular_mechanism_evidence"].nunique()
# df_url_all["molecular_mechanism_evidence"].unique()[0:20]
# df_url_all[df_url_all["molecular_mechanism_evidence"].str.contains("39315527", na=False)]

Unnamed: 0,g2p_id,gene_symbol,gene_mim,hgnc_id,previous_gene_symbols,disease_name,disease_mim,disease_MONDO,allelic_requirement,cross_cutting_modifier,confidence,variant_consequence,variant_types,molecular_mechanism,molecular_mechanism_support,molecular_mechanism_categorisation,molecular_mechanism_evidence,phenotypes,publications,additional_mined_publications,panel,comments,date_of_last_review,review


### DEFUNCT - Old review chunks

In [5]:
## checking that each file's columns are the same: they are
# cancer.columns == skin.columns

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True])

In [8]:
# ## was looking for duplicate records within 1 dataframe?

# skin[skin.duplicated(subset=['g2p id'], keep=False)].sort_values(by=['gene symbol', 'disease name'])

# skin[skin.duplicated(subset=['gene symbol', 'disease name', 'allelic requirement', 'molecular mechanism'], keep=False)].sort_values(by=['gene symbol', 'disease name'])

Unnamed: 0,g2p id,gene symbol,gene mim,hgnc id,previous gene symbols,disease name,disease mim,disease MONDO,allelic requirement,cross cutting modifier,confidence,variant consequence,variant types,molecular mechanism,molecular mechanism categorisation,molecular mechanism evidence,phenotypes,publications,panel,comments,date of last review


Unnamed: 0,g2p id,gene symbol,gene mim,hgnc id,previous gene symbols,disease name,disease mim,disease MONDO,allelic requirement,cross cutting modifier,confidence,variant consequence,variant types,molecular mechanism,molecular mechanism categorisation,molecular mechanism evidence,phenotypes,publications,panel,comments,date of last review


In [353]:
# hearing

Unnamed: 0,g2p id,gene symbol,gene mim,hgnc id,previous gene symbols,disease name,disease mim,disease MONDO,allelic requirement,cross cutting modifier,confidence,variant consequence,variant types,molecular mechanism,molecular mechanism categorisation,molecular mechanism evidence,phenotypes,publications,panel,comments,date of last review
0,G2P03582,MYO6,600970,7605,DFNA22; DFNB37; KIAA0389,MYO6-related nonsyndromic genetic hearing loss,,,biallelic_autosomal,,definitive,decreased gene product level; altered gene pro...,splice_donor_variant; splice_acceptor_variant;...,undetermined,inferred,,,18348273; 23485424; 25999546; 12687499; 24105371,Hearing loss,,2024-11-28 14:52:17+00:00
1,G2P03583,MYO6,600970,7605,DFNA22; DFNB37; KIAA0389,MYO6-related nonsyndromic genetic hearing loss,,,monoallelic_autosomal,,definitive,decreased gene product level; altered gene pro...,splice_donor_variant; splice_acceptor_variant;...,undetermined,inferred,,,18348273; 23485424; 25999546; 24105371,Hearing loss,,2024-11-28 14:47:17+00:00


## Notes on parsing data to create documents 

In [16]:
df_url_all.head()

Unnamed: 0,g2p_id,gene_symbol,gene_mim,hgnc_id,previous_gene_symbols,disease_name,disease_mim,disease_MONDO,allelic_requirement,cross_cutting_modifier,confidence,variant_consequence,variant_types,molecular_mechanism,molecular_mechanism_support,molecular_mechanism_categorisation,molecular_mechanism_evidence,phenotypes,publications,additional_mined_publications,panel,comments,date_of_last_review,review
0,G2P00001,HMX1,142992,5017,H6; NKX5-3,HMX1-related oculoauricular syndrome,612109,MONDO:0012802,biallelic_autosomal,,definitive,absent gene product,,loss of function,inferred,,,HP:0000007; HP:0000356; HP:0000482; HP:0000505...,18423520; 25574057; 29140751,21417677; 31469246; 35946463,DD; Eye,,2019-09-26 16:23:46+00:00,
1,G2P00002,SLX4,613278,23845,BTBD12; FANCP; KIAA1784; KIAA1987,SLX4-related Fanconi anemia,613951,MONDO:0013499,biallelic_autosomal,,definitive,absent gene product,,loss of function,inferred,,,HP:0000007; HP:0000028; HP:0000085; HP:0000125...,21240275; 21240277,21476996; 23033263; 26841305; 30047418; 347541...,DD,,2025-01-28 23:09:54+00:00,
2,G2P00003,ARG1,608313,663,,ARG1-related argininemia,207800,MONDO:0008814,biallelic_autosomal,,definitive,absent gene product,,loss of function,inferred,,,HP:0000007; HP:0000737; HP:0000752; HP:0001249...,10502833; 1463019; 1598908; 2365823; 7649538,15565656; 19052914; 19562505; 19936428; 213103...,DD,,2015-07-22 16:14:07+00:00,
3,G2P00004,ATR,601215,882,FRP1; MEC1; SCKL; SCKL1,ATR-related Seckel syndrome,210600,MONDO:0008869,biallelic_autosomal,,strong,absent gene product,,loss of function,inferred,,,HP:0000007; HP:0000028; HP:0000047; HP:0000175...,10889046; 12640452; 23111928; 30199583,10631133; 15279811; 21669506; 23144622; 272354...,DD; Skeletal,,2025-01-27 14:24:27+00:00,
4,G2P00005,FANCB,300515,3583,FAAP95; FAB; FLJ34064,FANCB-related Fanconi anemia,300514,MONDO:0010351,monoallelic_X_hemizygous,,definitive,absent gene product,,loss of function,inferred,,,HP:0000083; HP:0000100; HP:0000119; HP:0000707...,15502827; 16679491; 21910217; 22052692; 236135...,24416387; 31351673,DD; Skin,,2024-08-20 14:13:58+00:00,


### Gene subject section

**"gene_symbol"**: str. "HGNC-assigned gene symbol" (according to Data_download_format txt file)

**"gene_mim"** (a few NA): OMIM ID for gene. NodeNorm does recognize these. => force dtype to be str (easier to use if need be) 

**"hgnc_id"**: no NA so easier to use => force dtype to be str (easier to use, add prefix)

**"previous_gene_symbols"**: "; "-delimited. 

### Disease object section

**"disease_name"**: I found cases where name seemed to be "; "-delimited

**"disease_mim"** (some NA): OMIM ID for disease ("or highly similiar" according to Data_download_format txt file). (UPDATE 2025-12: no longer find orphanet IDs mixed in (raw data prefix is "Orphanet:")) => force dtype to be str (easier to use, add prefix). 

**"disease_MONDO"** (some NA): MONDO ID for disease ("or highly similiar" according to Data_download_format txt file). Already Translator-formatted CURIE

### Association section

**Format updates since notebook originally written:**

**Nov 2025**

Added new column "additional mined publications"

**Aug 2025**

added/augmented notes for "variant types", "molecular mechanism evidence", "date of last review"

analyzed format changes, took notes on:
- old "molecular_mechanism_categorisation" column -> now "molecular_mechanism_support"
- new "molecular_mechanism_categorisation" content
- new "review" column

Filename change: DataDownloadFormat => Data_download_format 

---

**Content**

**"g2p_id"**: resource's unique stable ID for each record/row (according to Data_download_format txt file, G2P webpage for single record) => use to generate links to edge info. Current format: https://www.ebi.ac.uk/gene2phenotype/lgd/G2P03386


**"allelic_requirement"**: required genotype (according to G2P webpage for single record). str (categorical). [G2P website](https://www.ebi.ac.uk/gene2phenotype/about/terminology) says these are synonyms of HPO "mode of inheritance terms", which seem to be from the [inheritance part of the ontology](https://hpo.jax.org/browse/term/HP:0000005). I've included a mapping table in reference section.


**"cross_cutting_modifier"** (lots of NA!): additional info relevant to gene-disease inheritance (according to G2P webpage for single record). Can be "; "-delimited, otherwise would be categorical. [G2P website](https://www.ebi.ac.uk/gene2phenotype/about/terminology) says these are HPO inheritance qualifier terms "when available" ([this part of ontology](https://hpo.jax.org/browse/term/HP:0034335)). I didn't try mapping these.


**"confidence"**: confidence that the association is real (according to G2P webpage for panel). [G2P website](https://www.ebi.ac.uk/gene2phenotype/about/terminology) has definitions of terms. str (categorical)


**"variant_consequence"** (some NA): consequences of reported variants on product (protein or RNA) (according to G2P webpage for single record). Can be "; "-delimited, otherwise would be categorical. [G2P website](https://www.ebi.ac.uk/gene2phenotype/about/terminology) has definitions of terms and SO mappings. 


**"variant_types"** (lots of NA!): associated with gene-disease pair (according to G2P webpage for single record). Can be "; "-delimited, otherwise would be categorical. [G2P website](https://www.ebi.ac.uk/gene2phenotype/about/terminology) has definitions of terms and SO mappings (large table!). **>100 UNIQUE VALUES (SINGLE, MIX OF TERMS). I don't think using this as a qualifier is a good idea - it's confusing**


**"molecular_mechanism"**: mechanism of how gene's variants causes disease (according to G2P webpage for single record). str (categorical). [G2P website](https://www.ebi.ac.uk/gene2phenotype/about/terminology) has definitions of terms


**"molecular mechanism support"** (new Aug 2025): seems to say how molecular mechanism was decided. **Doesn't seem to show up on single-record webpage** (compared rows with value "evidence" / no "molecular_mechanism_categorisation" value to their webpages; the webpages don't have the string "evidence"). Possible values: "inferred" or "evidence". str (categorical). 


**"molecular_mechanism_categorisation**" (new Aug 2025; **VERY SPARSE**): more detailed mechanism info, structured (displayed in single-record webpage as a table). 
- key: term from the ["molecular mechanism synopsis" section of terminology page](https://www.ebi.ac.uk/gene2phenotype/about/terminology#mechanism-synopsis-section). Separator between key and value is ":" 
- value: either 'evidence' or 'inferred' (no missing values, usually "evidence"). 
- Can have multiple key-value pairs ("; "-delimited). Each key-value pair becomes one row in molecular mechanism -> categorisation section of single-record webpage. 


**"molecular_mechanism_evidence"** (**VERY SPARSE**): "types of evidence available to support reported mechanism", according to Data download format txt. complicated structure, is displayed in single-record webpages as a table.
- `publication_ID -> functional studies`. In the functional studies section:
  - key: seems to be from a limited vocab ("function", "functional_alteration", "models", "rescue"). Separator between key and value is ": ".
  - value: perhaps from limited vocab (specific to each key)
  - can have multiple key-value pairs ("; "-delimited). Become different bullet points in functional studies column in single-record webpage table
- can have info from multiple publications (" & "-delimited). Become different rows in single-record webpage table


**"phenotypes"** (some NA): HPO IDs (already Translator-formatted CURIE) for phenotypes reported by the publications (according to G2P webpage for single record). Can be "; "-delimited
- based on G2P website, these are organized/assigned to specific gene-disease records...so I'm keeping as a part of the gene-disease association, rather than creating diff objects for gene-phenotype (if I did this, it would be a separate function)


**"publications"** (some NA): reviewed to create this row/association (according to Data_download_format file). These are PMIDs with no prefix (based on G2P webpage for single record). Can be "; "-delimited


**"additional_mined_publications"** (some NA): mined from PubMed but not reviewed yet (according to Data_download_format file). These are PMIDs with no prefix (based on G2P webpage for single record). Can be "; "-delimited


**"panel"**: the G2P panels this record is assigned to (according to Data_download_format txt file). Can be "; "-delimited, otherwise would be categorical.
  - map (str.replace) values to match G2P webpage for single record
    - "DD" = "Developmental disorders"
    - add " disorders" to end: "Cancer", "Cardiac", "Eye", "Skeletal", "Skin"
    - keep as-is: "Hearing loss"
    
    
**"comments"** (**VERY SPARSE**): additional comments from curation team (according to Data_download_format txt file). Appears to be free-text (sometimes short, sometimes very long) => strip whitespace


**"date_of_last_review"**: date the record was modified or reviewed (according to Data_download_format txt file). In single-record webpage, this is called "Last Updated On". 

**"review"** (new Aug 2025): indicates when record is under review (according to Data_download_format txt file). Don't know what the possible values are (boolean?) because I've only seen the column be all NA so far.

---

__Allelic requirement terms__

Mapping issues in HP (**bolded**) fixed: https://github.com/obophenotype/human-phenotype-ontology/issues/11243. Should match [2024 GenCC paper](https://www.gimjournal.org/article/S1098-3600(23)01045-6/fulltext) on harmonizing terminology


| G2P | HPO name | HPO ID | Exact synonym? | 
| :- | :- | :- | :- |
| biallelic_autosomal | Autosomal recessive inheritance | HP:0000007 | Yes |
| monoallelic_autosomal | Autosomal dominant inheritance | HP:0000006 | Yes |
| biallelic_PAR | Pseudoautosomal recessive inheritance | HP:0034341 | Yes |
| monoallelic_PAR | Pseudoautosomal dominant inheritance | HP:0034340 | Yes |
| mitochondrial | Mitochondrial inheritance | HP:0001427 | Yes |
| monoallelic_Y_hemizygous | Y-linked inheritance | HP:0001450 | Yes |
| **monoallelic_X** | X-linked inheritance | HP:0001417 | Yes |
| **monoallelic_X_hemizygous** | X-linked recessive inheritance | HP:0001419 | Yes |
| **monoallelic_X_heterozygous** | X-linked dominant inheritance | HP:0001423 | Yes |


---

__Would involve making new columns/fields__

Would do value-mapping using static dicts? Which would be brittle/work to maintain

* "allelic requirement": create new columns for HPO mapping IDs/labels (reference above)
* "variant consequence": create new column(s) for SO mapping (value is ID or labels, or a list of dict for ID/label pairs)
* "variant types": create new column(s) for SO mapping

## DEFUNCT - Issues merging data into 1 dataframe

If I loaded all data into 1 dataframe, there will likely be duplicate rows because the gene-disease pair is in several panels (aka disease falls into multiple categories). One proof of this is `panel` being a "; "-delimited string.

Drop duplicates will be needed. 

However, what if other column values differ between files (like "; "-delimited strings having a diff order of elements)? Especially wondering about `panel`. 

Exploring this:
* Add all dataframes together into a big dataframe
* how many duplicates are there based on the columns that matter for unique row: 'g2p id'
  * other important set: 'gene symbol', 'disease name', 'allelic requirement', 'molecular mechanism'
* VS how many duplicates are there considering all columns? 

If the duplicate numbers differ between the two, then some other column's values are contributing to "unique" rows between diff files. (probably everything duplicate count < subset duplicate count)
* If the problem is "; "-delimited string order, then can reorder them (split, sort, join. Can you do this column-wise rather than iterating?). Then can look for duplicate rows.

If the numbers are the same, then we're good! No other columns matter for unique row, and the duplicates can be dropped (based on all columns). 

In [None]:
# everything = pd.concat(objs=[cancer, cardiac, developmental, eye, hearing, skeletal, skin], ignore_index=True)

In [10]:
# everything.shape

# cancer.shape[0] + cardiac.shape[0] + \
# developmental.shape[0] + eye.shape[0] + \
# hearing.shape[0] + skeletal.shape[0] + \
# skin.shape[0]

(4721, 21)

4721

In [11]:
# everything[everything.duplicated(subset=['g2p id'], keep=False)].sort_values(by=['gene symbol', 'disease name']).shape


# everything[everything.duplicated(subset=['gene symbol', 'disease name', 'allelic requirement', 'molecular mechanism'], keep=False)].sort_values(by=['gene symbol', 'disease name']).shape

# everything[everything.duplicated(keep=False)].shape

(1935, 21)

(1935, 21)

(1935, 21)

In [12]:
# everything[everything["g2p id"] == "G2P00650"]

Unnamed: 0,g2p id,gene symbol,gene mim,hgnc id,previous gene symbols,disease name,disease mim,disease MONDO,allelic requirement,cross cutting modifier,confidence,variant consequence,variant types,molecular mechanism,molecular mechanism categorisation,molecular mechanism evidence,phenotypes,publications,panel,comments,date of last review
839,G2P00650,AAAS,605378,13666,,AAAS-related chalasia-addisonianism-alacrima s...,231550,,biallelic_autosomal,,definitive,absent gene product,,loss of function,inferred,,HP:0002459; HP:0011463; HP:0003487; HP:0001278...,11062474; 11701718; 18628786; 11159947; 15173230,DD; Eye; Skin,,2025-01-27 14:40:58+00:00
3104,G2P00650,AAAS,605378,13666,,AAAS-related chalasia-addisonianism-alacrima s...,231550,,biallelic_autosomal,,definitive,absent gene product,,loss of function,inferred,,HP:0002459; HP:0011463; HP:0003487; HP:0001278...,11062474; 11701718; 18628786; 11159947; 15173230,DD; Eye; Skin,,2025-01-27 14:40:58+00:00
4198,G2P00650,AAAS,605378,13666,,AAAS-related chalasia-addisonianism-alacrima s...,231550,,biallelic_autosomal,,definitive,absent gene product,,loss of function,inferred,,HP:0002459; HP:0011463; HP:0003487; HP:0001278...,11062474; 11701718; 18628786; 11159947; 15173230,DD; Eye; Skin,,2025-01-27 14:40:58+00:00


So...the numbers are the same! This means duplicate records/rows from diff files are completely identical, and can be dropped easily (based on all columns)

## Other parser code ideas

* each row => turn into 1 association object
  * document/object MUST have `_id` key (primary ID for BioThings database - ElasticSearch / mongo DB?) 
    * use 'g2p id' for `_id`
    * unique 'gene symbol', 'disease name', 'allelic requirement', 'molecular mechanism' combo
    * "G2P records are Locus-Genotype-Mechanism-Disease-Evidence (LGMDE) threads describing gene-disease associations" -> this is what a unique row is from resource POV (ref: homepage table "Total LGMDE Records" info (i) button)
* sometimes disease doesn't have any IDs for it => still make document/object but won't use in Translator unless it has ID
  * all hearing records (2) have this problem.
* str split method will create single-element arrays if no delimiter found
* parser feeds into upload step -> function should be generator to iterate through all stuff -> for each, yield the document(s) as dict/json

Confidence is important to filter on for Translator! => will still make document/object but don't use! 
* don't use "limited", "disputed", "refuted": Translator doesn't handle negation ("refuted") or low/conflicting evidence ("disputed" or "limited")