There's about 34 genes missing from a table in AGORA.  That table is made from the gene_info dataset.  Those genes supposedly exist in the harmonized_targets dataset, but are missing from the gene_info dataset.

Steps:
- verify the genes exist in the harmonized_targets dataset
- check for the presence of these genes in the gene_info dataset

*because this is a current problem in AGORA, we need to look at the live data

In [1]:
import pandas as pd
import synapseclient

syn = synapseclient.Synapse()
syn.login(silent=True)

harmonized = syn.get('syn12540368')
gene_info = syn.get('syn12548902')

harmonized = pd.read_csv(harmonized.path)
gene_info = pd.read_json(gene_info.path, orient='records')

print(harmonized.shape)
print(gene_info.shape)




(693, 13)
(57169, 16)


In [2]:
harmonized['combined'] = harmonized['hgnc_symbol'] + "-" + harmonized['ensembl_gene_id']
gene_info['combined'] = gene_info['hgnc_symbol'] + "-" + gene_info['ensembl_gene_id']

harmonized_combined_set = set(harmonized['combined'])
gene_info_combined_set = set(gene_info['combined'])

harmonized_symbol_set = set(harmonized['hgnc_symbol'])
gene_info_symbol_set = set(gene_info['hgnc_symbol'])

harmonized_ensembl_set = set(harmonized['ensembl_gene_id'])
gene_info_ensembl_set = set(gene_info['ensembl_gene_id'])

intersection_combined = harmonized_combined_set.intersection(gene_info_combined_set)
intersection_symbol = harmonized_symbol_set.intersection(gene_info_symbol_set)
intersection_ensembl = harmonized_ensembl_set.intersection(gene_info_ensembl_set)

In [3]:
harmonized_combined_set.difference_update(intersection_combined)
print(harmonized_combined_set)
print(len(harmonized_combined_set))

{'TARDBP-ENSG00000120948.10', 'SULT1A3-ENSG00000261052.1', 'PGM5-ENSG00000154330.7', 'SERPINI1-ENSG00000163536.7', 'FCERG1-ENSG00000158869', 'SPATA22-ENSG00000141255.7', 'BLOC1S1-ENSG00000135441.3', 'MANF-ENSG00000145050.11', 'GCKR-ENSG00000084734.4', 'LRRC8D-ENSG00000171492.10', 'MT-ND3-ENSG00000198840.2', 'DLGAP2-ENSG00000282318', 'GRAMD3-ENSG00000155324', 'PRPF31-ENSG00000105618.9', 'KIAA1045-ENSG00000122733', 'CSFR1-ENSG00000182578', 'DDAH2-ENSG00000225635', 'MRPS18B-ENSG00000223775', 'CELF1-ENSG00000149187.12', 'NBAS-ENSG00000151779.8', 'DAXX-ENSG00000229396', 'SiGLEC8-ENSG00000105366', 'SYNGR4-ENSG00000105467.3', 'TOMM70A-ENSG00000154174', 'ATP5O-ENSG00000241837', 'RP11-894P9.1-ENSG00000246451', 'AKR1D1-ENSG00000122787.9', 'RTFDC1-ENSG00000022277', 'RAPGEF2-ENSG00000109756.4', 'LPPR4-ENSG00000117600', 'SEPT11-ENSG00000186522', 'EXOC3L2-ENSG00000130201', 'SLC25A19-ENSG00000125454.7', 'C4A-ENSG00000227746', 'SPTB-ENSG00000070182.13', 'GUCY1B3-ENSG00000061918', 'LAMA4-ENSG0000011276

In [4]:
harmonized_symbol_set.difference_update(intersection_symbol)
print(harmonized_symbol_set)
print(len(harmonized_symbol_set))

{'LPPR4', 'CSFR1', 'GRAMD3', 'GUCY1B3', 'TOMM70A', 'EXOC3L2', 'ATP5O', 'KIAA1045', 'MT-ND3', 'FCERG1', 'SiGLEC8', 'RP11-894P9.1', 'RTFDC1'}
13


In [5]:
harmonized_ensembl_set.difference_update(intersection_ensembl)
print(harmonized_ensembl_set)
print(len(harmonized_ensembl_set))

{'ENSG00000163536.7', 'ENSG00000125454.7', 'ENSG00000084734.4', 'ENSG00000225635', 'ENSG00000070182.13', 'ENSG00000105618.9', 'ENSG00000154330.7', 'ENSG00000223775', 'ENSG00000151779.8', 'ENSG00000112769.13', 'ENSG00000135441.3', 'ENSG00000227746', 'ENSG00000105467.3', 'ENSG00000109756.4', 'ENSG00000282318', 'ENSG00000171492.10', 'ENSG00000145050.11', 'ENSG00000120948.10', 'ENSG00000122787.9', 'ENSG00000198840.2', 'ENSG00000141255.7', 'ENSG00000261052.1', 'ENSG00000130201', 'ENSG00000229396', 'ENSG00000149187.12'}
25
