## Generate genome-wide gene id list

**Input**: approved symbols from HGNC  

**Output:** gene id map w/ locus type + map of current to prev symbol

Notes:
- Data downloaded from HGNC, all genes in the downloaded file have status "Approved"
- The `locus_type` column indicates whether the gene is a *gene with protein product* or a different category, such as *readthrough*
- Some genes might be missing an entrez or ensembl id

In [1]:
import pandas as pd
import os
import re

get_data_path = lambda folders, fname: os.path.normpath(os.environ['3RD_PARTY_DIR']+'/'+'/'.join(folders) +'/'+ fname)
get_local_data_path = lambda folders, fname: os.path.normpath('../local_data/' +'/'.join(folders) +'/'+ fname)

file_hgnc_all = get_data_path(['hgnc'], 'non_alt_loci_set_20_05_20.txt')
file_map = get_local_data_path(['processed'], 'HGNC_gene_id_map.csv')

In [2]:
hgnc_all = pd.read_csv(file_hgnc_all, sep='\t', low_memory=False)
assert(hgnc_all[hgnc_all.status=='Approved'].shape[0]==hgnc_all.shape[0])

In [3]:
hgnc_all[:1]

Unnamed: 0,hgnc_id,symbol,name,locus_group,locus_type,status,location,location_sortable,alias_symbol,alias_name,...,kznf_gene_catalog,mamit-trnadb,cd,lncrnadb,enzyme_id,intermediate_filament_db,rna_central_ids,lncipedia,gtrnadb,agr
0,HGNC:5,A1BG,alpha-1-B glycoprotein,protein-coding gene,gene with protein product,Approved,19q13.43,19q13.43,,,...,,,,,,,,,,HGNC:5


In [4]:
# hgnc_all.columns

In [4]:
# Clean and print stats
id_map = hgnc_all[['hgnc_id', 'symbol', 'entrez_id', 'ensembl_gene_id', 'prev_symbol', 'ccds_id', 'locus_type', 'location']]
id_map = id_map.rename(columns={'ensembl_gene_id':'ensembl_id'})
id_map.hgnc_id = id_map.hgnc_id.apply(lambda x: x.split(':')[1])

print('Num genes: ', id_map.shape[0])
protein_coding = id_map[id_map.locus_type == 'gene with protein product']
print('Protein coding genes: ' + str(protein_coding.shape[0]))
print('Protein coding genes w/out entrez_id:', protein_coding[protein_coding.entrez_id.isna()].shape[0])
print('Protein coding genes w/out ensembl_id:', protein_coding[protein_coding.ensembl_id.isna()].shape[0])
id_map[:2]

Num genes:  41950
Protein coding genes: 19199
Protein coding genes w/out entrez_id: 0
Protein coding genes w/out ensembl_id: 38


Unnamed: 0,hgnc_id,symbol,entrez_id,ensembl_id,prev_symbol,ccds_id,locus_type,location
0,5,A1BG,1.0,ENSG00000121410,,CCDS12976,gene with protein product,19q13.43
1,37133,A1BG-AS1,503538.0,ENSG00000268895,NCRNA00181|A1BGAS|A1BG-AS,,"RNA, long non-coding",19q13.43


In [5]:
id_map.to_csv(file_map, index=0)