# Convert HGNC identifiers to Entrez Gene identifiers

2017-06-09

In [1]:
import pandas as pd
import numpy as np

## Read source HGNC data

**Data Source**

1. Go to the "Complete Dataset Download Links" section on the following website:www.genenames.org/cgi-bin/statistics
2. Download the "Complete HGNC dataset (txt)" file called `hgnc_complete_set.txt`
3. File was renamed to `hgnc_id_map.tsv`

In [2]:
hgnc_raw = pd.read_csv("../data/maps/hgnc_id_map.tsv", sep='\t', low_memory=False)

In [3]:
hgnc_raw.shape

(42215, 49)

In [4]:
hgnc_raw.head(2)

Unnamed: 0,hgnc_id,symbol,name,locus_group,locus_type,status,location,location_sortable,alias_symbol,alias_name,...,merops,imgt,iuphar,kznf_gene_catalog,mamit-trnadb,cd,lncrnadb,enzyme_id,intermediate_filament_db,rna_central_ids
0,HGNC:5,A1BG,alpha-1-B glycoprotein,protein-coding gene,gene with protein product,Approved,19q13.43,19q13.43,,,...,I43.950,,,,,,,,,
1,HGNC:37133,A1BG-AS1,A1BG antisense RNA 1,non-coding RNA,"RNA, long non-coding",Approved,19q13.43,19q13.43,FLJ23569,,...,,,,,,,,,,URS00007E4F6E


---

## Convert to Entrez space

All genes in Rephetio are in Entrez space, which we will need to convert to.

In [5]:
small = (hgnc_raw
    [["hgnc_id", "symbol", "status", "entrez_id"]]
    .assign(hgnc_id = lambda df: df["hgnc_id"].str[5:].astype(int))
)

In [6]:
small.head()

Unnamed: 0,hgnc_id,symbol,status,entrez_id
0,5,A1BG,Approved,1.0
1,37133,A1BG-AS1,Approved,503538.0
2,24086,A1CF,Approved,29974.0
3,7,A2M,Approved,2.0
4,27057,A2M-AS1,Approved,144571.0


### Are any HGNC genes missing Entrez ids?

In [7]:
small.isnull().sum()

hgnc_id         0
symbol          0
status          0
entrez_id    1288
dtype: int64

In [8]:
small.groupby("status").apply(lambda df: df["entrez_id"].isnull().sum())

status
Approved            127
Entry Withdrawn    1161
dtype: int64

The majority of missing Entrez IDs come from the non-approved HGNC ids.

In [9]:
small["status"].value_counts()

Approved           41054
Entry Withdrawn     1161
Name: status, dtype: int64

In [10]:
small.to_csv("hgnc_entrez_map.tsv", sep='\t', index=False)