# Getting gene name synonyms


To generate gene objects in the platform ETL, we need to have a good collection of gene name synonyms and aliases. This dataset is also going to be used for grounding the ePMC ML result entities. 

**Sources to be considered:**

* HGNC complete gene set. [link](ftp://ftp.ebi.ac.uk/pub/databases/genenames/hgnc/json/hgnc_complete_set.json)
* NCBI gene info [link](ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz)

**Process:**

* Read both dataset, convert them to pandas dataframes.
* Re-shape tables where 3 columns will be preserved: `Ensembl` with Ensembl gene ids, `alias` containing any available alternative name/identifier/accession a gene was mention in the dataset and `alias_type` describing what type of alias it was eg. `hgnc_id`, `synonym`, `symbol` etc.
* As not all NCBI genes are mapped to ensembl, those genes, where only `hgnc` identifier is available is then mapped to Ensembl identifier using the hgnc dataset. 
* Genes in the hgnc dataset without ensembl mappings are discarded.
* Genes in the ncbi dataset, which could not be mapped to ensembl gene ids are discarded
* Concatenate the two datasets and remove duplicates.
* Identifiers and accessions can also be removed.

**Results:**

* In the HGNC dataset, of 39k unique genes there are 435k synonyms/cross references of which 174k synonyms.
* In the NCBI dataset there are 44k unique genes of which 35k could be mapped to Ensembl. There are 430k links, of which 290k alternative names/synonyms. 
* 6731 of the genes without Ensembl mapping in the NCBI dataset could be rescued with 49k synonyms/cross refs.
* There are still 2.5k genes that could not be mapped with 20k synonyms/cross refs.
* In the merged dataset there are synonyms/cross refs for 42k Ensembl genes. 
* The total number of unique links: 750k, the number of unique aliases: 257k.


**Alias types:** The list below shows all types of alternative names/synonyms/labels/accessions/identifiers:

Identifiers/accessions:

* `vega_id`
* `refseq_accession`
* `hgnc_id`
* `rgd_id`
* `omim_id`
* `mgd_id`
* `ucsc_id`
* `uniprot_ids`
* `ccds_id`
* `ena`
* `entrez_id`
* `imgt/gene-db_id`
* `mirbase_id`
* `ensembl_id`

Synonyms:

* `symbol`
* `name`
* `alias_symbol`
* `prev_name`
* `prev_symbol`
* `alias_name`
* `description`
* `symbol_from_nomenclature_authority`
* `full_name_from_nomenclature_authority`
* `other_designations`


**Head looks like this:**

```
Ensembl          alias                   alias_type
ENSG00000121410  OTTHUMG00000183507      vega_id
ENSG00000121410  NM_130786               refseq_accession
ENSG00000121410  5                       hgnc_id
ENSG00000121410  69417                   rgd_id
ENSG00000121410  138670                  omim_id
ENSG00000121410  A1BG                    symbol
ENSG00000121410  alpha-1-B glycoprotein  name
ENSG00000121410  2152878                 mgd_id
ENSG00000121410  uc002qsd.5              ucsc_id
```

In [48]:
%%bash

# Fetching entrez -> ensembl mapping file & filter for human
curl -s ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2ensembl.gz \
    | gunzip \
    | awk '$1 == "#tax_id" || $1 == 9606' \
    | gzip > gene2ensembl.human.gz
    

# Fetch gene info file:
wget 'ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz'

In [140]:
import pandas as pd
import json
import gzip
import numpy as np
from collections import defaultdict


def get_gene_id(values):
    """
    Based on a list of xdb references we get a list of
    gene IDs
    """
    xrefs = defaultdict(list)
    for gene_id in values.split('|'):
        vals = gene_id.split(':')
        xrefs[vals[0]].append(vals[-1])
    return xrefs
        

def parse_ncbi_table(row):
    """
    This function parses all possible name/synonym/symbol/xref for a given ensembl gene id
    """
    
    # This list will contain the parsed data:
    rows = []
    
    # xref parsing:
    xrefs = get_gene_id(row['dbXrefs'])
    
    # 
    if 'Ensembl' in xrefs:
        id_source = 'Ensembl'
        ids = xrefs['Ensembl']
    elif 'HGNC' in xrefs:
        id_source = 'HGNC'
        ids = [f'HGNC:{x}' for x in xrefs['HGNC']]
    else:
        ids = []
        
    # Parsing Ensembl gene ids:
    for identifier in ids:

        # Adding entrez gene id:
        rows.append({
            id_source: identifier,
            'alias': row["GeneID"],
            'alias_type': 'entrez_id'
        })
        
        # Parse dbxrefs:
        for xref in row['dbXrefs'].split('|'):
            rows.append({
                id_source: identifier,
                'alias': xref.split(':')[-1],
                'alias_type': f"{xref.split(':')[0].lower().replace('mim','omim')}_id"
            })

        # all values:
        for key in ['Symbol', 'description', 'Symbol_from_nomenclature_authority', 'Full_name_from_nomenclature_authority', 'Other_designations']:
            for value in row[key].split('|'):
                # Excluding gene IDs:
                if gene_id in value:
                    continue

                rows.append({
                    id_source: identifier,
                    'alias': value,
                    'alias_type': key.lower()
                })
        
    return rows


def parse_hgnc_table(row):
    data = []

    # Only considering ensembl gene id:
    gene_id = row['ensembl_gene_id']
    
    if 'ENSG' not in str(gene_id):
        return data
    
    # fields to check
    fields = ['vega_id', 'refseq_accession', 'hgnc_id', 'rgd_id', 'omim_id', 'symbol', 'name',
             'mgd_id', 'ucsc_id', 'uniprot_ids', 'ccds_id', 'alias_symbol', 'prev_name', 
              'prev_symbol', 'ena', 'orphanet', 'alias_name']

    # Looping through all relevant fields and extract data:
    for field in fields:
        
        # No data in the field, skip
        if isinstance(row[field], float) and np.isnan(row[field]):
            continue

        # Under one field, there might be multiple values:
        if isinstance(row[field], list):
            for f in row[field]:
                data.append({
                   'Ensembl': gene_id,
                    'alias': f.split(':')[-1] if '_id' in field else f,
                    'alias_type': field.lower()
                })
                
        # String fields are simply picked:
        elif isinstance(row[field], str):
                data.append({
                   'Ensembl': gene_id,
                    'alias': row[field].split(':')[-1] if '_id' in field else row[field],
                    'alias_type': field.lower()
                })

    return data

##
## Reading HGNC data:
##

# List of gene names from HGNC website: 
complete_set = 'ftp://ftp.ebi.ac.uk/pub/databases/genenames/hgnc/json/hgnc_complete_set.json'
withdrawn = 'ftp://ftp.ebi.ac.uk/pub/databases/genenames/hgnc/json/withdrawn.json'

# Loading the dataset:
hgnc_df = pd.read_json(complete_set)
hgnc_df = pd.DataFrame(hgnc_df.loc['docs'].response)
hgnc_df = hgnc_df.loc[hgnc_df.alias != '-']

##
## Parsing HGNC data:
##
data = []
for i, row in hgnc_df.iterrows():
    data += parse_hgnc_table(row)

parsed_hgnc = pd.DataFrame(data)

##
## Reading NCBI data
##
ncbi_df = pd.read_csv('Homo_sapiens.gene_info.gz', compression='infer', sep='\t')

##
## Parsing NCBI data
##

data = []

for i, row in ncbi_df.iterrows():
    data += parse_ncbi_table(row)

parsed_ncbi = pd.DataFrame(data)


# List of columns that are not identifiers or accessions:
aliases = [
'symbol',
 'name',
 'alias_symbol',
 'prev_name',
 'prev_symbol',
 'alias_name',
 'description',
 'symbol_from_nomenclature_authority',
 'full_name_from_nomenclature_authority',
 'other_designations'
]

Unnamed: 0,Ensembl,alias,alias_type,HGNC
0,ENSG00000121410,1,entrez_id,
1,ENSG00000121410,138670,omim_id,
2,ENSG00000121410,5,hgnc_id,
3,ENSG00000121410,ENSG00000121410,ensembl_id,
4,ENSG00000121410,A1BG,symbol,


In [190]:
print(f'HGNC lookups: {len(parsed_hgnc)}')
print(f'HGNC aliases: {len(parsed_hgnc.loc[parsed_hgnc.alias_type.isin(aliases)])}')
print(f'HGNC unique genes: {len(parsed_hgnc.Ensembl.unique())}')
print('')
print(f'NCBI lookups: {len(parsed_ncbi)}')
print(f'NCBI aliases: {len(parsed_ncbi.loc[parsed_ncbi.alias_type.isin(aliases)])}')
print(f'NCBI unique Ensembl genes: {len(parsed_ncbi.Ensembl.unique())}')
print(f'NCBI hgnc without ensembl: {len(parsed_ncbi.loc[parsed_ncbi.Ensembl.isna()].HGNC.unique())}')
print('')


HGNC lookups: 435141
HGNC aliases: 173880
HGNC unique genes: 39440

NCBI lookups: 428493
NCBI aliases: 287469
NCBI unique Ensembl genes: 35084
NCBI hgnc without ensembl: 9095



In [193]:
##
## Looking up Ensembl gene id where only hgnc is given:
##
ncbi_hgnc = parsed_ncbi.loc[ parsed_ncbi.Ensembl.isna()]
print(f'Number of lookups without Ensembl id: {len(ncbi_hgnc)}')
ncbi_hgnc['HGNC'] = ncbi_hgnc['HGNC'].apply(lambda x: x.replace('HGNC:', ''))

# Prepare hgnc table for the merge:
hgnc_ensembl_lookup = (
    parsed_hgnc
    .loc[parsed_hgnc.alias_type=='hgnc_id']
    .rename(columns={'alias':'HGNC'})
    .drop('alias_type', axis =1)
)

# Doing the lookup by applying an inner join:
hgnc_looked_up = (
    ncbi_hgnc[['alias','alias_type','HGNC']]
    .merge(hgnc_ensembl_lookup, on='HGNC', how='inner')
    .drop('HGNC', axis=1)
    .drop_duplicates()
)
print(f'Number of lookups mapped: {len(hgnc_looked_up)}')
print(f'Number of unique hgnc ids looked up: {len(hgnc_looked_up.loc[hgnc_looked_up.alias_type=="hgnc_id"].alias.unique())}')
print(f'Number of ensembl ids mapped: {len(hgnc_looked_up.Ensembl.unique())}')



Number of lookups without Ensembl id: 68230
Number of lookups mapped: 49475
Number of unique hgnc ids looked up: 6731
Number of ensembl ids mapped: 6731


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [214]:
##
## Merging tables:
##
merged = pd.concat([parsed_hgnc, hgnc_looked_up, parsed_ncbi.loc[parsed_ncbi.Ensembl.notna(), ['Ensembl', 'alias', 'alias_type']]])
merged.drop_duplicates(inplace=True)

# merged_alias = merged.loc[(merged.alias_type.isin(aliases))
#            & (merged.alias != '-'),['Ensembl','alias']].drop_duplicates()
print(f'Number of total links: {len(merged)}')
print(f'Number of unique Ensembl genes: {len(merged.Ensembl.unique())}')
print(f'Number of unique aliases: {len(merged.loc[(merged.alias_type.isin(aliases)),["Ensembl","alias"]].drop_duplicates())}')


Number of total links: 749858
Number of unique Ensembl genes: 41873
Number of unique aliases: 257074


In [171]:
# merged.loc[merged.Ensembl == 'ENSG00000136244']
# [x for x in merged.alias_type.unique().tolist() if '_id' not in x ]


Unnamed: 0,Ensembl,alias
5,ENSG00000121410,A1BG
6,ENSG00000121410,alpha-1-B glycoprotein
14,ENSG00000268895,A1BG-AS1
15,ENSG00000268895,A1BG antisense RNA 1
17,ENSG00000268895,FLJ23569
...,...,...
427240,ENSG00000286710,uncharacterized LOC116435278
427609,ENSG00000227066,LOC117779438
427610,ENSG00000227066,uncharacterized LOC117779438
427717,ENSG00000048545,LOC118142757


In [195]:
merged.loc[merged.alias == 'DLC1']

Unnamed: 0,Ensembl,alias,alias_type
83465,ENSG00000164741,DLC1,symbol
83515,ENSG00000008226,DLC1,alias_symbol
90466,ENSG00000088986,DLC1,alias_symbol
98736,ENSG00000164741,DLC1,symbol_from_nomenclature_authority


In [202]:
parsed_hgnc.loc[parsed_hgnc.alias_type == 'symbol'].groupby('Ensembl').count().sort_values('alias')


Unnamed: 0_level_0,alias,alias_type
Ensembl,Unnamed: 1_level_1,Unnamed: 2_level_1
ENSG00000000003,1,1
ENSG00000231537,1,1
ENSG00000231538,1,1
ENSG00000231540,1,1
ENSG00000231541,1,1
...,...,...
ENSG00000172000,1,1
ENSG00000172005,1,1
ENSG00000171984,1,1
ENSG00000230417,2,2


In [206]:
x = (merged
    .merge(parsed_hgnc.loc[parsed_hgnc.alias_type == 'symbol',['Ensembl','alias']]
           .rename(columns={'alias':'hgnc_symbol'}), on='Ensembl', how='left'))

x.head()

Unnamed: 0,Ensembl,alias,alias_type,hgnc_symbol
0,ENSG00000121410,OTTHUMG00000183507,vega_id,A1BG
1,ENSG00000121410,NM_130786,refseq_accession,A1BG
2,ENSG00000121410,5,hgnc_id,A1BG
3,ENSG00000121410,69417,rgd_id,A1BG
4,ENSG00000121410,138670,omim_id,A1BG


In [211]:
label = 'HER2'

x.loc[x.alias == label]

Unnamed: 0,Ensembl,alias,alias_type,hgnc_symbol
99282,ENSG00000141736,HER2,alias_symbol,ERBB2


In [212]:
x.loc[x.Ensembl == 'ENSG00000141736']

Unnamed: 0,Ensembl,alias,alias_type,hgnc_symbol
99264,ENSG00000141736,OTTHUMG00000179300,vega_id,ERBB2
99265,ENSG00000141736,NM_004448,refseq_accession,ERBB2
99266,ENSG00000141736,3430,hgnc_id,ERBB2
99267,ENSG00000141736,2561,rgd_id,ERBB2
99268,ENSG00000141736,164870,omim_id,ERBB2
99269,ENSG00000141736,ERBB2,symbol,ERBB2
99270,ENSG00000141736,erb-b2 receptor tyrosine kinase 2,name,ERBB2
99271,ENSG00000141736,95410,mgd_id,ERBB2
99272,ENSG00000141736,uc002hso.4,ucsc_id,ERBB2
99273,ENSG00000141736,P04626,uniprot_ids,ERBB2


In [215]:
merged.alias_type.unique()

array(['vega_id', 'refseq_accession', 'hgnc_id', 'rgd_id', 'omim_id',
       'symbol', 'name', 'mgd_id', 'ucsc_id', 'uniprot_ids', 'ccds_id',
       'alias_symbol', 'prev_name', 'prev_symbol', 'ena', 'alias_name',
       'entrez_id', 'description', 'symbol_from_nomenclature_authority',
       'full_name_from_nomenclature_authority', 'other_designations',
       'imgt/gene-db_id', 'mirbase_id', 'ensembl_id'], dtype=object)

In [218]:
merged.to_csv('target_mapping.tsv.gz', sep='\t', compression='infer', index=False)

In [1]:
import pandas as pd


tm_df = pd.read_csv('target_mapping.tsv.gz', sep='\t', compression='infer')

tm_df.head()

Unnamed: 0,Ensembl,alias,alias_type
0,ENSG00000121410,OTTHUMG00000183507,vega_id
1,ENSG00000121410,NM_130786,refseq_accession
2,ENSG00000121410,5,hgnc_id
3,ENSG00000121410,69417,rgd_id
4,ENSG00000121410,138670,omim_id


In [8]:
# List of columns that are not identifiers or accessions:
aliases = [
'symbol',
 'name',
 'alias_symbol',
 'prev_name',
 'prev_symbol',
 'alias_name',
 'description',
 'symbol_from_nomenclature_authority',
 'full_name_from_nomenclature_authority',
 'other_designations'
]

a_types  = (
    tm_df.loc[(tm_df.Ensembl == 'ENSG00000142208') &
         (tm_df.alias_type.isin(aliases))]
    .groupby(['Ensembl','alias'])
    .apply(lambda x: x.alias_type.tolist())
)

a_types.name = 'alias_types'
a_types_df = pd.DataFrame(a_types)


In [12]:
print(a_types_df.reset_index().to_markdown(index=False))

| Ensembl         | alias                                              | alias_types                                                      |
|:----------------|:---------------------------------------------------|:-----------------------------------------------------------------|
| ENSG00000142208 | AKT                                                | ['alias_symbol']                                                 |
| ENSG00000142208 | AKT serine/threonine kinase 1                      | ['name', 'description', 'full_name_from_nomenclature_authority'] |
| ENSG00000142208 | AKT1                                               | ['symbol', 'symbol_from_nomenclature_authority']                 |
| ENSG00000142208 | AKT1m                                              | ['other_designations']                                           |
| ENSG00000142208 | PKB                                                | ['alias_symbol']                                                 |
| ENSG00000142208 | 

In [31]:
ambigious_hgnc = (
    tm_df
    .loc[tm_df.alias_type == 'hgnc_id',['Ensembl','alias']]
    .drop_duplicates()
    .groupby('alias')
    .agg(count=('Ensembl', 'count'), ensemblIds =('Ensembl',lambda x: x.tolist()))
    .query('count > 1')
    .reset_index()
)

In [39]:
(
    tm_df
    .loc[tm_df.alias_type == 'entrez_id',['Ensembl','alias']]
    .drop_duplicates()
    .groupby('alias')
    .agg(count=('Ensembl', 'count'), ensemblIds =('Ensembl',lambda x: x.tolist()))
    .query('count > 1')
    .reset_index()
)

Unnamed: 0,alias,count,ensemblIds
0,100506736,2,"[ENSG00000205045, ENSG00000286065]"
1,100652739,2,"[ENSG00000223701, ENSG00000268592]"
2,101927815,2,"[ENSG00000254319, ENSG00000277526]"
3,11046,2,"[ENSG00000130958, ENSG00000285269]"
4,11068,2,"[ENSG00000114395, ENSG00000271858]"
5,1124,2,"[ENSG00000106069, ENSG00000285162]"
6,114794,2,"[ENSG00000166897, ENSG00000243902]"
7,1201,2,"[ENSG00000188603, ENSG00000261832]"
8,132989,2,"[ENSG00000163633, ENSG00000285458]"
9,135250,2,"[ENSG00000164520, ENSG00000285991]"


In [40]:
ambigous_label = (
    tm_df
    .loc[tm_df.alias_type.isin(aliases),['Ensembl','alias']]
    .drop_duplicates()
    .groupby('alias')
    .agg(count=('Ensembl', 'count'), ensemblIds =('Ensembl',lambda x: x.tolist()))
    .query('count > 1')
    .reset_index()
)

len(ambigous_label)

4437

In [43]:
ambigous_label.loc[ambigous_label['count'] > 2]

Unnamed: 0,alias,count,ensemblIds
2,-,12647,"[ENSG00000229129, ENSG00000236136, ENSG0000026..."
3,"1,4-alpha-D-glucan glucanohydrolase 1",3,"[ENSG00000237763, ENSG00000174876, ENSG0000018..."
4,"1,4-cineole 2-exo-monooxygenase",3,"[ENSG00000255974, ENSG00000197408, ENSG0000016..."
6,1-acylglycerophosphocholine O-acyltransferase,5,"[ENSG00000111684, ENSG00000087253, ENSG0000015..."
7,1-acylglycerophosphoethanolamine O-acyltransfe...,3,"[ENSG00000111684, ENSG00000143797, ENSG0000017..."
...,...,...,...
4386,zinc finger protein 114 pseudogene,5,"[ENSG00000278720, ENSG00000234837, ENSG0000022..."
4392,zinc finger protein 20,3,"[ENSG00000132010, ENSG00000197961, ENSG0000018..."
4409,zinc finger protein 479 pseudogene,3,"[ENSG00000197990, ENSG00000241149, ENSG0000018..."
4422,zinc finger protein 839 pseudogene,3,"[ENSG00000205663, ENSG00000234449, ENSG0000028..."
