# Instructions

Set your gene list as input, with one gene ID on every line. Run this entire notebook and the ID mapping file will be output to your working directory.

## Important note
* Only 3 external databases are allowed for mapping at once

## Example

```
my_genes_file = "160426-recon3_enst.txt"
my_selected_outputs = c('external_gene_name',
                        'ensembl_transcript_id',
                        'ensembl_peptide_id',
                        'entrezgene',
                        'refseq_peptide',
                        'uniprot_swissprot'
                       )
```

## Other useful outputs

See http://www.ensembl.org/biomart/martview for all possible attributes

```
'ensembl_gene_id', # ENSG IDs
'hpa', # Human Protein Atlas
'pdb', # PDB
```

## Caveats

* Why don't all my genes map to all other IDs?
    * See: http://pre.ensembl.org/Help/Faq?id=476
    * Example: if you are mapping a transcript -> gene AND protein IDs, only transcript IDs that successfully map to protein IDs will be returned
    * You will want to specify only gene related IDs or protein related IDs to make sure everything maps
    * Check out the bottom of this notebook for things that don't successfully map

# Input

In [65]:
my_genes_file = "/home/nathan/projects/iAB-RBC-283/data_frames/160520-ensp_homology.txt"
my_gene_id_type = 'ensembl_peptide_id'
my_selected_outputs = c(#'entrezgene', 
#                         'external_gene_name',
                        'ensembl_peptide_id',
                        'refseq_peptide'
                        #'uniprot_swissprot',
                        #'uniprot_sptrembl'
                       )

# Code

In [3]:
# Loading biomaRt
options(useHTTPS=FALSE)
library("biomaRt")

# date format is yymmdd (ie. 160428 = 2016-04-28)
short_date = format(Sys.time(), "%y%m%d")

In [4]:
hs_genes = useMart(biomart="ENSEMBL_MART_ENSEMBL", dataset="hsapiens_gene_ensembl", host="www.ensembl.org")

In [66]:
my_genes_table = read.table(my_genes_file, header=FALSE)
my_genes = my_genes_table[['V1']]

In [69]:
my_genes_mapped <- getBM(attributes = my_selected_outputs, 
                         filter = my_gene_id_type,
                         values = my_genes, 
                         mart = hs_genes)

In [70]:
write.csv(my_genes_mapped, file=paste(my_genes_file,'.ensp-refp.out',sep=''))

# Unsuccessful mappings

Please note that these will not be in the output file!

In [24]:
setdiff(my_genes, my_genes_mapped$uniprot_sptrembl) 