# Get Ensembl Gene List

This notebooks demonstrate how to get a list with all human genes in the Ensembl database. It uses `pyensembl` package. 

In [1]:
import pandas as pd
from pyensembl import EnsemblRelease

ENSEMBL_RELEASE = 97

# release 97 uses human reference genome GRCh38
data = EnsemblRelease(ENSEMBL_RELEASE)

Get all data about genes. Count them.

In [2]:
human_genes = data.genes()
len(human_genes)

60617

See what information about the gene is recorded.

In [3]:
human_genes[0]

Gene(gene_id='ENSG00000000003', gene_name='TSPAN6', biotype='protein_coding', contig='X', start=100627109, end=100639991, strand='-', genome='GRCh38')

Let us reformat `human_genes` list into DataFrame object.

In [4]:
human_genes_tuples = [(x.gene_id, x.gene_name, x.biotype, x.contig, x.start, x.end, x.strand) for x in human_genes]
human_genes_table = pd.DataFrame.from_records(human_genes_tuples, columns=["id", "symbol", "biotype", "chr", "start", "end", "strand"])
assert all(human_genes_table.start <= human_genes_table.end)

human_genes_table.head()

Unnamed: 0,id,symbol,biotype,chr,start,end,strand
0,ENSG00000000003,TSPAN6,protein_coding,X,100627109,100639991,-
1,ENSG00000000005,TNMD,protein_coding,X,100584936,100599885,+
2,ENSG00000000419,DPM1,protein_coding,20,50934867,50958555,-
3,ENSG00000000457,SCYL3,protein_coding,1,169849631,169894267,-
4,ENSG00000000460,C1orf112,protein_coding,1,169662007,169854080,+
