2018-Challenge

Rethinking Genome Annotation

Our imagination of the genome has matured tremendously. Your challenge is to create a data-driven visual representation of our current understanding of the human genome, 20 years after its initial publishing. Do this either using the data we have curated for you from human chromosome 20 or your choice of human genetic material of similar magnitude.

Background

What does it mean to be human? Nearly 20 years since the conclusion of the human genome project, we know a lot more about gene function than we once did. For tens of thousands of genes, this amounts to a huge volume of information. How can we improve on how we associate the existing information for a gene? Consider a gene's sequence, function, regulation, pathways it participates in, and its associated protein(s).

Gene annotation and prediction is a process where a gene function or family is predicted from reading a nucleotide or amino acid sequence. This process is one of the most important steps in studying the metabolism, phylogeny, and the overall genomic properties of a sequenced species.

The first two draft sequences of the human genome were published in February of 2001. Three years from now will mark the twentieth anniversary of this accomplishment that like now other has shaped the landscape of bioinformatics, computational biology and molecular medicine. In 2001, Celera - a private company founded three years earlier to commercialize genome information - published an iconic poster summarizing their version of the genome. It is still fascinating today. This poster is significant, not so much for its interpretable content, but for the unique perspective it gives us on the entirety of information that constitutes our molecular identity. The details are rich, in fact, surprisingly "modern", presenting features like CpG islands and SNP density, and exon transcripts with Gene Ontology functional categories colour coded, for forward and reverse strand, accurately plotted on the nucleotide backbone at about 500 kB per centimetre. This was computed from gff records with Josep Abril's gff2ps software.

But we know so much more today. While the Celera map showed us the genome of one Caucasian male, the number of sequenced genomes has exploded - we envisioned the 1,000 genomes project (2008, completed 2012); quickly set our sights on 100,000 genomes (2012, almost completed), and as of today more than 500,000 human genomes have been sequenced overall. We have sequenced cancers, and genetic diseases. We have sequenced representatives of virtually all ethnicities on the planet. We have even sequenced Neanderthals and Denisovians, and we have sequenced other species far and wide to acquire a sense of where we humans fit into the landscape of evolution. We have annotated the contents of the genome in the ENCODE project. We have built databases that carefully dissect all proteins into their domains, such as InterPro. We have started to outline how things work together in functional networks such as the STRING data, or in modules as published by KEGG, and we are beginning to translate our insights into actionable information for medicine, at the OICR, at Sick Kids' TCAG.

Deliverables: What you'll be judged on

Prototype of your visualization
- Should be clearly data driven - or the path to making it data driven should be clear
Code
- Quality and structure
Documentation
- Should include your sources
- Architecture - get from data to visualization in a sensical way
Presentation

Data

We have cleaned up the "wild" data for chromosome 20 for you to use. Download our txt files and you'll find they contain (tab delimited) HUGO gene symbol, and IDs to allow you to find annotations for this gene via crossRef, InterProt domains, STRING DB, GO annotations, etc. We'll also provide sample scripts of how the data was prepared that you can adapt.

Starter code

We will provide some starter code to give you some direction. You don't need to use it! Languages supported are R, python, JavaScript. Find these at our github.

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
tad-cloud		tad-cloud
.gitignore		.gitignore
Chr20FuncIntx.tsv		Chr20FuncIntx.tsv
Chr20GOslimData.tsv		Chr20GOslimData.tsv
Chr20GWAStraits.tsv		Chr20GWAStraits.tsv
Chr20GeneData.tsv		Chr20GeneData.tsv
README-DATA		README-DATA
README.md		README.md
chr20_data.tsv		chr20_data.tsv
circosDraw-cloudMockup.svg		circosDraw-cloudMockup.svg
circosDraw.svg		circosDraw.svg
circosStarter.py		circosStarter.py
dank_TADs_boi.txt		dank_TADs_boi.txt
genes_by_TAD.txt		genes_by_TAD.txt
genomePlotBasic.R		genomePlotBasic.R
genomePlotDemo.R		genomePlotDemo.R
genomePlotFunctions.R		genomePlotFunctions.R
genomePlotFunctions.py		genomePlotFunctions.py
genomePlotIntermediate.R		genomePlotIntermediate.R
goddamned_GO_data.txt		goddamned_GO_data.txt
linDraw.svg		linDraw.svg
linearStarter.py		linearStarter.py
mart_export.txt		mart_export.txt
new_chr20_data.tsv		new_chr20_data.tsv
new_tad_positions.txt		new_tad_positions.txt
prepCh20Data.py		prepCh20Data.py
prepGeneData.py		prepGeneData.py
prepTADData.py		prepTADData.py
prepareGenomeData.R		prepareGenomeData.R
setUp.py		setUp.py
setUpCh.py		setUpCh.py
suggestions.txt		suggestions.txt
svgCols.txt		svgCols.txt
tad_GO.rar		tad_GO.rar
tads.txt		tads.txt
wordGenerator.py		wordGenerator.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

2018-Challenge

Rethinking Genome Annotation

Background

Deliverables: What you'll be judged on

Data

Starter code

Tools

Useful Reading

About

Releases

Packages

Languages

LyMarco/2018-Challenge

Folders and files

Latest commit

History

Repository files navigation

2018-Challenge

Rethinking Genome Annotation

Background

Deliverables: What you'll be judged on

Data

Starter code

Tools

Useful Reading

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages