# Insulin gene assembly challenge

Student name: 

## Introduction

One of the classic stories in early molecular genetics was the race to clone and sequence the human insulin gene. Multible teams with different strategies raced towards the answer, as the winner was sure to secure fame and fortune with this important piece of genetic information in hand. It really is a great story and you can read more about it [here](https://www.gene.com/stories/cloning-insulin) if you are interested.

Today we will be undertaking out own race to "clone" insulin but with a modern and bioinformatics twist.

One of the common challenges faced in computional biology is the assembly of genomes. We don't yet have the technology to sequence entire genomes in one shot (although some people are [trying to get there](https://en.wikipedia.org/wiki/Nanopore_sequencing)) so we have to settle for breaking the genome into very small fragments, sequencing those fragments, and then trying to put everything back together using algorithims. 

Below is an idealized diagram of how this process works, it is often called "shotgun" sequencing because the genome is cut into very small fragments.

![](resources/Shotgun_sequencing_lg.jpg)

At the most basic level these small fragments are put back togther by identifying overlapping sequences present in two or more fragments and then merging these into larger fragments. This process is repeated again and again to form larger and large pieces of the genome called contigs.

## The challenge

Today, it is up to you to assemble the insulin gene using Python. Remember, fame and fortune are on the line. The winning group takes all. 

Your deliverable is a string with your best guess at the insulin gene sequence based on the reads you are provided. The group who's assembled sequence is closest to the actual (reference) will be crowned the winner and be awarded the patent rights for insulin production.

## Rules

1. You may only use the reads provided to you as your input data
2. You may *not* install any additional packages or libraries
3. You may use Google and online resources to help you but the final code must be your own
4. This notebook must produce your final answer

### About the dataset

- All reads come directly from the insulin gene and do not contain errors
- The insulin gene is fragmented randomly so reads may be of different length
- Reads may not be unique (duplicates of the same sequence may exist)

## Getting started

Below I have provided some code to make getting going a bit faster. All it is doing is reading the simulated reads into a list of strings. If you would like to check out the code yourself it is located in `resources/resources.py`.

In [7]:
from resources import resources

reads = resources.READS

for i in range(2):
    print(f'Read {i}: {reads[i]}')

Read 0: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Read 1: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA


Additionally here is some pseduo code to help you get started thinking about your approach. Once you have an idea in mind I would recommend testing it out on some example data first and then move on to implementing it on the full data set.

### Pseudocode example

```
assembled_gene = read[i]

for i in len(reads[1:]):
    if assembled_gene overlaps with reads[i]:
        assembled_gene = assembled_gene + read[i]
```

Some questions to ask yourself
- How will I locate overlaps between reads?
- How will I combine reads that have overlaps?
- How will I deal with non-overlapping reads?
- Which part of a read needs to overlap in order to combine?

### Test example

Here's how I might start testing my code. First take a phrase or your name and break it down into small overlapping parts

In [None]:
name = 'EthanHolleman'

ethan_reads = []

for i in range(len(name)- 4):
    ethan_reads.append(name[i:i+4])
ethan_reads

['Etha', 'than', 'hanH', 'anHo', 'nHol', 'Holl', 'olle', 'llem', 'lema']

Now shuffle these up and try and put them back together

In [None]:
import random

random.shuffle(ethan_reads)
ethan_reads

['nHol', 'hanH', 'anHo', 'llem', 'olle', 'Etha', 'Holl', 'lema', 'than']

## Workspace

Use the space beyond this cell in the notebook to prototype and work on your solution. Remember, run your code often to make sure any changes you make don't break things!

### Saving your work

If you are using this notebook within a Binder your work will not be saved after you close the Binder. To avoid losing any of your work make sure you save by exporting the notebook to your local machine. Please ask if you have any questions about this.