# Lecture 2 - DNA sequencing and assembly

In this tutorial, we will use DNA sequencing data for the pathogen *Mycoplasma pneumonium*. 

The data was downloaded from the European Nucleotide Archive, run accession [DRR040043](https://www.ebi.ac.uk/ena/browser/view/DRR040043?show=reads).

## Exercise 1 - Assembly with SPAdes:

If you open the project description, you will see that this a pair-ended library obtained will Illumina MiSeq, and there are two associated FASTQ files for download. Begin by downloading these two files.

> Alternatively, you can also find those files in this repository under `files/reads/`.

Now try one of the two following options to assemble your reads:

- Using Galaxy -> go to [this notebook](./SPADES_galaxy.ipynb)
- Using the command line -> go to [this notebook](./SPADES_terminal.ipynb)

## Exercise 2 - Biopython

If everything went well in the previous step, you will have your genome assembled into contigs/scaffolds in the form of FASTA files. 

> If not, don't worry, you can find expected files under `files/assembled/`.

You will now learn use the [Biopython](https://biopython.org/) library to work with different kinds of files.

> Please check the documentation of the [SeqIO](https://biopython.org/wiki/SeqIO) module that we will be using today.

## 2.1 Reading FASTQ files

Let's start by loading one of the FASTQ files. 

> Note that we need to use the `gzip` module because we are reading a compressed file.

In [None]:
from Bio import SeqIO
import gzip

with gzip.open('files/reads/DRR040043_1.fastq.gz', 'rt') as handle:
    reads = list(SeqIO.parse(handle, 'fastq'))

> **Note**: Depending on how much RAM you have available, the code above may run out of memory and crash your python kernel. If that happens, run the code below to load just the first 1000 reads into memory: 

In [None]:
from Bio import SeqIO
import gzip

with gzip.open('files/reads/DRR040043_1.fastq.gz', 'rt') as handle:
    reads = []
    for i, read in enumerate(SeqIO.parse(handle, 'fastq')):
        if i == 1000:
            break
        reads.append(read)

**Q:** How many reads are in the file?

In [None]:
# type your code here...

Solution (click to expand):

In [None]:
len(reads)

**Q:** What is the sequence of the first read?

In [None]:
# type your code here...

Solution (click to expand):

In [None]:
read = reads[0]
print(read.seq)

**Q:** What is the average length of the reads?

In [None]:
# type your code here...

Solution (click to expand):

In [None]:
lengths = [len(read) for read in reads]
avg_length = sum(lengths) / len(lengths)
min_length = min(lengths)
max_length = max(lengths)

print(f'min: {min_length} mean: {avg_length} max: {max_length}')

## 2.2 Reading FASTA files

Now let's read the FASTA file with the assembled genome and check how many contigs we have:

In [None]:
contigs = list(SeqIO.parse('files/assembled/contigs.fasta', 'fasta'))

print('final number of contigs:', len(contigs))

**Exercise:** Print the length of all the contigs.

In [None]:
# type your code here...

Solution (click to expand):

In [None]:
lengths = [len(x) for x in contigs]
print(lengths)

Now let's look the length distribution using a more convenient visualization:

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

lengths = [len(x) for x in contigs]
labels = [f'{x/1000:.1f} kb' if x > 10000 else '' for x in lengths]
plt.pie(lengths, radius=2, wedgeprops={'width':1}, startangle=90, labels=labels);

You can see that, in fact, more than 90% of the assembled genome is composed by contigs longer than 10 kb.

**Exercise**: Write a new FASTA file that only contains the contigs larger than 10 kb.

In [None]:
# type your code here...

Solution (click to expand):

In [None]:
long_contigs = [contig for contig in contigs if len(contig) > 10000]

SeqIO.write(long_contigs, 'long_contigs.fasta', 'fasta')