# Lecture 2 - DNA sequencing and assembly

In this tutorial, we will use DNA sequencing data for the pathogen *Mycoplasma pneumonium*. 

The data was downloaded from the European Nucleotide Archive, run accession [DRR040043](https://www.ebi.ac.uk/ena/browser/view/DRR040043?show=reads).

## Exercise 1 - Assembly with SPAdes:

If you open the project description, you will see that this a pair-ended library obtained will Illumina MiSeq, and there are two associated FASTQ files for download. Begin by downloading these two files.

> Alternatively, you can also find those files in this repository under `files/reads/`.

Now try one of the two following options to assemble your reads:

- Using Galaxy -> go to [this notebook](./SPADES_galaxy.ipynb)
- Using the command line -> go to [this notebook](./SPADES_terminal.ipynb)

## Exercise 2 - Biopython

If everything went well in the previous step, you will have your genome assembled into contigs/scaffolds in the form of FASTA files. 

> If not, don't worry, you can find expected files under `files/assembled/`.

You will now learn use the [Biopython](https://biopython.org/) library to work with different kinds of files.

> Please check the documentation of the [SeqIO](https://biopython.org/wiki/SeqIO) module that we will be using today.

## 2.1 Loading FASTQ files

In [24]:
from Bio import SeqIO
import gzip

with gzip.open('files/reads/DRR040043_1.fastq.gz', 'rt') as handle:
    reads = SeqIO.parse(handle, 'fastq')
    reads_dict = SeqIO.to_dict(reads)


In [11]:
for read_id, seq in reads_dict.items():
    print(seq)
    break

ID: DRR040043.1
Name: DRR040043.1
Description: DRR040043.1 1/1
Number of features: 0
Per letter annotation for: phred_quality
Seq('CTATAGCCGTTTTCCCCATCCTTGGNAAAANTAAAGCGATGGTTAGTTAACTCA...ATT')


In [15]:
read_sizes = [len(x) for x in reads_dict.values()]

In [20]:
print(list(reads_dict.values())[10])

ID: DRR040043.11
Name: DRR040043.11
Description: DRR040043.11 11/1
Number of features: 0
Per letter annotation for: phred_quality
Seq('GGTGATGAAAAGACCTTTGACGGACTTGATTATCTACCTAAAAACATTACCAAG...TTA')
