# Genomics Algorithmics project

## Development of a mapping solution over a reference genome for sequencing datas  

### Nicolas Parisot & Sergio Peignier

## Reading sequencing datas with *Biopython* package

### Complete genome, chromosome by chromosome, of *Plasmodium falciparum*

With *Biopython*, each of the 15 *Plasmodium falciparum* chromosomes is stored in a python list under the form of a string. The python list is named **chromosomes**. Mapping operations will be realised on these chromosomes.

In [1]:
from Bio import SeqIO

chromosomes = []

for record in SeqIO.parse("SEQUENCES/P_fal_genome.fna", format="fasta"):
    chromosomes.append((record.seq, record.id))

[(Seq('TGAACCCTaaaacctaaaccctaaaccctaaaccctgaaccctaaaccctgaac...agg'), 'NC_004325.2'), (Seq('aaccctaaaccctaaaccctaaaccctaaaccctaaaccctaaacctaaaccct...TCA'), 'NC_037280.1'), (Seq('TAAACCCTAAATCTCTAAACCCTAAAGCTATACCTAAACCCTGAAGGTTATACC...TCA'), 'NC_000521.4'), (Seq('aaccctaaaccctgaaccctaaaccctaaaccctgaaccctgaaccctaaaccc...tta'), 'NC_004318.2'), (Seq('ctaaaccctgaaccctaaaccctgaaccctaaaccctaaaccctgaaccctaaa...ggt'), 'NC_004326.2'), (Seq('taaaccctaaaccctgaacctaaccctgaaccctaaaccctgaaccctaaaccc...tca'), 'NC_004327.3'), (Seq('aaaccctaaaccctgaaccctgaaccctaaaccctgaaccctaaaccctgaacc...tca'), 'NC_004328.3'), (Seq('aacctaaaccctaaaccctaaaccctgaaccctaaaccctaaaccctgaaccct...agg'), 'NC_004329.3'), (Seq('aaccctgaaccctaaaccctaaaccctaaaccctgaaccctaaaccctaaaccc...gtt'), 'NC_004330.2'), (Seq('taaaccctgaaccctaaaccctgaaccctaaaccactaaccctaaaccctgaac...GTT'), 'NC_037281.1'), (Seq('aaccctaaaccctgaaccctgaaccctgaaccctgaaccctaaaccctgaaccc...TAG'), 'NC_037282.1'), (Seq('ctgaaccctaaaccctaaaccctaaacctaaaccctgaaccctaaac

Number and lengths of the chromosomes are given here:

In [2]:
print(f"Chromosomes number: {len(chromosomes)}")
print("Chromosomes length, from 1 to 15:")
i = 1
for chromo in chromosomes:
    print(f"Length of the {i:>2} chromosome: {len(chromo[0]):<7} pairs of bases")
    i += 1

Chromosomes number: 15
Chromosomes length, from 1 to 15:
Length of the  1 chromosome: 640851  pairs of bases
Length of the  2 chromosome: 947102  pairs of bases
Length of the  3 chromosome: 1067971 pairs of bases
Length of the  4 chromosome: 1200490 pairs of bases
Length of the  5 chromosome: 1343557 pairs of bases
Length of the  6 chromosome: 1418242 pairs of bases
Length of the  7 chromosome: 1445207 pairs of bases
Length of the  8 chromosome: 1472805 pairs of bases
Length of the  9 chromosome: 1541735 pairs of bases
Length of the 10 chromosome: 1687656 pairs of bases
Length of the 11 chromosome: 2038340 pairs of bases
Length of the 12 chromosome: 2271494 pairs of bases
Length of the 13 chromosome: 2925236 pairs of bases
Length of the 14 chromosome: 3291936 pairs of bases
Length of the 15 chromosome: 34250   pairs of bases


### *Reads* acquired from a *P.falciparum* genome sequencing 
In the same way, *reads* are extracted with *Biopython* and stored in a  python list **reads**. All the *reads* have the same length: 100 pairs of bases

In [3]:
reads = []

for record in SeqIO.parse("SEQUENCES/P_fal_reads.fq", format="fastq"):
    reads.append((record.seq, record.id))

In [4]:
print(f"Number of reads: {len(reads)}")
print(f"Reads length: {len(reads[0][0])}")

Number of reads: 1500000
Reads length: 100


As we can see, the high-throughput sequencing generated 1.5 million *reads*, each 100 nucleotides in size. To sharpen the mapping, the reads will be divided into even smaller fragments, called **kmer**. 

_____________________

# DC3 on genome

To store all DC3 result, the following comands.
Stored in .npy, one of the best way of storage for list (https://stackoverflow.com/questions/9619199/best-way-to-preserve-numpy-arrays-on-disk)

In [10]:
from y_dc3 import dc3
import Chromosome

for i, chr in enumerate(chromosomes):
    name = f"P_fal_chromosome_{i+1}"
    file = Chromosome.Chromosome(name)
    file.export_dc3_result(list(dc3(chr))) # List(numpy) pour régler mais bon 
