# Genomics Algorithmics project

## Development of a mapping solution over a reference genome for sequencing datas  

### Nicolas Parisot & Sergio Peignier

## Reading sequencing datas with *Biopython* package

### Complete genome, chromosome by chromosome, of *Plasmodium falciparum*

With *Biopython*, each of the 15 *Plasmodium falciparum* chromosomes is stored in a python list under the form of a string. The python list is named **chromosomes**. Mapping operations will be realised on these chromosomes.

In [1]:
from Bio import SeqIO
from Chromosome import Chromosome
from dc3 import dc3
import bwt
import tools

In [2]:
chromosomes:list[Chromosome] = []
chromosomes_inv:list[Chromosome] = []

for i,record in enumerate(SeqIO.parse("SEQUENCES/P_fal_genome.fna", format="fasta")):
    name = f"P_fal_chromosome_{i+1}"
    chromo = Chromosome(name,record.seq,record.id)
    chromosomes.append(chromo)
    #now for inverse
    name_inversed_comp = f"P_fal_chromosome_-{i+1}"
    inversed_comp_seq = tools.inverse_sequence(record.seq)
    chromo_inv = Chromosome(name_inversed_comp,inversed_comp_seq,record.id)
    chromosomes_inv.append(chromo_inv)

Number and lengths of the chromosomes are given here:

In [23]:
print(f"Chromosomes number: {len(chromosomes)}")
print("Chromosomes length, from 1 to 15:")
for i, chromo in enumerate(chromosomes):
    i += 1
    print(f"Length of the chromosome {i:>2} : {len(chromo.DNA_dol):<7} pairs of bases")
for i, chromo in enumerate(chromosomes_inv):
    i += 1
    print(f"Length of the chromosome -{i:>2} : {len(chromo.DNA_dol):<7} pairs of bases")

Chromosomes number: 15
Chromosomes length, from 1 to 15:
Length of the chromosome  1 : 640852  pairs of bases
Length of the chromosome  2 : 947103  pairs of bases
Length of the chromosome  3 : 1067972 pairs of bases
Length of the chromosome  4 : 1200491 pairs of bases
Length of the chromosome  5 : 1343558 pairs of bases
Length of the chromosome  6 : 1418243 pairs of bases
Length of the chromosome  7 : 1445208 pairs of bases
Length of the chromosome  8 : 1472806 pairs of bases
Length of the chromosome  9 : 1541736 pairs of bases
Length of the chromosome 10 : 1687657 pairs of bases
Length of the chromosome 11 : 2038341 pairs of bases
Length of the chromosome 12 : 2271495 pairs of bases
Length of the chromosome 13 : 2925237 pairs of bases
Length of the chromosome 14 : 3291937 pairs of bases
Length of the chromosome 15 : 34251   pairs of bases
Length of the chromosome - 1 : 640852  pairs of bases
Length of the chromosome - 2 : 947103  pairs of bases
Length of the chromosome - 3 : 1067972 p

### *Reads* acquired from a *P.falciparum* genome sequencing 
In the same way, *reads* are extracted with *Biopython* and stored in a  python list **reads**. All the *reads* have the same length: 100 pairs of bases

In [3]:
reads = []

for record in SeqIO.parse("SEQUENCES/P_fal_reads.fq", format="fastq"):
    reads.append((record.seq, record.id))

In [4]:
print(f"Number of reads: {len(reads)}")
print(f"Reads length: {len(reads[0][0])}")

Number of reads: 1500000
Reads length: 100


As we can see, the high-throughput sequencing generated 1.5 million *reads*, each 100 nucleotides in size. To sharpen the mapping, the reads will be divided into even smaller fragments, called **kmer**. 

_____________________

# DC3 on genome

To store all DC3 result, the following comands.
Stored in .npy, one of the best way of storage for list (https://stackoverflow.com/questions/9619199/best-way-to-preserve-numpy-arrays-on-disk)

In [6]:
for chromo in chromosomes+chromosomes_inv:
    print(chromo.file_name)
    if chromo.suffix_table is None:
        dc3result = dc3(chromo.DNA_dol)
        chromo.export_dc3_result(dc3result)

P_fal_chromosome_1
P_fal_chromosome_2
P_fal_chromosome_3
P_fal_chromosome_4
P_fal_chromosome_5
P_fal_chromosome_6
P_fal_chromosome_7
P_fal_chromosome_8
P_fal_chromosome_9
P_fal_chromosome_10
P_fal_chromosome_11
P_fal_chromosome_12
P_fal_chromosome_13
P_fal_chromosome_14
P_fal_chromosome_15
P_fal_chromosome_-1
P_fal_chromosome_-2
P_fal_chromosome_-3
P_fal_chromosome_-4
P_fal_chromosome_-5
P_fal_chromosome_-6
P_fal_chromosome_-7
P_fal_chromosome_-8
P_fal_chromosome_-9
P_fal_chromosome_-10
P_fal_chromosome_-11
P_fal_chromosome_-12
P_fal_chromosome_-13
P_fal_chromosome_-14
P_fal_chromosome_-15


This chunk create all Suffix table if it's the first time you generated the dc3 result.
All suffixe table has been created to compute BWT on all sequences

In [7]:
for chromo in chromosomes+chromosomes_inv:
    print(chromo.file_name)
    if chromo.bwt is None:
        bwt_result = bwt.bwt(str(chromo.DNA_dol),chromo.suffix_table) #TODO: avoid str() use
        chromo.export_bwt_result(bwt_result) #TODO 2: verify result


P_fal_chromosome_1
P_fal_chromosome_2
P_fal_chromosome_3
P_fal_chromosome_4
P_fal_chromosome_5
P_fal_chromosome_6
P_fal_chromosome_7
P_fal_chromosome_8
P_fal_chromosome_9
P_fal_chromosome_10
P_fal_chromosome_11
P_fal_chromosome_12
P_fal_chromosome_13
P_fal_chromosome_14
P_fal_chromosome_15
P_fal_chromosome_-1
P_fal_chromosome_-2
P_fal_chromosome_-3
P_fal_chromosome_-4
P_fal_chromosome_-5
P_fal_chromosome_-6
P_fal_chromosome_-7
P_fal_chromosome_-8
P_fal_chromosome_-9
P_fal_chromosome_-10
P_fal_chromosome_-11
P_fal_chromosome_-12
P_fal_chromosome_-13
P_fal_chromosome_-14
P_fal_chromosome_-15


In [3]:
for chromo in chromosomes+chromosomes_inv:
    print(chromo.file_name)
    if chromo.rank_mat is None:
        print(len(chromo.bwt))
        rank_mat = bwt.create_rank_mat(chromo.bwt)
        print("Done !")
        chromo.export_rank_matrix_result(rank_mat) 
        print("Exported !")

P_fal_chromosome_1
640852
Done !
Exported !
P_fal_chromosome_2
947103
Done !
Exported !
P_fal_chromosome_3
1067972
Done !
Exported !
P_fal_chromosome_4
1200491
Done !
Exported !
P_fal_chromosome_5
1343558
Done !
Exported !
P_fal_chromosome_6
1418243
Done !
Exported !
P_fal_chromosome_7
1445208
Done !
Exported !
P_fal_chromosome_8
1472806
Done !
Exported !
P_fal_chromosome_9
1541736
Done !
Exported !
P_fal_chromosome_10
1687657
Done !
Exported !
P_fal_chromosome_11
2038341
Done !
Exported !
P_fal_chromosome_12
2271495
Done !
Exported !
P_fal_chromosome_13
2925237
Done !
Exported !
P_fal_chromosome_14
3291937
Done !
Exported !
P_fal_chromosome_15
34251
Done !
Exported !
P_fal_chromosome_-1
640852
Done !
Exported !
P_fal_chromosome_-2
947103
Done !
Exported !
P_fal_chromosome_-3
1067972
Done !
Exported !
P_fal_chromosome_-4
1200491
Done !
Exported !
P_fal_chromosome_-5
1343558
Done !
Exported !
P_fal_chromosome_-6
1418243
Done !
Exported !
P_fal_chromosome_-7
1445208
Done !
Exported !
P_f