# Genomics Algorithmics project

## Development of a mapping solution over a reference genome for sequencing datas  

### Nicolas Parisot & Sergio Peignier

## Reading sequencing datas with *Biopython* package

### Genome and inversed genome of *Plasmodium falciparum*

With *Biopython*, each of the 15 *Plasmodium falciparum* chromosomes is stored in a python list under the form of a string. The python list is named **chromosomes**. Mapping operations will be realised on these chromosomes.

In [1]:
from Bio import SeqIO
from tqdm import tqdm
from Chromosome import Chromosome
from dc3 import dc3
import bwt
import tools
import mapping
import importlib

In [2]:
chromosomes:list[Chromosome] = []
chromosomes_inv:list[Chromosome] = []

with tqdm(total=30, desc="Chromosomes importation") as pbar:
    for i,record in enumerate(SeqIO.parse("SEQUENCES/P_fal_genome.fna", format="fasta")):
        # Normal chromosomes
        name = f"P_fal_chromosome_{i+1}"
        chromo = Chromosome(name,record.seq,record.id)
        chromosomes.append(chromo)
        pbar.update(1)
        
        # Inversed chromosomes
        name_inversed_comp = f"P_fal_chromosome_-{i+1}"
        inversed_comp_seq = tools.inverse_sequence(record.seq)
        chromo_inv = Chromosome(name_inversed_comp,inversed_comp_seq,record.id)
        chromosomes_inv.append(chromo_inv)
        pbar.update(1)

Chromosomes importation: 100%|██████████| 30/30 [00:20<00:00,  1.44it/s]


Number and lengths of the chromosomes are given here:

In [None]:
print(f"Chromosomes number: {len(chromosomes)}")
print("Chromosomes length, from 1 to 15:")
for i, chromo in enumerate(chromosomes):
    i += 1
    print(f"Length of the chromosome {i:>2} : {len(chromo.DNA_dol):<7} pairs of bases")
for i, chromo in enumerate(chromosomes_inv):
    i += 1
    print(f"Length of the chromosome -{i:>2} : {len(chromo.DNA_dol):<7} pairs of bases")

The objectives of this project is to perform a **mapping** of short DNA sequences (1500000 of 100 nucleotides long) over all the chromosomes we just imported. In other words, we need to find their localisation.

Due to the length of the genome and the number of read to map, this problem can't be done with "naive" methods. Instead, we use a **string-search** algorithm, based on the Burrows Wheeler Transform (BWT) (Sources: https://www.molgen.mpg.de/3708260/bwt_fm.pdf). The read are mapped over the BWT of all the chromosomes. To that extend, we need to compute the BWT of very large strings which can be very time consuming. The **DC3 algorithm** allows us to compute very efficiently suffix array of strings and the we use it to compute the BWT.

_____________________

# DC3 on genome

The following lines of code are used to import the suffix table computed from the DC3 algorithm

To store all DC3 result, the following comands.
Stored in .npy, one of the best way of storage for list (https://stackoverflow.com/questions/9619199/best-way-to-preserve-numpy-arrays-on-disk)

In [3]:
for chromo in chromosomes+chromosomes_inv:
    print(chromo.file_name)
    if chromo.suffix_table is None:
        dc3result = dc3(chromo.DNA_dol)
        chromo.export_dc3_result(dc3result)

P_fal_chromosome_1
P_fal_chromosome_2
P_fal_chromosome_3
P_fal_chromosome_4
P_fal_chromosome_5
P_fal_chromosome_6
P_fal_chromosome_7
P_fal_chromosome_8
P_fal_chromosome_9
P_fal_chromosome_10
P_fal_chromosome_11
P_fal_chromosome_12
P_fal_chromosome_13
P_fal_chromosome_14
P_fal_chromosome_15
P_fal_chromosome_-1
P_fal_chromosome_-2
P_fal_chromosome_-3
P_fal_chromosome_-4
P_fal_chromosome_-5
P_fal_chromosome_-6
P_fal_chromosome_-7
P_fal_chromosome_-8
P_fal_chromosome_-9
P_fal_chromosome_-10
P_fal_chromosome_-11
P_fal_chromosome_-12
P_fal_chromosome_-13
P_fal_chromosome_-14
P_fal_chromosome_-15


This chunk creates (or imports) the BWT transform of every chromosomes

In [None]:
for chromo in chromosomes+chromosomes_inv:
    print(chromo.file_name)
    if chromo.bwt is None:
        bwt_result = bwt.bwt(str(chromo.DNA_dol),chromo.suffix_table)
        bwt_result = bwt_result.replace("$", "")
        chromo.export_bwt_result(bwt_result)


Next we compute or import the rank matrix of each chromosome. Rank matrices are large python dictionnaries storing for a chromosome, the rank array of each of its four nucleotides "A", "C", "G", "T" and for the end of string character "$".<p>
Rank matrices are very useful for our *string_search* function in the mapping module. They allow us to localise substrings with $O(1)$ complexity (again the source: https://www.molgen.mpg.de/3708260/bwt_fm.pdf).

In [None]:
for chromo in chromosomes+chromosomes_inv:
    print(chromo.file_name)
    if chromo.rank_mat is None:
        print(len(chromo.bwt))
        rank_mat = bwt.create_rank_mat(chromo.bwt)
        print("Done !")
        chromo.export_rank_matrix_result(rank_mat) 
        print("Exported !")

## Now we calcule all read

### *Reads* acquired from a *P.falciparum* genome sequencing 
In the same way, *reads* are extracted with *Biopython* and stored in a  python list **reads**. All the *reads* have the same length: 100 pairs of bases

In [3]:
reads = []
with tqdm(total=1500000, desc="Reads importation") as pbar:
    for record in tqdm(SeqIO.parse("SEQUENCES/P_fal_reads.fq", format="fastq")):
        reads.append((record.seq, record.id))
        pbar.update(1)

1500000it [00:31, 48009.93it/s]██▉| 1495832/1500000 [00:31<00:00, 46849.75it/s]
Reads importation: 100%|██████████| 1500000/1500000 [00:31<00:00, 48006.86it/s]


In [None]:
print(f"Number of reads: {len(reads)}")
print(f"Reads length: {len(reads[0][0])}")

In [8]:
importlib.reload(mapping)
def test():
    with tqdm(total=150000, desc="Reads importation") as pbar:
        for i in range(150000):
            k_reads = mapping.cut_read_to_kmer(reads[i][0],20)
            for k_read in k_reads:
                mapping.string_search(k_read,chromosomes[0])
            pbar.update(1)
            
import cProfile
cProfile.run("test()")

Reads importation: 100%|██████████| 150000/150000 [01:03<00:00, 2352.04it/s]

         120462219 function calls (120312219 primitive calls) in 63.777 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.125    1.125   63.777   63.777 4192026678.py:3(test)
   316593    0.423    0.000    3.174    0.000 <__array_function__ internals>:177(sort)
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:1053(_handle_fromlist)
        1    0.000    0.000   63.777   63.777 <string>:1(<module>)
 12986101    5.442    0.000    6.852    0.000 Seq.py:1683(__init__)
 13438142    3.814    0.000    5.409    0.000 Seq.py:406(__len__)
 25222202   15.705    0.000   27.075    0.000 Seq.py:410(__getitem__)
        1    0.000    0.000    0.000    0.000 __init__.py:48(create_string_buffer)
        1    0.000    0.000    0.000    0.000 _monitor.py:94(report)
        1    0.000    0.000    0.000    0.000 _weakrefset.py:111(remove)
        2    0.000    0.000    0.000    0.000 _weakrefset.py:17(__i


