# Jupyter Notebook GenoRobotics Full Pipeline

## Imports

In [18]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [19]:
import os

from lib.consensus.consensus import run_consensus
from lib.identification.identification import run_identification

## Define Your File and Folder Paths

- Modify the "input_src" variable to point to the directory containing the input files. 
  
- Modify the "output_src" variable to point to the directory where you want the output files to be saved.

In [20]:
input_fastq_filename = "rbcL_Qiagen_tomato_5000.fastq"
input_fastq_path = os.path.join("assets","input", input_fastq_filename)
base_name = os.path.splitext(input_fastq_filename)[0]

output_base_dir = os.path.join('assets','output')
output_dir = os.path.join(output_base_dir, base_name)
os.makedirs(output_dir, exist_ok=True)

## Run Preprocessing (Optional)

In [21]:
# preprocessing()

## Run Consensus Sequence Generation

Select which consensus sequence generation method you want to use by setting the "consensus_method" variable to either:

- "majority" (default)

- "consensus"

- "consensus_with_ambiguities"

In [22]:
# choose a consensus method between the following:
# - "80_20_best_sequence"
# - "80_20_longest_sequence"

#If you're on Windows and have to use WSL (Windows Subsystem for Linux), set wsl to True
wsl = True

run_consensus(input_name= base_name, 
              input_fastq_path=input_fastq_path, 
              consensus_method="80_20_best_sequence",
              wsl = wsl)

Running consensus pipeline... 

Running consensus pipeline with 80_20_best_sequence method...
Minimap2 alignment took 0.58 seconds.
Total Racon iterations took 24.42 seconds.
Total time taken for the consensus pipeline: 24.99 seconds.


## Run Idenfitication of Consensus Sequence

In [26]:
# Choose your db along the gene you're trying to identify : matK, rbcL, psbA-trnH or ITS
db = "matK"

run_identification(base_name, db=db)

Running consensus pipeline... 

Querying  matK[All Fields] AND (is_nuccore[filter] AND "750"[SLEN] : "1500"[SLEN]))
Downloading from 0 to 10000 : 0.0%
Downloading from 10000 to 20000 : 8.329099374484636%
Downloading from 20000 to 30000 : 16.658198748969273%
Downloading from 30000 to 40000 : 24.98729812345391%
Downloading from 40000 to 50000 : 33.316397497938546%
Downloading from 50000 to 60000 : 41.64549687242318%
Downloading from 60000 to 70000 : 49.97459624690782%
Downloading from 70000 to 80000 : 58.30369562139246%
Downloading from 80000 to 90000 : 66.63279499587709%
Downloading from 90000 to 100000 : 74.96189437036173%
Downloading from 100000 to 110000 : 83.29099374484636%
Downloading from 110000 to 120000 : 91.620093119331%
Downloading from 120000 to 130000 : 99.94919249381564%
You can find identification output at assets\output\blastn\rbcL_Qiagen_tomato_5000


In [27]:
from Bio import SeqIO

count = sum(1 for _ in SeqIO.parse("matK_sequences.fasta", "fasta"))
print(count)


120048
