# Jupyter Notebook GenoRobotics Full Pipeline

## Imports

In [1]:
import os

from lib.consensus.consensus import run_consensus
from lib.identification.identification import run_identification

## Define Your File and Folder Paths

- Modify the "input_src" variable to point to the directory containing the input files. 
  
- Modify the "output_src" variable to point to the directory where you want the output files to be saved.

In [2]:
input_fastq_filename = "rbcL_Qiagen_tomato_5000.fastq"
input_fastq_path = f"assets/input/{input_fastq_filename}"
base_name = os.path.splitext(input_fastq_filename)[0]

output_base_dir = "assets/output"
output_dir = os.path.join(output_base_dir, base_name)
os.makedirs(output_dir, exist_ok=True)

## Run Preprocessing (Optional)

In [3]:
# preprocessing()

## Run Consensus Sequence Generation

Select which consensus sequence generation method you want to use by setting the "consensus_method" variable to either:

- "majority" (default)

- "consensus"

- "consensus_with_ambiguities"

In [4]:
run_consensus(input_fastq_filename, input_fastq_path, output_dir)

Running read alignment with minimap2 on top 20% sequences...


ERROR:root:Error: [M::mm_idx_gen::0.024*1.28] collected minimizers
[M::mm_idx_gen::0.037*1.87] sorted minimizers
[M::main::0.037*1.87] loaded/built the index for 1000 target sequence(s)
[M::mm_mapopt_update::0.038*1.84] mid_occ = 608
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 1000
[M::mm_idx_stat::0.039*1.82] distinct minimizers: 61942 (78.57% are singletons); average occurrences: 3.831; average spacing: 2.966; total length: 703802
[M::worker_pipeline::2.814*2.69] mapped 1000 sequences
[M::main] Version: 2.22-r1101
[M::main] CMD: minimap2 -x ava-ont assets/output/rbcL_Qiagen_tomato_5000/rbcL_Qiagen_tomato_5000.fastq_top20.fastq assets/output/rbcL_Qiagen_tomato_5000/rbcL_Qiagen_tomato_5000.fastq_top20.fastq
[M::main] Real time: 2.819 sec; CPU: 7.575 sec; Peak RSS: 0.068 GB



Generating consensus sequence with racon on top 20% sequences...


ERROR:root:Error: [racon::Polisher::initialize] loaded target sequences 0.007548 s
[racon::Polisher::initialize] loaded sequences 0.007444 s
[racon::Polisher::initialize] loaded overlaps 0.588765 s
[racon::Polisher::initialize] aligning overlaps [=>                  ] 0.006336 s
[racon::Polisher::initialize] aligning overlaps [==>                 ] 0.010554 s
[racon::Polisher::initialize] aligning overlaps [===>                ] 0.014701 s
[racon::Polisher::initialize] aligning overlaps [====>               ] 0.018939 s
[racon::Polisher::initialize] aligning overlaps [=====>              ] 0.023358 s
[racon::Polisher::initialize] transformed data into windows 0.009004 s
[racon::Polisher::polish] generating consensus [=>                  ] 1.250673 s
[racon::Polisher::polish] generating consensus [==>                 ] 1.811734 s
[racon::Polisher::polish] generating consensus [===>                ] 1.937921 s
[racon::Polisher::polish] generating consensus [====>               ] 1.952718

Multiple sequences found in assets/output/rbcL_Qiagen_tomato_5000/rbcL_Qiagen_tomato_5000.fastq_top20_consensus.fasta. Selecting the best alignment...


ERROR:root:Error: [M::mm_idx_gen::0.001*8.10] collected minimizers
[M::mm_idx_gen::0.002*4.99] sorted minimizers
[M::main::0.002*4.97] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.002*4.82] mid_occ = 10
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.002*4.73] distinct minimizers: 130 (100.00% are singletons); average occurrences: 1.000; average spacing: 6.031; total length: 784
[M::worker_pipeline::0.026*2.28] mapped 4000 sequences
[M::main] Version: 2.22-r1101
[M::main] CMD: minimap2 -x map-ont assets/output/rbcL_Qiagen_tomato_5000/rbcL_Qiagen_tomato_5000.fastq_top20_consensus.fasta assets/output/rbcL_Qiagen_tomato_5000/rbcL_Qiagen_tomato_5000.fastq_remaining80.fastq
[M::main] Real time: 0.027 sec; CPU: 0.060 sec; Peak RSS: 0.006 GB



Running read alignment with minimap2 on remaining 80% sequences...
Generating final consensus sequence with racon...


ERROR:root:Error: [racon::Polisher::initialize] loaded target sequences 0.000078 s
[racon::Polisher::initialize] loaded sequences 0.018144 s
[racon::Polisher::initialize] loaded overlaps 0.005549 s
[racon::Polisher::initialize] aligning overlaps [=>                  ] 0.025948 s
[racon::Polisher::initialize] aligning overlaps [==>                 ] 0.035553 s
[racon::Polisher::initialize] aligning overlaps [===>                ] 0.044434 s
[racon::Polisher::initialize] aligning overlaps [====>               ] 0.051820 s
[racon::Polisher::initialize] aligning overlaps [=====>              ] 0.058610 s
[racon::Polisher::initialize] transformed data into windows 0.001633 s
[racon::Polisher::polish] generated consensus 8.859059 s
[racon::Polisher::] total = 9.009265 s



Minimap2 alignment took 0.07 seconds.
Total Racon iterations took 14.68 seconds.
Total time taken for the pipeline: 14.76 seconds.


## Run Idenfitication of Consensus Sequence

In [90]:
run_identification()