# Jupyter Notebook GenoRobotics Full Pipeline

## Imports

In [2]:
import os

from lib.consensus.consensus import run_consensus
from lib.identification.identification import run_identification

## Define Your File and Folder Paths

- Modify the "input_src" variable to point to the directory containing the input files. 
  
- Modify the "output_src" variable to point to the directory where you want the output files to be saved.

In [3]:
input_fastq_filename = "rbcL_Qiagen_tomato_5000.fastq"
input_fastq_path = f"assets/input/{input_fastq_filename}"
base_name = os.path.splitext(input_fastq_filename)[0]

output_base_dir = "assets/output"
output_dir = os.path.join(output_base_dir, base_name)
os.makedirs(output_dir, exist_ok=True)

## Run Preprocessing (Optional)

In [3]:
# preprocessing()

## Run Consensus Sequence Generation

Select which consensus sequence generation method you want to use by setting the "consensus_method" variable to either:

- "majority" (default)

- "consensus"

- "consensus_with_ambiguities"

In [5]:
# choose a consensus method between the following:
# - "80_20_best_sequence"
# - "80_20_longest_sequence"

run_consensus(input_name= input_fastq_filename, 
              input_fastq_path=input_fastq_path, 
              output_dir=output_dir, 
              consensus_method="80_20_best_sequence")

Running read alignment with minimap2 on top 20% sequences...


ERROR:root:Error: [M::mm_idx_gen::0.026*1.21] collected minimizers
[M::mm_idx_gen::0.035*1.67] sorted minimizers
[M::main::0.035*1.67] loaded/built the index for 1000 target sequence(s)
[M::mm_mapopt_update::0.037*1.64] mid_occ = 608
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 1000
[M::mm_idx_stat::0.038*1.62] distinct minimizers: 61942 (78.57% are singletons); average occurrences: 3.831; average spacing: 2.966; total length: 703802
[M::worker_pipeline::2.796*2.69] mapped 1000 sequences
[M::main] Version: 2.22-r1101
[M::main] CMD: minimap2 -x ava-ont assets/output/rbcL_Qiagen_tomato_5000/rbcL_Qiagen_tomato_5000.fastq_top20.fastq assets/output/rbcL_Qiagen_tomato_5000/rbcL_Qiagen_tomato_5000.fastq_top20.fastq
[M::main] Real time: 2.801 sec; CPU: 7.524 sec; Peak RSS: 0.069 GB



Generating consensus sequence with racon on top 20% sequences...


ERROR:root:Error: [racon::Polisher::initialize] loaded target sequences 0.007265 s
[racon::Polisher::initialize] loaded sequences 0.007781 s
[racon::Polisher::initialize] loaded overlaps 0.622514 s
[racon::Polisher::initialize] aligning overlaps [=>                  ] 0.006859 s
[racon::Polisher::initialize] aligning overlaps [==>                 ] 0.011121 s
[racon::Polisher::initialize] aligning overlaps [===>                ] 0.015240 s
[racon::Polisher::initialize] aligning overlaps [====>               ] 0.019874 s
[racon::Polisher::initialize] aligning overlaps [=====>              ] 0.024352 s
[racon::Polisher::initialize] transformed data into windows 0.008841 s
[racon::Polisher::polish] generating consensus [=>                  ] 1.242651 s
[racon::Polisher::polish] generating consensus [==>                 ] 1.796944 s
[racon::Polisher::polish] generating consensus [===>                ] 1.943133 s
[racon::Polisher::polish] generating consensus [====>               ] 1.959451

Multiple sequences found in assets/output/rbcL_Qiagen_tomato_5000/rbcL_Qiagen_tomato_5000.fastq_top20_consensus.fasta. Selecting the best alignment...


ERROR:root:Error: [M::mm_idx_gen::0.001*8.96] collected minimizers
[M::mm_idx_gen::0.001*5.79] sorted minimizers
[M::main::0.001*5.75] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.001*5.54] mid_occ = 10
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.001*5.39] distinct minimizers: 130 (100.00% are singletons); average occurrences: 1.000; average spacing: 6.031; total length: 784
[M::worker_pipeline::0.025*2.30] mapped 4000 sequences
[M::main] Version: 2.22-r1101
[M::main] CMD: minimap2 -x map-ont assets/output/rbcL_Qiagen_tomato_5000/rbcL_Qiagen_tomato_5000.fastq_top20_consensus.fasta assets/output/rbcL_Qiagen_tomato_5000/rbcL_Qiagen_tomato_5000.fastq_remaining80.fastq
[M::main] Real time: 0.025 sec; CPU: 0.057 sec; Peak RSS: 0.006 GB



Running read alignment with minimap2 on remaining 80% sequences...
Generating final consensus sequence with racon...


ERROR:root:Error: [racon::Polisher::initialize] loaded target sequences 0.000167 s
[racon::Polisher::initialize] loaded sequences 0.017978 s
[racon::Polisher::initialize] loaded overlaps 0.005554 s
[racon::Polisher::initialize] aligning overlaps [=>                  ] 0.025392 s
[racon::Polisher::initialize] aligning overlaps [==>                 ] 0.033861 s
[racon::Polisher::initialize] aligning overlaps [===>                ] 0.042363 s
[racon::Polisher::initialize] aligning overlaps [====>               ] 0.049628 s
[racon::Polisher::initialize] aligning overlaps [=====>              ] 0.056248 s
[racon::Polisher::initialize] transformed data into windows 0.001893 s
[racon::Polisher::polish] generated consensus 8.840039 s
[racon::Polisher::] total = 8.985954 s



Minimap2 alignment took 0.07 seconds.
Total Racon iterations took 14.95 seconds.
Total time taken for the pipeline: 15.02 seconds.


## Run Idenfitication of Consensus Sequence

In [90]:
run_identification()