# Jupyter Notebook GenoRobotics Full Pipeline

## Imports

In [1]:
import os

from lib.consensus.consensus import run_consensus
from lib.identification.identification import run_identification

## Define Your File and Folder Paths

- Modify the "input_src" variable to point to the directory containing the input files. 
  
- Modify the "output_src" variable to point to the directory where you want the output files to be saved.

In [2]:
input_fastq_filename = "rbcL_Qiagen_tomato_5000.fastq"
input_fastq_path = f"assets/input/{input_fastq_filename}"
base_name = os.path.splitext(input_fastq_filename)[0]

output_base_dir = "assets/output"
output_dir = os.path.join(output_base_dir, base_name)
os.makedirs(output_dir, exist_ok=True)

## Run Preprocessing (Optional)

In [3]:
# preprocessing()

## Run Consensus Sequence Generation

Select which consensus sequence generation method you want to use by setting the "consensus_method" variable to either:

- "majority" (default)

- "consensus"

- "consensus_with_ambiguities"

In [5]:
# choose a consensus method between the following:
# - "80_20_best_sequence"
# - "80_20_longest_sequence"

# rename the input_fastq_filename with the extension removed
input_fastq_filename = input_fastq_filename.replace(".fastq", "")

run_consensus(input_name= input_fastq_filename, 
              input_fastq_path=input_fastq_path, 
              output_dir=output_dir, 
              consensus_method="80_20_best_sequence")

Running consensus pipeline...
Running consensus pipeline with 80_20_best_sequence method...
Running read alignment with minimap2 on top 20% sequences...


ERROR:root:Error: [M::mm_idx_gen::0.023*1.20] collected minimizers
[M::mm_idx_gen::0.035*1.79] sorted minimizers
[M::main::0.035*1.79] loaded/built the index for 1000 target sequence(s)
[M::mm_mapopt_update::0.037*1.76] mid_occ = 608
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 1000
[M::mm_idx_stat::0.037*1.74] distinct minimizers: 61942 (78.57% are singletons); average occurrences: 3.831; average spacing: 2.966; total length: 703802
[M::worker_pipeline::2.806*2.69] mapped 1000 sequences
[M::main] Version: 2.22-r1101
[M::main] CMD: minimap2 -x ava-ont assets/output/rbcL_Qiagen_tomato_5000/rbcL_Qiagen_tomato_5000_top20.fastq assets/output/rbcL_Qiagen_tomato_5000/rbcL_Qiagen_tomato_5000_top20.fastq
[M::main] Real time: 2.810 sec; CPU: 7.556 sec; Peak RSS: 0.074 GB



Generating consensus sequence with racon on top 20% sequences...


ERROR:root:Error: [racon::Polisher::initialize] loaded target sequences 0.007301 s
[racon::Polisher::initialize] loaded sequences 0.007621 s
[racon::Polisher::initialize] loaded overlaps 0.593083 s
[racon::Polisher::initialize] aligning overlaps [=>                  ] 0.006839 s
[racon::Polisher::initialize] aligning overlaps [==>                 ] 0.011122 s
[racon::Polisher::initialize] aligning overlaps [===>                ] 0.015304 s
[racon::Polisher::initialize] aligning overlaps [====>               ] 0.019999 s
[racon::Polisher::initialize] aligning overlaps [=====>              ] 0.024304 s
[racon::Polisher::initialize] transformed data into windows 0.009494 s
[racon::Polisher::polish] generating consensus [=>                  ] 1.236418 s
[racon::Polisher::polish] generating consensus [==>                 ] 1.787041 s
[racon::Polisher::polish] generating consensus [===>                ] 1.912120 s
[racon::Polisher::polish] generating consensus [====>               ] 1.926793

Multiple sequences found in assets/output/rbcL_Qiagen_tomato_5000/rbcL_Qiagen_tomato_5000_top20_consensus.fasta. Selecting the best alignment...


ERROR:root:Error: [M::mm_idx_gen::0.001*8.12] collected minimizers
[M::mm_idx_gen::0.001*5.26] sorted minimizers
[M::main::0.001*5.24] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.001*5.08] mid_occ = 10
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.001*4.97] distinct minimizers: 130 (100.00% are singletons); average occurrences: 1.000; average spacing: 6.031; total length: 784
[M::worker_pipeline::0.025*2.27] mapped 4000 sequences
[M::main] Version: 2.22-r1101
[M::main] CMD: minimap2 -x map-ont assets/output/rbcL_Qiagen_tomato_5000/rbcL_Qiagen_tomato_5000_top20_consensus.fasta assets/output/rbcL_Qiagen_tomato_5000/rbcL_Qiagen_tomato_5000_remaining80.fastq
[M::main] Real time: 0.026 sec; CPU: 0.058 sec; Peak RSS: 0.006 GB



Running read alignment with minimap2 on remaining 80% sequences...
Generating final consensus sequence with racon...


ERROR:root:Error: [racon::Polisher::initialize] loaded target sequences 0.000067 s
[racon::Polisher::initialize] loaded sequences 0.017888 s
[racon::Polisher::initialize] loaded overlaps 0.005623 s
[racon::Polisher::initialize] aligning overlaps [=>                  ] 0.024927 s
[racon::Polisher::initialize] aligning overlaps [==>                 ] 0.033647 s
[racon::Polisher::initialize] aligning overlaps [===>                ] 0.041678 s
[racon::Polisher::initialize] aligning overlaps [====>               ] 0.048782 s
[racon::Polisher::initialize] aligning overlaps [=====>              ] 0.055204 s
[racon::Polisher::initialize] transformed data into windows 0.001582 s
[racon::Polisher::polish] generated consensus 8.850429 s
[racon::Polisher::] total = 8.992642 s



Deleting intermediate files...
Minimap2 alignment took 0.07 seconds.
Total Racon iterations took 14.62 seconds.
Total time taken for the pipeline: 14.68 seconds.


## Run Idenfitication of Consensus Sequence

In [90]:
run_identification()