# RespiCoV sequencing analysis by reference genome

Analyze fastq file(s) from nanopore sequencing and compare to a set of known reference genomes. Intended for use with the [RespiCoV](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0264855) sequencing protocol. Relevent reference genomes are found through prior analysis (eg. RCMatchPrimers and BLAST). Specific reference sources include:
 * Rhinoviruses: [Enterovirus species taxonomy](https://ictv.global/report/chapter/picornaviridae/picornaviridae/enterovirus)
 * SARS-CoV-2: [Pango lineages](https://cov-lineages.org/lineage_list.html)

Run here on my first RespiCoV sequencing attempt where I knew (from gel) that I had a lot of mis-priming, and where my flow cell was nearly exhausted. 

**Goals:**
 * Identify best target species/type match(es) in each input sample.
 * Support both ligation and transposase ("rapid") chemistry, i.e. primer-independent.
 * Support multiple pooled samples per barcode for pool demultiplexing.
 * Run quickly on a single machine and scale linearly with the product of input and reference sequences.

**TODO:**
 * Make this do something
 
**Non-goals / future work elsewhere:**
 * Study PCR efficiency and mis-priming (see RCMatchPrimers).
 * Identify new species/types (relies on prior analysis to populate set of relevant reference genomes).


## Initialization and configuration

In [1]:
from Bio.Seq import Seq
from Bio import SeqIO
import matplotlib_inline.backend_inline
import os
import gzip
import pandas as pd
import RCUtils

# Get high-dpi output for retina displays
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')

fastQBaseDir = "../20221204_2344_MN41817_FAV39017_1bf37150/fastq_pass/"
genomeDir = "refseq"

# Read in all the reference genomes
genomes = []
for file in sorted(filter(lambda f: f.endswith(".gb"), os.listdir(genomeDir))):  
    for genome in SeqIO.parse(os.path.join(genomeDir, file), "gb"):
        # Use a more descriptive name
        genome.name = file.removesuffix(".gb")
        genomes.append(genome)
print("Read %i genomes" % (len(genomes)))


Read 4 genomes


In [2]:
genomes[0]

SeqRecord(seq=Seq('CCAAAGTAGTTGGTCCCGTCCCGCATGCAACTTAGAAGCTTTGCACAAAGACCA...TAG'), id='DQ473497.1', name='Rhinovirus-A23', description='Rhinovirus A23, complete genome', dbxrefs=[])

## Read 

In [3]:
import RCUtils

for read in RCUtils.getReads(fastQBaseDir, "barcode07"):
    hits = RCUtils.seqMatch(read, genomes)
    if hits:
        print(f'read {read.id} len={len(read.seq)}')
        for hit in hits:
            print(f'  {hit.target.name} [{hit.targetStart},{hit.targetEnd}]{hit.strand} read [{hit.readStart},{hit.readEnd}] score={hit.score}')

Reading: ../20221204_2344_MN41817_FAV39017_1bf37150/fastq_pass/barcode07/FAV39017_pass_barcode07_1d0a44b7_0.fastq.gz
read 6082e1b1-de83-41be-8e44-a5834f145c04 len=902
  Rhinovirus-C1 [26,423]+ read [(67, 462)] score=335.0
  Rhinovirus-A23 [84,483]+ read [(67, 467)] score=195.0
  Rhinovirus-A56 [128,554]+ read [(34, 462)] score=193.0
read 7f2f0cf8-a6e0-4f09-9517-63dc3d9559d9 len=800
  Rhinovirus-A56 [554,159]- read [(383, 774)] score=315.0
  Rhinovirus-A23 [478,84]- read [(383, 774)] score=263.0
  Rhinovirus-C1 [423,26]- read [(383, 774)] score=195.0
read 585e18bf-27e7-42f4-9115-eb94f21b7072 len=507
  Rhinovirus-C1 [421,26]- read [(68, 456)] score=297.0
  Rhinovirus-A23 [476,84]- read [(68, 456)] score=155.0
  Rhinovirus-A56 [552,159]- read [(68, 456)] score=146.0
read 10e40a60-5fc4-47cc-bc67-59f1e022a87b len=503
  Rhinovirus-C1 [26,421]+ read [(65, 452)] score=302.0
  Rhinovirus-A56 [144,552]+ read [(49, 452)] score=175.0
  Rhinovirus-A23 [84,476]+ read [(65, 452)] score=172.0
read e5d