# Off Target Analysis

Dig into some off-target PCR priming sequences to see if I can understand their cause. Questions to study include:
 * Do there tend to be many different off-target binding sites or just a few?
 * Do there tend to be some very closely aligned human matches for the primers that could be avoided by better assay design?
 * Do human matches tend to vary with the genome of the individual being sampled?

## Setup

In [2]:
%load_ext autoreload
%autoreload 1
%aimport RCUtils

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [57]:
from Bio import Align
from Bio import SeqIO
import matplotlib_inline.backend_inline
import os
import pandas as pd
import RCUtils
import random

MIN_SPAN = 150

qPCRprimers = RCUtils.readPrimers("qPCRPrimers.fasta", display=True)
ONTBarcodes = list(SeqIO.parse("ONTBarcodes.fasta", format="fasta"))

# Return a list of all primers that match the given primerIds
def getPrimersById(primerIds):
    return [primer for primer in qPCRprimers if primer.id in primerIds]

def flipSeqForBackwardsPrimer(hit, seq):
    # If we have a fwd primer in reverse orientation or a reverse primer in forward orientation
    # then we need to reverse complement the sequence to be consistent.
    if (hit.rev == hit.primer.id.endswith("-f")):
        return seq.reverse_complement()
    else:
        return seq

def stripBarcodes(read):
    #RCUtils.primer_hits_to_print = 5
    hits = RCUtils.computePrimerHits(read, ONTBarcodes)
    #RCUtils.primer_hits_to_print = 0
    for hit in hits:
        print(f" {hit.primer.id} {hit.start}-{hit.end} mr={hit.mr}")

    #aligner = Align.PairwiseAligner(mode='local', match_score=1, mismatch_score=-1, gap_score=-1)
    #a = aligner.align(read.seq, ONTBarcodes[9].seq, strand="-")[0]
    #print(f" Barcode at {a.coordinates[0][0]}-{a.coordinates[0][-1]} mr={a.score/len(a.query)}")
    #print(a)
    return read


# TODO: Filter out the barcodes / adapters
def getAmplicons(read, hits):
    # See if there may be a partial hit at the beginning of the read
    # Note that since we're studying off-target amplification here, there's no guarantee that the forward
    # primer is actually bound to the start of the amplicon.
    if (len(hits) > 0):
        seq = read.seq[:hits[0].end]
        if (len(seq) > MIN_SPAN):
            yield flipSeqForBackwardsPrimer(hits[0], seq)

    # Look for a pair of primers with enough space between them to potentially be an amplicon
    for i in range(0, len(hits) - 1):
        seq = read.seq[hits[i].start:hits[i+1].end]
        if (len(seq) > MIN_SPAN):
            yield flipSeqForBackwardsPrimer(hits[i], seq)

    # See if there may be a partial hit at the end of the read
    if (len(hits) > 0):
        seq = read.seq[hits[-1].start:]
        if (len(seq) > MIN_SPAN):
            yield flipSeqForBackwardsPrimer(hits[-1], seq)


Reading primers: qPCRPrimers.fasta
  ENTng-f (2 variations)
  ENTng-r
  ENTng-p (8 variations)
  ENTrc-f1
  ENTrc-f2
  ENTrc-r
  HRVma-f
  HRVma-r
  HRVma-p
  HRVkaV-fo (2 variations)
  HRVkaV-fi
  HRVkaV-r (768 variations)
  HRVka5-f
  HRVka5-ro
  HRVka5-ri
Read 791 primers


## Analyze the S59 HRV-Ka5 matches from RVP1/NB05

In [60]:
assayPrimers = getPrimersById(["HRVka5-f","HRVka5-ro"])
dirs = ["../RVP1/RVP1a-mixed/20230713_1659_MN41817_APC648_641ecf93/fastq_pass/barcode05/",
        "../RVP1/RVP1b-mixed/20230714_0739_MN41817_APC774_a5efdaf2/fastq_pass/barcode05/"]

def primerName(hit):
    return hit.primer.id + (" rev" if hit.rev else "")

reads = 0
hitCount = 0
readsWithHit = 0
readsWithTwoOrMoreHits = 0

ampliconSets = []

aligner = Align.PairwiseAligner(mode='local', match_score=1, mismatch_score=-1, gap_score=-1)
MATCH_THRESHOLD = 0.5

for read in RCUtils.readFastQDirs(dirs):
    reads = reads + 1
    print (f"Read {read.id} length {len(read.seq)}")
    strippedRead = stripBarcodes(read)
    hits = RCUtils.computePrimerHits(strippedRead, assayPrimers)
    for hit in hits:
        print (f" {primerName(hit)} {hit.start}-{hit.end} mr={hit.mr}")
        #print ("  " + read.seq[hit.start:hit.end])

    hitCount = hitCount + len(hits)
    if (len(hits) > 0):
        readsWithHit = readsWithHit + 1
    if (len(hits) > 1):
        readsWithTwoOrMoreHits = readsWithTwoOrMoreHits + 1

    amplicons = getAmplicons(strippedRead, hits)
    for amplicon in amplicons:
        found = False
        for i in range(0, len(ampliconSets)):
            (setSeq, count) = ampliconSets[i]
            mr = aligner.score(amplicon, setSeq) / min(len(amplicon), len(setSeq))
            print (f" amplicon mr={mr}")
            if (mr >= MATCH_THRESHOLD):
                # Keep the longest amplicon as representative of the set
                if (len(amplicon) > len(setSeq)):
                    setSeq = amplicon
                ampliconSets[i] = (setSeq, count + 1)
                found = True
                break
        if (not found):
            ampliconSets.append((amplicon, 1))
    print ("  spans: " + " ".join([str(len(amplicons)) for amplicon in amplicons]))
    print("")


print(f"{reads} reads")
print(f"{hitCount} hits")
print(f"{readsWithHit} reads with hits")
print(f"{readsWithTwoOrMoreHits} reads with two or more hits")
print(f"{len(ampliconSets)} amplicon sets")
for (amplicon, count) in ampliconSets:
    print (f"  {count}x {amplicon}")


Read dd9634e0-5fd3-47c6-830a-3d2d4528ee9a length 474
 NB05f 37-62 mr=0.92
 NB05r 450-473 mr=0.92
 HRVka5-ro 68-92 mr=0.96
  spans: 

Read 02c542a2-0ab6-4fc0-8943-d8e6e73437a9 length 517
 NB05f 37-59 mr=0.83
 NB05r 470-495 mr=0.96
 HRVka5-f rev 441-462 mr=1.0
 amplicon mr=0.6231527093596059
  spans: 

Read 78ab60d2-596d-41e6-9a2c-7afbe5454e25 length 491
 NB05f 34-58 mr=1.0
 HRVka5-f 66-87 mr=1.0
 HRVka5-ro rev 436-459 mr=1.0
 amplicon mr=0.7837150127226463
  spans: 

Read e8652454-d656-484e-b9bc-c3073f3fdccd length 483
 NB05f 34-58 mr=1.0
 HRVka5-f 66-87 mr=0.81
 HRVka5-ro rev 433-456 mr=1.0
 amplicon mr=0.7717948717948718
  spans: 

Read 55673050-c97b-448b-9437-6cb27bb6bcc4 length 472
 NB05f 34-56 mr=0.88
 HRVka5-f 64-85 mr=0.95
 HRVka5-ro rev 421-446 mr=0.91
 amplicon mr=0.7198952879581152
  spans: 

Read 9c8791e0-fbf6-4ee2-8bc6-f313400f91c3 length 511
 NB05f 33-56 mr=0.88
 NB05r 467-491 mr=1.0
 HRVka5-ro 64-87 mr=0.96
 HRVka5-f rev 438-459 mr=0.86
 amplicon mr=0.7544303797468355
  sp