## Adapter trimming and barcode demultiplexing with Porechop

Some ONT applications can gain advantage of pooling (multiplexing) different samples by using sequence adapters to reduce the costs of the sequencing run. As a consequence, the results of the run will be a mixture of reads with different adapters tagging each sample. NanoDJ relies on .[Porechop](https://github.com/rrwick/Porechop) for finding and trimming ONT adapters and demultiplexing the barcoded ONT reads. Porechop commands and options are shown in the command usage:

In [None]:
!porechop -h

Basic Porechop command requires an input file (<font color='blue'>-i</font> option) and output filename (after the '<font color='blue'>></font>' symbol). Porechop finds the adapters and places the trimmed reads on the output.

Demultiplexing can be done with the <font color='blue'>-b</font> BARCODE_DIR option instead of defining the output file for trimmed reads. Reads are distributed in different bins (files) depending on their barcodes and these files are placed on BARCODE_DIR directory. The user can also control the minimum match percentage of barcodes changing the threshold (<font color='blue'>--barcode_threshold</font>) and add more options as shown on Porechop usage page.

The demultiplexing task can be in Albacore since 1.0 version. In this case, it is frequent to find that both algorithms disagree on the most appropriate bin for a read. Porechop can perform its own demultiplexing on the Albacore output, placing the reads in which both software disagree in a bin called 'none'.


In [None]:
#Porechop data link available in github
!porechop -i data/porechop/1_out.fastq -b data/porechop/demultiplexed --threads 2

The following builds a plot that shows the run yield for each barcode. The x and y axes show the total number of reads and the total number of bases, repectively. The size of each point corresponds to the average read length of the reads for that specific barcode.

In [None]:
#!/usr/bin/python

import sys
import matplotlib
import matplotlib.patches as mpatches
import matplotlib.pyplot as plt
from Bio import SeqIO

In [None]:
matplotlib.rcParams['figure.figsize'] = (15, 10)

barcode_dir = 'data/porechop/demultiplexed/'
#Modify this list with an element per barcode (6 in the example)
barcode_files = ['BC01', 'BC02', 'BC03', 'BC04', 'BC05', 'BC06']

avg_lenghts = []
bases_cnt = []
read_cnt = []

#Bin files parser
def read_bin(filename):
    read_lenghts = []
    count = 0
    for seq_record in SeqIO.parse(filename, 'fastq'):
        read_lenghts.append(len(seq_record.seq))
        count += 1

    avg_lenghts.append(sum(read_lenghts)/len(read_lenghts))
    bases_cnt.append(sum(read_lenghts))
    read_cnt.append(count)

for bin_file in barcode_files:
    read_bin(barcode_dir + str(bin_file) + '.fastq')

#One colour per barcode (6 + none colour in the example)
colors = ['#F15854', '#5DA5DA', '#FAA43A', '#60BD68', '#F17CB0', '#B276B2', '#DECF3F']

#Size of each point depending on the average read lengths
s = [3.2**(n/1000) for n in avg_lenghts]

#Draw the points and annotations
fig, ax = plt.subplots()
for index, barcode in enumerate(barcode_files, start=0):
    ax.scatter(read_cnt[index],bases_cnt[index],s=s[index],c=colors[index],label=barcode, alpha=0.7, linewidth=3)
    ax.annotate(int(avg_lenghts[index]), (read_cnt[index] - 7,bases_cnt[index] - 10**4.5))

#Legend
handles = [mpatches.Patch(color=color, label=barcode) for color, barcode, avg_length in zip(colors, barcode_files, avg_lenghts)]
ax.legend(handles=handles, loc=4, prop={'size': 15}, fontsize=9, frameon=True)

#Title and axes text
plt.title('Yield per barcode and average read length', fontsize=18)
plt.xlabel('Number of reads', fontsize=12)
plt.ylabel('Total number of bases', fontsize=12)

ax.grid(True)

plt.show()