# Processing Mapping Sequencing data

© 2022 Tom Röschinger. This work is licensed under a <a href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution License CC-BY 4.0</a>. All code contained herein is licensed under an <a href="https://opensource.org/licenses/MIT">MIT license</a>

***

The first data processing step is to process data for the mapping step. In this step we map a promoter to the random barcode in the DNA construct. Due to the size of the DNA element, we usually obtain paired end sequencing data. Here we walk through how we process this data.

In [1]:
# Import packages
using wgregseq, FASTX, DataFrames, CSV, BioSequences, CairoMakie

# Set plotting style
wgregseq.plotting_style.default_makie!()

For the purpose of this notebook, we provide a subset of data that can be run easily in the notebook. For processing of full files, we created scripts that will be explained below. The files are stored in the following directory: `data/sequencing/20220514_mapping/`. This data was obtained on a HiSeq machine and was demultiplexed in advance.



## Quality filtering

The first step is to filter out reads that have low quality. We use the software [`fastp`](https://github.com/OpenGene/fastp) (version 0.2.23). The software is run through the command line. Here we show how the command cam be used. In this run, read 1 contains the sequence for the promoter variant, while read 2 contains the corresponding barcode. Each promoter variant has a length of 160 bp, and read 1 has 11 extra bases trailing the promoter variant, so they are trimmed, such that the read only contains the desired sequence. The random barcode is 20bp long, and read 2 also has 11 trailing bases, which are trimmed as well.



In [3]:
# Find local path
dir = @__DIR__

# Define input directory
indir = "/" * joinpath(split(dir, '/')[1:end-1]..., "data/sequencing/20220514_mapping/")
# Define output directory
outdir = "/" * joinpath(split(dir, '/')[1:end-1]..., "data/filtered_sequencing/20220514_mapping/")

# Define file names for input files and output files
READ1 = indir * "110_R1_subset.fastq"
READ2 = indir * "110_R2_subset.fastq"
OUT1 = outdir * "110_R1_subset.fastq"
OUT2 = outdir * "110_R2_subset.fastq"
html = outdir * "110_fastp_report.html"
JSON = outdir * "110_fastp_report.json"

# Run fastp command
run(
    `bash -c "source activate fastp; fastp --in1 $READ1 --in2 $READ2 --out1 $OUT1 --out2 $OUT2 --trim_tail1 '11' --trim_tail2 '11' --verbose --disable_length_filtering --html $html --json $JSON --thread '6' "`
)

[12:02:53] start to load data of read1 
[12:02:53] start to load data of read2 
[12:02:53] Read2: loading completed with 101 packs 
[12:02:53] Read1: loading completed with 101 packs 
[12:02:53] thread 6 data processing completed 
[12:02:53] thread 6 finished 
[12:02:53] thread 5 data processing completed 
[12:02:53] thread 5 finished 
[12:02:53] thread 3 data processing completed 
[12:02:53] thread 3 finished 
[12:02:53] thread 4 data processing completed 
[12:02:53] thread 4 finished 
[12:02:53] thread 1 data processing completed 
[12:02:53] thread 1 finished 
[12:02:53] thread 2 data processing completed 
[12:02:53] thread 2 finished 
[12:02:53] /Users/tomroeschinger/git/1000_genes_ecoli/data/filtered_sequencing/20220514_mapping/110_R2_subset.fastq writer finished 
[12:02:53] /Users/tomroeschinger/git/1000_genes_ecoli/data/filtered_sequencing/20220514_mapping/110_R1_subset.fastq writer finished 
[12:02:53] start to generate reports
 
Read1 before filtering:
total reads: 100000
total

Process(`[4mbash[24m [4m-c[24m [4m"source activate fastp; fastp --in1 /Users/tomroeschinger/git/1000_genes_ecoli/data/sequencing/20220514_mapping/110_R1_subset.fastq --in2 /Users/tomroeschinger/git/1000_genes_ecoli/data/sequencing/20220514_mapping/110_R2_subset.fastq --out1 /Users/tomroeschinger/git/1000_genes_ecoli/data/filtered_sequencing/20220514_mapping/110_R1_subset.fastq --out2 /Users/tomroeschinger/git/1000_genes_ecoli/data/filtered_sequencing/20220514_mapping/110_R2_subset.fastq --trim_tail1 '11' --trim_tail2 '11' --verbose --disable_length_filtering --html /Users/tomroeschinger/git/1000_genes_ecoli/data/filtered_sequencing/20220514_mapping/110_fastp_report.html --json /Users/tomroeschinger/git/1000_genes_ecoli/data/filtered_sequencing/20220514_mapping/110_fastp_report.json --thread '6' "[24m`, ProcessExited(0))

The output is two `fastq` files that contain sequences passing the quality filters. The last 11 bases were cut from both reads, such that the reads only contain the promoter or barcode of interest. We can have a brief look at the sequencing files.

In [60]:
open(OUT1) do file
    lines = readlines(file)
    display(lines[1:8])
end

8-element Vector{String}:
 "@VH00472:8:AAAYTWTM5:1:1101:19670:1000 1:N:0:ATGGCT"
 "GGCGACTGCCGTTTGATCAGTCATGTTTTAA" ⋯ 98 bytes ⋯ "ATACTGGTTCTCCACAAGGGATGCAAAAGAA"
 "+"
 ";;CCCCCCCCCCCCCCCCCCCCCCCCCCCCC" ⋯ 98 bytes ⋯ "CCCCCCCCC;;CCCC;CC-CCCCCCC;CC-C"
 "@VH00472:8:AAAYTWTM5:1:1101:20125:1000 1:N:0:ATGGCT"
 "CAGCCGGGCGAAGATATAGCCAAAACGGCGG" ⋯ 98 bytes ⋯ "ATGACTACATCGCCAGGCGGCATCCCCACCG"
 "+"
 "-CCCCCCCCCCCC;CCCCCCCCCCCCCCCCC" ⋯ 98 bytes ⋯ "CCCCCCCCCCCCCCCC;;CCC;CCCCCC-CC"

In [62]:
open(OUT2) do file
    lines = readlines(file)
    display(lines[1:8])
end

8-element Vector{String}:
 "@VH00472:8:AAAYTWTM5:1:1101:19670:1000 2:N:0:ATGGCT"
 "CCCCACACCCTGGTGAGAGC"
 "+"
 "CCCCCCCCCCC;CCC;CCCC"
 "@VH00472:8:AAAYTWTM5:1:1101:20125:1000 2:N:0:ATGGCT"
 "TTGCACTTAACTTGGTCACC"
 "+"
 "CCCCCCCCCCCCCCCCCCCC"

In [71]:
# Define arrays
promoters = String[]
barcodes = String[]

# Open file for promoters
open(OUT1) do file
    # Read lines
    lines = readlines(file)
    # Iterate through lines
    for (i, line) in enumerate(lines)
        # Find sequence
        if i%4 == 2
            # Add promoter to 
            push!(promoters, line)
        end
    end
end

# Open file for barcodes
open(OUT2) do file
    # Read lines
    lines = readlines(file)
    # 
    for (i, line) in enumerate(lines)
        if i%4 == 2
            push!(barcodes, line)
        end
    end
end

df_map = DataFrame(barcode=barcodes, promoter=promoters)

Unnamed: 0_level_0,barcode,promoter
Unnamed: 0_level_1,String,String
1,CCCCACACCCTGGTGAGAGC,GGCGACTGCCGTTTGATCAGTCATGTTTTAAACTGAGGCACATCAACGCCCTATGGCTCGTAACGCCAACCTTTTGCGGAAGCGGCTTCTGCTCGAATCCGAAATAATTTTGTAGTTTGATCGCGCTAAATACTGGTTCTCCACAAGGGATGCAAAAGAA
2,TTGCACTTAACTTGGTCACC,CAGCCGGGCGAAGATATAGCCAAAACGGCGGAAGCGCTGGCTATATGGTTCTTGCAGGTATCCATAGTCATGATTCACATGCGCGCGATATTGCCGTTCAATATAAGCCCGCCGCAGAAACGACGATTTATGACTACATCGCCAGGCGGCATCCCCACCG
3,GGATTGGCTAATTGTAGCGT,CCGTGCACAACAATGTCCTGGCAAAAGTCTTACTGTGACGGAAAACGAACGCCACGCAAACCTGACCGCACAAATGGGGAGTGCTTTTCTGTGCTTAGCGGTTAGAAAAGCCTTATGACTATTTCTGCAGTTTACAATGTTGGAGATATTAATAAGTCTG
4,CACGTAAATCCTACTGGAAT,GCCTGGATAACAATGTCATAGCAAAAGTCTTATTGTGAGGGAAATCGCACACCACGCGACGCTGACCGCACAAAAGGGGAGGGCTTGTCTGTGCTTAGCGGTTAGAAGAGTCTTGAAGATATCTGGAGTTTAACATGTTAGAGTTATTAAAAAGTCTGGG
5,CTGCAGGCGGTTTATGGGTG,GCCAGCGTAACTATGTCCTGGCCAAAGTCTTATTGTGCCGGAAAACGGACGCCACGTAAAGCAGACCGCACAAAAGGGGTGTGCTGTTTTGTGCTTAGCGGTTAGAATAGTCTGATGACTATATCTGGAATTGCCCACGTTAGAGTTATTAACAAGTCTG
6,CTCACACTGTACCGGGTGGC,CATGCCAACAGTGCCCGGAGAGGCTAAATCGTGCCAGATGGCCATGCCCAGCTCTGCTAACACCATATAGCCGCCTGTGTTGTAATGATAACGTTTCGCGGCTATTCATGAGTGGTCTACAGCCACGATTAGCCCCCGTGGTCTTGTCAGGTGCATACCT
7,ATCGCCTACAGAATCTAGCA,GCACCCTATGCGCTATCTTCCCGAACCCGGATCACACTCTGCTGCCCGGTATGTTCGTGAGTGCACTTCTGGAAGAGGGTCTTATTCCAAATGCTATTTAAGGCCCGCAACAGGGGGTAATCCGTACGCCGCGTGGCGATGACACCCTACTGGTAGCTGG
8,GCGACACCGTAAGTTCATGG,CGAGGATGTGTTGGCGCGTATCTTGCGCTTCGTGTTTGGTTGTTCGTGTGAAATTTTCGTGAATTTACACGCGATAAATTTACATACAGTTGTGAATGTATGTACCATAACACAACGATAATATAAACGCAGCATTGAGTTTATTAACTTATGACCATTG
9,TTAAACCGATATGTTGGTAG,GCGGAAGGTAACTTCTTCTGCAAAATCATTGATTACGTAAAATTAATGTTCTATCACTGGTTTGCTTAAAATTTAAACACTTGAAAGTGTAATCACCGTCCGCATATACTAAACATTAGTAAAAAACTCCCGCCTTTAGGCGGGAGTTTCTATTAAATTA
10,GTGATTTCGTCTAAAGTCTT,CGTGGCTGGACATTAACAATGCTTAAGCCAGACTAGTCTCATGTCTATATCTGTGGTTGACCATGTTCGGGTTATTAAAAAGTCTGGGGCCCTGCATGGGTCTGTCTATTGCGTAAGGACAGTAGCAAGGGCGATAACACCCATAAACCGCCTGCAGGAA


In [75]:
df_map_unique = combine(groupby(df_map, [:barcode, :promoter]), nrow => :counts)


Unnamed: 0_level_0,barcode,promoter
Unnamed: 0_level_1,String,String
1,CCCCACACCCTGGTGAGAGC,GGCGACTGCCGTTTGATCAGTCATGTTTTAAACTGAGGCACATCAACGCCCTATGGCTCGTAACGCCAACCTTTTGCGGAAGCGGCTTCTGCTCGAATCCGAAATAATTTTGTAGTTTGATCGCGCTAAATACTGGTTCTCCACAAGGGATGCAAAAGAA
2,TTGCACTTAACTTGGTCACC,CAGCCGGGCGAAGATATAGCCAAAACGGCGGAAGCGCTGGCTATATGGTTCTTGCAGGTATCCATAGTCATGATTCACATGCGCGCGATATTGCCGTTCAATATAAGCCCGCCGCAGAAACGACGATTTATGACTACATCGCCAGGCGGCATCCCCACCG
3,GGATTGGCTAATTGTAGCGT,CCGTGCACAACAATGTCCTGGCAAAAGTCTTACTGTGACGGAAAACGAACGCCACGCAAACCTGACCGCACAAATGGGGAGTGCTTTTCTGTGCTTAGCGGTTAGAAAAGCCTTATGACTATTTCTGCAGTTTACAATGTTGGAGATATTAATAAGTCTG
4,CACGTAAATCCTACTGGAAT,GCCTGGATAACAATGTCATAGCAAAAGTCTTATTGTGAGGGAAATCGCACACCACGCGACGCTGACCGCACAAAAGGGGAGGGCTTGTCTGTGCTTAGCGGTTAGAAGAGTCTTGAAGATATCTGGAGTTTAACATGTTAGAGTTATTAAAAAGTCTGGG
5,CTGCAGGCGGTTTATGGGTG,GCCAGCGTAACTATGTCCTGGCCAAAGTCTTATTGTGCCGGAAAACGGACGCCACGTAAAGCAGACCGCACAAAAGGGGTGTGCTGTTTTGTGCTTAGCGGTTAGAATAGTCTGATGACTATATCTGGAATTGCCCACGTTAGAGTTATTAACAAGTCTG
6,CTCACACTGTACCGGGTGGC,CATGCCAACAGTGCCCGGAGAGGCTAAATCGTGCCAGATGGCCATGCCCAGCTCTGCTAACACCATATAGCCGCCTGTGTTGTAATGATAACGTTTCGCGGCTATTCATGAGTGGTCTACAGCCACGATTAGCCCCCGTGGTCTTGTCAGGTGCATACCT
7,ATCGCCTACAGAATCTAGCA,GCACCCTATGCGCTATCTTCCCGAACCCGGATCACACTCTGCTGCCCGGTATGTTCGTGAGTGCACTTCTGGAAGAGGGTCTTATTCCAAATGCTATTTAAGGCCCGCAACAGGGGGTAATCCGTACGCCGCGTGGCGATGACACCCTACTGGTAGCTGG
8,GCGACACCGTAAGTTCATGG,CGAGGATGTGTTGGCGCGTATCTTGCGCTTCGTGTTTGGTTGTTCGTGTGAAATTTTCGTGAATTTACACGCGATAAATTTACATACAGTTGTGAATGTATGTACCATAACACAACGATAATATAAACGCAGCATTGAGTTTATTAACTTATGACCATTG
9,TTAAACCGATATGTTGGTAG,GCGGAAGGTAACTTCTTCTGCAAAATCATTGATTACGTAAAATTAATGTTCTATCACTGGTTTGCTTAAAATTTAAACACTTGAAAGTGTAATCACCGTCCGCATATACTAAACATTAGTAAAAAACTCCCGCCTTTAGGCGGGAGTTTCTATTAAATTA
10,GTGATTTCGTCTAAAGTCTT,CGTGGCTGGACATTAACAATGCTTAAGCCAGACTAGTCTCATGTCTATATCTGTGGTTGACCATGTTCGGGTTATTAAAAAGTCTGGGGCCCTGCATGGGTCTGTCTATTGCGTAAGGACAGTAGCAAGGGCGATAACACCCATAAACCGCCTGCAGGAA
