# Create key to match sequences to barcodes

The code in this tutorial is released under the [MIT License](https://opensource.org/licenses/MIT). All the content in this notebook is under a [CC-by 4.0 License](https://creativecommons.org/licenses/by/4.0/). 

For a detailed explanation of the steps leading to this notebook, as well as the experimental context, refer to the [Reg-Seq wiki](https://github.com/RPGroup-PBoC/RegSeq/wiki).

In [1]:
import regseq.create_key
import pandas as pd
import numpy as np

After performing sequencing on the prepared library, the data has to be prepared to be in the right format for the next steps. This includes connecting pair-ended reads using the software `FLASH` and removing low quality score reads using `FastX`. We won't go into detail on how to use the software here, for a tutorial on using the software, as well as links to downloadable files, refer to the *Computational Analysis of the "Mapping" Run (Building the Codex)* chapter of the [Reg-Seq wiki](https://github.com/RPGroup-PBoC/RegSeq/wiki/3.-Sequencing).


In this notebook we perform the mapping step on the quality score filtered sequences. The mapping is used in the next step to count sequences in the DNA and RNA datasets obtained from growth experiments (more in this in the notebook `4_1_match_data.ipynb`). Therefore we need to find the barcode and the sequences we generated in previous steps, and filter for unique barcode and sequence mappings. First, we check the length of sequences and discard sequences whose length is varying from the consensus, since this is likely caused by insertions or deletions. Also we are discarding sequences with unresolved base pairs, i.e., a `N` in the sequence. <br><br>
We are taking into consideration that there might be sequencing errors that lead to ambiguous mappings, therefore if a barcode maps to multiple sequences which are very similar, the most counted sequence is taken as consensus sequence. Finally, we compare sequences to the wild type genome, to identify which gene the sequence belongs to. 

In this notebook we use a file we obtained after processing experimental data using the software discussed above. 

In [2]:
data_file = "../data/sequencing_data/mappingseqs.fastq"

There will be some number of base pairs after the barcode in each sequence. For our sequencing set up we have either 24 bp or 20 bp at the end of the sequences. If the total sequence length is 299 bp, then the total trailing sequence length is 24 bp, otherwise if the total length is 295 then the trailing sequence length is 20 bp. Additionally, there are 20 bp preceding the sequence.
<br><br>
As first step, we check that the sequences are one of the two lengths. If the sequences from your experiment have different lengths, simply change the argument in the function below.

In [3]:
correct_seq = regseq.create_key.check_length(data_file, optimal_lengths=[299, 295], trailing_lengths=[24, 20], starting_length=20)

optimal length is 295


The output of this function is a list with sequences that have the required length, excluding the extra bp at the beginning and end of the sequences. Let's have a look at the output.

In [4]:
# Increase displaying width
pd.options.display.max_colwidth = 350
print("Sequence length: {}\n".format(len(correct_seq.iloc[0])))
correct_seq.head()

Sequence length: 255



1     ACGTACGAAATCGACGCCGCCACCACGACTGCCACCGAGGATGAGAACTGCCTTAGACGTAAAAGCGCCCATAAGGACACCTTGATTAATTGTGTAACATGCATTGCAAACCTGTTTTAACTATCTGTTAAGAGGTTTTGTAATGGTCACTATAAAACAGCCGGGATTCAGTGATTGAACTATTAGGCTTCTCCTCAGCGTTAATCACTGGCCGTCGTTTTACATGACTGACTGACACCGGTCGACTCATCGAAT
5     ACGGTCGATAGCGGCACCGATACTACTGCTGCCACCGAGGATGAGAACTGTCTTACCTGTAAAAGCGCCCATCAGCACCCCTTGATTTATTATGTAACATCCATTACAAACCTAATTTAACTTCCTGTCAATAGGTTTTGCAATGGTCACTAAAAAATAACCGGGATTCAGTGATTGAACTATTAGGCTTCTCCTCAGCGTTAATCACTGGCCGTCGTTTTACATGACTGACTGAGTTCACCGGTTCAATATTTG
17    AAGTACGATAGCGGCACCGACACCAAGACTGCTACCGAGGATGAGAACTGTCTCACATGTCAAAGCGCCCATAAGGATTCCTTGATTTCTTATGTAACATGCATTACAAAACTGTTTTAAAGTTCTGACAACACGTTTTGTAATGGTCACTTAAAAACAACCGGGATTCAGTGATTGAACTATTAGGCTTCTCCTCAGCGTTAATCACTGGCCGTCGTTTTACATGACTGACTGATTCTAGGCTTGAAAGCTAGG
25    ACGTACGATAGCGGCACCGAAACCATGCCCGGCACCGAAGAGGAGAACTGTCTTACCTGTAAAAACGCTCATCAGGACTCCTTGATTTATTATGTGACACACCTTACAACACTGTTTTGACTTTCTGTCAACAGGTTTGGTAACGGTCACACAAAAACAACCGGGAGTCAGTGATTGAACTATTAGGCTGCACCTCAGCGTTAATCAC

By looking at the sequences, we can see that they all have the same length, where the initial 20 bp, and 20 bp at the end were removed, hence the length of 255bp instead of 295bp.

Using the sequences with the right length, we are going to stitch the part of the sequence we are interested in together with the barcode. There are still some bases between the sequences we generated and the barcode, which have to be removed in the next step. This is needed for the follwing step to look for unique barcode/ sequence interactions. 

In [5]:
stitched = regseq.create_key.stitch_barcode_sequence(correct_seq)
stitched.head()

1     ACGTACGAAATCGACGCCGCCACCACGACTGCCACCGAGGATGAGAACTGCCTTAGACGTAAAAGCGCCCATAAGGACACCTTGATTAATTGTGTAACATGCATTGCAAACCTGTTTTAACTATCTGTTAAGAGGTTTTGTAATGGTCACTATAAAACAGCACCGGTCGACTCATCGAAT
5     ACGGTCGATAGCGGCACCGATACTACTGCTGCCACCGAGGATGAGAACTGTCTTACCTGTAAAAGCGCCCATCAGCACCCCTTGATTTATTATGTAACATCCATTACAAACCTAATTTAACTTCCTGTCAATAGGTTTTGCAATGGTCACTAAAAAATAAGTTCACCGGTTCAATATTTG
17    AAGTACGATAGCGGCACCGACACCAAGACTGCTACCGAGGATGAGAACTGTCTCACATGTCAAAGCGCCCATAAGGATTCCTTGATTTCTTATGTAACATGCATTACAAAACTGTTTTAAAGTTCTGACAACACGTTTTGTAATGGTCACTTAAAAACAATTCTAGGCTTGAAAGCTAGG
25    ACGTACGATAGCGGCACCGAAACCATGCCCGGCACCGAAGAGGAGAACTGTCTTACCTGTAAAAACGCTCATCAGGACTCCTTGATTTATTATGTGACACACCTTACAACACTGTTTTGACTTTCTGTCAACAGGTTTGGTAACGGTCACACAAAAACAACGGATCTAACCCCCGACTTT
29    ACGTACGATAGCGGCCCCGATATCACGACTGCCACCGAGGAGGAGAACTGTTTTACCTGGAAGTGCGCCCATAAGGATTCTTAGACCTATTATGTAACATGCATCACAAAACTGTTTTAACTTTCTGGCAACCGGTTTTGTAATTGTCACAAAAAAGCAAGCACACTTCTCGCGTGAAAG
dtype: object

Now we are left with sequences of length 180, of which 160bp are the sequences we generated and 20bp are the barcodes. Now we check that the barcode/sequence mappings are unique. 

In [6]:
barcodes, counts, seq_tag_df = regseq.create_key.check_barcode_uniqueness(stitched)

Since there can be sequencing errors, which can lead to a barcode being mapped on supposedly different sequences, we are checking the sequences for this possibility.

In [7]:
seq_tag_df = regseq.create_key.check_rare_barcode_errors(barcodes, counts, seq_tag_df)

number of good sequencing counts 10450


Having the unique barcode/sequence mapping in hand, we can check from which gene the sequence originated. Therefore we need to use the file we generated in the first step of the protocol, the list of gene names and corresponding wildtype sequences. We choose the gene which is most similar to the observed sequence, i.e., has the least mutations, and assign the sequence to the gene. Let'ts look at the output of the function.

In [8]:
wildtypefile = '../data/prior_designs/wtsequences.csv'
df = regseq.create_key.detect_genes(seq_tag_df, wildtypefile)
df.head()

Unnamed: 0,tag,seq,count,gene,nmut
0,ATTCGTGAACGTGACCATTA,ACGTACGGTAGCGGCACCAATACCACGTCTGCCACCGAGGATGAGACCTGTCATACTCGTAAAAGCGCCCATAGGGACTCCTTGATTTATTATGTAACATGCATTACAAAACTGTTTTAATTATCTTTCAACAGGTTTTGTAATTATCACTAAAAAACAA,9,bdcR,13
1,GGACCCCTGAATTTAAATGT,ACCTACGATAGCGGCACCGATACCACGACTGCCCCCGAGGATGGGAACTGACTTGCATGTAAAAGCGCCCATAAGGACTCCTTGATTTATTTTGTAACATGGATTATAAAACTGTTTTAGCTTTCTGACAACAGGTATTGTAATGGACACTAAAAAACAA,9,bdcR,13
2,AGTCACCATTTATGGGTCAA,ACGTATGATAGTGTCGCCGATACCATGACTTCCACCGACGGGAAGAACTGTCTTGCCTGTAAAAGCGCCAATAAGGTCTCCTCGATTTAGTATGTATCATGCATTATAAAACTGTTCTAACTTCCTGCCAACAGGATTTGTAATGCTCACTAAAAAACAA,8,bdcR,22
3,TTGCTTTTCGTGGTTTTAGC,ACGTACGCTAGCGGCATCTGGACCACCACTGCCACCGAGCATGAGAGCCGTCTTACCTGTAAAATCGCCCATAAGCACTCCTTGATTTATGATGTTACATGCATTACAAAACTGTTTTAACTTTCTGACAACACGTTTTGTAATGGTCACTAAAAAACAA,8,bdcR,15
4,TACGTGAATGCTACTTACCT,CCGTACGATAGCGGCACCGATACCGCGACTGCCGCCGAGAATGTAAACTGTCTTACCTGTAAAAGCGCCCATAAGAACTCCTTGATTTATTATGTAACTTGCATTACAAAACTGTTTTAACTTTCTGTCAACGTGTTTTGTAATGGTCACTAAAAAACAA,7,bdcR,10


The table contains the information about barcode and mutated sequence, the gene, how often the unique barcode/sequence combination was found, and the number of mutations compared to the wildtype genome.

The steps above are combined into a single function. Below you can find the docstring as well as 

In [9]:
?regseq.create_key.key_barcode_sequence

[0;31mSignature:[0m
[0mregseq[0m[0;34m.[0m[0mcreate_key[0m[0;34m.[0m[0mkey_barcode_sequence[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mdata_file[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0moutput_path[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mwildtypefile[0m[0;34m=[0m[0;34m'../data/prior_designs/wtsequences.csv'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mgenes[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Go through functions to create unique map of barcode to sequence and gene in wiltype.

The sequences are checked for correct lengths, to exlude insertion and deletion events.
Then, created sequences and barcodes are extracted (removing overhangs) and unique barcode/
sequence maps are found. Possible sequencing errors that lead to false negatives in uniqueness
are considered. Sequences are compared to gene sequences in wildtype.

Parameters
----------
data_file : str
    Path to f

We have to define a path were the keys are being stored. We are using the `"../data/barcode_keys/"` in this repo. The function below will store a file for every gene observed in the sequence data.

In [10]:
output_path="../data/barcode_keys/"

In [11]:
regseq.create_key.key_barcode_sequence(data_file, output_path)

optimal length is 295
number of good sequencing counts 10450


In the example file we provided were sequence from one gene only, so there will only be one file in the folder.

In [12]:
!ls ../data/barcode_keys/

bdcR_barcode_key.csv


Finally, here are the versions of packages used in this notebook. To display the versions, we are using the Jupyter Lab extension `watermark`, which can be found [here](https://github.com/rasbt/watermark). (This will already be installed if you use the environment we prepared.)

## Computing Environment

In [13]:
%load_ext watermark
%watermark -v -p regseq,numpy,pandas

CPython 3.6.9
IPython 7.13.0

regseq 0.0.2
numpy 1.18.1
pandas 1.0.3
