# Create key to match sequences to barcodes

The code in this tutorial is released under the [MIT License](https://opensource.org/licenses/MIT). All the content in this notebook is under a [CC-by 4.0 License](https://creativecommons.org/licenses/by/4.0/). 

In [1]:
import regseq.create_key
import pandas as pd

For a detailed explanation of the steps leading to this notebook, as well as the experimental context, refer to the [RegSeq wiki](https://github.com/RPGroup-PBoC/RegSeq/wiki).

In this notebook we perform the mapping step on the quality score filtered sequences. Therefore we need to find the barcode and the sequences we generated in previous steps, and filter for unique barcode and sequence mappings. First, we check the length of sequences and discard sequences whose length is varying from the consensus, since this is likely caused by insertions or deletions. Also we are discarding sequences with unresolved base pairs, i.e., a `N` in the sequence. <br><br>
We are taking into consideration that there might be sequencing errors that lead to ambiguous mappings, therefore if a barcode maps to multiple sequences which are very similar, the most counted sequence is taken as consensus sequence. Finally, we compare sequences to the wild type genome, to identify which gene the sequence belongs to. 

In [2]:
data_file = "../data/sequencing_data/mappingseqs.fastq"
output_path = "../data/barcode_keys/"

We start by checking the sequences for the right length.

In [3]:
correct_seq = regseq.create_key.check_length(data_file)

optimal length is 295


Using the right sequences, we are going to stitch the part of the sequence we are interested in together with the barcode.

In [4]:
stitched = regseq.create_key.stitch_barcode_sequence(correct_seq)

Find unique barcode/sequence relations.

In [5]:
barcodes, counts, seq_tag_df = regseq.create_key.check_barcode_uniqueness(stitched)

Check for possible sequencing errors in mapping

In [6]:
seq_tag_df = regseq.create_key.check_rare_barcode_errors(barcodes, counts, seq_tag_df)

number of good sequencing counts 10450


Find gene relating to sequence and store result

In [7]:
wildtypefile='../data/prior_designs/wtsequences.csv'
df = regseq.create_key.detect_genes(seq_tag_df, wildtypefile)

Store results as one file per gene.

In [14]:
genes = ['bdcR']
if genes == None:
    for gene in df["gene"].unique():
        genedf = df.loc[df["gene"] ==gene]
        genedf.drop(['gene'], axis=1).to_csv(output_path + gene + "_barcode_key.csv", index=False)
else:
    for gene in genes:
        genedf = df.loc[df["gene"] ==gene]
        genedf.drop(['gene'], axis=1).to_csv(output_path + gene + "_barcode_key.csv", index=False)

The steps above are combined into a single function. Below you can find the docstring as well as 

In [2]:
?ck.key_barcode_sequence

[0;31mSignature:[0m
[0mck[0m[0;34m.[0m[0mkey_barcode_sequence[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mdata_file[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0moutput_path[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mwildtypefile[0m[0;34m=[0m[0;34m'../data/prior_designs/wtsequences.csv'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mgenes[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Go through functions to create unique map of barcode to sequence and gene in wiltype.

The sequences are checked for correct lengths, to exlude insertion and deletion events.
Then, created sequences and barcodes are extracted (removing overhangs) and unique barcode/
sequence maps are found. Possible sequencing errors that lead to false negatives in uniqueness
are considered. Sequences are compared to gene sequences in wildtype.

Parameters
----------
data_file : str
    Path to file containing sequencing data.
ou

Finally, here are the versions of packages used in this notebook. To display the versions, we are using the Jupyter Lab extension `watermark`, which can be found [here](https://github.com/rasbt/watermark). (This will already be installed if you use the environment we prepared.)

## Computing Environment

In [5]:
%load_ext watermark
%watermark -v -p regseq

CPython 3.6.9
IPython 7.13.0

regseq 0.0.2
