# Create key to match sequences to barcodes

The code in this tutorial is released under the [MIT License](https://opensource.org/licenses/MIT). All the content in this notebook is under a [CC-by 4.0 License](https://creativecommons.org/licenses/by/4.0/). 

In [6]:
import regseq.create_key as ck

For a detailed explanation of the steps leading to this notebook, as well as the experimental context, refer to the [RegSeq wiki](https://github.com/RPGroup-PBoC/RegSeq/wiki).

In this notebook we perform the mapping step on the quality score filtered sequences. Therefore we need to find the barcode and the sequences we generated in previous steps, and filter for unique barcode and sequence mappings. First, we check the length of sequences and discard sequences whose length is varying from the consensus, since this is likely caused by insertions or deletions. Also we are discarding sequences with unresolved base pairs, i.e., a `N` in the sequence. <br><br>
We are taking into consideration that there might be sequencing errors that lead to ambiguous mappings, therefore if a barcode maps to multiple sequences which are very similar, the most counted sequence is taken as consensus sequence. Finally, we compare sequences to the wild type genome, to identify which gene the sequence belongs to. 
<br><br>
All these steps are performed by functions in the `regseq.create_key` module, and are summarized in the function `regseq.create_key.key_barcode_sequence`. 

In [7]:
?ck.key_barcode_sequence

[0;31mSignature:[0m
[0mck[0m[0;34m.[0m[0mkey_barcode_sequence[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mdata_file[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0moutput_file[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mwildtypefile[0m[0;34m=[0m[0;34m'../data/test_data/wtsequences.csv'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Go through functions to create unique map of barcode to sequence and gene in wiltype.

The sequences are checked for correct lengths, to exlude insertion and deletion events.
Then, created sequences and barcodes are extracted (removing overhangs) and unique barcode/
sequence maps are found. Possible sequencing errors that lead to false negatives in uniqueness
are considered. Sequences are compared to gene sequences in wildtype.

Parameters
----------
data_file : str
    Path to file containing sequencing data.
output_file : str
    Path to file where results are stored.

Returns
-------
[0;31m

The only thing we need to give is the file containng sequence reads, as well as the path where we store the results. The results are stores in `.csv` files for every gene. **CHECK IF THIS IS WHAT WE WANT**

Here is an example for gene `bdcR`.

In [8]:
data_file = "../data/sequencing_data/mappingseqs.fastq"
output_path = "../data/test_data/"

In [9]:
ck.key_barcode_sequence(data_file, output_path)

optimal length is 295
number of good sequencing counts 10450


Finally, here are the versions of packages used in this notebook. To display the versions, we are using the Jupyter Lab extension `watermark`, which can be found [here](https://github.com/rasbt/watermark). (This will already be installed if you use the environment we prepared.)

## Computing Environment

In [5]:
%load_ext watermark
%watermark -v -p regseq

CPython 3.6.9
IPython 7.13.0

regseq 0.0.2
