# Create key to match sequences to barcodes

The code in this tutorial is released under the [MIT License](https://opensource.org/licenses/MIT). All the content in this notebook is under a [CC-by 4.0 License](https://creativecommons.org/licenses/by/4.0/). 

For a detailed explanation of the steps leading to this notebook, as well as the experimental context, refer to the [Reg-Seq wiki](https://github.com/RPGroup-PBoC/RegSeq/wiki).

In [11]:
import regseq.create_key
import pandas as pd
import numpy as np

After performing sequencing on the prepared library, the data has to be prepared to be in the right format for the next steps. This includes connecting pair-ended reads using the software `FLASH` and removing low quality score reads using `FastX`. We won't go into detail on how to use the software here, for a tutorial on using the software, as well as links to downloadable files, refer to the *Computational Analysis of the "Mapping" Run (Building the Codex)* chapter of the [Reg-Seq wiki](https://github.com/RPGroup-PBoC/RegSeq/wiki/3.-Sequencing).

**Update 06/24/2022**
We have added bash scripts to the repository. They require the installation of [fastp](https://github.com/OpenGene/fastp). The script `process_mapping_sequencing.sh` performs the filtering of raw sequencing files and merging of pair ended reads. Put the raw sequencing files in `fastq` format in the folder `/RegSeq/data/sequencing_data/`, or change the paths in the bash scripts appropriately. 

In this notebook we perform the mapping step on the quality score filtered sequences. The mapping is used in the next step to count sequences in the DNA and RNA datasets obtained from growth experiments (more in this in the notebook `4_1_match_data.ipynb`). Therefore we need to find the barcode and the sequences we generated in previous steps, and filter for unique barcode and sequence mappings. First, we check the length of sequences and discard sequences whose length is varying from the consensus, since this is likely caused by insertions or deletions. Also we are discarding sequences with unresolved base pairs, i.e., a `N` in the sequence. <br><br>
We are taking into consideration that there might be sequencing errors that lead to ambiguous mappings, therefore if a barcode maps to multiple sequences which are very similar, the most counted sequence is taken as consensus sequence. Finally, we compare sequences to the wild type genome, to identify which gene the sequence belongs to. 

In this notebook we use a file we obtained after processing experimental data using the software discussed above. 

In [60]:
data_file = "../data/sequencing_data/processed_sequencing/dna_mapping_sequences.txt"

There will be some number of base pairs after the barcode in each sequence. For our sequencing set up we have either 24 bp or 20 bp at the end of the sequences. If the total sequence length is 299 bp, then the total trailing sequence length is 24 bp, otherwise if the total length is 295 then the trailing sequence length is 20 bp. Additionally, there are 20 bp preceding the sequence.
<br><br>
As first step, we check that the sequences are one of the two lengths. If the sequences from your experiment have different lengths, simply change the argument in the function below.

In [61]:
def check_length(
    input_file_name, 
    barcode_length=20, 
    sequence_length=160, 
    optimal_lengths= np.array([299, 295]), 
    trailing_lengths= np.array([24, 20]),
    starting_length=20
):
    """
    Check length of sequences.
    
    Return sequences that have 24 or 20 trailing bp, and have 20 bp at the start. Sequences
    with varying length have insertions and deletions and are filtered out.
    Parameters:
    ----------
    input_file_name : str
        file name for the fastq sequencing
    barcode_length : int, default 20
    sequence_length : int, default 160
    optimal_lengths : array-like, default np.array([299, 295])
        Numpy array containing possible lengths of sequences
    trailing_lengths : array-like, default np.array([24, 20])
        Numpy array containing possible number of trailing bp after barcode
    starting_length : int, default 20
        Number of bp preceding the sequence of interest
    
    Return
    ------
    sliceddf : Pandas DataFrame
        DataFrame containing only the mutated sequence and the barcode
    
    """
    if type(trailing_lengths) == list:
        trailing_lengths = np.array(trailing_lengths)
    elif type(trailing_lengths) == np.ndarray:
        pass
    else:
        raise RuntimeError("`trailing_lengths` has to be a numpy array or list.")
        
    if type(optimal_lengths) == list:
        optimal_lengths = np.array(optimal_lengths)
    elif type(optimal_lengths) == np.ndarray:
        pass
    else:
        raise RuntimeError("`optimal_lengths` has to be a numpy array or list.")
        
    # Load data
    df = pd.read_csv(input_file_name, header=None, index_col=None, names=['seq'])#, delim_whitespace=True, header=None)
    #Select only rows with sequences (fastq format).
    #df = df.loc[1::4, 0]

    # Find all lengths
    lengths = df['seq'].apply(len)

    # Find the correct length by finding the most common length
    lengthsmax = list(lengths.value_counts().index)[0]
    print('optimal length is ' + str(lengthsmax))


    # Find all sequences with correct length.
    goodlength = (df['seq'].apply(len) == lengthsmax)
    df = df.loc[goodlength]
    ind = np.where(optimal_lengths == lengthsmax)[0]
    if len(ind) == 0:
        raise ValueError('Sequence length not in the list of required lengths.')
    else:
        sliceddf = df['seq'].str.slice(starting_length, -trailing_lengths[ind[0]])
        
    return sliceddf.reset_index(drop=True)

In [62]:
correct_seq = check_length(data_file, optimal_lengths=[299, 295], trailing_lengths=[24, 20], starting_length=20)
correct_seq

optimal length is 295


0           TAGGGAGTGAACGTCATCCGTCGCCGGAAAACGTTGTACTGTCAGTCAACGGAGCCCGTTCTATAACGGGCTCTTCCGCCCGCCTTAATGATAAAATTTCGACATTGCCCCTGAAAAAGGCGCGGGACTATACCCTTTTTCTCTTTCTCGTGTGCGGTTATTCCACAGCTCTATGAGGTGTATTAGGCTTCTCCTCAGCGACGCTCACTGGCCGTCGTTTTACATGACTGACTGACGTCCCGCATGGAGATTTCC
1           AGTTCTTAACAATGCCAAATCCCCAGTTCTCACCGCAAAATTATTTGTCGTTATGCTTTAAATGTTTTGTTTTACACTTTATCAAGCGTAACTATCACTCCGCGGCATAACTACCTCGGTCAAAGACCTCGGAGCGTGCAGGCTGGCGGTAAGCTTTACGCTATGGGCATTCCCGTACGATATTAGGCTTCTCCTCAGCGGCGCTCACTGGCCGTCGTTTTACATGACTGACTGGCAATTATCACAGCACTACCG
2           CACAACAGGTATTCTCTTTCATCTTTTGTCAACCATTCACAGCGCAAATATACGCCTTTTTTTGTGATCAGTCCGGCCTTTTTCGATCTTTATACTTCTATGGTAGTAGCTCAGTTGCGTAGATTTCATACATCACGAAAAGCGATGCACGGAATCGAACCTATGGTCATTCCCGTACGATATTAGGCTTCTCCTCAGCGGCTCTCACTGGCCGTCGTTTTACATGACTGACTGAGCTACGGGGAGCGCGTCCTG
3           TGTCTAATAATCGGCTTATGCCCGATGATATTCCTTTCATCGGGCTATTTATCCGTTACTGCTGTCTCTCTCTCCCAACCCTACCCCCTCCGTCTTATGAACTAGACTTGTTACAGTTATAGCATTCCGGAGCTGGCGAATCATGATCCATACGGTTGGACTCTGGTCATTCCCGTACTATATT

The output of this function is a list with sequences that have the required length, excluding the extra bp at the beginning and end of the sequences. Let's have a look at the output.

In [63]:
# Increase displaying width
pd.options.display.max_colwidth = 350
print("Sequence length: {}\n".format(len(correct_seq.iloc[0])))
correct_seq.head()

Sequence length: 255



0    TAGGGAGTGAACGTCATCCGTCGCCGGAAAACGTTGTACTGTCAGTCAACGGAGCCCGTTCTATAACGGGCTCTTCCGCCCGCCTTAATGATAAAATTTCGACATTGCCCCTGAAAAAGGCGCGGGACTATACCCTTTTTCTCTTTCTCGTGTGCGGTTATTCCACAGCTCTATGAGGTGTATTAGGCTTCTCCTCAGCGACGCTCACTGGCCGTCGTTTTACATGACTGACTGACGTCCCGCATGGAGATTTCC
1    AGTTCTTAACAATGCCAAATCCCCAGTTCTCACCGCAAAATTATTTGTCGTTATGCTTTAAATGTTTTGTTTTACACTTTATCAAGCGTAACTATCACTCCGCGGCATAACTACCTCGGTCAAAGACCTCGGAGCGTGCAGGCTGGCGGTAAGCTTTACGCTATGGGCATTCCCGTACGATATTAGGCTTCTCCTCAGCGGCGCTCACTGGCCGTCGTTTTACATGACTGACTGGCAATTATCACAGCACTACCG
2    CACAACAGGTATTCTCTTTCATCTTTTGTCAACCATTCACAGCGCAAATATACGCCTTTTTTTGTGATCAGTCCGGCCTTTTTCGATCTTTATACTTCTATGGTAGTAGCTCAGTTGCGTAGATTTCATACATCACGAAAAGCGATGCACGGAATCGAACCTATGGTCATTCCCGTACGATATTAGGCTTCTCCTCAGCGGCTCTCACTGGCCGTCGTTTTACATGACTGACTGAGCTACGGGGAGCGCGTCCTG
3    TGTCTAATAATCGGCTTATGCCCGATGATATTCCTTTCATCGGGCTATTTATCCGTTACTGCTGTCTCTCTCTCCCAACCCTACCCCCTCCGTCTTATGAACTAGACTTGTTACAGTTATAGCATTCCGGAGCTGGCGAATCATGATCCATACGGTTGGACTCTGGTCATTCCCGTACTATATTAGGCTTCTCCTCAGCGGCGCTCACTGGC

By looking at the sequences, we can see that they all have the same length, where the initial 20 bp, and 20 bp at the end were removed, hence the length of 255bp instead of 295bp.

Using the sequences with the right length, we are going to stitch the part of the sequence we are interested in together with the barcode. There are still some bases between the sequences we generated and the barcode, which have to be removed in the next step. This is needed for the follwing step to look for unique barcode/ sequence interactions. 

In [64]:
correct_seq = regseq.create_key.stitch_barcode_sequence(correct_seq)
correct_seq.head()

0    TAGGGAGTGAACGTCATCCGTCGCCGGAAAACGTTGTACTGTCAGTCAACGGAGCCCGTTCTATAACGGGCTCTTCCGCCCGCCTTAATGATAAAATTTCGACATTGCCCCTGAAAAAGGCGCGGGACTATACCCTTTTTCTCTTTCTCGTGTGCGGTTACGTCCCGCATGGAGATTTCC
1    AGTTCTTAACAATGCCAAATCCCCAGTTCTCACCGCAAAATTATTTGTCGTTATGCTTTAAATGTTTTGTTTTACACTTTATCAAGCGTAACTATCACTCCGCGGCATAACTACCTCGGTCAAAGACCTCGGAGCGTGCAGGCTGGCGGTAAGCTTTACGCAATTATCACAGCACTACCG
2    CACAACAGGTATTCTCTTTCATCTTTTGTCAACCATTCACAGCGCAAATATACGCCTTTTTTTGTGATCAGTCCGGCCTTTTTCGATCTTTATACTTCTATGGTAGTAGCTCAGTTGCGTAGATTTCATACATCACGAAAAGCGATGCACGGAATCGAACGCTACGGGGAGCGCGTCCTG
3    TGTCTAATAATCGGCTTATGCCCGATGATATTCCTTTCATCGGGCTATTTATCCGTTACTGCTGTCTCTCTCTCCCAACCCTACCCCCTCCGTCTTATGAACTAGACTTGTTACAGTTATAGCATTCCGGAGCTGGCGAATCATGATCCATACGGTTGGATGAACCTGACTCCGTGTGGC
4    TCTGGGCAACGTTATGAAGGTGACGGATTCATATATCAATTAATTTTTTAACGCCATTGTAAAACTGCCGTTTTACCTCGTTTACAACGCGTGCGCTGGACATTACCCTCCACCTCTGCGATTTATCATCGCAACCGCACGACTCGGGGCGCCGTTCTGCTTCAAGGCCGTAGCTAATCA
dtype: object

Now we are left with sequences of length 180, of which 160bp are the sequences we generated and 20bp are the barcodes. Now we check that the barcode/sequence mappings are unique. 

In [65]:
barcodes, counts, seq_tag_df = regseq.create_key.check_barcode_uniqueness(stitched)

Since there can be sequencing errors, which can lead to a barcode being mapped on supposedly different sequences, we are checking the sequences for this possibility.

In [66]:
seq_tag_df = regseq.create_key.check_rare_barcode_errors(barcodes, counts, seq_tag_df)

number of good sequencing counts 2283046.0


Having the unique barcode/sequence mapping in hand, we can check from which gene the sequence originated. Therefore we need to use the file we generated in the first step of the protocol, the list of gene names and corresponding wildtype sequences. We choose the gene which is most similar to the observed sequence, i.e., has the least mutations, and assign the sequence to the gene. Let'ts look at the output of the function.

In [67]:
wildtypefile = '../data/prior_designs/wtsequences.csv'
df = regseq.create_key.detect_genes(seq_tag_df, wildtypefile)
df.head()

In [None]:
df.to_csv("just_store.csv")

The table contains the information about barcode and mutated sequence, the gene, how often the unique barcode/sequence combination was found, and the number of mutations compared to the wildtype genome.

The steps above are combined into a single function. Below you can find the docstring.

?regseq.create_key.key_barcode_sequence

We have to define a path were the keys are being stored. We are using the `"../data/barcode_keys/"` in this repo. The function below will store a file for every gene observed in the sequence data.

output_path="../data/barcode_keys/"

regseq.create_key.key_barcode_sequence(data_file, output_path)

In the example file we provided were sequence from one gene only, so there will only be one file in the folder.

!ls ../data/barcode_keys/

In the following steps we will use these keys to count sequence from growth experiments.

Finally, here are the versions of packages used in this notebook. To display the versions, we are using the Jupyter Lab extension `watermark`, which can be found [here](https://github.com/rasbt/watermark). (This will already be installed if you use the environment we prepared.)

## Bash scripts

For large sequencing files memory can become an issue and running notebooks is inefficient. For that reason we use bash scripts for the analysis of large 

## Computing Environment

%load_ext watermark
%watermark -v -p regseq,numpy,pandas