# Match mRNA and DNA counts

The code in this tutorial is released under the [MIT License](https://opensource.org/licenses/MIT). All the content in this notebook is under a [CC-by 4.0 License](https://creativecommons.org/licenses/by/4.0/). 

In [1]:
import regseq.match_data
import pandas as pd

For a detailed explanation of the steps leading to this notebook, as well as the experimental context, refer to the [Reg-Seq wiki](https://github.com/RPGroup-PBoC/RegSeq/wiki).

Previously we processed the data from sequencing the library and prepared the barcode keys. After that, the library is used for various growth conditions, and RNA and DNA sequences are obtained from the cells. The next step is to use the key to count unique sequences in both RNA and DNA datasets, and then combine counts from the same sequences in both data sets for further analysis.


Therefore, we store the data in a table that contains both RNA(`ct_1`) and DNA counts(`ct_0`) for each sequence (`seq`), as well as total counts(`ct`). In the end, a table will have the following format

|ct|ct_0|ct_1|seq|
|----|----|----|----|
|10|5|5|AAACAAAAAAAC...|
|2|2|0|AAACAAAAAATC...|

where the sequence column contains the full sequence, which is shortened here for displaying purposes. In the module `regseq.match_data`, you can find a function `regseq.match_data.combine_counts` which reads the necessary files using `pandas` and performs the matching step, returning a table in the format shown above. The filenames are given as strings to the function.

In [2]:
?regseq.match_data.combine_counts

[0;31mSignature:[0m
[0mregseq[0m[0;34m.[0m[0mmatch_data[0m[0;34m.[0m[0mcombine_counts[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mmRNA_file[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mDNA_file[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtag_key_file[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0moutput_file[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Compute counts for sequences from mRNA and DNA.

Parameters
----------
mRNA_file : str
    Path of file for mRNA sequencing
DNA_file : str
    Path of file for DNA sequencing
tag_key_file : str
    Path of file for barcode/sequence mapping
output_file : str
    Path of file constructed for output
[0;31mFile:[0m      ~/git/RegSeq/regseq/match_data.py
[0;31mType:[0m      function


For demonstration purposes, let's use the same gene we used in the previous step, `bdcR`. The barcode key for this gene was created in `3_1_create_keys.ipynb` using an example library, and stored in the `../data/barcode_keys/` folder in this repo. If you have libraries and DNA and RNA counts for other genes, you can use those files instead. Make sure the right path is given to the variables. 

In [3]:
# Barcode key
tag_key_file = "../data/barcode_keys/ykgE_barcode_key.csv"

# RNA dataset
mRNA_file = "../data/sequencing_data/ykgE/BI106_mRNA_101"

# DNA dataset
DNA_file = "../data/sequencing_data/ykgE/BI105_DNA_101"

# Path to store result
output_file = "../data/sequencing_data/ykgE_dataset_combined.csv"

Now that we have defined all files, we can use the function to combine counts.

In [4]:
regseq.match_data.combine_counts(mRNA_file, DNA_file, tag_key_file, output_file)

Let's have a look at the output file.

In [5]:
pd.read_csv(output_file).head()

Unnamed: 0,ct,ct_0,ct_1,seq
0,1.0,1.0,0.0,ACAATTTCACCATAAAATGTCGGCGTTGCCGAAAGAAATAAAATGAGGTATTGCATTTGACGTTTGGATGAAAGATTTTCATTTGTCCTACAATTGCGGGGTGGTATGTGGCTAGCCCATTAAAAAAGAACGCCATATTTATTGATGATTGACACCGCGGGAGAGCCTCGCGTATCCCTC
1,1.0,1.0,0.0,ACGAATTCCCCATAAGAAGTAAGCGATGCAGAAAGAAATAAAATTAGTTATCGCATTGGGGGTTTGCATCAAAGATTATCATTTGTCATACAGATGAGGGGGGGTATGTTGCTAGTCACTTAAACAAGAACGCCCTAGTTATTGATGAATGATCCTCCGGGGATCCATGGTCATTCGGTG
2,1.0,1.0,0.0,ACGAATTCCCCATAAGAAGTAAGCGATGCAGAAAGAAATAAAATTAGTTATCGCATTGGGGGTTTGCATCAAAGATTATCATTTGTCATACAGATGAGGGGGGGTATGTTGCTAGTCACTTAAACAAGAACGCCCTAGTTATTGATGAATGATCCTCCGGTATTACGGTACGAGATTGCT
3,2.0,2.0,0.0,ACGACTTGCCCAATAAATGTGAGCGTTGCCAAAAGGAATACAATGAGTTATTTCATTTGACGTTTGGGTGAAAGATTATCATTTGTCATACAAATGAAGGCTGGTATGTCGCTAGCCTATTAAAAAAGAACGCCATATATATTGGTCATTGATCCCGCGGAACTCTCACTCTGCTGTACG
4,2.0,2.0,0.0,ACGACTTGCCCAATAAATGTGAGCGTTGCCAAAAGGAATACAATGAGTTATTTCATTTGACGTTTGGGTGAAAGATTATCATTTGTCATACAAATGAAGGCTGGTATGTCGCTAGCCTATTAAAAAAGAACGCCATATATATTGGTCATTGATCCCGCGGTACCCGTGTTCGTAACCCCT


The function returns the table with RNA(`ct_1`) and DNA counts(`ct_0`) for each sequence (`seq`), as well as total counts(`ct`).
It contains all the necessary information we need to perform statistical inference to determine the effect of mutations on the expression of the gene in the following step.

Finally, here are the versions of packages used in this notebook. To display the versions, we are using the Jupyter Lab extension `watermark`, which can be found [here](https://github.com/rasbt/watermark). (This will already be installed if you use the environment we prepared.)

## Computing Environment 

In [6]:
%load_ext watermark
%watermark -v -p regseq

CPython 3.6.9
IPython 7.13.0

regseq 0.0.2
