# Match mRNA and DNA counts

The code in this tutorial is released under the [MIT License](https://opensource.org/licenses/MIT). All the content in this notebook is under a [CC-by 4.0 License](https://creativecommons.org/licenses/by/4.0/). 

In [2]:
import regseq.match_data as md
import pandas as pd

For a detailed explanation of the steps leading to this notebook, as well as the experimental context, refer to the [RegSeq wiki](https://github.com/RPGroup-PBoC/RegSeq/wiki).

After retrieving quality-filtered, split sequencing file with data for each biological replicate, DNA vs. RNA, and growth condition, we need to prepare the data in a format to use it for further statistical inference. Therefore, we store the data in a table that contains both RNA and DNA counts for each sequence. In the end, a table will have the following format

|ct|ct_0|ct_1|seq|
|----|----|----|----|
|10|5|5|AAACAAAAAAAC...|
|2|2|0|AAACAAAAAATC...|

where the sequence column contains the full sequence, which is shortened here for displaying purposes. In the third step of the protocol, we generated a unique mapping of barcodes to sequences, which we need to use in this step. In the module `regseq.match_data`, you can find functions which perform the matching step, and are combined in `regseq.match_data.combine_counts`. Let's have a look at the docstring.

In [3]:
?md.combine_counts

[0;31mSignature:[0m [0mmd[0m[0;34m.[0m[0mcombine_counts[0m[0;34m([0m[0mmRNA_file[0m[0;34m,[0m [0mDNA_file[0m[0;34m,[0m [0mtag_key_file[0m[0;34m,[0m [0moutput_file[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Compute counts for sequences from mRNA and DNA.

Parameters
----------
mRNA_file : str
    Path of file for mRNA sequencing
DNA_file : str
    Path of file for DNA sequencing
tag_key_file : str
    Path of file for barcode/sequence mapping
output_file : str
    Path of file constructed for output
[0;31mFile:[0m      ~/git/RegSeq/regseq/match_data.py
[0;31mType:[0m      function


For demonstration purposes, let's use the same gene we used in the previous step, `bdcR`.

In [4]:
tag_key_file = "../data/test_data/bdcR_barcode_key.csv"
mRNA_file = "../data/sequencing_data/BI94_102_mRNA"
DNA_file = "../data/sequencing_data/BI95_102_DNA"
output_file = "../data/sequencing_data/bdcRAnaerodataset"

In [5]:
md.combine_counts(mRNA_file, DNA_file, tag_key_file, output_file)

**RIGHT NOW GENE IS A COLUMN, BUT ONLY ONE GENE PER FILE, SO REDUNDANT**

In [7]:
pd.read_csv(output_file).head()

Unnamed: 0,ct,ct_0,ct_1,gene,seq
0,3.0,1.0,2.0,bdcR,AAATACGATAGCGGCATCGATTCAACGACTTGCACCGAGGATGTGAACTGTCATATCTGAAAAAGCGCCCATAAGGACTCCTTGATTTATTATGTAATATGCATTACAAAACTGTTTTAACTTTTTGCCGACAGGTTTTGCAATGGTAAATAAAACACAATTCCCGAACCTGGCTGATCC
1,2.0,2.0,0.0,bdcR,AACTAAGAGAGGGGCACCGATACCACGACTGACACCGAGGATGCGAACTCTCTTAGTTGTAAAAGCGCCCATAAGGAGTCCTTGATTTATTTTGTAACATGCATTACAAAACTGTTTTAACTTTCTGTCAACAGGTTTTGCAATGGGCACTGAACCGTAAAGGCTTGGGGTGTCCGGAGG
2,1.0,1.0,0.0,bdcR,AACTAAGAGAGGGGCACCGATACCACGACTGACACCGAGGATGCGAACTCTCTTAGTTGTAAAAGCGCCCATAAGGAGTCCTTGATTTATTTTGTAACATGCATTACAAAACTGTTTTAACTTTCTGTCAACAGGTTTTGCAATGGGCACTGAACCGTAAATGACGGCGGTATTGTCAAC
3,3.0,3.0,0.0,bdcR,AACTAAGAGAGGGGCACCGATACCACGACTGACACCGAGGATGCGAACTCTCTTAGTTGTAAAAGCGCCCATAAGGAGTCCTTGATTTATTTTGTAACATGCATTACAAAACTGTTTTAACTTTCTGTCAACAGGTTTTGCAATGGGCACTGAACCGTAACACCATCGCTTAAATCTTTA
4,4.0,4.0,0.0,bdcR,AACTAAGAGAGGGGCACCGATACCACGACTGACACCGAGGATGCGAACTCTCTTAGTTGTAAAAGCGCCCATAAGGAGTCCTTGATTTATTTTGTAACATGCATTACAAAACTGTTTTAACTTTCTGTCAACAGGTTTTGCAATGGGCACTGAACCGTAACATAGTATAGTCGATTTTAT


Finally, here are the versions of packages used in this notebook. To display the versions, we are using the Jupyter Lab extension `watermark`, which can be found [here](https://github.com/rasbt/watermark). (This will already be installed if you use the environment we prepared.)

## Computing Environment 

In [5]:
%load_ext watermark
%watermark -v -p regseq

CPython 3.6.9
IPython 7.13.0

regseq 0.0.2
