# Inference

The code in this tutorial is released under the [MIT License](https://opensource.org/licenses/MIT). All the content in this notebook is under a [CC-by 4.0 License](https://creativecommons.org/licenses/by/4.0/). 

In [1]:
import regseq.inference
from mpathic import learn_model
import pandas as pd

For a detailed explanation of the steps leading to this notebook, as well as the experimental context, refer to the [Reg-Seq wiki](https://github.com/RPGroup-PBoC/RegSeq/wiki).

In the previous step the sequence counts from DNA and RNA measurements were combined to a single table for each gene that we considered in the experiment. This information is needed to computed how the mutated bases changed the expression of the gene. To infer how mutations change expression, we can perform multiple types of inferences that are discussed here. Detailed descriptions of the inference methods can be found at [here](https://github.com/RPGroup-PBoC/RegSeq/blob/master/Wiki4_equations.html).

### Simple Least Squares

At this point, we are equipped with the data necessary to use statistical inference tools to compute the effect that each observed mutation had on the expression of the gene. In the `regseq` package, we include a simple linear regression, which estimates the effects of mutations per position, independent of the mutated base. This is a very fast inference, however not very accurate. Therefore it can be used to predict patterns one will see from more sophisticated algorithms.

In [2]:
?regseq.inference.lin_reg

[0;31mSignature:[0m
[0mregseq[0m[0;34m.[0m[0minference[0m[0;34m.[0m[0mlin_reg[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0minputname[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0moutputname[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mwildtypefile[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mold_format[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mgene[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m <no docstring>
[0;31mFile:[0m      ~/git/RegSeq/regseq/inference.py
[0;31mType:[0m      function


The input is the combined data set of mRNA and DNA counts, which we combined in the previous step. The file we need to use here was created in the notebook `4_1_match_data.ipynb.` We need to give a path to the location where the results of the inference are going to be stored. Also, in case a different genes are being used, the location of the file containing wildtype genes can be given. By default, the file we generated in the the first step of this protocol is chosen.

In [3]:
inputname = "../data/sequencing_data/ykgE_dataset_combined.csv"
outputname = "../data/inference_results/ykgE_LS_mut_inf.txt"

Now we can just run the function and the results will be stored. If one has a lot of data in hand, one can simply iterate through the files and perform the inference.

In [4]:
regseq.inference.lin_reg(inputname, outputname, wildtypefile='../data/prior_designs/wtsequences.csv')

## Inference using mpathic

To perform more accurate inferences, we use the `mpathic` package, especially the `learn_model` module. We can either use a least squares approach, or maximize information. In both cases, the inference returns an energy matrix. 

First let's look at the parameters that we need for the function.

In [5]:
?learn_model.main

[0;31mSignature:[0m
[0mlearn_model[0m[0;34m.[0m[0mmain[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mdf[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlm[0m[0;34m=[0m[0;34m'IM'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmodeltype[0m[0;34m=[0m[0;34m'MAT'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mLS_means_std[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdb[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0miteration[0m[0;34m=[0m[0;36m30000[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mburnin[0m[0;34m=[0m[0;36m1000[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mthin[0m[0;34m=[0m[0;36m10[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mrunnum[0m[0;34m=[0m[0;36m0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0minitialize[0m[0;34m=[0m[0;34m'LS'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mstart[0m[0;34m=[0m[0;36m0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mend[0m[0;34m=[0m[0;32mNone[0m[0;34m,[

To use this function we need to load the combined data set. Then we need to define a file path where the inference results are stored.

In [6]:
# combined data set
file = "../data/sequencing_data/ykgE_dataset_combined.csv"

df = pd.read_csv(file)

# output files
db = "../data/inference_results/ykgE_dataset_db"
out = "../data/inference_results/ykgE_LS_mut.csv"

In [7]:
ls_df = learn_model.main(
    df=df,
    lm='LS',
    modeltype='MAT',
    LS_means_std=None,
    db=db,
    iteration=30000,
    burnin=1000,
    thin=10,
    runnum=0,
    initialize='LS',
    start=0,
    end=None,
    foreground=1,
    background=0,
    alpha=0,
    pseudocounts=1,
    test=False,
    drop_library=False,
    verbose=False,
)
ls_df.to_csv(out, index=False)

A more accurate inference can be performed by using Markoc Chain Monte-Carlo inference methods. To perform such an inference, we can use the same method as we did in the last step. However, we have to change a couple of arguments to make the function viable for an mcmc inference. Below is an example, which has much less steps as an actual inference would have. Note, there are only 10 warm up steps (`burnin`) and only 1000 steps in the inference (`iteration`). Usually you want to have around 1000 warm up steps and at least 10000 iteration steps, such that the inference can converge. These many steps require a lot of computational power, and we do not recommend running it here in the notebook. In the Reg-Seq wiki we explain how to use AWS instances to perform high power computations in the cloud.

In [8]:
file = "../data/sequencing_data/ykgE_dataset_combined.csv"
df = pd.read_csv(file)
db = "../data/inference_results/ykgE_db"
out = "../data/inference_results/ykgE_MCMC_mut.csv"
df

Unnamed: 0,ct,ct_0,ct_1,seq
0,1.0,1.0,0.0,ACAATTTCACCATAAAATGTCGGCGTTGCCGAAAGAAATAAAATGA...
1,1.0,1.0,0.0,ACGAATTCCCCATAAGAAGTAAGCGATGCAGAAAGAAATAAAATTA...
2,1.0,1.0,0.0,ACGAATTCCCCATAAGAAGTAAGCGATGCAGAAAGAAATAAAATTA...
3,2.0,2.0,0.0,ACGACTTGCCCAATAAATGTGAGCGTTGCCAAAAGGAATACAATGA...
4,2.0,2.0,0.0,ACGACTTGCCCAATAAATGTGAGCGTTGCCAAAAGGAATACAATGA...
...,...,...,...,...
2585,1.0,1.0,0.0,TTGTTTTCGCCATAAATTGTGAGCGATGCCGTAAGAAACAAAATTA...
2586,2.0,1.0,1.0,TTGTTTTTCCCAGAAAATGTAAGTCACGTCGACAGAAATAAAATTA...
2587,1.0,1.0,0.0,TTGTTTTTCCCAGAAAATGTAAGTCACGTCGACAGAAATAAAATTA...
2588,1.0,1.0,0.0,TTGTTTTTCCCAGAAAATGTAAGTCACGTCGACAGAAATAAAATTA...


In [9]:
mcmc_df = learn_model.main(
    df=df,
    lm='IM',
    modeltype='MAT',
    LS_means_std=None,
    db=db,
    iteration=1000,
    burnin=10,
    thin=10,
    runnum=0,
    initialize='rand',
    start=0,
    end=None,
    foreground=1,
    background=0,
    alpha=0,
    pseudocounts=1,
    test=False,
    drop_library=False,
    verbose=True,
)
mcmc_df.to_csv(out, index=False)

 [-----------------100%-----------------] 1000 of 1000 complete in 77.8 sec

The result of these inferences are energy matrices. In the following notebooks we will go through three methods of displaying the results, which are information footprints, logos, and the energy matrices themselves. We also use these methods to find sites that significantly change gene expression, and therefore can be identified as activator or repressor binding sites.

## Computing Environment

In [10]:
%load_ext watermark
%watermark -v -p regseq

CPython 3.6.9
IPython 7.16.1

regseq 0.0.4
