# Inference

The code in this tutorial is released under the [MIT License](https://opensource.org/licenses/MIT). All the content in this notebook is under a [CC-by 4.0 License](https://creativecommons.org/licenses/by/4.0/). 

In [1]:
import regseq.inference
from mpathic import learn_model
import pandas as pd

For a detailed explanation of the steps leading to this notebook, as well as the experimental context, refer to the [RegSeq wiki](https://github.com/RPGroup-PBoC/RegSeq/wiki).

### Simple Least Squares

At this point, we are equipped with the data necessary to use statistical inference tools to compute the effect that each observed mutation had on the expression of the gene. In the `regseq` package, we include a simple linear regression, which estimates the effects of mutations per position, independet of the mutated base. This is a very fast inference, however not very accurate. Therefore it can be used to predict patterns one will see from more sophisticated algorithms.

In [2]:
?regseq.inference.lin_reg

The input is the combined data set of mRNA and DNA counts, which we combined in the previous step. 

In [3]:
inputname = "../data/sequencing_data/bdcR_Anaerodataset_combined.csv"
outputname = "../data/inference_results/bdcR_Anaero_LS_mut_inf.txt"

Now we can just run the function and the results will be stored. If one has a lot of data in hand, one can simply iterate through the files and perform the inference.

In [4]:
regseq.inference.lin_reg(inputname, outputname)

## Inference using mpathic

To perform more accurate inferences, we use the `mpathic` package, especially the `learn_model` module. We can either use a least squares approach, or maximize information. In both cases, the inference returns an energy matrix. 

First let's look at the parameters that we need for the function.

In [5]:
?learn_model.main

In [6]:
file = "../data/sequencing_data/bdcR_Anaerodataset_combined.csv"
df = pd.read_csv(file)
db = "../data/inference_results/bdcR_Anaero_dataset_db"
out = "../data/inference_results/bdcR_Anaero_LS_mut.csv"

In [7]:
ls_df = learn_model.main(
    df=df,
    lm='LS',
    modeltype='MAT',
    LS_means_std=None,
    db=db,
    iteration=30000,
    burnin=1000,
    thin=10,
    runnum=0,
    initialize='LS',
    start=0,
    end=None,
    foreground=1,
    background=0,
    alpha=0,
    pseudocounts=1,
    test=False,
    drop_library=False,
    verbose=False,
)
ls_df.to_csv(out, index=False)

In [13]:
file = "../data/aphAAnaerodataset_alldone_with_large.csv"
df = pd.read_csv(file)
db = "../data/inference_results/bdcR_Anaero_dataset_db"
out = "../data/inference_results/bdcR_Anaero_MCMC_mut.csv"

In [18]:
mcmc_df = learn_model.main(
    df=df,
    lm='IM',
    modeltype='MAT',
    LS_means_std=None,
    db=db,
    iteration=1000,
    burnin=10,
    thin=10,
    runnum=0,
    initialize='rand',
    start=0,
    end=None,
    foreground=1,
    background=0,
    alpha=0,
    pseudocounts=1,
    test=False,
    drop_library=False,
    verbose=True,
)
mcmc_df.to_csv(out, index=False)

 [-----------------100%-----------------] 1000 of 1000 complete in 118.6 sec

## Computing Environment

In [24]:
%load_ext watermark
%watermark -v -p regseq

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
CPython 3.6.9
IPython 7.13.0

regseq 0.0.2
