# Internal documentation
This document gives a tour of the code in the `MDSINE2` package and how it interacts with `MDSINE2_Paper` package. The first part of the tutorial is giving a high level tour of the `MDSINE2` package. The next part shows how to run selected analyses with the code. The last part of the tour shows where specific functions are implemented in the code that are used often

### Table of contents
* [High level tour](#highleveltour)
    * [Datatypes](#datatypes)
        - [`md2.Taxon` and `md2.OTU`](#taxonandotu)
        - [`md2.TaxaSet`](#taxaset)
        - [`md2.Study` and `md2.Subject`](#studyandsubject)
        - [`md2.Perturbations` and `md2.BasePerturbation`](#perturbationsandbaseperturbatio)
        - [Everything is connected to each other through the `Study` class](#everythingisconnected)
    * [`DataNode` object]()
    * [`md2.Clustering` object]()
    * [`md2.Interactions` object]()
    * [`md2.BaseMCMC` object]()
* [Common functionality](#commonfunctionality)
    * [Reading in the gibson dataset](#readinginthegibsondataset)
    * [Retrieve trace from disk](#retrievetracefromdisk)
    * [Get statistics of a trace](#getstatisticsofatrace)
    * [Defining the parameters of the model](#definingtheparametersofthemodel)
    * [Bayes factors](#bayesfactors)
    * [Condensing fixed cluster interactions in perturbations into cluster-cluster interactions](#condensingfixedcluster)
    * [Forward simulating](#forwardsimulating)

# High level tour <a class="anchor" id="highleveltour"></a>
The purpose of this section is to take a deep dive into the core code of MDSINE2

In [1]:
import mdsine2 as md2
import numpy as np

## Datatypes <a class="anchor" id="datatypes"></a>
Data is parsed into various python objects depending on the type of data:

### `md2.Taxon` and `md2.OTU` <a class="anchor" id="taxonandotu"></a>
The first datatypes are the `md2.Taxon` (location in code: `md2.pylab.base.Taxon`) and `md2.OTU` (location in code: `md2.pylab.base.OTU`) objects. These objects contain the OTU and Taxon specific data and can be referenced with internal parameters such as:
 - `taxonomy` (`dict`: `str` -> `str`): This dictionary maps the taxonomic level to the taxonomic classification of the OTU/taxon.
 - `sequence` (`str`): This is the sequence of the taxa/OTU

An `OTU` is distinct from a `Taxon` object in the sense that an OTU is comprised of multiple `Taxon`s and contains all of the information for each one. Additionally, you can set the consensus taxonomy into the `taxonomy` parameter and the consensus sequence into the `sequence` parameter. This two objects are treated identically in the code.

### `md2.TaxaSet` <a class="anchor" id="taxaset"></a>
The next datatype is the `md2.TaxaSet` (location in code: `md2.pylab.base.TaxaSet`) object. This object works as an aggregator of the `md2.OTU` and `md2.Taxon` objects. The primary use of this object is reference all of the `OTU` and `Taxon` objects and to keep them in order. You can "get" the individual taxa by indexing it directly:

In [38]:
# Get the TaxaSet object for the UC gibson dataset (these commands are the same for any dataset)
study = md2.dataset.load_gibson(dset='uc') # description of the `Study` object is below
taxa = study.taxa

# Index a taxon directly
taxon = taxa['ASV_1']
print(taxon)

# subfields
print('taxonomy')
print(type(taxon.taxonomy))
print(taxon.taxonomy.keys())
print('\nsequence')
print(type(taxon.sequence))
print(taxon.sequence)

Taxon
	id: 1910077543704
	idx: 0
	name: ASV_1
	taxonomy:
		kingdom: Bacteria
		phylum: Bacteroidetes
		class: Bacteroidia
		order: Bacteroidales
		family: Bacteroidaceae
		genus: Phocaeicola
		species: NA
taxonomy
<class 'dict'>
dict_keys(['kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species', 'asv'])

sequence
<class 'str'>
TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGATGGATGTTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCAGTTGATACTGGATATCTTGAGTGCAGTTGAGGCAGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCCTGCTAAGCTGCAACTGACATTGAGGCTCGAAAGTGTGGGTATCAAACAGG


We can index the same taxon by the Python ID (`id`) or the index (`idx`) that the taxon/OTU appears in order of the others in the same way that we index by the name:

In [25]:
taxon = taxa[0]
print('Retrieving with index')
print(taxon)

ID = taxon.id
print('Retrieving with python id')
taxon = taxa[ID]
print(taxon)

Retrieving with index
Taxon
	id: 1910050152688
	idx: 0
	name: ASV_1
	taxonomy:
		kingdom: Bacteria
		phylum: Bacteroidetes
		class: Bacteroidia
		order: Bacteroidales
		family: Bacteroidaceae
		genus: Phocaeicola
		species: NA
Retrieving with python id
Taxon
	id: 1910050152688
	idx: 0
	name: ASV_1
	taxonomy:
		kingdom: Bacteria
		phylum: Bacteroidetes
		class: Bacteroidia
		order: Bacteroidales
		family: Bacteroidaceae
		genus: Phocaeicola
		species: NA


Additionally with indexiing the taxa from the `TaxaSet` object, we can also iterate over the taxa in order:

In [26]:
# Print the first 4 taxa
for taxon in taxa:
    if taxon.idx > 3:
        break
    print(taxon)

Taxon
	id: 1910050152688
	idx: 0
	name: ASV_1
	taxonomy:
		kingdom: Bacteria
		phylum: Bacteroidetes
		class: Bacteroidia
		order: Bacteroidales
		family: Bacteroidaceae
		genus: Phocaeicola
		species: NA
Taxon
	id: 1910050152576
	idx: 1
	name: ASV_2
	taxonomy:
		kingdom: Bacteria
		phylum: Bacteroidetes
		class: Bacteroidia
		order: Bacteroidales
		family: Bacteroidaceae
		genus: Bacteroides
		species: ovatus/fragilis
Taxon
	id: 1910050017120
	idx: 2
	name: ASV_3
	taxonomy:
		kingdom: Bacteria
		phylum: Verrucomicrobia
		class: Verrucomicrobiae
		order: Verrucomicrobiales
		family: Akkermansiaceae
		genus: Akkermansia
		species: muciniphila
Taxon
	id: 1910050017176
	idx: 3
	name: ASV_4
	taxonomy:
		kingdom: Bacteria
		phylum: Bacteroidetes
		class: Bacteroidia
		order: Bacteroidales
		family: Bacteroidaceae
		genus: Bacteroides
		species: NA


Another useful tool for using `TaxaSet` is reverse indexing. This is where we take the name or the Python ID of the taxon and we want to get the index:

In [27]:
# Get a random taxon
taxon = taxa['ASV_111']
print(taxon)
name = taxon.name
ID = taxon.id

print(taxa.names.index[name])
print(taxa.ids.index[ID])

Taxon
	id: 1910051233184
	idx: 110
	name: ASV_111
	taxonomy:
		kingdom: Bacteria
		phylum: Firmicutes
		class: Clostridia
		order: Clostridiales
		family: Lachnospiraceae
		genus: Blautia
		species: NA
110
110


We use the `TaxaSet` object to read from a taxonomic table (`md2.TaxaSet.parse`) and to write it's internal state to a table (`md2.TaxaSet.write_taxonomy_to_csv`). The order of the Taxa is __strictly dependent on their order in this object__ - all other classes (`md2.Clustering`, `md2.Study`, etc.) depend on this ordering.

### `md2.Study` and `md2.Subject` <a class="anchor" id="studyandsubject"></a>

The next objects are the `md2.Study` (location in code: `mdsine2.pylab.base.Study`) and `md2.Subject` (location in code: `mdsine2.pylab.base.Subject`) objects. These objects read in the reads and qPCR measurements for each subject, and reference the `md2.TaxaSet` object and `md2.Perturbations` object (referenced below), and how we generate matrices.

A `md2.Subject` object contains all of the information regarding a single subject in a dataset, such as:
 * `times` (`np.ndarray`): This stores the timepoints contained for this subject in order. These times map to the qPCR and reads for those time points.
 * `reads` (`dict`: `float` -> `np.ndarray`): This maps the timepoint to an array of counts for each taxa. The order of the array corresponds to the order of the taxa in the `md2.TaxaSet` object.
 * `qpcr` (`dict`: `float` -> `md2.qPCRdata`): Maps the timepoint to the qPCR measurement. The `md2.qPCRdata` object (location in code: `mdsine2.pylab.base.qPCRData`) stores the information for the triplicate qPCR measurements. 
 * `perturbations` (`md2.Perturbations`): References all of the perturbations in the study

In [29]:
# Get the 4th subject (just an example)
subject = study[4]

print('subject name:', subject.name)
for i, t in enumerate(subject.times):
    if i > 3:
        break
    print('\nTime {}'.format(t))
    print('reads', subject.reads[t]) # These are the reads
    print(len(subject.reads[t]), len(subject.taxa)) # These are equal
    print('qPCR:', subject.qpcr[t].data) # These are the triplicate qPCR measurements

subject name: 9

Time 0.0
reads [150 343  21 ...   0   0   0]
1473 1473
qPCR: [ 7158224.75044306 21449824.45947887  7026214.74028283]

Time 0.5
reads [927 411  21 ...   0   0   0]
1473 1473
qPCR: [3.65801801e+09 4.40176703e+09 6.46110522e+09]

Time 1.0
reads [36037  2819    32 ...     0     0     0]
1473 1473
qPCR: [1.02195918e+10 1.72985110e+10 1.51350301e+10]

Time 1.5
reads [28263  5178   119 ...     0     0     0]
1473 1473
qPCR: [3.06884858e+10 7.59445116e+10 6.77454856e+10]


Given that we have all of this information for the subject, we can create matrices of the counts, relative abundances, and absolute abundances (multiplying the relative abundance by the qPCR mean). This is done with the `matrix` and `df` function. The `matrix` function makes the abundance table for each taxa in order for each time in order. It returns 3 types of abundances: 
1. `'raw'`: These are the raw counts for each taxa
2. `'rel'`: These are the relative abundances for each taxa
3. `'abs'`: These are the absolute abundances for each taxa

The `df` function just wraps the `matrix` function and adds the taxa names as the index and the times as the columns for a `pandas.DataFrame`:

In [30]:
print('absolute abundance')
print(subject.matrix()['abs'])
df = subject.df()['abs']
df.head()

absolute abundance
[[3.10227753e+05 1.35872110e+08 9.55686446e+09 ... 5.56451703e+10
  4.31226057e+10 2.92904389e+10]
 [7.09387463e+05 6.02410327e+07 7.47587227e+08 ... 1.25140774e+10
  2.37771086e+10 1.84579855e+10]
 [4.34318855e+04 3.07800897e+06 8.48626863e+06 ... 1.27296188e+09
  3.43757415e+09 7.08885867e+09]
 ...
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]]


Unnamed: 0,0.0,0.5,1.0,1.5,2.0,2.5,3.0,3.5,4.0,4.5,...,58.5,59.0,59.5,60.0,60.5,61.0,62.0,63.0,64.0,64.5
ASV_1,310227.753446,135872100.0,9556864000.0,29279020000.0,40204050000.0,37082540000.0,575440600.0,10467940000.0,62303510000.0,4522686000.0,...,43168840000.0,131050200000.0,13553880000.0,13373350000.0,26618410000.0,14183620000.0,39050770000.0,55645170000.0,43122610000.0,29290440000.0
ASV_2,709387.462879,60241030.0,747587200.0,5364143000.0,16377910000.0,21281360000.0,327216200.0,7426126000.0,47966940000.0,4036937000.0,...,48737060000.0,102512800000.0,15229940000.0,6159113000.0,15035700000.0,2482822000.0,7512273000.0,12514080000.0,23777110000.0,18457990000.0
ASV_3,43431.885482,3078009.0,8486269.0,123277900.0,7966482000.0,32275390000.0,342646200.0,13936190000.0,44092410000.0,8937856000.0,...,47334250.0,0.0,16954460.0,11893460.0,7704687.0,45265670.0,21408300.0,1272962000.0,3437574000.0,7088859000.0
ASV_4,31022.775345,1172575.0,1591175.0,0.0,6819158.0,0.0,0.0,3589581.0,16593250.0,0.0,...,120487200.0,176320500.0,39560400.0,12459820.0,24398180.0,9430347.0,23447180.0,60858740.0,66490800.0,53667340.0
ASV_5,33090.960368,1612290.0,0.0,24862770.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,8495329.0,25682290.0,0.0,0.0,0.0,0.0,0.0


We can also produce these matrices clustered at a specific taxonomic level. We can only do this if taxonomic information is available for the taxa. Note that you can make the `index_formatter` whatever you want as long as the taxonomic level is above or equal to the taxonomic level `taxlevel` you are aggregating to:

In [31]:
df, _ = subject.cluster_by_taxlevel(dtype='abs', taxlevel='family', index_formatter='%(order)s %(family)s')
df.head()

Unnamed: 0,0.0,0.5,1.0,1.5,2.0,2.5,3.0,3.5,4.0,4.5,...,58.5,59.0,59.5,60.0,60.5,61.0,62.0,63.0,64.0,64.5
Bacteroidaceae Bacteroidia,3034027.0,350013600.0,12682990000.0,44540620000.0,74306660000.0,79349130000.0,1222880000.0,25616690000.0,156901600000.0,13318000000.0,...,112599600000.0,277537300000.0,35341970000.0,21947400000.0,48461200000.0,17492540000.0,48656980000.0,72543610000.0,78997710000.0,57420470000.0
Akkermansiaceae Verrucomicrobiae,76522.85,4690299.0,8486269.0,152284500.0,8292097000.0,33549100000.0,356085300.0,14464580000.0,45826410000.0,9285970000.0,...,47334250.0,0.0,16954460.0,20388790.0,33386980.0,45265670.0,21408300.0,1272962000.0,3437574000.0,7088859000.0
Sutterellaceae Betaproteobacteria,31022.78,879431.1,96796500.0,1714495000.0,5126302000.0,6476278000.0,73441970.0,786118300.0,6508702000.0,745609300.0,...,1811611000.0,18875110000.0,3816367000.0,2541802000.0,5849142000.0,651825600.0,6680408000.0,16223920000.0,13594040000.0,13038780000.0
Enterobacteriaceae Gammaproteobacteria,310227.8,5276587.0,16442150.0,89091590.0,0.0,11778110.0,522630.1,33024150.0,240602100.0,4552805.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7607342.0,0.0,0.0
Porphyromonadaceae Bacteroidia,221295.8,26089790.0,211626300.0,2056358000.0,4677943000.0,4285549000.0,66075370.0,498951800.0,8010392000.0,998815300.0,...,10921300000.0,32301920000.0,2681226000.0,2888412000.0,4670325000.0,1971697000.0,7286976000.0,7982637000.0,6559317000.0,4238527000.0


The `md2.Study` object is an object that references multiple `Subject` objects and contains all of the information used in a study. Most of the major commands are done through the `Study` object so that all of the various data structures stay consistent with each other. You can index a subject both with its index and its name:

In [32]:
subj = study[2]
print(subj.name)
subj = study[subj.name]
print(subj.name)

7
7


It is with the `Study` object that we do the majority of the parsing of the tables (`md2.Study.parse`). We can also remove certain data from the study manually:
* Remove timepoints with `study.pop_times`
* Remove subjects with `study.pop_subjects`
* Remove taxa with `stuy.pop_taxa`

In [44]:
study = md2.dataset.load_gibson(dset='uc')
print('removing subject 10')
print('number of subjects:', len(study))
study.pop_subject('10')
print('number of subjects:', len(study))

subject = study['6']

print('Removing the timepoint 1 from all subjects')
print(subject.times)
study.pop_times(1)
print(subject.times)



removing subject 10
number of subjects: 5
number of subjects: 4
Removing the timepoint 1 from all subjects
[ 0.   0.5  1.   1.5  2.   2.5  3.   3.5  4.   4.5  5.   6.   7.   8.
  9.  10.  11.  14.  16.  18.  21.  21.5 22.  22.5 23.  23.5 24.  25.
 28.  28.5 29.  29.5 30.  30.5 31.  31.5 32.  33.  35.  35.5 36.  36.5
 37.  37.5 38.  39.  42.  42.5 43.  43.5 44.  44.5 45.  45.5 46.  47.
 50.  50.5 51.  51.5 52.  52.5 53.  54.  57.  57.5 58.  58.5 59.  59.5
 60.  60.5 61.  62.  63.  64.  64.5]
[ 0.   0.5  1.5  2.   2.5  3.   3.5  4.   4.5  5.   6.   7.   8.   9.
 10.  11.  14.  16.  18.  21.  21.5 22.  22.5 23.  23.5 24.  25.  28.
 28.5 29.  29.5 30.  30.5 31.  31.5 32.  33.  35.  35.5 36.  36.5 37.
 37.5 38.  39.  42.  42.5 43.  43.5 44.  44.5 45.  45.5 46.  47.  50.
 50.5 51.  51.5 52.  52.5 53.  54.  57.  57.5 58.  58.5 59.  59.5 60.
 60.5 61.  62.  63.  64.  64.5]


We can also create numpy matrices and `DataFrame`s of the data at the study level using `study.matrix` and `study.df`. This works the same the `matrix` and `df` functions for the `Subject` object but this will aggregate the different subjects together in a way that you describe. See documentation for details.

Another key feature of the `Study` object is the `study.normalize_qpcr` and `study.denormalize_qpcr` functions. These are used to rescale the data either higher or lower. This is done for numerical reasons during inference. We learn the parameters in normalized space, and then we denormalize then with the function `mdsine2.run.denormalize_parameters`.

### `md2.Perturbations` and `md2.BasePerturbation` <a class="anchor" id="perturbationsandbaseperturbatio"></a>
The last core data structure used to wrap the raw data are the `md2.Perturbations` (location in code: `mdsine2.pylab.base.Perturbations`) object and the `md2.BasePerturbation` (location in code: `mdsine2.pylab.base.BasePerturbation`) object. These hold the information for what subject has what perturbation.

The `BasePerturbation` class is analogous to the `Subject` class in the sense that it stores in the information needed for a single perturbation. Internal parameters include:
 * `name` (`str`): This is the name of the perturbation
 * `starts` (`dict`: `str` -> `float`): This maps the name of the subject that has this perturbation to the start time of the perturbation
 * `ends` (`dict`: `str` -> `float`): This maps the name of the subject that has this perturbation to the end time of the perturbation
 
The `Perturbations` class is analogous to the `Study` class in the sense that it agglomerates many perturbations. You can index the perturbations by their name, Python ID, or their index.
 

### Everything is connected to each other through the `Study` class <a class="anchor" id="everythingisconnected"></a>

The `Study` object connects all of the aggregator classes together:
 * `study.perturbations` -> `Perturbations`
 * `study.taxa` -> `TaxaSet`
 
Each subject also has the `.perturbations` and `.taxa` pointer but as a `@property` so that it points to the same pointer that the study object is:
```python
    @property
    def perturbations(self) -> Perturbations:
        return self.parent.perturbations

    @property
    def taxa(self) -> TaxaSet:
        return self.parent.taxa
```

When we make changes to the data as a whole, 99% of the time this is done with the `Study` object so that everything can stay consistent. For example, when we remove taxa from the inference, the `Study` object removes the reads at the respective indices in each subject, tells the `TaxaSet` object to remove that from the set, and tells it to reorder the indices of the taxa so that they are sequential again.



## `DataNode` object

## `md2.Clustering` object

## `md2.Interactions` object

## `md2.BaseMCMC` object

# Common functionality <a class="anchor" id="commonfunctionality"></a>
If there is a command that is not listed here that you don't know where it is in the code, a way that you can find it is by looking in the `mdsine2.__init__` file. By seeing the module that the function/class is imported from, you can see the location in the code.

The `import` statements in the `__init__` file allow the user to access functions/classes directly from the imported package instead of having to go throughout all of the submodules. For example, instead of loading a `Study` object as
```python
study = md2.pylab.base.Study(...)
```
we can import it directly from the `mdsine2` package:
```python
study = md2.Study(...)
```
We can do this because it is imported in the `__init__` file. 

We can see the location in the code where the `Study` object is implemented by looking in the `__init__` file:
```python
from .pylab.base import ..., Study, ...
```
Note that there are many objects/functions that are imported on the same line.

To find the documentation of the different classes and functions, you can look at the html files `docs` folder of the autogenerated docs. The location of the functions/classes in the docs are the same as their location in the code, so if you don't know the location, then you need to look at the `__init__` file.

In [1]:
import mdsine2 as md2
from mdsine2.names import STRNAMES

### Reading in the Gibson dataset <a class="anchor" id="readinginthegibsondataset"></a>
    
This will automatically try to download this dataset from github. If there is no internet connection, then you can provide a path with the parameter `load_local` that will look in that path for all the files.

* Location in MDSINE2: `MDSINE2.dataset.load_gibson`

*  Command: `study = md2.dataset.load_gibson(...)`

* Example:

   ```python
   # Load healthy
   healthy = md2.dataset.load_gibson(dset='healthy')
   # Load uc
   uc = md2.dataset.load_gibson(dset='uc')
   ```


### Retrieve trace from disk <a class="anchor" id="retrievetracefromdisk"></a>
    
Get the Gibb samples of a variable from disk. This has 3 difference options:
1. `section='posterior'` : This only gets the Gibb samples post-burnin
2. `section='burnin'` : This only gets the Gibb samples used for burnin
3. `section='entire'` : Gets all of the Gibb samples (`burnin` + `posterior`)

* Location in MDSINE2: `MDSINE2.pylab.inference.Tracer.get_trace`

* Example:
    Load using `Tracer` object

    ```python
    mcmc = md2.BaseMCMC.load(path/to/mcmc.pkl)

    # Get the growth parameters from disk of the posterior
    trace = mcmc.tracer.get_trace(name=STRNAMES.GROWTH_VALUE, section='posterior')
    ```

    Load directly from the object

    ```python
    mcmc = md2.BaseMCMC.load(path/to/mcmc.pkl)
    growth = mcmc.graph[STRNAMES.GROWTH_VALUE]

    # Get the growth parameters from disk from the posterior
    trace = growth.get_trace_from_disk(section='posterior')
    ```

    Note that these two are the exact same - the function `get_trace_from_disk` internally calls `tracer.get_trace`
    and passes in its own name. You can call this function for any variable you are tracing during inference
            

### Get statistics of a trace <a class="anchor" id="getstatisticsofatrace"></a>
    
Calculate some statistics of a trace. 

* Automatically calculates and returns a dictionary of:
    * `'mean'`
    * `'median'`
    * `'25th percentile'`
    * `'75th percentile'`

* Location in MDSINE2: `MDSINE2.pylab.variables.summary`

* Command: `md2.summary`

* Example:

    Using objects directly
    ```python
    mcmc = md2.BaseMCMC.load(path/to/mcmc.pkl)
    processvar = mcmc.graph[STRNAMES.PROCESSVAR]
    summ = md2.summary(processvar, section='posterior')
    # Get the mean
    mean = summ['mean']
    ```

    From raw numpy files
    ```python
    mcmc = md2.BaseMCMC.load(path/to/mcmc.pkl)
    processvar = mcmc.graph[STRNAMES.PROCESSVAR]
    trace = processvar.get_trace_from_posterior(section='posterior')
    summ = md2.summary(traces)
    # Get the mean
    mean = summ['mean']
    ```

    Note that these two calls are exactly the same - if you pass in a variable with a trace, it will automatically get the trace from disk by calling the function `get_trace_from_disk`. You can specify the section to retrieve using the `section` parameter.

* Handling NaNs
    By default, `md2.summary` ignores NaNs when calculating the statistics by using the functions `numpy.nanmean`, etc. If you want to set the NaNs to 0s, set the flag `set_nans_to_0=True`

    - Example:
        ```python
        mcmc = md2.BaseMCMC.load(path/to/mcmc.pkl)
        interactions = mcmc.graph[STRNAMES.INTERACTIONS_OBJ]
        ```

        Ignores the NaNs:

        ```python
        trace = md2.summary(interactions)
        ```

        Sets the NaNs to zero:

        ```python
        trace = md2.summary(interactins, set_nan_to_0=True)
        ```

### Defining the parameters of the model <a class="anchor" id="definingtheparametersofthemodel"></a>

* Location in MDSINE2: `MDSINE2.config`

* Command:
    - Logging: `logging = md2.LoggingConfig(...)`
        - This is used to define the logging level and the format to log with.
        - This automatically writes all of the logging to a file that you can view later
    - MDSINE2 parameters: `params = md2.MDSINE2ModelConfig(...)`
        - Defines the parameters to run the MDSINE2 model
    - Negative Binomial dispersion parameters: `params = md2.MDSINE2ModelConfig(...)`
        - Defines the parameters to learn the Negative binomial dispersion parameters

### Bayes factors <a class="anchor" id="bayesfactors"></a>

Generate the bayes factors after the chains have run

- Interactions

    * Location in MDSINE2: `MDSINE2.util.generate_interation_bayes_factors_posthoc`

    * Command: `bf = md2.generate_interation_bayes_factors_posthoc(...)`

    * Example:

    ```python
    mcmc = md2.BaseMCMC.load(path/to/mcmc.pkl)
    bf = md2.generate_interation_bayes_factors_posthoc(mcmc=mcmc, section='posterior')
    ```

- Perturbations

    * Location in MDSINE2: `MDSINE2.util.generate_perturbation_bayes_factors_posthoc`

    * Command: `bf = md2.generate_perturbation_bayes_factors_posthoc(...)`

    * Example:

    ```python
    mcmc = md2.BaseMCMC.load(path/to/mcmc.pkl)
    perturbation = mcmc.graph.perturbations[name_of_perturbation]
    bf = md2.generate_perturbation_bayes_factors_posthoc(
        mcmc=mcmc, perturbation=perturbation, section='posterior')
    ```

### Condensing fixed cluster interactions in perturbations into cluster-cluster interactions <a class="anchor" id="condensingfixedcluster"></a>

- Interactions

    * Location in MDSINE2: `MDSINE2.util.condense_fixed_clustering_interaction_matrix`

    * Command: `M = md2.condense_fixed_clustering_interaction_matrix(...)`

    * Example:

    Generate cluster-cluster interactions for each gibb step

    ```python
    mcmc = md2.BaseMCMC.load(path/to/fixed/clustering/mcmc.pkl)
    clustering = mcmc.graph[STRNAMES.CLUSTERING_OBJ]
    M = mcmc.graph[STRNAMES.INTERACTIONS_OBJ].get_trace_from_disk(
        section='posterior') # (n_gibbs, n_taxa, n_taxa)
    M_condense = md2.condense_fixed_clustering_interaction_matrix(
        M, clustering=clustering) # (n_gibbs, n_clusters, n_clusters)
    ```

    Generate expected cluster-cluster interactions

    ```python
    mcmc = md2.BaseMCMC.load(path/to/fixed/clustering/mcmc.pkl)
    clustering = mcmc.graph[STRNAMES.CLUSTERING_OBJ]
    M = md2.summary(mcmc.graph[STRNAMES.INTERACTIONS_OBJ],
                    set_nan_to_0=True, section='posterior')['mean'] # (n_taxa, n_taxa)
    M_condense = md2.condense_fixed_clustering_interaction_matrix(
        M, clustering=clustering) # (n_clusters, n_clusters)
    ```

    Generate bayes factors of the cluster-cluster interactions

    ```python
    mcmc = md2.BaseMCMC.load(path/to/fixed/clustering/mcmc.pkl)
    clustering = mcmc.graph[STRNAMES.CLUSTERING_OBJ]
    bf = md2.generate_interation_bayes_factors_posthoc(
        mcmc=mcmc, section='posterior') # (n_taxa, n_taxa)
    bf_condensed = md2.condense_fixed_clustering_interaction_matrix(
        bf, clustering=clustering) # (n_clusters, n_clusters)
    ```

    Note that the function can be fed any `n.ndarray` as long as the last 2 dimensions have the shape `(n_taxa, n_taxa)`


- Perturbations

    * Location in MDSINE2: `MDSINE2.util.condense_fixed_clustering_perturbation`

    * Command: `bf = md2.condense_fixed_clustering_perturbation(...)`   
    
    * Example:
    
    Generate cluster perturbations for each gibb step

    ```python
    mcmc = md2.BaseMCMC.load(path/to/fixed/clustering/mcmc.pkl)
    perturbation = mcmc.graph.perturbations[name_of_perturbation]
    M = perturbation.get_trace_from_disk(section='posterior') # (n_gibbs, n_taxa)
    M_condense = md2.condense_fixed_clustering_perturbation(
        M, clustering=clustering) # (n_gibbs, n_clusters)
    ```
    
    Generate expected cluster perturbation values

    ```python
    mcmc = md2.BaseMCMC.load(path/to/fixed/clustering/mcmc.pkl)
    perturbation = mcmc.graph.perturbations[name_of_perturbation]
    M = md2.summary(perturbation,set_nan_to_0=True, section='posterior')['mean'] # (n_taxa, )
    M_condense = md2.condense_fixed_clustering_perturbation(
        M, clustering=clustering) # (n_clusters, )
    ```

    Generate bayes factors of the perturbation

    ```python
    mcmc = md2.BaseMCMC.load(path/to/fixed/clustering/mcmc.pkl)
    clustering = mcmc.graph[STRNAMES.CLUSTERING_OBJ]
    bf = md2.generate_perturbation_bayes_factors_posthoc(
        mcmc=mcmc, perturbation=perturbation, section='posterior') # (n_taxa, )
    bf_condensed = md2.condense_fixed_clustering_perturbation(
        bf, clustering=clustering) # (n_clusters, )
    ```

    Note that the function can be fed any `n.ndarray` as long as the last dimension has the shape `(n_taxa, )`

### Forward simulating <a class="anchor" id="forwardsimulating"></a>

Forward simulate a dynamical system

* Location in MDSINE2: 
    - Definition of the model: `MDSINE2.model.gLVDynamicsSingleClustering`
    - Forward simulation: `MDSINE2.pylab.dynamics.integrate`
* Command:
    - Definition of the model: `md2.gLVDynamicsSingleClustering`
    - Forward simulation: `md2.integrate`
    
* Examples:
    
    Forward simulate the dynamical system in each Gibb step:
    ```python
    mcmc = md2.BaseMCMC.load(path/to/mcmc.pkl)
    
    # Get the initial conditions from  a subject
    subj = md2.Subject.load(path/to/subject.pkl)
    initial_conditions = subj.matrix()['abs'][:, 0]
    
    pred_matrix = md2.gLVDynamicsSingleClustering.forward_sim_from_chain(
        mcmc=mcmc, subjname=subj.name, initial_conditions=initial_conditions,
        times=subj.times, simulation_dt=0.01, section='posterior')
    ```
    
    Initial conditions can come from anywhere, it just needs to be a numpy array with `n_taxa` elements. We do need to pass in a subject name `subjname` so the chain knows which perturbations to use. This function acts as a wrapper for the `gLVDynamicsSingleClustering` and `md2.integrate` and is high level. We can forward simulate a single Gibb step as well with more fine tuning:
    
    Forward simulate for a single gibb step:
    ```python
    
    # Initialize and run the dynamics object
    dyn = md2.gLVDynamicsSingleClustering(
        # np.ndarray(n_taxa), growth parameters
        growth=...,
        # np.ndarray(n_taxa, n_taxa), Interaction matrix (with self-interactions on diagonal)
        interactions=..., 
        # iterable(np.ndarray(n_taxa)), perturbation effects for each perturbation
        perturbations=...,
        # iterable(np.ndarray(float)), start time for each perturbation
        perturbation_starts=...,
        # iterable(np.ndarray(float)), end time for each perturbation
        perturbation_ends=...)
    
    ret = md2.integrate(
        dyn, 
        # np.ndarray(n_taxa, 1) initial abundances
        initial_conditions=..., 
        dt=...,
        n_days=...,
        # If you want to return only certain timepoints
        subsample=True, times=...)
    ```
    
    You can find examples of forward simulation in `MDSINE2_Paper/forward_sim.py`


### Make a name for a taxon <a class="anchor" id="makingnamefortaxon"></a>

You can specify a format in which to write the label of taxa using the function `md2.taxaname_formatter`. With this, you specify what taxonomic information you want where and it will automatically generate the label. This is how the visualization heatmaps generate the labels on the Axes:

* Location in MDSINE2: `mdsine2.pylab.base.taxaname_formatter`
* Command: `mdsine2.taxaname_formatter`
* See documentation for examples and details

### Generating cluster assignments <a class="anchor" id="clusterassignments"></a>

To get the clustering order and to generate the cluster assignments after inference, use the function `md2.generate_cluster_assignments_posthoc`. Note, if you just load a `BaseMCMC` object and get the cluster object, the only time where it is in the right configuration is when we are running with fixed clustering. If we are not running with fixed clustering, the current assignments __are wrong__. You need to call this function to generate the right posterior cluster assignments.
* Location in MDSINE2: `mdsine2.util.generate_cluster_assignments_posthoc`
* Command: `mdsine2.generate_cluster_assignments_posthoc`

