# Network Optimization and Causal Analysis for Perturb-seq (NOCAP)



## Outline:
1. Network creation
2. Linear model analysis
3. Non-Linear (Hill) model analysis (in progress)
4. Future directions

# Introduction

Our aim is to build a robust tool for modeling, analysis, and casual inference of pertub-seq data for biosystems design applications. 

## 1. Network creation

### Generate a network from the biocyc database

E coli regulatory database --> output file --> directed graph with polarity for activation/inihibition type

In [1]:
import utility 

fname = "ECOLI-regulatory-network.txt"
# code to generate full e coli network
ecoli_network = utility.parse_regulation_file(fname) 
# simple analysis / visualization of the graph
print(f"E. coli regulatory network: {len(ecoli_network.nodes)} nodes and {len(ecoli_network.edges)} edges.")

network = utility.parse_regulation_file(fname)

E. coli regulatory network: 3042 nodes and 9678 edges.




View of the entire network (from ecocyc):

<img src="ecoli_viz.png" alt="Ecoli GRN" width="800"/>

### Generate a subnetwork from desired target gene

Include the target gene, descendents and their ancestors (confounders)

In [3]:
# code to generate sub network

subnetwork = utility.get_subgraph_from_nodes(network,['lacI'])

print(len(subnetwork.nodes))
print(len(subnetwork.edges))
# simple analysis / visualization of the graph

108
536


### Adjust subnetwork for analysis*

Current causal inference algorithms require a directed *acyclic* graph (DAG)

*Ongoing work for algorithms on cyclic graphs 

## 2. Linear models

### Linear structural causal models

each node $i$ has an associated bias term $\epsilon_i$.

each edge from node $i\rightarrow j$ has an associated weight $\beta_{ij}$, which is >0 for activation and <0 for inhibition.

this gives a linear form of the node (gene $x_i$) values: $x_i = \sum_{j \\  \textrm{neighbors}}{\beta_{ij}x_j} + \epsilon_i$

these equations can be constructed automatically from a DAG

In [4]:
# create set of linear equations from a DAG

### Simulating data from a model

Once we have a model and a set of parameters ($\beta$ and $\epsilon$), we can simulate data.

In [6]:
# simulate data and show example

### Fitting a model to data

Suppose we have a dataset and a model, and we want to fit the parameters ($\beta$ and $\epsilon$). 

We can also examine the quality of the fit.

In [2]:
# fit model (frequentist regression)
# show results / evaluation

### Computing average treatment effect (ATE)

Suppose we want to predict how intervening of X (geneA) effects Y (geneB).

This is given by the average treatment effect: E[Y | do(x)] - E[Y]

In [3]:
# compute ATE

How well does our estimate do?

In [4]:
# predicted vs actual ATE

### Counterfactual reasoning

Suppose we have observed data, we want to fit the data. 

"soft intervention" = rewiring the network


### Examples for FadR, TyrR, LacI

## 3. Non-linear models (Hill equations)

technical noise with SERGIO
missing batch effects (chirho)
microsplit - can it measure small RNA? 
rate equation for small RNA? 1st order?
falsififaction: can we test if conditional independencies in the data matches the conditional independencies in the model
- model repair: can we "fix" a misspecified model when we the falsify the model (eliator)


## Discussion