#  Introductions to constrainat-based modeling using cobrapy

## Part 3: In-silico gene knockouts

### Instructor:
* Miguel Ponce de León from (Barcelona Supercomputing Center)
* Contact: miguel.ponce@bsc.es


Install COBREXA if not installed yet, and load it

In [None]:
import Pkg
Pkg.add("COBREXA")
using COBREXA

Pkg.add("GLPK")
using GLPK

Let's download and open the big model again

In [None]:
path_to_model = "data/E_coli_iJO1366.json"
model = load_model(StandardModel, path_to_model);

## Inspecting gene reactions associations

Main reference is this:
https://lcsb-biocore.github.io/COBREXA.jl/stable/examples/07_gene_deletion/

Each reaction has a gene association or gene reaction rule, which dictates the gene products that
need to be available so that the reaction can carry flux.


Pick a gene of interest

`gene = model.genes["b0720"]`

Inspect the reactions associated to b0720



In [None]:
genes(model)

gene = model.genes["b0720"]

# gene_rules_dnf = reaction_gene_association(model, "PFK")

reaction_gene_association(model, "PFK")

rxn = [r for r in model.reactions.vals][400]

rxn.grr

The result is in DNF for (computational) simplicity; the rules can be
converted e.g. to Strings which are more suitable for reading:

In [None]:
COBREXA._unparse_grr(String, gene_rules_dnf)

We might knock out genes by running through the reactions and evaluating DNF.
The knockout is available as a modification for simplicity:

In [None]:
gene_name(model, "b0720")

### Exercise 3.1: Single knock out study.

Documentation: [https://cobrapy.readthedocs.io/en/latest/deletions.html#Knocking-out-single-genes-and-reactions](https://cobrapy.readthedocs.io/en/latest/deletions.html#Knocking-out-single-genes-and-reactions)

We will use gene b0720 as an example

1. COBRA can find the proper reaction to be disabled when a gene is knocked out as follows:

```
flux_dict = flux_balance_analysis_dict(
    model,
    GLPK.Optimizer,
    modifications = [knockout("b0720")],
)
```

(This codes knocks out the gene b0720, recalculates the FBA and stores the new solution in ko_solution and If we perform the knockout using the "with" block we don't need to care about restoring the knocked out gene afterwards; it is automatically restored out of the "with" block..)

2. Check the growth value (Hint: ko_solution.objective_values)
3. Is the gene predicted as essential or non-essential
4. Go to the Ecocyc database and check the in vivo experimental result for the knockout of b0720 by accessing the following link:
* [https://ecocyc.org/gene?orgid=ECOLI&id=EG10402](https://ecocyc.org/gene?orgid=ECOLI&id=EG10402)

Is b0720 essential or not?

In [None]:
## TODO
## Write your code below




...the model is still feasible (so it can "sustain itself"), but growth is
basically zero.

## Systems-wide knock out study of *E. coli*.
    
COBREXA has a special function to run the single gene knock outs of a list of genes. 

The function's name is `screen` and it allows us to run many analyses on a model with parallel,
with many optimizations related for distributed processing (e.g., data are
only moved once).

We can screen through all genes. One could simply write something like:

In [None]:
knockout_fluxes = screen(
    model, # the model which we process
    args = tuple.(genes(model)[1:10]), # all argument lists for the analyses
    analysis = (model, gene) -> # the analysis function ("lambda") that we want to run on the cluster for each item from the argument list
        flux_balance_analysis_dict(model, GLPK.Optimizer, modifications = [knockout(gene)]),
)

...but that might be slow for larger amounts of genes, and we would like to
add some special handling for knockouts where there is no feasible solution
(and the function returns `nothing`).

## Systems-wide knock out study of *E. coli*.

First, let's use COBREXA parallelization capabilities to make this faster. 
We will use Distributed package to run this over multiple processes.
But for technical reasons, instead of doing in the notebook, we will use the following script:

`julia  --project=./  run_single_gene_deletion.jl`

The script will:
1. Load the model
2. screen ech gene and performed the KO experiments
3. gather all the results in a DataFrame
4. Save the results in CSV format in the `out/ko_report.csv` folder

We can inspect the script code below to see what it does

## Loading need modules and running the experiment
```import Pkg
Pkg.add(["COBREXA", "GLPK", "Distributed", "DataFrames","CSV])

using COBREXA, GLPK
using Distributed
using DataFrames, CSV

# Loading the model
path_to_model = "data/E_coli_iJO1366.json"
model = load_model(StandardModel, path_to_model);

# only add processes if you are sure that you have sufficient resources available!
nprocs = 4
addprocs(nprocs)  

# only necessary if you added the extra processes
@everywhere using COBREXA, GLPK 

# Now that we see that it works, let's post-process the results a little, and
# also add more genes:
knockout_fluxes = screen(
    model,
    args = tuple.(genes(model)),
    analysis = (model, gene) -> begin
        res = flux_balance_analysis_dict(model, GLPK.Optimizer, modifications = [knockout(gene)])
        if !isnothing(res)
            gene => res["BIOMASS_Ec_iJO1366_core_53p95M"]
        else
            gene => 0.0
        end
    end,
    workers = workers(),
)```

After everything works, you can erase the limit to the first 50 genes and see a complete result.
```
## Let's create a CSV with a report, as always
```
df = DataFrame(gene = first.(knockout_fluxes))
df.name = gene_name.(Ref(model), df.gene)
df.fluxes = last.(knockout_fluxes)
```

Typically we want to mark the genes that changed something. Let's mark the
genes that are required for growth as essential, and the ones that reduce the
growth somehow (but not fatally) as interesting.


```
best_result = maximum(last.(knockout_fluxes))
essential_threshold = 0.01 * best_result
df.essential = df.fluxes .<= essential_threshold
df.interesting = (df.fluxes .< best_result * 0.999) .&& .!df.essential
df

CSV.write("out/ko_report.csv", df);
``` 

The idea is that a gene ko reduces growth below 10% of the maximal growth rate predicted for the wild type, then we can consider that gene as essential

## How well do the in-silico knockouts compare to real measurements?

Since there are existing measurements of what happens with real E. Coli after
knockouts, we can look at our results as predictions, and compare them to the
ground truth with the usual statistical means.

First, let's read the experimentally verified "lethal" knockout genes from
the supplied JSON data file

In [None]:
using CSV
using DataFrames
using JSON

In [None]:
This is the list of lethal gene knockouts:

Now we can count:
* True/False positives
* True/False negatives

In [None]:
# Reading list of in-vivo essential genes in M9 media
ex_lethal = JSON.parsefile("data/m9_invivo_lethals.json")

# Reading in-silico gene deletion results
df_ko = DataFrame( CSV.File( "out/ko_report_presolved.csv" ) );

# Comparting predicitons and experiments
df_ko.invivo_essential = in.(df_ko.gene, Ref(ex_lethal))
df_ko.insilico_essential = df_ko.essential;

In [None]:
TP = count(df_ko.insilico_essential .& df_ko.invivo_essential);
TN = count(.!df_ko.insilico_essential .& .!df_ko.invivo_essential);
FP = count(.!df_ko.insilico_essential .& df_ko.invivo_essential);
FN = count(df_ko.insilico_essential .& .!df_ko.invivo_essential);

### Excercise 3.3:
Complete the following table using the values from Exercise 3.2 (*E. coli*)

| In vivo \ In silico        | in silico lethal | in silico non-lethal |
| -------------------------- |:----------------:| --------------------:|
| <b>in vivo lethal</b>      |               ?  |                   ?  |
| <b>in vivo non-lehtal</b>  |               ?  |                   ?  |

Tip for creating a matrix in Julia:

```
matrix = [ 
    A B
    C D
]
```


### Excercise 3.4:
Acces the following link:

https://en.wikipedia.org/wiki/Sensitivity_and_specificity

Get the formulas and calculate the following metrics:
* sensitivity
* specificity
* precision
* accuracy

In [None]:
## TODO
## Write your code below


sensitivity = TP / (TP + FN)

# do the other


# Mo

## Doing the knockouts manually, the hard way

Now, let's have a look at how the knockouts are computed.

Each reaction has a Gene-Reaction Rule (GRR) that marks the genes required
for it to actually work in the organism. These are generally any Boolean
expressions, but in COBREXA we tend to store them in disjunctive normal form
(DNF, see https://en.wikipedia.org/wiki/Disjunctive_normal_form) which
closely corresponds to the biological meaning of gene units that form
interchangeable complexes. You can access them using the `grr` field in
Reaction structures:

In [None]:
model.reactions["RNDR2"].grr

Here, the reaction can be supported by either of the 2 possibilities (enzyme
complexes) where the first possibility is built from gene products of genes
`b2234`, `b2235`, and `b2582`; and as the second possibility it can also use
`b3781` instead of the `b2582`.

We may list all GRRs simply by iterating through the model reactions:

In [None]:
[rid => r.grr for (rid,r) in model.reactions]

It is often interesting to ask which reactions may depend on which gene, we
can make a convenience function for that:

In [None]:
reactions_of_gene(model, gene) =
  [rid for (rid,r) in model.reactions if !isnothing(r.grr) && any(complex -> any(contains(gene), complex), r.grr)]

reactions_of_gene(model, "b1064")

Using the vector notation is quite convenient for creating lists that allow
us to get an overview of the situation:

In [None]:
gene_name.(Ref(model), genes(model)) .=> reactions_of_gene.(Ref(model), genes(model))

Now, given a set of genes that we want to knock out, we can manually find if
a given reaction will still work or not. Let's try on RNDR1:

In [None]:
grr = model.reactions["RNDR1"].grr

ko_genes = ["b2234"]

We can transform the `grr` to a form where it says which genes are present
and which genes are not:

In [None]:
grr_available = map(c -> map(!in(ko_genes), c), grr)

To determine if the reaction _can_ work, at least one ("any") of the
complexes must have all gene products available:

In [None]:
any(all, grr_available)

Since `b2234` is essential for RNDR1 (it needs to be present in all complexes
that may run the reaction), the reaction is effectively disabled by knocking
out `b2234`.

What if we knock out `b2582`?

In [None]:
ko_genes = ["b2582"]
grr_available = map(c -> map(!in(ko_genes), c), grr)

any(all, grr_available)

...the reaction may still work with just `b2582` knocked out.

Anyway, if we knock out multiple genes, the reaction will cease to work again:

In [None]:
ko_genes = ["b2582", "b3781"]
grr_available = map(c -> map(!in(ko_genes), c), grr)
any(all, grr_available)

We can formalize the knockout evaluation in a function

In [None]:
function is_reaction_knocked_out(model, reaction, ko_genes)
    grr = model.reactions[reaction].grr
    if isnothing(grr)
        return false # reactions without a gene-reaction rule happen spontaneously and cannot be knocked out
    end
    grr_available = map(c -> map(!in(ko_genes), c), grr)
    !any(all, grr_available)
end

Now, we can manually modify the model to disable the reactions that would be
knocked out by a certain gene combination:

In [None]:
ko_genes = ["b2582", "b3781"]
for (rid, r) = model.reactions
    if is_reaction_knocked_out(model, rid, ko_genes)
        r.lb = r.ub = 0.0
    end
end

Does the model still grow?

In [None]:
sol = flux_balance_analysis_dict(model, GLPK.Optimizer)
sol["BIOMASS_Ec_iJO1366_core_53p95M"]

...which seems like the combination of the 2 genes was not essential at all.

Let's arrange these in a standard confusion matrix:

This allows us to compute some useful metrics about the predictions:

---

*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*