# Integrating transcriptomics data to reconstruct cancer cell line models
**Authors:**
Thierry D.G.A Mondeel, Stefania Astrologo, Ewelina Weglarz-Tomczak & Hans V. Westerhoff <br/>
University of Amsterdam <br/>
2016 - 2018

In this part of the tutorial we will make use of the previously introduced human metabolic network (Recon 3) and apply cancer cell line transcriptomics to constrain the fluxes. 

**Objectives**
- Learn about ER+ cancer cell lines
- See an example of how (transcriptomics) data can be integrated with the human metabolic reconstruction
- Investigate if the data integrated network can teach you anything new about the cell lines

<span style="color:red">**Preliminary question:**</span>
**What do you think**: will the integration of transcriptomics and the metabolic network teach us more than the transcriptomics alone?

## Setting up the Python environment
<span style="color:red">**Assignment:**</span> Execute the cell below to set up our Python environment

In [2]:
# FBA tools
import cobra
from cobra.flux_analysis import pfba

# Pandas tables
import numpy as np
import pandas as pd # for tables
pd.set_option('display.max_colwidth', -1) # don't constrain the content of the tables
pd.options.display.max_rows = 9999

# import plotting capabilities
from bokeh.layouts import gridplot
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import Range1d
output_notebook() # Run once to get inline figuresimport matplotlib.pyplot as plt

import matplotlib
%matplotlib inline
from utils import show_map
import escher
map_loc = './maps/escher_map_RECON3D_energy_metabolism.json' # the escher map used below

import pickle

# required functions for the radar plot
from utils.file_list_function import file_list
from utils.flux_pattern_function import flux_pattern
from utils.df_plot_function import df_plot
from utils.flux_pie_plot_function import flux_pie_plot

# Venn plots
from utils import venn

# show all output in each cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "last_expr" # all, last_expr

WT_r3_orig = cobra.io.load_json_model('models/Recon3D_301/Recon3DModel_301_simple_medium.json')
WT_r3 = WT_r3_orig.copy()

## Introduction to ER+ Breast Cancer 
<span style="color:red">**Assignment (5 min):**</span> Read the following general overview about ER+ cancer subtypes and the 4 cell lines we will be investigating.


## Introduction to ER+ Breast Cancer 
<span style="color:red">**Assignment (5 min):**</span> Read the following general overview about ER+ cancer subtypes and the 4 cell lines we will be investigating.

> <span style="color:blue">**About Breast Cancer**:</span>

>According to the **World Health Organization**, breast cancer is the most common cancer among women worldwide, claiming the lives of hundreds of thousands of women each year and affecting countries at all levels of modernization.

>There are **four** main intrinsic or **molecular subtypes** of breast cancer that are based on the genes a cancer cell expresses. 
>Breast cancer is a heterogeneous disease that can be classified using a variety of clinical and pathological features. Classification may help in prognostication and targeting of treatment to those most likely to benefit. 

>- **Luminal 1** (Luminal A) breast cancer is hormone-receptor positive ER+ or PR+ (estrogen-receptor and/or progesterone-receptor positive), HER2 negative (HER2 -), and has low levels of the protein Ki-67, which helps control how fast cancer cells grow. Luminal 1 cancers are low-grade, tend to grow slowly and have the best prognosis.

>- **Luminal 2** (Luminal B) breast cancer is hormone-receptor positive (estrogen-receptor and/or progesterone-receptor positive), and either HER2 positive or HER2 negative with high levels of Ki-67. Luminal 2 cancers generally grow slightly faster than luminal A cancers and their prognosis is slightly worse.

>- **HER2-enriched** breast cancer is hormone-receptor negative (estrogen-receptor and progesterone-receptor negative) and HER2 positive. HER2-enriched cancers tend to grow faster than luminal cancers and can have a worse prognosis, but they are often successfully treated with targeted therapies aimed at the HER2 protein, such as Herceptin (chemical name: trastuzumab), Perjeta (chemical name: pertuzumab), Tykerb (chemical name: lapatinib), and Kadcyla (chemical name: T-DM1 or ado-trastuzumab emtansine).

>- **Triple-negative/basal-like** (ER- /PR- / HER2-) breast cancer is hormone-receptor negative (estrogen-receptor and progesterone-receptor negative) and HER2 negative. This type of cancer is more common in women with BRCA1 gene mutations. Researchers aren’t sure why, but this type of cancer also is more common among younger and African-American women.

><img src="images/BC_subtypes.png" width="600" height="600" align="center"/><br/>
>[Click here to see the original Figure](https://doi.org/10.1371/journal.pmed.1000279.g001)

>Among the different molecular subtypes of breast cancer, **ER+ breast cancer comprises ~75% of all breast cancers**. Thus, the ER status has become the most important discriminator of breast cancer molecular subtypes, resulting in primary treatment options through targeting the estrogen synthesis or the ER functions. This kind of treatment is generally known as _**endocrine therapy**_.

>--------

> <span style="color:blue">**About endrocrine therapy in ER+ BC**:</span>
>



> Hormonal therapy medicines treat hormone-receptor-positive breast cancers in two ways:
- <span style="color:blue">- by **lowering** the amount of the hormone estrogen in the body;</span>
- <span style="color:blue">- by **blocking** the action of estrogen on breast cancer cells;</span>


>The main types of **endrocrine therapy** are the following: 
- Selective estrogen-receptor response modulators **(SERMs)**: <span style="color:red">tamoxifen</span>, Evista, Fareston
- Selective Estrogen-receptor degrader or downregulator **(SERDs)**: <span style="color:orange">fulvestrant</span>
- Aromatase inhibitors **(AIs)**: anastrozole, exemestane, letrozole.

>--------
> <span style="color:blue">**About Breast Cancer Resistance**:</span>

> Although endocrine therapy has dramatically improved survival in breast cancer patients over the past several decades, **resistance** to these therapies remains one of the major causes of breast cancer mortality today. Late recurrence and death from estrogen receptor positive (ER+) breast cancer can occur for at least 20 years after the original diagnosis even after 5 years of adjuvant endocrine therapy. 

>Identifying mechanisms of resistance and strategies by which to combat these mechanisms is paramount to patient survival. [(Mills et al., 2018)](https://www.sciencedirect.com/science/article/pii/S147148921830002X?via%3Dihub)

>--------

> <span style="color:blue">**About** **E**ndocrine **T**herapy-**R**esistant (ETR) cell-lines:</span>

> Altough, endocrine therapies are all designed to block oestrogen-driven proliferation, the development of resistance may follow distinct routes and generate alternative phenotypes to each agent. To test this hypothesis, a series of  cell-lines resistant to single agents (**E**ndocrine ***T**herapy-**R**esistant cell-lines) were developed by Magnani's Lab at Imperial Collage of London.  ETR cell-lines may help to understand the connection between the acquisition of drug-resistance and breast cancer progression, particularly metastatic development.

>**MCF7** is a ER+ Breast Cancer cell-line (**Estradiol-dependent**) and sensitive to any treatment targeting the estrogen receptor network.

> <span style="color:red">**MCF7-T**</span> is a Tamoxifen resistant cell-line derived  from MCF7.

> <span style="color:orange">**MCF7-F**</span> is a Fulvestrant resistant cell-line derived  from MCF7.

> <span style="color:blueviolet">**LTED**</span> is a resistant cell-line, that recapitulates the Aromatase Inhibitor resistance (**Estradiol-independent**).

<img src="images/Cell_lines.png" width="600" height="600" align="center"/>

### The RNA-seq data on Endocrine Therapy-Resistant  (ETR) cell-lines

>Here we will use _**transcriptomics data**_ from [Nguyen et al. 2015](https://www.nature.com/articles/ncomms10044) for four breast cancer cell lines: **MCF7**, <span style="color:red">**MCF7-T**</span>, <span style="color:orange">**MCF7-F**</span> and <span style="color:blueviolet">**LTED**</span>


Let's first have a look at the dataset

In [None]:
dataset = pd.read_pickle('models/Recon3D_301/Recon3_cell_line_models/data/total_dataset')
print('The dataset tracks:', len(dataset), 'genes that encode metabolic enzymes.')
dataset.head()

<span style="color:red">**Question:**</span> We might expect that the resistance mechanisms lead to changes in metabolism and therefore metabolic enzyme expression.

Let's investigate if this is true, below.  

<span style="color:red">**Assignment:**</span> We might expect that non-resistant cancer cells (MCF7) have downregulated most of their metabolism, focusing solely on glycolysis. 

Based on the histogram plot below, how many genes are roughly not expressed in each cell line? Note that you will not see the actual number of exact zeros (due to the bins).

In [None]:
sub_dataset = dataset[dataset < 1] # zoom in on low expression genes
sub_dataset['MCF7'].plot(kind='hist',xlim=[0,1],bins=25,alpha=0.3,figsize=(12,5))

<span style="color:red">**Assignment:**</span> Now confirm your answer above by looking at the percentage of actual zeros below. Is the percentage surprising to you? Might these all be the same enzymes since all four cell lines are cancer cell lines?

In [None]:
(dataset['MCF7'] < 1e-3).sum(axis=0)/len(dataset['MCF7'])*100

Surprisingly, all MCF7 is using about 80% of all the metabolic enzymes known to occur in human metabolism. Let's look at the other cell lines that have acquired resitance. We might expect these cell lines to have focused completely on resistance pathways and therefore have downregulated most of the rest of metabolism. Let's look. 

In [None]:
(dataset < 1e-3).sum(axis=0)/len(dataset)*100

Not at all! All four cell lines have downregulated only 20% of metabolism. 

<span style="color:red">**Assignment:**</span> Below we plot the overlap of expressed (>0) genes between all cell lines as a Venn diagram. Are the "expressed enzymes" all the same enzymes or are there lots of different genes turned off in each cell line?

In [None]:
labels = venn.get_labels([dataset[dataset['MCF7'] > 0].index.values,
                              dataset[dataset['MCF7_F'] > 0].index.values,        
                              dataset[dataset['MCF7_T'] > 0].index.values, 
                              dataset[dataset['LTED'] > 0].index.values ],
                              fill=['number'])
    
fig, ax = venn.venn4(labels, names=['MCF7','MCF7_F','MCF7_T','LTED'], legend=True, textsize=25, )

Let's investigate if the 80% that is expressed is expressed in different amounts. The bar plot below is useful to see expression level differences between the cell lines for highly expressed genes. 

In [None]:
subdataset = dataset[(dataset.T > 300).any()] # zoom in on genes with reasonably high transcript levels
print('Looking only at the',len(subdataset),'genes that show relatively high transcript levels out of a total of', len(dataset),'genes')
subdataset.plot(kind='bar',figsize=(15,5))

<span style="color:red">**Assignment:**</span> Are there quantitative differences between these highly expressed genes?

<span style="color:red">**Assignment:**</span> The plot below zooms in on genes not expressed in some cell lines but expressed in others. Is this more convincing?

In [None]:
# zoom in on genes that are not expressed in some cell lines and some expression in others
subdataset = dataset[(dataset.T == 0).any() & (dataset.T > 1).any()] 
print('Lookin only at the',len(subdataset),'genes that show relatively high transcript levels out of a total of', len(dataset),'genes')
subdataset.plot(kind='bar',figsize=(15,5))

<span style="color:red">**Assignment:**</span> Do the transcriptomics values plotted above convince you that there might be some differences between the cell lines?

---

So most of metabolism is quite similar between the cell lines. But there are individual genes with major quantitave differences between the cell lines.

Let's now investigate if these differences matter for the flux pattern capabilities of the cell lines...

## Mapping transcriptomic data to Recon3D
Our approach for integrating the transcriptomics data with the metabolic network is based on [Damiani et al. 2018](https://www.biorxiv.org/content/early/2018/01/30/256644). This approach was originally devised to deal with single cell transcriptomics but is here modified to apply across different cell lines. We will explain the details below, but beware as the details can be tricky to understand if you are new to flux balance analysis. Don't despair! You do not have to understand it 100%. 

### The approach in a nutshell (NOTE THAT THIS PART IS OPTIONAL TO UNDERSTAND)
The key idea is to normalize each gene's transcriptomic level in a given cell line by the sum total across all four cell lines. This leads to transcriptomics "score" between 0 and 1. 0 if the transcript level is 0 in a particular cell line and 1 if that cell line was the only one to have any expression. 

We then take the unconstrained Recon 3 model and determine for each reaction what the maximal flux is that it can sustain on the defined medium. That maximal flux is then multiplied by the transcriptomics score and this product is set as the maximal flux bound in the model.

**The outcome of this is that each reaction will have a flux bound that is proportional (i.e. linearly related) to the transcriptomics score.** 

<span style="color:red">**Discuss:**</span> Can you think of an argument for why this approach of assuming a linear relationship between flux and transcript levels is reasonable? 

<span style="color:red">**Discuss:**</span> Can you  think of a reason for why our approach is probably inaccurate? How would you go about improving this if you were in the business of systems biology research?



### The details: two thorny problems (feel free to skip if you're in a hurry)
What the above fails to mention is that we (1) need to account for isoenzymes and enzyme complexes and (2) that we need to deal with completely untranscribed genes which can wreak havoc on our model. 

- (1) The former is dealt with by summing up transcriptomic scores across isoenzymes for each reaction and taking the minimum transcriptomics score for components of an enzyme complex for each reaction. 
- (2) is solved by not setting reactions to zero if there is no transcript levels detected, but rather to a very low flux bound 1e-3. This way we will still be able to see bottlenecks (their flux will hit the 1e-3 bound) but we ensure that the model will keep functioning at all times (no pathways will be fully blocked).

<span style="color:red">**Note:**</span> Many problems remain with this aproach: metabolite concentrations, kcat and K_m parameters etc.

## Analyzing pre-prepared cancer cell line models
We applied the approach detailed above and saved the constrained cell lines for you. We will analyze them below.  

## Load the models
Start by loading the models using the cell below. This will take a minute or so. 

In [5]:
files = ["LTED_FPKM.json", "MCF7_FPKM.json", "MCF7_T_FPKM.json", "MCF7_F_FPKM.json"]

LTED_orig = cobra.io.load_json_model('./models/Recon3D_301/Recon3_cell_line_models/LTED_FPKM.json')
MCF7_orig = cobra.io.load_json_model('./models/Recon3D_301/Recon3_cell_line_models/MCF7_FPKM.json')
MCF7_F_orig = cobra.io.load_json_model('./models/Recon3D_301/Recon3_cell_line_models/MCF7_F_FPKM.json')
MCF7_T_orig = cobra.io.load_json_model('./models/Recon3D_301/Recon3_cell_line_models/MCF7_T_FPKM.json')

# Save unaltered copies of each model
LTED = LTED_orig.copy()
MCF7 = MCF7_orig.copy()
MCF7_F = MCF7_F_orig.copy()
MCF7_T = MCF7_T_orig.copy()

## Analyzing the growth rates of the cel lines
The growth rates of the four cell lines have been experimentally determined (See the green curves in the plot below). 

<span style="color:red">**Assignment:**</span> Based on the experimental data (below), which cell line grows the fastest? Which the slowest?

<img src="images/Growth_rate_cell_lines.png" align="center"/>

<span style="color:red">**Assignment:**</span> The cell below prints the growth rate (calculated using Flux Balance Analysis) for the unconstrained Recon 3 model and each of the four cell lines. Do the results make sense? Do you notice anything interesting? What about the agreement with the experimental data?

In [6]:
growth_rate_dict = {}
for model_name,model in [('WT Recon 3',WT_r3), ('LTED',LTED), ('MCF7',MCF7), ('MCF7_F',MCF7_F), ('MCF7_T',MCF7_T)]:
    sol = model.optimize()
    growth_rate_dict[model_name] = round(sol.fluxes['biomass_reaction'],5)

pd.DataFrame.from_dict(growth_rate_dict,orient='index', columns=['Growth rate'])

Unnamed: 0,Growth rate
WT Recon 3,0.16889
LTED,0.04002
MCF7,0.06915
MCF7_F,0.05797
MCF7_T,0.05719


**Possible observations**
- All cell lines' growth rates are predicted to be lower than the unconstrained Recon 3 model, i.e. the transcriptomics constraints are actually constraining the flux patterns.
- The cell lines are not all the same metabolically (based on the applied methodology). In fact, these changes are bigger than you would expect assuming that the resistance mechanism is very simple.
- The (non-resistant) MCF7 cell line outperforms the other cell lines
- The resistant cell lines are predicted to have lower growth rates than MCF7 under the simulated conditions where the drug is absent in this computation! **This matches the experimental data!**
- Computationally, LTED is predicted to grow the slowest, whereas experimentally MCF7_F was the slowest. **This does not match the experimental data!**

<span style="color:red">**Assignment:**</span> Could you have gotten this insight, into the potential effect of transcript limitations on growth rate, from just the transcriptomics? This is the power of 'Systems Thinking'.

##  More detailed analysis of flux distribution across metabolic subsystems
Recon 3 comes annotated with a so-called "metabolic subsystem" for many reactions. These subsystems reflect the major pathway that a reaction is a part of, e.g.: glycolysis, the TCA cycle, cholesterol synthesis etc.

The cell below will perform flux balance analysis on all four cell lines and plot the total flux running through a selected number of subsystems. This will give an indication of the metabolic pathways each cell line is using to grow.

<span style="color:red">**Assignment:**</span> Execute the cell below (wait up to a minute or so for the calculations to complete) and investigate the output. Do you observe any interesting patterns?

In [None]:
# list of model's files
files  = file_list('models/Recon3D_301/Recon3_cell_line_models/')

f_pattern = flux_pattern(files, analysis = 'FBA')

target_ss = [
    ['Glycolysis/gluconeogenesis','Central Carbon Metabolism'],
    ['Citric acid cycle','Central Carbon Metabolism'],
    ['Oxidative phosphorylation','Central Carbon Metabolism'],
    ['Pentose phosphate pathway','Central Carbon Metabolism'],
    ['Pyruvate metabolism','Central Carbon Metabolism'],
    ['Fatty acid oxidation','Peripheral metabolism'],
    ['Squalene and cholesterol synthesis','Peripheral metabolism'],]

df  = df_plot(target_ss, f_pattern)
flux_pie_plot(df)

**Observe:** LTED and MCF7_F seem to be very similar. and MCF7 and MCF7_T seem to be similar. So we could conclude that the tamoxifen resistance has very little metabolic effects. 

Note as well that LTED and MCF7_F seem to have severely reduced oxidative phosphorylation. This relates to the Warburg effect (see below).

## WT Recon 3 vs. cell lines: The Warburg effect
Finally, we will use FVA to ascertain essentiality of certain reactions for the optimal growth condition in each cell line and investigate the Warburg effect. 

### Cancer, a-socialism and the Warburg effect
One way to think of cancer cells is that these will behave a-socially: i.e. like unicellular organisms within the multicellular one: tumors. 

It has been observed that such cells, in contrast to healthy, social cells, are characterized by growing on a low ATP/glucose ratio (the Warburg effect): Glucose + 2 ADP => lactate + 2 ATP (fermentation) instead of: Glucose + 32 ADP =>  6 CO2 + 32 ATP (respiration). 

[See this Wiki page about the Warburg effect](https://en.wikipedia.org/wiki/Warburg_effect) and the [original publication by Otto Warburg (1956)](http://science.sciencemag.org/content/123/3191/309).

### Applying FVA to the cell lines
Remember that with FVA (Flux Variability Analysis) we can find out the range of flux through a reaction that is possible while the cell still grows optimally! This will allow us to see if, for instance, oxygen uptake is essential for these cancer cells.

<span style="color:red">**Assignment:**</span> The cell below performs FVA for a variety of reactions representing: oxygen uptake, proton pumping through the ATPase (final step in respiration), glycolysis (PGK/PYK), glucose uptake, non-essential amino acids (glutamate and glutamine uptake), essential amino acids (phenylalanine), lactate production. 

Each of these reactions, for each cell line and the unconstrained Recon 3, is associated with an interval that represents the lowest and highest possible flux possible while the cell is able to achieve optimal growth rate. 

Inspect the output. Pay particular attention to see if you can spot presence or absence of the Warburg effect. 

Then, read our observations further down the page.

In [None]:
interesting_rxns = ['EX_o2_e',
                    'ATPS4mi','PGK','PYK', # ATP production
                    'EX_glc_D_e','EX_glu_L_e','EX_gln_L_e','EX_phe_L_e', # carbon sources
                    'EX_lac_L_e'
                   ]

df = pd.DataFrame()
d = {}

modelnames = ['WT Recon 3','MCF7','MCF7_F','MCF7_T','LTED']
for i,model in enumerate([WT_r3, MCF7, MCF7_F, MCF7_T, LTED]):
    
    model.reactions.EX_lac_D_e.upper_bound = 0 # keep one lactate exit
    
    fvasol = cobra.flux_analysis.flux_variability_analysis(model,reaction_list=interesting_rxns,
                                                           fraction_of_optimum=1)
    
    parsed_res = {r:[round(fvasol.loc[r]['minimum'],3),round(fvasol.loc[r]['maximum'],3)] for r in fvasol.index }
    d[modelnames[i]] = parsed_res
    
df = pd.DataFrame(d,index=interesting_rxns,columns=modelnames)
df.to_csv('FVA_comparison_cell_lines_vs_WT.csv',sep='\t')
df

**Possible observations:**
- Positive controls: some oxygen and essential amino acid uptake is needed for growth in all cell lines (see O2 and phenylalanine rows). This may be seen by the negative lower bounds.

- Cancer cell lines shift toward fermentation (Warburg effect): In the WT the optimal growth rate requires proton pumping (ATPS4mi). The optimal growth rate is not dependent on proton pumping in the 4 cell lines suggesting a shift to fermentation vs. respiration. This can be observed based on the zero lower bound for ATPS4mi. 
- All 4 cell lines tend to secrete more lactate than the WT Recon 3 (see the lactate uper bound).
- Compnents of the ATPase are not expressed in MCF7 and MCF7_T: note that the upper bound on ATPS4mi in MCF7 and MCF7_T equals 0.001. 0.001 is the minimum reaction bound we enforced during the transcriptomic mapping. This is therefore likely due to one or more components of the complex not being expressed. 
- MCF7 and MCF7_T require glucose uptake to grow optimally, like the WT, but more severe. LTED and MCF7_F do not. This also indicates MCF7 and MCF7_T are likely to at least partly ferment. Whereas MCF7/F and LTED tend to at least partially make use of respiration.

<span style="color:red">**Scientific hypothesis from this work:**</span>
Like typical tumor cells resistant cells remain "Warburg type" but with differences.
The estrogen receptor-drugs resistant cells may be druggable metabolically

# Wrap up
Hopefully this tutorial taught you something about the power of systems biology, systems and deep thinking. You now grasp the basics of flux balance/variability analysis, experienced using Python programming for science and saw one way of integrating real data in a computational model. Hopefully, you are convinced that $$\text{data} + \text{model} > \text{data alone}$$

This is only one (limited) example of the power of computational systems biology. The ultimate challenge is for you to now translate this to you own problem of interest! Hopefully this tutorial inspired you to do so.