# Caffeine Proteomics Public Data

This notebook contains data processing for 4 conditions in the proteomics data from [Schmidt et al. 2016](https://www.nature.com/articles/nbt.3418).

Benjamín J. Sánchez, 2020-01-22

### 1. Loading Data

In [1]:
import pandas as pd

# Extract of supplementary table 5:
data = pd.read_csv("s05_proteomics_data_raw.csv", index_col=0)

# Molecular weights:
MW = data["molecular_weight"] # Da = g/mol
MW = MW / 1000 # kDa = g/mmol
print(MW)

# Rest of the data (abundances + uncertainties):
data = data.iloc[:,1:]
print(data)

P0A8T7    155.045008
P0A8V2    150.520276
P36683     93.420946
P15254    141.295898
P09831    163.176315
             ...    
P0ACS2     17.121072
P0AA97     20.845493
P0AB83     23.529402
P23862     20.344805
P77433     25.178923
Name: molecular_weight, Length: 2058, dtype: float64
        chemostat_0.5_mean  chemostat_0.35_mean  chemostat_0.2_mean  \
P0A8T7                4780                 3900                3477   
P0A8V2                5245                 4388                3860   
P36683               15733                20261               16410   
P15254                2285                 1730                1468   
P09831                2321                 1959                1771   
...                    ...                  ...                 ...   
P0ACS2                   1                    1                   0   
P0AA97                   1                    1                   1   
P0AB83                   4                    4                   1   
P23862

In [2]:
# Cell volumes (fL/cell) (extract of supplementary table 23):
cell_volumes = pd.read_csv("s23_cell_volume.csv", index_col=0)
print(cell_volumes)

                cell_volume
chemostat_0.12     1.901568
chemostat_0.2      2.080800
chemostat_0.35     2.398575
chemostat_0.5      2.692500


### 2. Convert Units:

First of all, note that the variation values come as coefficients of variation (%), so let's transform them to the same units as the mean values (molecules/cell):

In [3]:
for (col_name, d) in data.iteritems():
    if col_name.endswith("uncertainty"):
        mean_name = col_name.replace("uncertainty", "mean")
        data[col_name] = data[col_name] / 100 * data[mean_name]

print(data)

        chemostat_0.5_mean  chemostat_0.35_mean  chemostat_0.2_mean  \
P0A8T7                4780                 3900                3477   
P0A8V2                5245                 4388                3860   
P36683               15733                20261               16410   
P15254                2285                 1730                1468   
P09831                2321                 1959                1771   
...                    ...                  ...                 ...   
P0ACS2                   1                    1                   0   
P0AA97                   1                    1                   1   
P0AB83                   4                    4                   1   
P23862                   8                    8                   8   
P77433                   0                    3                   1   

        chemostat_0.12_mean  chemostat_0.5_uncertainty  \
P0A8T7                 3000                   414.9040   
P0A8V2                 3455    

Now everything is in molecules/cell, and as we need to transform to mmol/gDW, we need to do:

```
1. Abundance [mmol/cell] = Abundance [molecules/cell] / Na [molecules/mol] * 1000 [mmol/mol]
2. Abundance [mmol/gDW] = Abundance [mmol/cell] / ( cell volume [fL/cell] * cell density [g/fL] * dry content [gDW/g] )
```

Where Na is Avogadro's number = 6.022e+23. Cell volumes for all conditions are available in [Volkmer et al. 2011](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0023126).

> TODO: Cell volume measurements in that reference are quite variable (Table 1), so we could account for that variability in the uncertainty .

Additional assumptions:
* Cell density is constant = 1.105 [g/mL] = 1.105e-12 [g/fL] ([Martinez-Salas et al. 1981](https://jb.asm.org/content/147/1/97.long))
* Water content is constant = 0.70 [g/g] ([Feijó Delgado et al. 2013](https://doi.org/10.1371/journal.pone.0067590)), i.e. the dry content = 0.3 [gDW/g]

> TODO: How much do these assumptions affect the final simulation results?

In [4]:
# Convert values to mmol/cell:
data = data / 6.022e+23 * 1000

# Iterate through the dataset and divide by the corresponding cell volume, to get mmol/fL:
for (col_name, d) in data.iteritems():
    chemo_name = col_name.replace("_uncertainty", "").replace("_mean", "")
    data[col_name] = data[col_name] / cell_volumes.loc[chemo_name]["cell_volume"]

# Finally, convert to mmol/gDW:
water_content = 0.3
cell_density = 1.105e-12
data = data / cell_density / water_content
print(data)

        chemostat_0.5_mean  chemostat_0.35_mean  chemostat_0.2_mean  \
P0A8T7        8.892992e-06         8.144924e-06        8.370474e-06   
P0A8V2        9.758105e-06         9.164083e-06        9.292502e-06   
P36683        2.927059e-05         4.231392e-05        3.950517e-05   
P15254        4.251148e-06         3.613005e-06        3.534040e-06   
P09831        4.318124e-06         4.091258e-06        4.263477e-06   
...                    ...                  ...                 ...   
P0ACS2        1.860459e-09         2.088442e-09        0.000000e+00   
P0AA97        1.860459e-09         2.088442e-09        2.407384e-09   
P0AB83        7.441834e-09         8.353768e-09        2.407384e-09   
P23862        1.488367e-08         1.670754e-08        1.925907e-08   
P77433        0.000000e+00         6.265326e-09        2.407384e-09   

        chemostat_0.12_mean  chemostat_0.5_uncertainty  \
P0A8T7         7.902875e-06               7.719117e-07   
P0A8V2         9.101478e-06    

### 3. Data Validation

Before this study, the assumption had been that the average _E. coli_ cell weights 1 pgDW. Let's see how close we are to that by using the new formalism:

In [5]:
for (c, vol) in cell_volumes.iteritems():
    print(vol * cell_density * water_content * 1e12)  # fL/cell * g/fL * gDW/g * pgDW/gDW = pgDW/cell

chemostat_0.12    0.630370
chemostat_0.2     0.689785
chemostat_0.35    0.795128
chemostat_0.5     0.892564
Name: cell_volume, dtype: float64


We see the new values are slightly smaller, but in the same order of magnitude.

Finally, let's check out how much protein in total are we adding to the models. For that we need the molecular weights (g/mmol) of each protein:

In [6]:
for (col_name, col_data) in data.iteritems():
    prot_fraction = sum(col_data * MW)  # mmol/gDW * g/mmol = g/gDW
    print(col_name + ": " + str(prot_fraction))

chemostat_0.5_mean: 0.257483673026852
chemostat_0.35_mean: 0.2574833494824215
chemostat_0.2_mean: 0.2574833826556106
chemostat_0.12_mean: 0.2574840524040743
chemostat_0.5_uncertainty: 0.017108912891197606
chemostat_0.35_uncertainty: 0.01123831773827171
chemostat_0.2_uncertainty: 0.01452525742959928
chemostat_0.12_uncertainty: 0.010512903022899124


We see that:
* All means add up to reasonable fractions g(prot)/gDW (roughly half of the protein content in _E. coli_ ).
* The average uncertainty for all 4 conditions is below 10%

### 4. Data Export

In [7]:
data.to_csv("s05_proteomics_data_processed.csv")