# Check distribution of protein copy counts for normal tissue

We want to know why many ci proteins don't show a significant change even when the arm is deleted. It's possible that they have low expression to begin with, so the change from the deletion isn't noticeable.

Unfortunately, we can only get relative expression levels from the CPTAC data, not absolute expression levels. So instead of using CPTAC dat, we're going to be working with tissue-specific absolute expression data from this paper: Wang D, Eraslan B, Wieland T, et al. A deep proteome and transcriptome abundance atlas of 29 healthy human 
tissues. Mol Syst Biol. 2019;15(2):e8503. Published 2019 Feb 18. doi:10.15252/msb.20188503

We downloaded the specific table, Table 5, from https://www.embopress.org/action/downloadSupplement?doi=10.15252%2Fmsb.20188503&file=msb188503-sup-0007-TableEV5.zip

The protein copy count data is significantly right skewed. This notebook investigates that and determines what data transformations to use.

In [1]:
import pandas as pd
import numpy as np
import os
import altair as alt
import cnvutils
import scipy.stats
from toolz.curried import pipe

In [2]:
alt.data_transformers.disable_max_rows()

def json_dir(data, data_dir):
    os.makedirs(data_dir, exist_ok=True)
    return pipe(data, alt.to_json(filename=os.path.join(data_dir, "{prefix}-{hash}.{extension}")) )

alt.data_transformers.register("json_dir", json_dir)
alt.data_transformers.enable("json_dir", data_dir="plot_data")

DataTransformerRegistry.enable('json_dir')

## Plot distributions

To determine which proteins naturally have "low" expression, we want to see what kind of distribution the protein copy counts follow for each tissue type. The plots below demonstrate that the distributions are all significantly right skewed.

In [3]:
expr = cnvutils.get_normal_expr_table()

In [4]:
expr_df_long = expr.drop(
    columns=["Gene_ID", "Protein_ID"]
).\
melt(
    id_vars="Gene_name",
    var_name="tissue_type",
    value_name="prot_copy_count"
)

alt.Chart(expr_df_long).mark_boxplot().encode(
    x="prot_copy_count",
    y="tissue_type"
).properties(
    width=800
)

### Plot with a log scale

This should help with the right skew.

In [5]:
expr_plus1 = expr_df_long.assign(prot_copy_count=expr_df_long["prot_copy_count"] + 1)

alt.Chart(expr_plus1).mark_boxplot().encode(
    x=alt.X(
        "prot_copy_count",
        scale=alt.Scale(
            type="log"
        )
    ),
    y="tissue_type"
).properties(
    width=800
)

In [6]:
expr_plus1_log10 = expr_plus1.assign(prot_copy_count=np.log10(expr_plus1["prot_copy_count"]))

alt.Chart(expr_plus1_log10).mark_bar().encode(
    x=alt.X(
        "prot_copy_count",
        bin=alt.Bin(step=0.25)
    ),
    y="count()",
    row="tissue_type"
).properties(
    width=800
)

### Conclusion: Transform to log(x + 1), and exclude zeros

Based on the plots above, it looks like our best option is to use a log(x + 1) scale, and exclude zeros. From a biological perspective, I'm fine with excluding the zeros, because proteins that aren't expressed seem a different class from proteins that just have low expression. Although it is important to remember that it's also possible that proteins with zero copies may have just been too low in expression to be detected. Nevertheless, the fact that there's such a huge number of proteins with zero copies suggests that they aren't just the few escaping detection.

In [7]:
expr_transf = expr_plus1_log10[expr_plus1_log10["prot_copy_count"] > 0]

alt.Chart(expr_transf).mark_bar().encode(
    x=alt.X(
        "prot_copy_count",
        bin=alt.Bin(step=0.25)
    ),
    y="count()"
).properties(
    width=800
)

## Calculate "low" cutoff for each tissue type

In [8]:
low_cutoffs = expr_transf.groupby("tissue_type").quantile(0.25)
low_cutoffs

Unnamed: 0_level_0,prot_copy_count
tissue_type,Unnamed: 1_level_1
Adrenal gland,3.639864
Appendix,3.580872
Brain,3.593914
Colon,3.513825
Duodenum,3.851583
Endometrium,3.617557
Esophagus,3.797054
Fallopian tube,3.501528
Fat,3.843306
Gallbladder,3.787582
