# Goals and take home points

- Understand the basics of data manipulation and visualization using python
- Describe the role of the four proteomic data processing steps
- Connect the different steps of data processing to biological interpretations

# Introduction

This is a proteomics dataset of hepitelial cells infected by Herpesvirus and control.
Each of the two condition has 4 replicates; two acquired in an experiment in July (1.1 and 1.2) and two in September (2.1 and 2.2).

In order to compare these two conditions, data need to be transformed to ensure that there is no bias in sample injection, and we use proper statistics. Also, we need to replace missing values with values that represent background noise, otherwise we cannot estimate an enrichment of proteins detected in only one condition (e.g. row 11).

This is how we proceed:

- first, we transform protein values into logarithmic values. Raw protein values (as well as most other -omics values) have a positive log-skewed distribution. This means that their logarithms are normally distributed. This helps with normalization, as we are working with symmetric data distributions.

- normalization is performed by correcting data distribution for their central value, i.e. median or average. If data distributions have different widths, then they should be corrected also for the slope of their correlation. But this usually does not happen if you use the same instrumentation for all samples

- Data imputation is performed by replacing missing values with values that randomly represent background noise. Given that data have normal distributions, moving to the left of the distribution of 2-3 standard deviations implies being in the low percentage of detectable values

- Finally, t-test is applied because replicates are too few to consider non-parametric statistics. However, it is still important to assess whether we should use a homoscedastic (two samples equal variance) or heteroscedastic (two samples unequal variance) test. Whether the two samples have equal or unequal variance can be determined with an F-test

## Importing modules and data

We will import the "Raw data" excel sheet. To do this, we will use the pandas module and require **this notebook and the excel sheet to be in the same folder.**

In [None]:
# Import all modules
import pandas as pd
import numpy as np
from numpy.random import uniform
import scipy.stats as stats
import math
import seaborn as sns
import matplotlib.pyplot as plt 

In [None]:
raw_df = pd.read_excel("Raw_data.xlsx",index_col = 0)

We imported the raw data using the pandas read_excel function with the sheet name and with index_col = 0. This last argument makes it so that the zeroth column in the sheet becomes the dataframe index. This allows us to access cells of interest using protein accession numbers.

In [None]:
raw_df

As you can see, not all columns are numerical. To facilitate our data analysis, we will **separate the values (numerical) from the metadata (categorical)**. We will put the numerical data in `val_df` and the metadata in `meta_df`

In [None]:
val_cols = ["Ctrl 1.1","Ctrl 1.2", "Ctrl 2.1", "Ctrl 2.2", "HSV 1.1", "HSV 1.2", 
            "HSV 2.1", "HSV 2.2"]
meta_cols = ["Description","Gene Symbol", "Organism"]
val_df = raw_df[val_cols]
meta_df = raw_df[meta_cols]

To visualize the underlying distribution of the protein abundance values, **we will be using violin plots**. In these plots, the width of the shape is an estimate of how common a value is in the data. I have defined a function that makes these violin plots below

In [None]:
def violin_plotter(df,title,group_names,ylabel):
    """
    This function makes violin plot for the different groups present in a DataFrame
    
    Parameters
    --------------
    df : pd.DataFrame
        Dataframe where the rows are proteins, columns are the conditions and the
        cells are the abundance values
    title : str
        Title for the violin plot
    group_names : list
        List of the names of the different groups. This list MUST be the same size
        as the number of groups (columns)
        
    Returns
    --------------
    None
    """
    df_unstack = df.unstack().reset_index()
    g = sns.violinplot(data=df_unstack,x="level_0",y=0)
    plt.title(title)
    plt.xlabel("Groups")
    plt.ylabel(ylabel)
    g.set_xticks(range(len(df.columns)))
    g.set_xticklabels(group_names, size=10, rotation=30)

In [None]:
group_names=["Control 1.1","Control 1.2","Control 2.1","Control 2.2",
            "HSV 1.1","HSV 1.2","HSV 2.1","HSV 2.2"]
violin_plotter(val_df,"Raw values",group_names,"Raw values")

As you can see, most of the raw values are around 0. However there are still some values that are much higher than that. This is why we say that **our data follows a positive log skewed distribution**.

### Log2

We will make the protein abundance data symmetrical by computing the log2 of all values that are not NaN. To do this, we will use the `log2` function in the numpy module.

In [None]:
log2_vals = np.log2(val_df)
violin_plotter(log2_vals,"Log2 violin plot",group_names,"Log2 values")

Now at a first glance, the values are within the same order of magnitude.

### Normalize

Now that the data is symmetrical, we will center the values for all conditions around their mean. As a result, values that are closer to the mean will be around 0. To do this, for every column `col` in `val_col` (which is our list with the names of the columns containing numerical values) we will calculate the mean and substract that value from the log2 values calculated from the previous step.

In [None]:
norm_vals = pd.DataFrame(index = log2_vals.index, columns = val_cols)
for col in val_cols:
    col_mean = log2_vals[col].mean()
    #print(col_mean)
    norm_vals[col] = log2_vals[col]-col_mean
    
violin_plotter(norm_vals,"Normalized violin plot",group_names,"Normalized values")

### Impute

The NaN values that we have in our data will make downstream statistics difficult. We will replace them assuming that they are just random noice. As such, they will be...

In [None]:
from numpy.random import uniform
impute_vals = pd.DataFrame(index = norm_vals.index, columns = val_cols)
for col in val_cols:
    col_data = norm_vals[col]
    col_sd = col_data.std()
    impute_vals[col] = col_data.mask(col_data.isnull(), (uniform()*-0.3)-2.5*col_sd)
    
violin_plotter(impute_vals,"Imputed violin plot",group_names,"Imputed values")

### Generating the values for the volcano plot

Now that we have transformed the data to be normally distributed and with no missing values, we will identify those that significantly differentially expressed from control to HSV treated groups. To do this we will:
1. Calculate the average protein abundance within a condition (control or HSV)
2. Calculate the fold change of average HSV expression compared to the control group
    - **NOTE:** log2(HSV/conrol) == log2(HSV) - log2(control)
3. Determine which proteins have statistically similar variances (homoscedastic) or unequal variances (heteroscedastic)
    - To do this we calculate the variances of all proteins in each condition and perform an F-test
4. Calculate the p-value of an independent t-test (which tests for statistically significant means) for all proteins. 
    - **NOTE:** the `stats.ttest_ind` function requires defining if the two samples have equal variance, so we are using the p-value from step 3 to determine if a protein has equal variance in both conditions.
5. Calculate the score, which is the protein abundance fold-change times the -log2(ttest pvalue)

In [None]:
# Separate the control from the HSV
control_cols = ["Ctrl 1.1","Ctrl 1.2", "Ctrl 2.1", "Ctrl 2.2"]
exp_cols = ["HSV 1.1", "HSV 1.2", "HSV 2.1", "HSV 2.2"]
# Initiate an empty dataframe
volcano_df = pd.DataFrame(index = impute_vals.index,columns=["ctrl_avg","HSV_avg",
                                                             "HSV/ctrl","Ftest","Neg_pval",
                                                            "score"])
# Fill in the average and fold change columns
volcano_df["ctrl_avg"] = impute_vals[control_cols].mean(axis=1)
volcano_df["HSV_avg"] = impute_vals[exp_cols].mean(axis=1)
volcano_df["HSV/ctrl"] = volcano_df["HSV_avg"]-volcano_df["ctrl_avg"]

# Determine if the variance of protein values are the same between 
# Control and HSV groups using an F test
control_var = impute_vals[control_cols].var(axis=1)
exp_var = impute_vals[exp_cols].var(axis=1)
f_stat = control_var/exp_var
volcano_df["Ftest"]=f_stat.apply(stats.f.cdf,dfn=3,dfd=3) # If you have questions about this line ask me

# Using that F test, conduct a standard ttest for proteins with the same variance
# or Welch’s t-test otherwise
for index in volcano_df.index:
    equal_var = volcano_df.loc[index,"Ftest"] < 0.05
    ttest = stats.ttest_ind(impute_vals.loc[index,control_cols],
                               impute_vals.loc[index,exp_cols],equal_var=equal_var)
    volcano_df.at[index,"Neg_pval"] = -math.log10(ttest.pvalue)

# Calculate protein scores
volcano_df["score"] = volcano_df["HSV/ctrl"]*volcano_df["Neg_pval"]    

sns.histplot(volcano_df["score"],kde=True)

### Generating figures

Now we will generate the figures from the data we have

#### Correlation

The `corr()` method will compute pairwise pearson correlation coefficients between all column pairs.

The `seaborn.clustermap()` function will conduct hierirchichal clustering on the pearson coefficients computed above, and will plot those coefficients as a heatmap, showing their clustering as well.

In [None]:
corr_df = impute_vals.corr()
sns.clustermap(corr_df,vmin=0,vmax=1)

**Question:** Is there a batch effect?

#### Volcano plot

In [None]:
def volcano_plotter(df,title,ylabel,xlabel):
    """
    This function makes a volcano plot from a volcano dataframe
    
    Parameters
    -------------
    df : pd.DataFrame
        Volcano dataframe. MUST contain the columns: 'HSV/ctrl' and "Neg_pval'
    title, ylabel, xlabel : str
        Names for the title, x-label and y-label respectively
    color : pd.Series, dictionary, or None; (Default:None)
        Iterable that determines how different points should be colored
    """
    # Calculate the bonferroni corrected p-value threshold for significance
    a_bonf = -1*np.log10(0.05/df.shape[0])
    
    # Classify the proteins based on their significance and fold change
    v_hue = pd.Series(data="no significance",index=df.index,name="hue")
    for i in df.index:
        fc = abs(df.loc[i,"HSV/ctrl"])
        pval = df.loc[i,"Neg_pval"]
        if pval>a_bonf and fc>0.5:
            v_hue[i]="significant change"
        elif pval>a_bonf and fc<0.5:
            v_hue[i] = "no change"
        elif pval<a_bonf and fc>0.5:
            v_hue[i] = "not significant change"
    df_merge = df.merge(v_hue,left_index=True,right_index=True)
    sns.scatterplot(data=df_merge,x="HSV/ctrl",y="Neg_pval",hue="hue",style="Organism",palette="colorblind")
    plt.legend(bbox_to_anchor=(1.25, 1), borderaxespad=0)
    plt.axvline(-0.5,linestyle = '--',lw = 0.5)
    plt.axvline(0.5,linestyle = '--',lw = 0.5)
    plt.axhline(a_bonf,ls = '--',lw = 0.5)
    plt.ylabel(ylabel)
    plt.xlabel(xlabel)
    plt.title(title)


In [None]:
volcano_df["Organism"] = meta_df["Organism"]
volcano_plotter(volcano_df,"Volcano plot","log10 p-value","log2 HSV/control")

#### Correlation protein abundance

In [None]:
sns.scatterplot(data=volcano_df,x="ctrl_avg",y="HSV_avg",hue="score",style="Organism",palette="vlag")
ax = plt.gca()
xpoints = ypoints = ax.get_xlim()
ax.plot(xpoints,ypoints,linestyle='--', color='k', lw=1, scalex=False, scaley=False)
plt.title("Protein abundance correlation")
plt.ylabel("Average HSV abundance")
plt.xlabel("Average control abundance")
plt.legend(bbox_to_anchor=(1.25, 1), borderaxespad=0)

**Question:** Describe the correlation of protein abundance between the control and HSV groups.

#### Biomarker

In [None]:
sns.scatterplot(data=volcano_df,x="HSV/ctrl",y="HSV_avg",hue="score",style="Organism",palette="vlag")
ax = plt.gca()
xpoints = ypoints = ax.get_xlim()
ax.plot(xpoints,ypoints,linestyle='--', color='k', lw=1, scalex=False, scaley=False)
plt.title("Protein abundance correlation")
plt.ylabel("Average HSV abundance")
plt.xlabel("Log2 HSV/ctrl")
plt.legend(bbox_to_anchor=(1.25, 1), borderaxespad=0)

## GO term enrichment analysis

For GO term enrichment analysis we will use the [Gorilla](https://cbl-gorilla.cs.technion.ac.il/) server. GOrilla, unlike other GO term enrichment analysis, does not require setting a significance threshold. Instead, you can provide a ranked list of protein IDs and it will identify the GO terms that are enriched iin the top proteins. 

To use this tool, use the code below to rank the proteins based on their score, output the ranked list into a text file, and past that list into the website

In [None]:
ranked_list = volcano_df["score"].index

file_name="score_rank.txt"
file = open(file_name,"w")
for protein in ranked_list:
    file.writelines(protein+"\n")