# Compute p-values from mass spectrometry data

The code in this tutorial is released under the [MIT License](https://opensource.org/licenses/MIT). All the content in this notebook is under a [CC-by 4.0 License](https://creativecommons.org/licenses/by/4.0/). 

In [1]:
#import basic modules
import glob
import numpy as np
import pandas as pd
import os

import regseq.utils

For a detailed explanation of the steps leading to this notebook, as well as the experimental context, refer to the [Reg-Seq wiki](https://github.com/RPGroup-PBoC/RegSeq/wiki).

Previously we analyzed sequencing datasets obtained by Reg-Seq experiments, and used information footprints to identify significant binding sites. We also used the package `logomaker` to create sequence logos of significant binding sites. We used the function `regseq.find_region.find_region` to identify significant binding sites, which we are going to use again here. To identify which proteins are binding to the proposed binding sites, we performed Mass spectrometry experiments. Here we compute p values for enrichment of all transcription factors observed in the experiments.

First we load in all the file names for protein groups, which are found in the folder `/data/massspec/filtered/`. In the repo you will find two files for *ykgE*, which we are going to use to demonstrate the procedure. The files contain the normalized heavy to light ratios for all
identified proteins.

In [2]:
# Find all files for our gene of interest
all_filtered = glob.glob('../data/massspec/filtered/*')
ykgE_list = [x for x in all_filtered if "ykgE" in x]

Let's have a look at one of these files.

In [3]:
pd.read_csv(ykgE_list[1])

Unnamed: 0.1,Unnamed: 0,Protein names,Peptide counts (razor+unique),Ratio H/L normalized
0,736,Uncharacterized HTH-type transcriptional regul...,2,75.362
1,406,DNA-binding protein HU-beta,3,1.8557
2,472,DNA gyrase subunit B,16,1.1209
3,88,DNA topoisomerase 1,33,1.018
4,471,DNA gyrase subunit A,7,0.77814
5,410,Deoxyribose operon repressor,1,0.68261
6,279,DNA-directed RNA polymerase subunit beta,16,0.65153
7,574,DNA topoisomerase 3,9,0.60886
8,277,DNA-directed RNA polymerase subunit beta,18,0.57989
9,836,Penicillin-binding protein activator LpoA,3,0.40466


To compute the p-values and identify proteins with the smallest p-value, we use the function `regseq.utils.cox_mann_p_values`, which computes the p-value following [Cox and Mann(2008)](https://www.nature.com/articles/nbt.1511). As input we give the list of files that we are considering, and the path where the output is stored (`output_name`). The output is a text file which contains the protein with the lowest p-value for each file that us in the input.


In [4]:
output_name = '../data/massspec/pval.txt'
regseq.utils.cox_mann_p_values(ykgE_list, output_file=output_name)

Let's have a look at the results for the two files we used here.

In [5]:
with open(output_name) as f:
    for line in f:
        print(line.strip())

ykgE_22Apr_2019.csv,p_val
Cation transport regulator ChaB,1.5229411396102862e-05
ykgE_may5_2019.csv,p_val
Uncharacterized HTH-type transcriptional regulator YieP,1.7079120002726256e-12


We found proteins with very small p-values here. However, we have not yet confirmed that these proteins are binding to DNA. We will do this in the next step.

Finally, here are the versions of packages used in this notebook. To display the versions, we are using the Jupyter Lab extension `watermark`, which can be found [here](https://github.com/rasbt/watermark).

## Computing Environment

In [7]:
%load_ext watermark
%watermark -v -p pandas,numpy,regseq

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
CPython 3.6.9
IPython 7.13.0

pandas 1.0.3
numpy 1.18.1
regseq 0.0.2
