# Compute p-values from mass spectrometry data

In [1]:
#import basic modules
import glob
import numpy as np
import pandas as pd
import os

import regseq.utils

First we load in all the file names for protein groups. The files contain the normalized heavy to light ratios for all
#identified proteins.

In [2]:
allnames = glob.glob('../data/massspec/proteinGroups*.txt')
# File that gives an error message
allnames.remove("../data/massspec/proteinGroups_Oct5v2.txt")

We can also load in the protein group files for only the targets we will be summarizing.

In [3]:
all_filtered = glob.glob('../data/massspec/filtered/*')
all_filtered

['../data/massspec/filtered/ykgE_may5_2019.csv',
 '../data/massspec/filtered/cpxR_1Apr_19.csv',
 '../data/massspec/filtered/aphA_May_19.csv',
 '../data/massspec/filtered/rspA_22Apr_2019.csv',
 '../data/massspec/filtered/idnK_1Apr_19.csv',
 '../data/massspec/filtered/rspA_may5_2019.csv',
 '../data/massspec/filtered/rhle_Mar9_2018.csv',
 '../data/massspec/filtered/ybjx_9Mar_18.csv',
 '../data/massspec/filtered/leuABCD_9May_19.csv',
 '../data/massspec/filtered/ykgE_22Apr_2019.csv',
 '../data/massspec/filtered/phnA_Mar9_2018.csv']

We will format an output dataframe that contains a mean value and variance the most highly enriched protein and for all background proteins.

In [4]:
#create a dataframe for pvals
out_pval = pd.DataFrame(columns=['pval'])

#format the output look of each dataframe.
pd.set_option('max_colwidth', 999)
pd.set_option('display.float_format', '{:10,.9f}'.format)

We will loop through all enriched proteins displayed in the figures in the Reg-Seq paper. The following function stores the resilts 

In [5]:
output_name = '../data/massspec/pval.txt'
regseq.utils.cox_mann_p_values(all_filtered,output_file=output_name)

In [6]:
with open(output_name) as f:
    for line in f:
        print(line.strip())

ykgE_may5_2019.csv,p_val
Uncharacterized HTH-type transcriptional regulator YieP,1.7079120002726256e-12
cpxR_1Apr_19.csv,p_val
Transcriptional regulatory protein CpxR,0.012627564862595015
aphA_May_19.csv,p_val
Deoxyribose operon repressor,1.2154400378041871e-08
rspA_22Apr_2019.csv,p_val
Glycerol-3-phosphate regulon repressor,0.00019362933512586566
idnK_1Apr_19.csv,p_val
Uncharacterized HTH-type transcriptional regulator YgbI,0.0001822896318279769
rspA_may5_2019.csv,p_val
Deoxyribose operon repressor,3.745348588798884e-34
rhle_Mar9_2018.csv,p_val
Glycerol-3-phosphate regulon repressor,0.000358827346436375
ybjx_9Mar_18.csv,p_val
DNA-binding protein StpA,0.005292003097828884
leuABCD_9May_19.csv,p_val
Uncharacterized HTH-type transcriptional regulator YgbI,0.00818124122601652
ykgE_22Apr_2019.csv,p_val
Cation transport regulator ChaB,1.5229411396102862e-05
phnA_Mar9_2018.csv,p_val
Uncharacterized HTH-type transcriptional regulator YciT,1.1746117315592258e-05


Finally, here are the versions of packages used in this notebook. To display the versions, we are using the Jupyter Lab extension `watermark`, which can be found [here](https://github.com/rasbt/watermark).

## Computing Environment

In [7]:
%load_ext watermark
%watermark -v -p jupyterlab,pandas,numpy,regseq

CPython 3.6.9
IPython 7.13.0

jupyterlab not installed
pandas 1.0.3
numpy 1.18.1
regseq 0.0.2
