## Exploration
This notebook contains the code to load the data and perform recommended filtering steps on lymphocyte data for two patient's B cells and T cells from 2 time points. We use this data to show the proteome variation between collection times. 

### Access data
First, we import our data package. Then we can create a proteomic dataset object and store it as <code>prot</code>. You can select versions and subjects with <code>load_dataset</code>.

Calling <code>head</code> shows the first several lines of the dataframe, which provides an idea of the type of data present and the structure of the dataframe.

In [1]:
import longitudinalCLL
prot = longitudinalCLL.get_proteomic()

prot.load_dataset(version='July_noMBR_FP', subjects = []) # Or say subjects=[1] for just subject 1

prot.data_raw.head()

Unnamed: 0_level_0,Subject1_B_cells_062920_C_10,Subject1_B_cells_062920_C_11,Subject1_B_cells_062920_C_12,Subject1_B_cells_062920_C_13,Subject1_B_cells_062920_C_9,Subject1_B_cells_072920_C_4,Subject1_B_cells_072920_C_5,Subject1_B_cells_072920_C_6,Subject1_B_cells_072920_C_8,Subject1_B_cells_072920_C_9,...,Subject2_T_cells_062920_F_12,Subject2_T_cells_062920_F_13,Subject2_T_cells_062920_F_14,Subject2_T_cells_062920_F_9,Subject2_T_cells_072920_F_1,Subject2_T_cells_072920_F_3,Subject2_T_cells_072920_F_4,Subject2_T_cells_072920_F_5,Subject2_T_cells_072920_F_6,Subject2_T_cells_072920_F_8
Protein ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A0A0B4J2D5,3203277.5,4697996.0,7346776.5,8472867.0,4280919.5,0.0,4787781.5,2153860.2,5444238.5,5514300.5,...,0.0,4905873.0,0.0,3405266.0,4656720.0,1673629.9,6060600.5,2407679.5,7791855.0,5542659.5
A0AVT1,1917388.2,3033529.2,3773018.8,1865758.6,5191332.5,3332031.5,2464089.0,2290868.2,2486001.2,2596365.2,...,1091121.4,630647.25,576871.9,0.0,0.0,0.0,0.0,827356.1,1260652.0,0.0
A0FGR8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1514363.4,866311.94,0.0,0.0,0.0,0.0,0.0,0.0,653698.6,879074.56
A6NHR9,2650513.5,0.0,3788095.8,0.0,0.0,1845555.5,933499.44,1958162.5,1604476.4,418360.38,...,0.0,0.0,0.0,0.0,430484.97,0.0,0.0,0.0,538303.6,948190.4
A8K2U0,0.0,0.0,0.0,1149867.8,0.0,0.0,0.0,0.0,2272656.2,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0



Here we define the list of cell types that the functions should look for based on the naming pattern. Note that the only requirement is that these strings occur in all replicates that you want to include and no other samples.

In [2]:
cell_types=['Subject1_B_cells_062920', 'Subject1_B_cells_072920',
            'Subject1_T_cells_062920', 'Subject1_T_cells_072920',
            'Subject2_B_cells_062920', 'Subject2_B_cells_072920',
            'Subject2_T_cells_062920', 'Subject2_T_cells_072920']

### Filter data 
Next, we select the proteins that are measured in at least three samples from each group, allowing the calculations to proceed without imputed zero-handling.

In [3]:
indecies = prot.check_n_of_each_type(cell_types=cell_types, null_value=0)
prot.data_frame = prot.data_frame[indecies]

In [4]:
print ("Total groups identified:")
print(prot.data_raw.shape[0])

print ("Proteins identified in at least 3 of each cell type:\t")
print (prot.data_frame.shape[0])

Total groups identified:
2426
Proteins identified in at least 3 of each cell type:	
887


### Normalize
Before any analysis, we log normalize and median normalize across runs. We need to do this after filtering for consistently expressed proteins so those only identified in some rows do not throw off the normalization. This relieves batch effects, though may not entirele get rid of them.

In [5]:
prot.normalize()

prot.data_frame

Unnamed: 0_level_0,Subject1_B_cells_062920_C_10,Subject1_B_cells_062920_C_11,Subject1_B_cells_062920_C_12,Subject1_B_cells_062920_C_13,Subject1_B_cells_062920_C_9,Subject1_B_cells_072920_C_4,Subject1_B_cells_072920_C_5,Subject1_B_cells_072920_C_6,Subject1_B_cells_072920_C_8,Subject1_B_cells_072920_C_9,...,Subject2_T_cells_062920_F_12,Subject2_T_cells_062920_F_13,Subject2_T_cells_062920_F_14,Subject2_T_cells_062920_F_9,Subject2_T_cells_072920_F_1,Subject2_T_cells_072920_F_3,Subject2_T_cells_072920_F_4,Subject2_T_cells_072920_F_5,Subject2_T_cells_072920_F_6,Subject2_T_cells_072920_F_8
Protein ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
O00148,1.376479,1.340402,1.281538,1.110298,0.827271,1.387920,1.070221,1.180213,0.915809,0.973494,...,1.208520,0.820813,1.445078,1.133851,1.295829,1.125747,1.076853,1.063122,1.208865,1.176501
O00151,-0.016176,0.035770,-0.107583,-0.033451,-0.122543,0.642699,0.635634,0.546499,0.179025,0.421700,...,-0.905244,-2.216417,-1.336944,-1.758952,,-1.610953,-1.486349,-1.004640,-2.027082,-1.413533
O00170,-1.794057,-2.402547,-1.803541,,,-1.696609,-1.764992,-2.035327,-1.878247,-1.283572,...,-1.275011,-1.551711,-1.764623,-1.904828,-1.107308,-1.086965,-1.045284,-1.433503,-1.129767,-1.110887
O00231,-1.809268,-2.044939,-2.671723,,-1.681016,,-1.683055,,-1.950543,-1.739249,...,-1.805573,-2.134510,-2.222441,-1.673983,-1.603100,-1.648746,-1.372268,-1.959888,-1.550092,-1.660486
O00303,-1.152235,-2.291264,-1.539569,-2.802766,-1.759188,-1.105277,-1.832747,-0.756486,-1.715554,-1.300011,...,-0.511477,-2.620680,-1.726668,-1.177003,-2.046698,-1.711648,-1.323695,-1.201397,-1.120975,-0.831181
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Q96Q89,-1.579776,-1.663152,,-1.032686,,6.273895,0.542648,,,0.296398,...,3.482063,-0.630322,5.411662,-0.486205,1.157400,1.918461,0.378184,0.346815,-0.821644,
P60983,0.442218,0.182954,0.132332,-0.115793,0.031849,0.083846,-0.096756,0.405767,-0.403734,-0.629704,...,0.969553,1.096736,0.996800,0.701211,0.329537,0.668325,0.066140,1.033020,-0.087159,0.148615
P62891,-1.872248,-1.729424,-1.822848,-1.287767,-1.553814,,-0.915629,-0.672013,-0.996686,-0.854449,...,-0.721069,,-1.024867,-2.807583,-0.934545,,-0.461915,,-0.524303,
Q6BDS2,4.237580,2.437860,,,2.533748,2.225110,3.448527,2.595640,1.158721,3.297262,...,4.171521,4.251520,,4.317260,2.440964,0.977445,1.720658,,1.971442,0.739808
