# Preliminary Proteomic Data Analyses

Using [data from Skyline](https://github.com/RobertsLab/project-oyster-oa/blob/master/notebooks/2017-03-14-Skyline-Test-Run.ipynb), I will analyze and visualize my data. I will first assess which proteins are differentially present in samples across sites and inside or outside of eelgrass beds. Then, I will visualize my data in three different ways.

### Data Exploration

First, I want to see what data I'm working with. I will work with the data for [average peak area](https://raw.githubusercontent.com/RobertsLab/project-oyster-oa/master/analyses/DNR_Skyline_20170314/Oyster-AverageArea-Proteinbased.csv) for proteins across samples.

![screen shot 2017-03-21 at 6 38 09 pm](https://cloud.githubusercontent.com/assets/22335838/24178348/ce486518-0e65-11e7-92a5-999f56a87274.png)

My "Row Labels" column is a list of protein IDs. The other columns pertain to the mass spectrometry sample IDs. To make this usable, I will use an R script to add more informative column names. Additionally, I need to remove two columns relating to sample O107. There was a bubble in the sample column, which lead to [poor mass spectrometer readings](https://yaaminiv.github.io/Mass-Spec-Updates/). I reran the sample twice, so I need to use those runs for analysis instead of the initial poor runs.

Here's the [R script](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Preliminary_Analyses_20170321/DNR-Reformat-Preliminary-Data.R) I used to reformat my data.

The data can be found [here](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Preliminary_Analyses_20170321/Oyster-AverageAdjustedMergedArea.csv).

In [8]:
!head /Users/yaaminivenkataraman/Documents/School/project-oyster-oa/analyses/DNR_Preliminary_Analyses_20170321/Oyster-AverageAdjustedMergedArea.csv

"","averageAreaAdjusted.proteins","bareCaseInlet","bareFidalgoBay","bareWillapaBay","bareSkokomishRiver","barePortGamble","eelgrassCaseInlet","eelgrassFidalgoBay","eelgrassWillapaBay","eelgrassSkokomishRiver","eelgrassPortGamble"
"1","CHOYP_14332.1.2|m.5643",3293219.556,1726699.5,2716275.545,13976900.44,2243615.1,4909780.357,3073463.222,1943599.9,5308420.417,1708631.273
"2","CHOYP_14332.2.2|m.61737",81172.33333,101834.6667,2605333.25,143657,94883.2778,883664.4286,47877.6,859956.1429,7453332.5,181803
"3","CHOYP_1433E.1.2|m.3638",9599115.391,22026926.25,5985057.944,29630617.52,2209625.5,2888550.125,24439963.26,17286007.32,49787884.24,4732493.542
"4","CHOYP_1433E.2.2|m.63376",9599115.391,22026926.25,5985057.944,29630617.52,2209625.5,2888550.125,24439963.26,17286007.32,49787884.24,4732493.542
"5","CHOYP_1433G.1.2|m.8906",4875530.333,2873835,4671105.667,20887905.17,3481416.667,5906856.286,4141798.333,2896918.8,6873987.556,271279.25
"6","CHOYP_1433G.2.2|m.63450",10460977.8,7264823.438,

Using the same R script, I reformatted my maximum peak area data. The reformatted data can be found [here](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Preliminary_Analyses_20170321/Oyster-MaxAdjustedMergedArea.csv).

### Ratio Analysis

I will now use basic ratios to determine which proteins are potentially differentially expressed between eelgrass and bare patches, and between the five different sites.

Within the formatted data files, I calculated averages across sites for proteins expression in bare sites versus eelgrass patches. I then took the ratio of eelgrass:bare and sorted the data from smallest to largest ratio.

**Average area**:

![average1](https://cloud.githubusercontent.com/assets/22335838/24216092/2eb7143e-0ef8-11e7-80c1-07ba1ded3fc0.png)
![average2](https://cloud.githubusercontent.com/assets/22335838/24216093/2eb7b362-0ef8-11e7-8c7f-ad17d9a800e8.png)

**Maximum area**:

![max1](https://cloud.githubusercontent.com/assets/22335838/24216097/306f1dda-0ef8-11e7-8096-665217a846d1.png)
![max2](https://cloud.githubusercontent.com/assets/22335838/24216096/306b3e86-0ef8-11e7-9234-67481b239536.png)

There's quite a range of eelgrass:bare ratios. Keeping this in mind, I'll proceed with my entire dataset for my intial visualizations. After seeing what I get, I'll pare down the dataset based on my ratio analysis.

### Nonmetric Multidimensional Scaling Plot

Based on my data exploration, it seems like most of my differences between eelgrass and bare patches are driven by site-specific interactions. To better visualize this, I'll first create an NMDS plot.

[R Script](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Preliminary_Analyses_20170321/DNR-Preliminary-Data-Analysis.R)

![preliminary NMDS](https://raw.githubusercontent.com/RobertsLab/project-oyster-oa/master/analyses/DNR_Preliminary_Analyses_20170321/preliminaryNMDS.png)

I couldn't figure out how to add a legend properly, so the preliminary figure it missing it. However, I did color each site differently. Bare patches are squares, eelgrass are circles. Without knowing which site is which, they don't seem to be clustering together in any way that makes sense.

I showed this plot to Emma and she said there was something weird about the axes. She's going to try and recreate my plot and after I get her feedback, I'll modify my NMDS.

### Heatmap

My next step was to create a heatmap to visualize differences in protein expression across sites and eelgrass conditions.

[R Script](https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/DNR_Preliminary_Analyses_20170321/DNR-Preliminary-Data-Analysis.R)

![preliminary heatmap](https://raw.githubusercontent.com/RobertsLab/project-oyster-oa/master/analyses/DNR_Preliminary_Analyses_20170321/preliminaryHeatmap.png)

From the heatmap, it's apparent that most of the proteins expressed are consistent between sites. There are only certain proteins on the "edges" of the heatmap that are different. It is likely that these proteins are the same as those identified by my ratio analysis. Either way, I'll need to pare down my dataset to get anything meaningful.