# Scatter Plots and Trend/Regression lines

**NOTE: This is an R Notebook**

Reference Site: http://zevross.com/blog/2014/08/04/beautiful-plotting-in-r-a-ggplot2-cheatsheet-3/
 
  * [Local Mirror](http://indigo.sgn.missouri.edu/static/mirror_sites/zevross.com/blog/2014/08/04/beautiful-plotting-in-r-a-ggplot2-cheatsheet-3/)


**Current GGPlot Reference** 
  * http://docs.ggplot2.org/current/
    * [Local Mirror](https://indigo.sgn.missouri.edu/static/mirror_sites/docs.ggplot2.org/current/)

The lab will plot simple graphs showing the linear relationships between different set of variables. 
The dataset used in the notebook is about Nutrition data of different food items. 
There are some of the variables in the dataset which have linear relationships between them.

Read the data from USDA dataset in "/dsa/data/all_datasets/USDA.csv" into the dataframe called _USDA_

In [None]:
USDA = read.csv("/dsa/data/all_datasets/USDA.csv",header=TRUE,sep=",")
head(USDA)

In [None]:
str(USDA)

In [None]:
cor(USDA[,3:16])

Above outcome is the result of NA (not available) values in the dataset. 
We must identify the NA values and fill the missing data so that we can identify the correlations. 

**How To: ** Find the number of rows where there are NA values.

In [None]:
# Review this documention
help(complete.cases)

In [None]:
## The "!" is the logical Not, so complete cases True are flipped to False
# Giving us non-complete
dim(USDA[!complete.cases(USDA),])

**How To : ** Fill the missing NA values in the dataset.

In [None]:
## Using the zoo library
library(zoo)

# Last Observation Carried Forward
help(na.locf)

In [None]:
USDA=na.locf(USDA)

In [None]:
#Convert everything to numbers 
USDA <- lapply(USDA, function(x) as.numeric(x))
# Re-form into a data frame
USDA = as.data.frame(USDA)

In [None]:
# Correlation
cor(USDA[,3:16])

**How To : ** Find the correlation between the variables in the dataset and find the pairs which have correlation more than 0.5. 

In [None]:
## Correlation function again, along with a filter 

cor(USDA[,3:16])>0.5

The pairs of variables that are adequately positively correlated are:

* Calories & Total Fat
* Calories & Saturated Fat
* Total Fat & Saturated Fat
* Carbohydrate & Sugar

In [None]:
library(ggplot2)
library(gridExtra)

**How To : ** Fit a linear regression model, `lm()`, between Calories and TotalFat.

In [None]:
# Train a (l)inear (m)odel

lm(Calories ~ TotalFat, data = USDA)

## Regression Lines

Lets let GGPlot add regression  limes.

In [None]:
# Create a scatterplot with Calories and TotalFat as inputs for x and y axis
p1 <- ggplot(USDA, aes(Calories, TotalFat,color=SaturatedFat)) + geom_point() 
# add a linear regression line and set minimum x,y limits on the plot
p1 <- p1 + geom_smooth(method = lm, se = FALSE, color="orange") 
           #                  # Use the linear model
           # This function adds the line that approximates the trend/regression.
p1

In [None]:
# Create a scatterplot with Calories and TotalFat as inputs for x and y axis
p1 <- ggplot(USDA, aes(Calories, TotalFat,color=SaturatedFat)) + geom_point() 
# add a linear regression line and set minimum x,y limits on the plot
p1 <- p1 + geom_smooth(se = FALSE, color="orange")
p1

## <span style="background:yellow">YOUR TURN</span>

#### 1) What is the difference in the last two plots above and which part of the code changed?



#### 2)  Choose another pair of values with high correlation.  Plot the data with regression 


In [None]:
# Add your code under this comment.
# -----------------------------------------






## Grid Plotting

Plot a grid of scatter plots for the 4 pairs of correlations identified above.


Build the plots in memory, then assigne all the grapohics to the grid layout manager.

In [None]:
## 

# Create a scatterplot with Calories and TotalFat as inputs for x and y axis
p1 <- ggplot(USDA, aes(Calories, TotalFat,color=SaturatedFat)) + geom_point() 
# add a linear regression line and set minimum x,y limits on the plot
p1 <- p1 + geom_smooth(method = lm, se = FALSE, color="orange")

# Create it with Calories, SaturatedFat
p2 <- ggplot(USDA, aes(Calories, SaturatedFat,color=Cholesterol)) + geom_point() 
p2 <- p2 + geom_smooth(method = lm, se = FALSE)


# Create it with TotalFat, SaturatedFat
p3 <- ggplot(USDA, aes(TotalFat, SaturatedFat,color=Cholesterol)) + geom_point() 
p3 <- p3 + geom_smooth(method = lm, se = FALSE)

# Create it with Carbohydrate, Sugar
p4 <- ggplot(USDA, aes(Carbohydrate, Sugar,color=Calories)) + geom_point() 
p4 <- p4 + geom_smooth(method = lm, se = FALSE)

grid.arrange(p1, p2, p3, p4)

**Activity 6: ** Plot a scatter plot between Calories and TotalFat. Plot the points in green color and transparent. Adjust the breaks for x and y axes. 

In [None]:
## 

pp1 <- ggplot(USDA,aes(Calories, TotalFat)) + geom_point(color="green",alpha = 1/10)
# add black and white theme
pp1 <- pp1 + theme_bw() 
# # adjust the limits on the axes
pp1 <- pp1 + scale_x_continuous(breaks = seq(0, 1000, 100)) + scale_y_continuous(breaks = seq(-10, 110, 10)) 
# # more axis stuff
pp1 <- pp1 + expand_limits(x = c(0,1000), y = c(-10,110)) 
# # add title 
pp1 <- pp1 + labs(title = "USDA") + geom_smooth(method = lm, se = FALSE,color = "orange")
pp1

In [None]:
## 

pp1 <- ggplot(USDA,aes(Calories, TotalFat)) + geom_point(color="green",alpha = 1/10)
# add black and white theme
pp1 <- pp1 + theme_bw() 
# # adjust the limits on the axes
pp1 <- pp1 + scale_x_continuous(breaks = seq(0, 1000, 100)) + scale_y_continuous(breaks = seq(-10, 110, 10)) 
# # more axis stuff
pp1 <- pp1 + expand_limits(x = c(0,1000), y = c(-10,110)) 
# # add title 
pp1 <- pp1 + labs(title = "USDA") + geom_smooth(method = lm, se = FALSE,color = "orange")
pp1 + stat_smooth(method="lm")

## <span style="background:yellow">YOUR TURN</span>

Explore the data for negative correlations.

 Run the cell below to see the raw corrlations.


In [None]:
# Potentially alter this cell to filter into True/False
cor(USDA[,3:16])

#### 1) Plot the two variabales which are most strongly negatively correlated.

In [None]:
# Add your code under this comment.
# -----------------------------------------






#### 2) Repeat, adding a regression element.

In [None]:
# Add your code under this comment.
# -----------------------------------------






# SAVE YOUR NOTEBOOK -- Then "Close & Halt" the notebook.