## Module 4 Practice - Histograms

This practice notebook has exercises for plotting histograms using R, ggplot, and plot.ly libraries. 

A histogram is a bar chart where each line (rectangle) represents the count of data items that fall to the corresponding bin in the x axis. 

In ggplot, histograms can be plotted either by modifying `geom_bar`, or simply by using `geom_histogram()` geometry. 

We will read a USDA dataset to plot some histogram examples. 

In [None]:
usda_data = read.csv("/dsa/data/all_datasets/USDA.csv")
head(usda_data)

In [None]:
summary(usda_data)

#### A little data carpentry!

Remove the NA values from Calories variable. 

In [None]:
usda_data=usda_data[!is.na(usda_data$Calories),]
summary(usda_data)

### Qplot in ggplot

Qplot is a convenient wrapper for ggplot to create a number of different types of plots using a consistent calling scheme that is similar to the base graphics capability of R. 
It is also referred to as **quick plot**. 

In the below plot, a histogram is plotted using the string **`histogram`** supplied to **`geom`** parameter. 
`binwidth` tells ggplot to form bins of specified width. 
With a `binwidth` of 10, each bin in below plot represents a range of calories like (50-59) on x axis, 
and the data items falling within these ranges are counted and depicted as the frequencies of corresponding bins.

In [None]:
library(ggplot2)
qplot(Calories, data=usda_data, geom="histogram",binwidth=10)

The **`weight`** aesthetic when used with histograms or bar charts can be used to create weighted histograms and bar charts. Here the height of the bar no longer represents count of observations, but a sum over some other variable.

In [None]:
library(ggplot2)
qplot(Calories, data=usda_data, geom="histogram", 
      weight=Protein, binwidth=10, ylab = "Protein") 

## <span style="background:yellow">Your Turn</span>

#### 1) Change the plot above to weight the histogram by `SaturatedFat`


In [None]:
# 1) Add your code below this comment
# ---------------------------------









#### 2) Put the two histograms (protein and fat weighted) onto a grid, then provide a comparable analysis of the distribution of calories in the USDA data.

In [None]:
# 2) Add your code below this comment
# ---------------------------------








### Layered Grammar of ggplot

We can use the ggplot syntax instead of qplot to create plots that follow the layered grammar convention of ggplot. 
The histogram can be also plotted like this:

In [None]:
library(ggplot2)
ggplot(usda_data, aes(x=Calories)) + geom_histogram(binwidth=10, fill="lightblue") + ylab("Frequency")
    


**Note**: We altered the color of the histogram by specifying the `color` attribute in the `geom_histogram()` layer.

### Density Curve on Histogram

A density curve can be plotted on a histogram that represents the probability density function of that variable. 
Density can be overlayed on histogram with a transparent density plot. 
The `alpha` value controls the level of transparency as shown in below example. 

This shows the layered structure of ggplot where two layers (histogram and density) can be plotted on the same plot.  
<span style="background:yellow">Note</span>, **..density..** is a derived variable computed by the ggplot library on the fly.

In [None]:
# Histogram with density plot
ggplot(usda_data, aes(x=Calories)) + 
 geom_histogram(aes(y=..density..),  # this ..density.. is a derived variable, computed during call to geom_histogram()
                colour="black", fill="lightblue", binwidth=10) +
 geom_density(alpha=.2, fill="red") 

## <span style="background:yellow">Your Turn</span>

#### 3) Provide a side-by-side density histogram of the Protein weighted calories and the Total Fat weighted calories.


In [None]:
# 3) Add your code below this comment
# ---------------------------------








### scale_fill_gradient


We can use `scale_fill_gradient` to fill bars with colors according to frequency. 
In the below plot, bar colors that are blue represent items that are most frequent,
and tan bars indicate food items that are very sparse in the dataset.

In [None]:
#  Create a plot variable 'p' ... the data is provided, 
# then the aes, aesthetic, specifies data elements are positioned on the X axis by their Calories attribute/field
p <- ggplot(usda_data, aes(x=Calories))

# Note the that ..count.. is a variable that is computed on the fly (implicit) 
# This ..count.. is then used to look up the appropriate color from the scale_fill_gradient
p + geom_histogram(aes(fill = ..count..), binwidth=10) +
  scale_fill_gradient("Count", low = "tan", high = "blue")

## <span style="background:yellow">Your Turn</span>

#### 4) Repeat the above plot using a tan to red.


In [None]:
# 4) Add your code below this comment
# ---------------------------------










#### 5) Repeat the above plot using a yellow to red.


In [None]:
# 5) Add your code below this comment
# ---------------------------------









#### 6) Which color gradient scheme (tan - blue, tan - red, yellow - red) do you feel most accurately conveys the transition in values?



#### 7) What effect of the human visual system do you feel is influcing your answer above.


# SAVE YOUR NOTEBOOK, then File > "CLose and Halt"