# Module 5 Lab

This lab covers plotting histograms, line graphs and heat maps. 
Histograms and line graphs are useful to understand the trends in data; and heat maps offer more insights when looking for patterns. 

The motor vehicles theft data used in the notebook are incidents reported in Chicago. 
It has 3 variables 
  * _Date_ (incident reported date with time), 
  * _Latitude_ and _Longitude_ (coordinates where incident occured).

We will read the data from a thefts text file in "/dsa/data/all_datasets/mvt.csv" into the dataframe called `mvt_data`.

In [None]:
mvt_data=read.csv("/dsa/data/all_datasets/mvt.csv")
str(mvt_data)

We can see that the variable Date is a factor in original dataset. 
To make this more useful, we are going to extract the Date, Day-of-Week, and Hour from the Date variable.
We do this by converting it to POSIX form to extract date and time from it. 
To convert the `Date` variable we can use the `strptime` function (string POSIX time) for conversion. 
Then create a variable "Day" to save the weekdays and variable "Hour" to save the hours. 

In [None]:
# The format specified should match the input date format to extract respective information.
# This converts the Date variable from its string format to a POSIX Date/Time object 
mvt_data$Date=strptime(mvt_data$Date,format="%m/%d/%y %H:%M")

# Then we extract Date / Day / Time parts

# To find out the day of week, use weekdays()
mvt_data$Day = weekdays(mvt_data$Date)

# mvt_data$Date$hour will extract the hour from Date
mvt_data$Hour = mvt_data$Date$hour
str(mvt_data)

**External Reference: **

- [POSIXlt](http://www.cyclismo.org/tutorial/R/time.html)
- [strptime](http://rfunction.com/archives/1912)


#### We can examine a histogram using the R `table()` function.

In [None]:
# The distribution of vehicle thefts by day of the week. 

table(mvt_data$Day)

Notice the order above is in alphabetical order.
A histogram of `Day` variable will not be in the order of weekdays instead it will be in alphabetical order. 
If we want to genereate a histogram that is intuitive to us (humans), we will want to have it ordered in chronological order as we are accustomed to for days of the week.
We will see how to do this later.

In [None]:
library(ggplot2)

        # our data, with a univariate aesthetic
ggplot(mvt_data,aes(x=Day))+
   geom_bar()+                 # using the bar geometry
   xlab("Day of the week") + 
   ylab("Total Motor Vehicle thefts")

## <span style="background:yellow">Your Turn</span>

 1. Display a tabular histogram of the motor vehicle thefts by hour.  
 2. Then, display the histogram as a bar plot.


In [None]:
# 1) Write your code below this comment
# ------------------------------------------------



In [None]:
# 2) Write your code below this comment
# ------------------------------------------------




#### Converting Factors to Ordered Factors

To fix the day ordering from alphabetic to expected order, 
we will convert the **Day** to be _ordered factors_. 
This is a very common need with data sets that have nominal variables that are typically ordered by humans, such as days of the week, months of the year, etc.

We can save the **Day** variable as a ordered factor, 
after which ggplot() will know to plot the days in listed order instead of plotting alphabetically.


**Convert** "Day" in "mvt_data" into a factor with ordered values. 
First we will define the ordered levels as an list of values.

In [None]:
# Assign the levels as
# c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday")
#   review c() function: http://astrostatistics.psu.edu/su07/R/html/base/html/c.html
mvt_data$Day = factor(mvt_data$Day,ordered=TRUE,
                      levels=c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"))

Then, we can plot a histogram and get our distribution of automotive thefts over days of the week in a natural order.

In [None]:
library(ggplot2)

# Same histogram as above
ggplot(mvt_data,aes(x=Day))+
   geom_bar()+
   xlab("Day of the week") + 
   ylab("Total Motor Vehicle thefts")

While looking at the univariate distribution is interesting, 
bringing additional information into the picture adds to the information value of the plot.

In [None]:
# Distribution of vehicle thefts by day of the week and hour. 
# the table tunction allows us to specify two columns of data 
table(mvt_data$Day,mvt_data$Hour)

**NOTE**: Since we have updated the Day to be a factor, the table is now structured to use the ordinal factors.

We can actually store this as a dataframe to plot a line graph and heat map later.

There are 168 observations one for each hour of the day and day of the week. 24*7=168. 
 * First variable gives day of the week. 
 * Second variable gives hour of the day. 
 * Third variable "Freq" gives the total crime count for a particular hour.

In [None]:
# Convert the list output from table() into a dataframe.
DayHourCounts = as.data.frame(table(mvt_data$Day,mvt_data$Hour))


str(DayHourCounts)

**Notice** that `Var1` is _Day-of-Week_ and `Var2` is the _Hour-of-Day_.
We will conver the `Var2` into a numerical factor in the next step.


**External Reference: **

- [as.data.frame](https://www.r-bloggers.com/converting-a-list-to-a-data-frame/)

In [None]:
# Convert var2 in DayHourCounts into numbers. 
# Create a new variable called Hour and assign numeric form of var2 to it. 
DayHourCounts$Hour = as.numeric(as.character(DayHourCounts$Var2))
str(DayHourCounts)

### Visualizing our table of motor vehicle thefts 

Plot a line graph with **Hour** and **Freq** from **DayHourCounts** on x axis and on y-axis respectively. Draw a line for each day differentiating each line with a different color. 

In [None]:
## Our aesthetic is now a line:
ggplot(DayHourCounts,aes(x=Hour,y=Freq))+
   geom_line(aes(group=Var1,color=Var1))

In the plot above, we are essentially trying to visualize three elements of data:
 * Day of Week
 * Hour of Day
 * Frequency of thefts


## <span style="background:yellow">Your Turn</span>

Consider the data in the plot above.
We see peak in the 8am - 9am time frame.  

Answer and elaborate on these questions in the box below:

  * Do you feel this data spike at 8-9am is an accurate reflection of when cars are being stolen in Chicago?
  * Explain, why or why not.


## <span style="background:yellow">Your Turn</span>

 1. Convert the plot above to use _Hour-of-Day_ as the trends.
 2. Then render the trends as points, not lines.



In [None]:
# 1) Write your code below this comment
# ------------------------------------------------








In [None]:
# 2) Write your code below this comment
# ------------------------------------------------









## Heatmaps

Heatmaps provide a visualization of a bivariable frequency or density.
Heatmaps encode the frequency or density into a matrix, defined by the two variables which are are discretely partitioned.
Ordinal, or ordered nomimal, values are a natural discrete partition of the space. 

In the case of our current data frame, those variables are _Day-of-Week_ and _Hour-of-Day_.
The result, conceptually, is a grid of values, essentially a matrix.
In this case, we are using the _Day-of-Week_ and _Hour-of-Day_ as positions, and by using a **tile** geometry we are establishing a gridded rendering.



Then we can plot a heatmap with **Hour** and **Day** from **DayHourCounts**, on x axis and on y axis, respectively. 
We will use a gradient to represent the frequency of thefts, setting _white_ for the low color and _red_ as high color to fill the tiles. 

In [None]:
# Again keep the the days in the order of weekdays just like how it is done above.
DayHourCounts$Var1 = factor(DayHourCounts$Var1,ordered=TRUE,
                            levels=c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"))


ggplot(DayHourCounts,aes(x=Hour,y=Var1))+
    geom_tile(aes(fill=Freq)) +                 # Using tiles instead of points
    # Note, the we are specifying to fill each tile / rectangle with a gradient from white to red.
    scale_fill_gradient(name="Total MV thefts",low="white",high="red") +
    theme(axis.title.y=element_blank())

# The fill parameter corresponds to the color of rectangles for the total crime for a particular hour. 

The result is that for each hour and each day there is a rectangle. 
The color intensity of rectangle tells the relative frequency or the number of crimes happened in that hour on that day. 

According to the legend, darker red color corresponds to more crimes. 
So, from the plot it is evident that more crimes occur at midnight during weekends. 
Also friday nights are when many thefts happen. 
Color schemes should be changed according to the problem. 
Different color schemes are helpful in different situations.

Please recall the culture meanings of colors, as referenced in Day 2.


## <span style="background:yellow">Your Turn</span>

 1. Pivot the heatmap to be _Days-of-Week_ across the bottom, and _Hour-of-Day_ along the side.
 1. Change the plot to be a gray gradient


In [None]:
# 1) Write your code below this comment
# ------------------------------------------------








In [None]:
# 2) Write your code below this comment
# ------------------------------------------------








# SAVE YOUR NOTEBOOK, and then "File > Close and Halt"