# Exporation of Data
## Data Science 350

This notebook contains an introduction to the methods data scientists used to prepare and explore a data set. Data scientists spend a lot of time manipulating data to clean and prepare it (**data munging**) and to understand the relationships within the variables in the data (**data exporation**).

These lessons are divided into several parts. In each part you will learn how to use the visualization tools available in Python.

- **Summarizing and manipulating data**:
  * How large is it?
  * What columns are of interest?
  * Missing data?
- **Overview of ggplot2** the package we will use extensively in this course for plotting. 
- **Overview of plot types** is a reveiw of creating basic plot types used to construct visualizations.
- **Using Aesthetics** is an overview of how to project additional plot dimensions using plot aesthetics.
- **Facetted plotting** also know as conditioned plotting or lattice plotting, introduces a powerful method for visualizing higher dimensional data. 

## About the data set

This data set is from the [Univeristy of California Irving Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Automobile)  The data was compiled by Jeffrey C. Schlimmer from the following sources:

- 1985 Model Import Car and Truck Specifications, 1985 Ward's Automotive Yearbook. 
- Personal Auto Manuals, Insurance Services Office, 160 Water Street, New York, NY 10038 
- Insurance Collision Report, Insurance Institute for Highway Safety, Watergate 600, Washington, DC 20037


## Load and examine the data set

Let's get started. The function shown in the cell below loads the data from the .csv file. Some minimal clean up is performed on the data. Rows with missing values are dropped and some columns are converted from strings containing numbers to numeric data. 

Execute the code in this cell to load the data into your notebook. **Make sure you have the .csv file in your working directory!**

In [None]:
read.auto = function(file = 'Automobile price data _Raw_.csv'){
  ## Read the csv file
  auto.price <- read.csv(file, header = TRUE, 
                      stringsAsFactors = FALSE)

  ## Coerce some character columns to numeric
  numcols <- c('price', 'bore', 'stroke', 'horsepower', 'peak.rpm')
  auto.price[, numcols] <- lapply(auto.price[, numcols], as.numeric)

  ## Remove cases or rows with missing values. In this case we keep the 
  ## rows which do not have nas. 
  auto.price[complete.cases(auto.price), ]
}
auto.price = read.auto()

Notice that there are some missing values in the numeric columns. These are removed using the `complete.cases` function which returns a logical `TRUE` for rows without missing data.

Next, have a first look at a summary of the columns in this data set by executing the code in the cell below.

In [None]:
str(auto.price)

Some key things to notice here are:

- There are 195 rows or cases in the data set.
- The data set has 26 variables or columns.
- Columns have a type of either `chr` (character), `num` (numeric or floating point number), or `int` (integer). 

### Summary statistics

You can also learn a lot about a data set by looking at statistical summaries. The R `summary` function does just this. In this case you will only look at a subset of colums to keep the results manageable.

$$mean = \mu = \frac{1}{n} \Sigma_i x_i$$

****
**Note:** `summary` uses rounding conventions which are differennt from other R functions. Watch out!
****

Execute the code in the cell below and note the result.

In [None]:
auto.sub = auto.price[, c('wheel.base', 'curb.weight', 'engine.size', 'horsepower', 'city.mpg', 'price')]

summary(auto.sub)

This summary provides some useful information:

- The minimum and maximum values of each variable are displayed.
- The quantiles are shown; fist, second (median), third. Note that the upper interquartile ranges are different from the lower IQR, indicating the distributions are skewed. 
- The mean is displayed. Compare the mean and the median, noting that the are often quite different. Again this is a sign of a skewed distribution. 

Next, compute the standard deviation of the numeric columns. The standard deviation is a measure of the dispursion of a distribution defined as:

$$sd = \sigma = \frac{1}{1-n} \Sigma_i (\mu - x_i)^2$$

The code below applies the `sd` function to each of the columns of the data frame `auto.sub`. Since a data frame is a list, the `lapply` function is used to iterate the `sd` function over the list.  

Execute this code and examine the result.

In [None]:
lapply(auto.sub, sd)

### Examine the data frames

Next, let's look at the first and last few lines of the data frame using the `head` and `tail` functions. Execute the code in the cells below to examine the first and last few rows of the data frame.

In [None]:
head(auto.price, n =10)

In [None]:
tail(auto.price)

### Frequency tables

The summary techniques we have used so far are suitable only for numeric variables. Categorical variables have values which are typically unordered. Therefore, we need a method to summrize categorical based on counts.  

The R `table` function lets you create a fequency table. The frequency table shows the count of each category of the variable. Execute the following code to show the number of cars in the data set by unique manufacturer.

In [None]:
table(auto.price$make)

Examine this data. Notice that some manufactures like alfa-romero and chevrolet have only three cars types in the sample. Toyota has the most cars in this sample. 

Or, you can look at the frquency of a combination of two categorical variables. Execute the code in the cell below and examine the results:

In [None]:
table(auto.price$make, auto.price$fuel.type)

Notice there are quite a few zeros in the frequency tables. 

With more than two table variables, multiple tables are displayed. Execute the code in the cell below and examine the result. 

In [None]:
table(auto.price$make, auto.price$fuel.type, auto.price$aspiration)

**Your turn!** In the cell below, create and execute the code to create a frequency table of autos by body type and number of doors.

### Covariance and correlation

Now, you will work with the correlations between the numberic variables.  Covariance is defined as follows:

$$cov(x,y) = E((x - \mu_x) (y - \mu_y))$$

Pearson correlation is similar to covariancec, but with normalization by the variance, and is defined as follows: 

$$cor(x,y) = \frac{E((x - \mu_x) (y - \mu_y))}{\sigma_x \sigma_y}$$

There are also non-parametric measures of correlation. One of widely used non-parametric method is Kendal's rank correlation. The Kendal correlation counts the number of values which have the same sign (concordant) and with the oposite sign (discordant). As a rank or non-parametric method, Kendal correlation is robust to outliers. Kendal's rank correlation can be computed as follows:

$$\tau = \frac{\Sigma concordant\ pairs - \Sigma disconcordant\ pairs}{n(n-1)/2}$$

Execute the code in the cell below to compute the Pearson correlation for a subset of the data frame. 

In [None]:
cov(auto.sub)

Next, compute the correlation matrix by executing the code in the cell below. 

In [None]:
cor(auto.sub)

The covariance is a bit hard to interpret since the variables are not normalized. The numberic ranges of the variables are quite different, leading to a large range of covariance values. 

The correlation matrix is a bit easier to understand, since there is a normalization of the values. The diagonal values of the correlation matrix are all 1.0, as a result of the normalization. The magnitudes of the correlation values between the variables exhibit a significant range. 

Some pairs of variables are weakly correlated, with small magnitueds. For example, wheel **base** and **horsepower** are weeky correlated. Other pairs of variables are strongly correlated. For example **engine size** and **curb weight** are strongly correlated. 

## Exploration and Visualization of Data

### Why visualization?

Visualization is an essential method in any data scientist’s toolbox. Visualization is a key first step in the exploration of most data sets. As a general rule, you should never start creating models until you have examined the data and understand the relationships. Otherwise, you risk wasting your time creating models blindly. Visualization is also a powerful tool for presentation of results and for determining sources of problems with analytics. 

The concepts of exploring a data set visually were pioneered by John Tukey in the 1960s and 1970s. Tukey consolidated his many ideas on data exploration into a book in the late 1970s, ***John Tukey, Exploratory Data Analysis, 1977, Addison-Westley***.

![](img/Tukey.jpg)

Bill Cleveland documented his seminal work in visualization of complex data sets in his book, ***William S. Cleveland, Visualizing Data, 1993, Hobart Press***.

![](img/Cleveland.jpg)

The key concept of exploratory data analysis (EDA) or visual exploration of data is to understand the relationships in the data set. Specifically using visualization when you approach a new data set you can:

- Explore complex data sets, using visualization to develop understanding of the inherent relationships.
- Use different chart types to create multiple views of data to highlight different aspects of the inherent relationships.
- Use plot aesthetics to project multiple dimensions. 
- Apply conditioning or faceting methods to project multiple dimensions




### Why ggplot2?

The R ggplot2 package is a powerful visualizatiton and presentation graphics package. With ggplot2 you cana explore complex relationships in a data set. Furthermore, you can use ggplot2 to to create presentation graphics for your coleagues, your boss and your instructor! 

The ggpolot2 package has excellent documentation. The documentation includes numerous examples. The ggplot2 documentation has [comprehensive index](http://docs.ggplot2.org/current/). The [ggplot2 Cheat Sheet](https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf) is another excellent resource. 

The [R Graphics Cookbook](http://shop.oreilly.com/product/0636920023135.do) by Winston Chang, published by O'Reilly, is an excelent resource for learninng ggplot2 in depth. 

![](img/Chang.jpg)

### Basic chart types

Now that you have loaded and had a first look at the data, its time to get to work making some charts. 

There are enumerable chart types that are used for data exploration. In this tutorial you will work with the most used chart types:

- **Bar plots**
- **Histograms**
- **Box plots**
- **Kernel Density Estimation Plots**
- **Violin plots**
- **Scatter plot**
- **2d Kernel Density Plots**
- **Hexbin plots**
- **Line plots**

The following exercises give you some practice using these chart types. 

### Bar plots

Bar plots are used to display the counts or frequency of unique values of a categorical variable. The height of the bar represents the count for each unique category of the variable. 

The code below uses ggplot2 to create the bar plot. The `ggplot` function defines the data frame and the x and y column(s) to be plotted. The plot type is defined by these function `geom_bar`. The `+` operator chains the functions. 

Execute this code to create the bar plot.

In [None]:
require(ggplot2)
ggplot(auto.price, aes(body.style)) + ## Specify the data frame and columns. Note the + chain operator
     geom_bar()  ## Fuction for the plot type

From this display you can see that sedan is the most frequent body type and convertible the list fequent. 

It would be easier to understand this plot if the frequencies of the categories were ordered. Further, if there are a large number of categories rotating the text would help quite a bit. The code in the cell below performs both of these operations. Notice that an additional chaining operator, `+`, is used to add the additional text theme attribute. 

Execute the code in the cell below to create an ordered bar plot.

In [None]:
ggplot(auto.price, aes(x=reorder(body.style,body.style, function(x) -length(x)))) + ## Function shorts the bars
                         geom_bar() + 
                         theme(axis.text.x = element_text(angle = 90, hjust = 1)) ## Theme function for conrol

**Your turn!** In the cell below, create and execute the code to create a bar chart of the number of doors. 

### Histograms

Histograms are related to bar plots. Histograms are used for numeric variables. Whereas, a bar plot shows the counts of unique categories, a histogram shows the number of data with values in each bin. The bins divide the values of the variable into equal segments. The vertical axis of the histogram shows the count of data values within each bin.  

Execute the code in the cell below to create a histogram using the `geom_histogram` function. 

In [None]:
ggplot(auto.price, aes(price)) + geom_histogram() + 
       ggtitle('Histogram of auto price')

The histogram shows that the price of automobiles are skewed toward the lower end, with only a few high priced autos.

### Box plots

Box plots, also known as box and wisker plots, were introduced by John Tukey in 1970. Box plots are another way to visualize the distribution of data values. In this respect, box plots are comparable to histograms, but are quite different in presentation. 

A diagram showing the interpretation of a box plot is shown in the figure below.

![Interpretation of a box plot](BoxPlot.png)

On a box plot the median value is shown with a dark bar. The inner two qartiles of data values are contained within the 'box'. The 'wiskers' enclose the majority of the data, at +/-2.5 times the inter-quartile range. Outliers are shown by symbols beyond the wiskers.

Several box plots can be stacked along an axis. The data are divided using a 'group by' operation, and the box plots for each group are stacked next to each other. In this way, the box plot allows you to display two dimensions of your data set. 

Execute the code in the cell below to create a box plot of the price of the automobiles.

In [None]:
ggplot(auto.price, aes(x = factor(0), y = price)) + geom_boxplot()

Box plots are most useful when the data are subdivided using another variable. In ggolot2 the you can add an x variable of type factor to perform this group by operation. Execute the code below to display the box plots of price grouped by the `fule.type` variable.

In [None]:
ggplot(auto.price, aes(x = factor(fuel.type), y = price)) + geom_boxplot() + 
    xlab('Fuel type') + ggtitle('Price by Fuel Type')

Examine these box plots and note the following:

- The median price of gas autos is less than the the median price for diesel autos.
- There is considerable overlap between the interquartile ranges of the price distributions of gas and diesel autos. 

**Your turn!** In the cell below create and execute the code to display a box plot of auto price grouped by body style. 

### Kernel density estimation plots

You have used two methods for visualizing distributions. Now, you will look at another method, kernel density estimation. Kernel density estimation uses a moving denisty kernel to average the density of the distribution. Execute the code in the cell below to compute and plot a kernel density estimate of auto price. 

In [None]:
ggplot(auto.price, aes(price)) + geom_density()

Examine this plot. Notice that the distribution is heavily skewwed to the left or low side. 

### Violin plots

Violin plots are a useful, and relatively new, plot type. The violin plot combines some attributes of a kernal density plot and the box plot. The violin plot shows a pair of symetric kernel desity plot. Like a box plot, the violin plot can be conditioned on a categorical (factor) variable. 

The code in the cell below creates a violin plot of auto price, grouped by fuel type. Some attributes have been added to show quantiles with the `draw_quantiles` attribute. Additionally, the `trim` attribute adds wiskers to show outliers. Execute this code to create the plot. 

In [None]:
ggplot(auto.price, aes(x = factor(fuel.type), y = price)) + 
    geom_violin(trim = TRUE, draw_quantiles = c(0.25, 0.5, 0.75)) +
    xlab('Fuel type')  + ggtitle('Price by Fuel Type')

The violin plot shows much the same information as the box plots you created earlier. However, the violin plot shows some additional details of the distributions. 

**Your turn!** In the cell below create and execute the code to display a violin of auto price grouped by body style. 

## Scatter plots

A basic scatter plot shows the relationship between two variables. Points are ploted at the x-y coordinates of each value pair. 

Execute the code in the cell below to create a scatter plot of city MPG on the horizontal axis vs. auto price on the vertical axis. 

In [None]:
ggplot(auto.price, aes(x = city.mpg, y = price)) + geom_point() + 
    xlab('City MPG') + ylab('Price') + 
    ggtitle('Relationship between City MPG and Price')

The scatter plot shows the relationship between price and city MPG. The relationship is rather complex, with several regions, which seem isolated from one another. 

**Your turn!** In the cell below create and execute the code to display a scatter plot of the the city mpg vs horsepower. 

### 2D kernel density plot

A 2 dimensional kernel density plot works much like the 1 dimensional version we have already explored. A useful application is to overlay the 2D kernel density plot on a scatter plot. Execute the code below to create the plot.

In [None]:
ggplot(auto.price, aes(city.mpg, price)) + geom_point() + 
    geom_density2d() +
    xlab('City MPG') + ylab('Price') +
    ggtitle('Relationship between City MPG and Price')

The 2D kernel density estimate is shown using contours in this case. Other options are available. 

The 2D kernel density plot highlights the fact that the relationship betwen city MPG and price has multiple clusters of concentration. The groups of outliers are even more promenent than before. 

### Hexbin plot

When you have a large number of data points to plot, over-plotting becomes a significant problem. When over-plotting occurs it is impossible to tell how many points are on top of each other. Several methods are available to deal with server overplotting.You have already explored one method for dealing with high point density, countour plotting. 

Now, you will explore hexbin plots. Hexbin point density within hexagonal regions. The plot area can be divided into a selected number of bins both vertically or horizontally. Very high point density can be displayed with a hexbin plot. 

Execute the code below to create a hexbin plot of auto price vs. city MPG.

In [None]:
ggplot(auto.price, aes(city.mpg, price)) + 
    stat_binhex(bins = 10) +
    xlab('City MPG') + ylab('Price') +
    ggtitle('Relationship between City MPG and Price')

Examine this plot noticing the following:

- There are a large number of autos with low price and fuel economy of abouot 30 MPG. Some cells to the left at low price also show a concentration of autos.  
- Some outliers are clearly visible in the lower right and upper left. Some of these cells contain only one data point. 

**Your turn!** In the cell below create and execute the code to display a hexbin plot of the the city mpg vs horsepower. 

In [None]:
ggplot(auto.price, aes(city.mpg, horsepower)) + geom_point() + 
    geom_density2d() +
    xlab('City MPG') + ylab('Horsepower') +
    ggtitle('Relationship between City MPG and Horsepower')

### Heat maps

A heat map or raster plot allows you to visualize data which is natually arranged on a regular grid or lattice. The heat map uses colors to show the values, or heat, on the grid. Data arranged on a lattice is fairly common in many areas including, spatial analysis, image analysis, and correlation analysis. 

You will now create a heatmap using some sample data. These data were collected from erruptions of the Old Faithful geyser. Execute the code below to plot a heat map for these data. 

In [None]:
str(faithfuld)
ggplot(faithfuld, aes(waiting, eruptions)) +
 geom_point(aes(size = density), alpha = 0.3) +
 ggtitle('Erruption time vs. waiting time for Old Faithful geyser') +
 xlab('Waiting time') + ylab('Erruption time')

Examine the plot. Notice that there appear to be two clusters, but otherwise, the plot above is hard to interpret. This is largely a result of over-plotting. 

Next, execute the code in the cell below to display these data with a heat or raster plot. 

In [None]:
ggplot(faithfuld, aes(waiting, eruptions)) +
 geom_raster(aes(fill = density)) +
 ggtitle('Erruption time vs. waiting time for Old Faithful geyser') +
 xlab('Waiting time') + ylab('Erruption time')

The heat map clearly shows the two clusters of erruption and waiting times. 

### Line plots

Line plots are used to show the relationship between variables with ordered values. For example, line plots are used for time series data.

As a first step, you will create a pair of variables with ordered values by executing the code in the cell below.

In [None]:
x = seq(1, 100, length.out = 100)
df = data.frame(x = x, y = x^2)

Now, create a line plot of these data by executing the code below. 

In [None]:
ggplot(df, aes(x, y)) + geom_line() + ggtitle('Line plot of x vs. y')

## View multiple dimensions with aesthetics

You have explored some basic plot types. All the plots you have created are projected onto a two dimensional surface. Thus, you have been limited to visualization of only two dimensions of the data set. 

Now, you will use plot aesthetics to view additional dimensions of the data set. Specific aesthetics are used to project additonal dimensions onto a two dimensional plot surface. Additionally, carefull choice of plot aesthetics can highlight certain aspects of a data set you wish to present.   

There are many possible aesthetics one can use for visualization. Specific examples you will work with here include:

- Color
- Transparency
- Size
- Marker shape
- aspect ratio
- Aesthetics specific to plot type

Using aesthetics with a bit of cleverness can lead to interesting insights about a data set. Don't over do it! Too many colors, shapes and sizes can be confusing, obfuscating rather than enlightening. 

### Color

Color can be used to view an additional dimension on a scatter plot. A categorical factor can be shown using different colors for the different categories. 

*****
**Caution!** Make sure you use choose a palette with distinctive colors for each cateogry. Too many categories using similar colors are invariably hard to see and imposible to interpret. 
****

Execute the code in the cell below to display **fuel.type** and a factor using the ``color`` argument in `aes`. 

In [None]:
ggplot(auto.price, aes(city.mpg, price)) + geom_point(aes(color = factor(fuel.type))) + 
    xlab('City MPG') + ylab('Price') + 
    ggtitle('Relationship between City MPG and Price, \n with gas and disel fuel shown')

You can now see the seperation between the gas and disel autos. In the above plot we can see that a diesel car is generally more costly at a given fuel efficiency. You can also see that all of the least efficient and most expensive cars have gas engines.

### Transparency

In the scatter plot above, notice there are areas on the plot with dense clusters of points. In particular, there are overlapping points for gas cars with city mpg between 19 and 31 (mpg is apparently reported in integer units). 

Because of the 'over-plotting' effect it is hard to see just how many cars are represented in some regions of the plot. Transparency, is a useful aesthetic to overcome over-plotting. Transparency is defined on a scale from 0 (completely transparent, or invisible) to 1 (perfectly opaque). By choosing reasonable transparency values, you can get a feel for how many points are clustered in a small area of a plot. 

The code in the cell below follows the aforementioned recipe. A transparency argument, ``alpha``, has been added to the `geom_point` function. Execute this code and examine the result. 

In [None]:
ggplot(auto.price, aes(city.mpg, price)) + geom_point(aes(color = factor(fuel.type)), alpha = 0.3) + 
    xlab('City MPG') + ylab('Price') + 
    ggtitle('Relationship between City MPG and Price, \n with gas and disel fuel shown')

With the transparency set to 0.3, you can now better see the density of autos in areas of the plot with significant overplotting. Notice, how changing this aesthetic changes your view of the distribution of data on the city mpg versus price projection. Most gas cars are clustered in a farily small range of price and fuel efficiency.

### Marker size

Another aesthetic which can project an additional dimension onto a two dimensional plot is marker size. Execute the code in the cell below which uses the size of the marker to indicate engine size. 

In [None]:
ggplot(auto.price, aes(city.mpg, price)) + geom_point(aes(color = factor(fuel.type), size = engine.size), alpha = 0.3) + 
    xlab('City MPG') + ylab('Price') + 
    ggtitle('Relationship between City MPG and Price, \n with gas and disel fuel shown, \n with marker area indicating engine size')

Examine the above plot noticing the different marker sizes. With larger markers, setting transparency is more necessary than ever. 

You may notice that the difference in engine size is not always apparent in the above plot. Using marker size can be a bit tricky. Another option is to use the square of the value being projected to determine marker size. The area of the marker is then proportional to the value of interest, rather than the less noticeable marker size. 

Execute the code in the cell below which uses the square of engine size to determine marker size.  

In [None]:
auto.price$engine.size2 = auto.price$engine.size^2
ggplot(auto.price, aes(city.mpg, price)) + geom_point(aes(color = factor(fuel.type), size = engine.size2), alpha = 0.3) + 
    xlab('City MPG') + ylab('Price') + 
    ggtitle('Relationship between City MPG and Price, 
             \n with gas and disel fuel shown, 
             \n with marker radius indicating engine size')

Examine the above plot. Notice that the differences in engine size are much more apparent when we use area to determine marker size. The fact that the most expensive and least fule efficient cars have the largest engines is now quite apparent.

### Marker shape

Marker shape can be used to project an additonal dimension onto a two dimensional plotting surface. Typically, different marker shapes are used to indicate the categories of a categorical variable. As with other aesthetics, don't over do it. Too many similar markers on a chart will only obfuscate the relationships you are trying to show. 

Execute the code in the cell below which uses different marker shapes to indicate standard and turbo aspiration cars. 

In [None]:
ggplot(auto.price, aes(city.mpg, price)) + geom_point(aes(color = factor(fuel.type), size = engine.size2, 
                                                      shape = factor(aspiration)), alpha = 0.3) + 
    xlab('City MPG') + ylab('Price') + 
    ggtitle('Relationship between City MPG and Price, \n with gas and disel fuel shown, 
              \n with marker radius indicating engine size \n and shape showing aspiration')

The above plot projects five dimensions onto the two dimensional plot surface: price, city mpg, engine size, aspiration type, and fuel type. It takes some time and effort to fully understand all of the information provided. 

Some conclusions we can draw from the chart above include:

- The most expensive cars have larger, standard aspiration, gas engines. 
- Most diesel cars use tubo aspiration and these tend to be more expensive. 
- Turbo aspiration gas engine cars are in the low range of fuel efficiency, but in the middle of the price range. 
- Some lower cost cars have large gas engines. 

**Your turn!** In the cell below create and execute the code to display a scatter plot of the the city mpg vs horsepower, using a low alpha value, color for aspiration, shape for  fuel type, and size for engine size. 

### Aspect ratio

The aspect ratio of a plot is the ratio of the horizontal dimension to the vertical dimension. Changes in aspect ratio of a plot can have a significant effect on which aspects of a data set are most noticable. The correct aspect ratio can highlight important aspects of a relationship. On the other hand, choosing a poor aspect ratio can obfuscate important aspects of the data. 

In the example below, a time series plot of the sunspot intensity is ploted. The `options` function from the `repr` package is a handy way to set the height and width of a plot. Initially we create a plot of the sunspot activity with an aspect ratio of 4:3. 

Execute the code below and examine the result. 

In [None]:
require(datasets)
require(repr)
options(repr.plot.width=8, repr.plot.height=6)
plot(sunspot.month, main = 'Monthly sunspot intensity', ylab = 'Sunspot intensity')

The periodic nature of the well-known sunspot cycle is clearly visible in the plot. But, how does our perception of these data change with the changing aspect ratio? 

Execute the code below to create the same plot, but with a larger aspect ration: 16:5. 

In [None]:
options(repr.plot.width=8, repr.plot.height=2.5)
plot(sunspot.month, main = 'Monthly sunspot intensity', ylab = 'Sunspot intensity')

Compare this plot to the previous one. Notice that in the lower aspect ratio plot you can now see that the sunspot intensity increases quickly, and dies off slowly for most cycles. This behavior was not visible before. 

## Plot specific aesthetics 

You have used some fairly general aesthetics to bring forth relationships in the dataset. You will now explore some additional aesthetics, which tend to be more chart type specific. 

You have already invesigated some plot specifific asthetics:

- Using a conditioning variable with a box plot or violin plot.
- Adding quantile lines to a violin plot.

### Histogram bins

As another example, you will examine the effect of changing the number of bins in a histogram. The code below plots the histogram of engine size using different numbers of bins; 20 bins and 50 bins. 

This is also the first time you have placed multiple sets of axes on the same figure. The first line of code defines two subplots in one row on the figure. The `grid.arrange` function from the `gridextra` package is used to arrage the two plots, side by side. 

Execute the code below to create the two histograms and compare the  results. 

In [None]:
options(repr.plot.width=6, repr.plot.height=6)
bw1 = (max(auto.price$price) - min(auto.price$price))/20
bw2 = (max(auto.price$price) - min(auto.price$price))/50
require(gridExtra)
p1 = ggplot(auto.price, aes(price)) + geom_histogram(binwidth = bw1) + 
        ggtitle('Histogram of auto price with 20 bins')
p2 = ggplot(auto.price, aes(price)) + geom_histogram(binwidth = bw2) + 
        ggtitle('Histogram of auto price with 60 bins')
grid.arrange(p1, p2, nrow = 2)

Notice that the histogram with 20 bins gives a smoother result, which may be better for obtaining an overall impression of the distribution. The histogram with 50 bins shows more detail, particularly for the outliers. 

### KDE operator bandwidth

The bandwidth or span of the density kernel estimator can be adjusted to enhance or further smooth the density estimation. The `adjust` argument can be used to change the bandwidth of the density kernel, as shown in the code below. Execute the code below and compare the result to the previous example. 

In [None]:
ggplot(auto.price, aes(price)) + geom_density(adjust = 1/5)

Compare this plot to the previously created KDE plot. Note the higher level of detail shown. However, it may be hard to tell if the curve shown represents important aspects of the data, or just noise. 

### Interpolation for heat map

Notice that the heat map you previously ploted has a jagged or pixelated appearance. Performing interpolation on the plot can smooth the appearance. Execute the code in the cell below and compare the result to the previous heat map.

In [None]:
ggplot(faithfuld, aes(waiting, eruptions)) +
 geom_raster(aes(fill = density), interpolate = TRUE) +
 ggtitle('Erruption time vs. waiting time for Old Faithful geyser') +
 xlab('Waiting time') + ylab('Erruption time')

The interpolated image now has a smother appearance. 

## Multi-axis and Facet Plots

You have explored basic plot types and used aesthetics to project multiple dimensions onto basic plot types. Multi-axis plots project a sub-set of dimensions of a high-dimensional data set onto the two dimensional plot surface. Displaying a sub-set of dimensions onto two dimensions, projects the marginal distribution of the data onto the plot surface. 

Now, you will investige the use of multiple plot axes and a technique known as faceting. Multiple plot axes can be placed on the two dimensional plot surface. Faceting involves using values of categorical or ranges of numeric variables to create a grid of related plots. Faceting goes by several other monikers, including conditioned plotting or lattice plotting. In effect, the faceting or conditioning projects the conditional distribution of the data set. 

We will now explore several aspects of multi-axis plots: 

- Pairwise scatter plots
- Correlation plots
- Faceted plots


### Pair wise scatter plots

Pair wise plots, also know as plot matrices, allow you to view multiple dimensions of a data set on one plot. A scatter plot is shown of each variable versus all other variables, arranged in an array. Another type of plot, such as a KDE plot, can be shown for each pair of variables on the diagonal. Optionally, regression lines can be added to each scatter plot. 

R and the availble packages contain several good options for scatter plot matrices The code to create a pair wise scatter plot for certain variables, in the auto price data set is shown in the cell below. The recipe is:

- Use the scatterplotMatrix function from the car package.
- Use the formula notation to define the columns you wish to plot. The fomula is defined:
$$\sim var1 + var2 + \ldots$$
- Spcify the data frame. 

The scatter plot matrix contains a number of elements, which are useful for understanding the relationships in the data sets. 

- There are KDE plots for each variable along the diagonal of the matrix. 
- A pair of scatter plots for each variabe pair is displayed. These two scatter plots show the axes in both possible orientations.
- A linear regression line is shown in green on each scatter plot. This regression line gives an indiction of trend.
- A nonlinear regression line with confidence intervals is shown in red. The nonlinear regression line can highlight more complex relationships in the data set. The confidence intervals give some indication of the likely range of regression lines, given the uncertainty in the data.  

In [None]:
options(repr.plot.width=8, repr.plot.height=8)
require(car)
scatterplotMatrix(~ wheel.base + curb.weight + engine.size + horsepower + city.mpg + price, data = auto.price)

There is a great deal of information on the above plot. Examine this plot, noticing at least the following:

- The KDE plots of the variables generally show a skew to the left, or lowere values.
- Several of the variabes have a fairly linear relationship. For example, the relationships between wheel base, curb weight and engine size.
- Other relationships are clearly nonliner. For example, the relationships between price and city MPG and city MPG, and housepower. 
- Other relationships are less clear cut. For example, the relationship between price and horsepower, or price and wheelbase. 

### Correlation plots

The correlation structure of a multivariate data set can provide some useful insight. However, it is hard to derive much insight by just looking at the matrix of numbers. A visualization can be useful in getting an overall impression of which variables have siginificant correlation. 

R and its packages contain many functions for displaying a correlation matrix. A common method is to display the correlation matrix as a heat map. This is particually useful if your have a large number of variables in you data set. See for example, the `corrgram` package.

We will look at other methods. The code below uses the `plotcorr` function from the `ellipse` package to plot the correlation ellipses. The length of the minor axis is computed as $1 - correlation$. A correlation of 1 gives a line, with the minor axis set to 0. Correlation of zero results in a circle. The intensity of the color used on the plot indicates the magnitude of the correlation. Additionally, orientation of the ellipse is used to highlight the positive or negative correlation value. For positive correlation the ellipse tilts right, with the oposite being true for negative correlation. 

In [None]:
options(repr.plot.width=6, repr.plot.height=6)
library(ellipse)
R = cor(auto.price[, c('wheel.base', 'curb.weight', 'engine.size', 'horsepower', 'city.mpg', 'price')], method = 'pearson')
print(R)
plotcorr(R, col = colorRampPalette(c("firebrick3", "white", "navy"))(10))

The correlation between some of these variables is strongly positive, shown as the narrow bright ellipses tilting right. Variable pairs with strong negative correlation are shown as narrow dark ellipses tilting left. The diagonal are all shown as lines.

Another way to display a correlation matrix is as a series of dots, with the size of the dots proportional to the correlation and the color indicating the sign. In the code below, the `corrplot` function from the `corrplot` package is used to plot the correlation matrix. 

In [None]:
library(corrplot)
corrplot(R, method="circle", type='lower')

The size and color intensity of the dot indicates the magnitude of the correlation. Blue dots indicate positive correlation and red dots indicate negative correlation. 

The information shown in this plot is the same as the previous plot. Which type of plot you prefer is purely subjective. 

**Your turn!** Compute a correlation matrix using the Kendal rank method. Plot and print the correlation matrix and compare to the correlation matrix computed using the Pearson method. 

### Facet plots

You have explored using plot asthetics to project additional dimensions onto the two dimensional plot surface. You have also used scatter plot matrices to examine the relationships in several variables Now, you will look at using facet plots or conditioned plots. 

A faceted or conditioned plot is comprised of a set of sub-plots, defined by one or more conditioning variables. The data in each sub-plot are sub-setted, based on the values of the conditioning variable. This conditioning operation is, in effect, a **group-by** opertion. 

The ggplot2 package contains the `facet_grid` function, used to define the grid on which the sub-plots are created. The facet grid function uses an R formula object to define the rows and columns to specify the conditioning variable used to define the rows and columns. The general form of this formula is:

$$RowVariables \sim ColumnVariables$$

A conditioned plot with a single column, but multiple rows, is therefore defined:

$$RowVariables \sim\ .$$

Or, conditioned plot with a single row, but multiple columns, is defined:

$$.\ \sim ColumnVariables$$

You can use multiple variables to condition rows and columns, using the $+$ symbol as the operator:

$$RowVar1 + RowVar2 + \ldots \sim ColVar1 + ColVar2 + \ldots$$

Like all good things in visulation, there are practical limits. Creating a large grid of sub-plots, using multiple conditioning variables, quickly becomes confusing to look at. 

Give this a try. Execute the code below to create a single row of histograms conditioned on the type of drive wheels of the auto.

In [None]:
library(repr)
options(repr.plot.width=8, repr.plot.height=3)
bw1 = (max(auto.price$price) - min(auto.price$price))/30
ggplot(auto.price, aes(price)) + 
        geom_histogram(binwidth = bw1) + 
        facet_grid(. ~ drive.wheels) +
        ggtitle('Histogram of auto price conditioned on drive wheels')

The three histograms are displayed in the grid; one for each value of the conditioning variable, drive wheels. The shingles at the top of each plot display the value of the conditioning variable. 

From these plots we can see that the price of the 4wd cars overlaps the fwd cars, and the rwd cars have a greatest range of prices. 

Try another example. In this case, there are two conditioning variables, fuel type and aspiration, defining a two dimensional grid. Fuel type conditions the rows. Aspiration conditions the  columns. Execute this code and examine the results.

In [None]:
options(repr.plot.width=8, repr.plot.height=8)
bw1 = (max(auto.price$price) - min(auto.price$price))/30
ggplot(auto.price, aes(price)) + 
        geom_histogram(binwidth = bw1) + 
        facet_grid(fuel.type ~ aspiration) +
        ggtitle('Histogram of auto price conditioned on drive wheels')

You can see the two dimensional grid of the conditioned histograms. The tiles on the top show the value of the aspiration. The tiles to the right show the values of the fuel type. The combination of aspiration and fuel type define the conditioning grid. 

Here is another example. The code in the cell below computes a grid of scatter plots conditioned by drive wheels for the column (horizontal) and body type for row (vertical) conditioning variables. Execute this code and examine the result.

In [None]:
options(repr.plot.width=8, repr.plot.height=11)
ggplot(auto.price, aes(city.mpg, price)) + geom_point(aes(color = factor(fuel.type), size = engine.size2, 
                                                      shape = factor(aspiration)), alpha = 0.3) + 
    facet_grid(body.style ~ drive.wheels) +
    xlab('City MPG') + ylab('Price') + 
    ggtitle('Relationship between City MPG and Price, \n with gas and disel fuel shown, 
              \n with marker radius indicating engine size \n and shape showing aspiration')

This plot projects seven dimensions of the data set onto the two dimensional plot surlface. Study this plot carefully, thinking about what you can learn.

**Your turn!** Create a new conditioned scatter plot using drive wheels as the column (horizontal), and body type and number of doors for the row (vertical) conditioning variables. Use shape to distinguish aspiration and color to distinguish fule type. 

## Summary

You have covered a lot of ground in this tutorial. You have explored many of the relationships in the auto price data set by creating multiple views to visualize these data. 

Specifically, in this tutoral you have:

- Examined the uses of and methods for creation of several basic plot types.
- Used plot aesthetics to project additional dimensions onto a two dimensional plot surface, or to highlight aspects of the relationships in the data.
- Used multi-axis and faceted plots to project additional dimensions onto the two dimensional plot surface. 

Finally, **have fun exploring your data!**

#### Copyright 2017, Stephen Elston. All rights reserved. 