## Multivariate Analysis Example 2 

-----

### Data Set : Kings County, WA housing data

This lab illustrates the summarization of data. 
In this case, the data is wider and has many columns compared to the prior datasets.

The Dataset contains information on house sale prices for King County, Seattle. 
It has over 21k rows of data to play with. 
It has 21 dimensions consisting of 2 non-predictor variables (id and date (which we will exclude)), 18 predictor ("independent") variables, and 1 response ("dependent") variable.

### Loading the data...

In [None]:
housing_prices <- read.csv("/dsa/data/all_datasets/house_sales_in_king_county/kc_house_data.csv")
head(housing_prices)

In [None]:
# We can examine the structure of dataframe as follows
str(housing_prices)

The `str()` function gave us an overall sense of the data. 

__We see that we have 21613 observations and 21 variables.__

For this data set, we can see various numeric and integer components in our multivariate data. 
Each line in the structure above is a vector component, therefore the vector structure is:  
*(id, data, price, bedrooms, ... , sqft_living15, sqft_loft)*

We will not be worried about id, as it is just a sequential counter. 


Let's dig deep into the data by doing some univariate analysis just like what we did in our prior module.
First, run summary() on all variables...

In [None]:
summary(housing_prices)

Each variable has a different scale of values. 
Some range from 0 to 1 and some vary over long ranges. 
*bedrooms* looks interesting with a maximum value of 33. 
There could be possible outliers, we will look into that shortly. 

For data sets such as this, you need to apply some cultural and domain understanding to the data.

For instance, we see *zipcode* is being treated as a numerical value.
However, we know zipcodes are actually buckets of an area, i.e., **factors**.

In another example, the *yr_renovated* has a **min** and **median** value of 0; and it has **mean value of 84.4**.
From this, we should probably surmise that **yr_renovated** defaults to 0 if the property has not been renovated.

All of these things should be kept in mind as we begin to try modelling our data.

### Univariate analysis

Let's plot histograms for all 18 variables and look into their distribution.

`gridExtra`: This R library helps you arrange multiple grid-based plots on a page, and draw tables. 
We are plotting 18 different histograms and arranging them in a grid. 

`ggplot2`: ggplot2 is commonly used package for doing visualizations. 
It takes care of many of the fiddly details that make plotting a hassle (like drawing legends).

__Reference__: https://cran.r-project.org/web/packages/gridExtra/vignettes/arrangeGrob.html  
__Reference__: http://docs.ggplot2.org/dev/vignettes/qplot.html  


In [None]:
require(gridExtra)
require(ggplot2)

## grid.arrange(x1,x2,x3...xn,ncol=x,nrow=y)
## The command will arrange the plots x1,x2....xn in the desired outlet of specified rows and columns

# The number of bins should be chosen as appropriate. If you are not sure then trial and error is the best way to figure out the 
# right number of bins. Each bin will have observations equal to bin size. 

# In the case of price, I am going to divide by 1000 to get the price in $1000's 

grid.arrange(qplot(housing_prices$price/1000,bins = 20,xlab='price ($k)'),
             qplot(housing_prices$bedrooms,bins = 5,xlab='bedrooms'),
             qplot(housing_prices$bathrooms,bins = 5,xlab='bathrooms'),
             qplot(housing_prices$sqft_living,bins = 25,xlab='sqft_living'),
             qplot(housing_prices$sqft_lot,bins = 25,xlab='sqft_lot'),
             qplot(housing_prices$floors,bins = 4,xlab='floors'),
             qplot(housing_prices$waterfront,bins = 4,xlab='waterfront'),
             qplot(housing_prices$view,bins = 4,xlab='view'),
             qplot(housing_prices$condition,bins = 10,xlab='condition'),
             qplot(housing_prices$grade,bins = 10,xlab='grade'),
             qplot(housing_prices$sqft_above,bins = 25,xlab='sqft_above'),
             qplot(housing_prices$sqft_basement,bins = 25,xlab='sqft_basement'),
             qplot(housing_prices$yr_built,bins = 10,xlab='yr_built'),
             qplot(housing_prices$yr_renovated,bins = 10,xlab='yr_renovated'),
             qplot(housing_prices$lat,bins = 20,xlab='lat'),
             qplot(housing_prices$long,bins = 20,xlab='long'),
             qplot(housing_prices$sqft_living15,bins = 25,xlab='sqft_living15'),
             qplot(housing_prices$sqft_lot15,bins = 25,xlab='sqft_lot15'),
             qplot(housing_prices$zipcode,bins = 10,xlab='zipcode'),
             ncol = 3)

Now let's look at a few of the plots to see the actual histograms as a table of values using the *table* function.

__Reference__: https://www.r-bloggers.com/r-function-of-the-day-table/


In [None]:
table(housing_prices$bedrooms)

In [None]:
table(housing_prices$bathrooms)

In [None]:
table(housing_prices$floors)

In [None]:
table(housing_prices$view)

In [None]:
table(housing_prices$yr_renovated)

### Observations based on histograms

* Bedrooms: Bedrooms variable appears to have outliers but it could be a valid value. 

* Year renovated: Not many houses are renovated. Most of the renovated houses are from 80s.

We have to identify independent variables that are related to our response variable price. 
To do this, we will look for bivariate relationships. 

We know, culturally, that bedrooms will be a major decider in the price of a house. 
So, let's look at this and test our expectation with a scatter plot of price and bedrooms. 
We will add a regression line to our scatter plot as well, 
so we can estimate the correlation coefficient between the variables.



In [None]:
# Plot housing prices, use the bedrooms as the x-axis and the price as the y-axis
ggplot(housing_prices, aes(x = bedrooms, y = price/1000)) +  # The plus sign lets R know that the command will continue
# Add a X axis label
 xlab("Bedrooms") +
# Add a Y axis label
 ylab("Price ($K)") +
# set the data plotting to be points
 geom_point() +
# add the smooth geometry element with a linear model, i.e., using the lm()
 geom_smooth(method = lm)

This is our basic plot.  

Something that often helps in plots is to bring another feature of the data into the plot via use of colors.

We will re-do the plot, using some variables as colors.

In [None]:
# Plot housing prices, use the bedrooms as the x-axis and the price as the y axis
ggplot(housing_prices, aes(x = bedrooms, y = price/1000)) +  #The plus sign lets R know that the command will continue
# Add a X axis label
 xlab("Bedrooms") +
# Add a Y axis label
 ylab("Price ($K)") +
# This next line allows us to view some variables as colors
# set the data plotting to be points with an aesthetic of colour=view
 geom_point(aes(colour = view)) + 
# add the smooth geometry element with a linear model, i.e., using the lm()
 geom_smooth(method = lm)

#### Try me:
Try changing the `(colour = view)` to use a different variable, such as `bathrooms` or `floors`.

---
"33 bedrooms" looks like an outlier, because its price is similar to what a 4-bedroom house will cost. 
Look at that particular record in dataset by running the cell below. 
It just has 1.75 bathrooms and 1620 sqft_living. 

In [None]:
housing_prices[housing_prices$bedrooms == 33,]

Looks like the observation is an outlier (and most likely an error). Let's remove it from the dataset and then repeat our plot!

In [None]:
housing_prices=housing_prices[!housing_prices$bedrooms %in% c(33),]
ggplot(housing_prices, aes(x=bedrooms,y=price/1000)) +  
 xlab("Bedrooms") + ylab("Price ($K)") +  geom_point(aes(colour = view)) + geom_smooth(method=lm)

Let's generate scatter plots of price and some other independent variables... 

In [None]:
library(gridExtra)
library(ggplot2)

ggplot(housing_prices, aes(x=bedrooms,y=price/1000)) + xlab("Bedrooms")+ ylab("Price ($K)") + 
    geom_point(aes(colour = factor(view))) + geom_smooth(method=lm)

grid.arrange(
    
    ggplot(housing_prices, aes(x=bathrooms,y=price/1000)) + xlab("Bathrooms")+ ylab("Price ($K)") + 
    geom_point(aes(colour = factor(view))) + geom_smooth(method=lm),
    
    ggplot(housing_prices, aes(x=sqft_living,y=price/1000)) + xlab("sqft_living")+ ylab("Price ($K)") + 
    geom_point(aes(colour = factor(view))) + geom_smooth(method=lm),
    
    ggplot(housing_prices, aes(x=floors,y=price/1000)) + xlab("floors")+ ylab("Price ($K)") + 
    geom_point(aes(colour = factor(view))) + geom_smooth(method=lm),
    
    ggplot(housing_prices, aes(x=condition,y=price/1000)) + xlab("condition")+ ylab("Price ($K)") + 
    geom_point(aes(colour = factor(view))) + geom_smooth(method=lm),
    
    ggplot(housing_prices, aes(x=grade,y=price/1000)) + xlab("grade")+ ylab("Price ($K)") + 
    geom_point(aes(colour = factor(view))) + geom_smooth(method=lm),
    
    ggplot(housing_prices, aes(x=sqft_above,y=price/1000)) + xlab("sqft_above")+ ylab("Price ($K)") + 
    geom_point(aes(colour = factor(view))) + geom_smooth(method=lm),
    
    ggplot(housing_prices, aes(x=sqft_basement,y=price/1000)) + xlab("sqft_basement")+ ylab("Price ($K)") + 
    geom_point(aes(colour = factor(view))) + geom_smooth(method=lm),
    
    ggplot(housing_prices, aes(x=yr_built,y=price/1000)) + xlab("yr_built")+ ylab("Price ($K)") + 
    geom_point(aes(colour = factor(view))) + geom_smooth(method=lm),
    
    ggplot(housing_prices, aes(x=yr_renovated,y=price/1000)) + xlab("yr_renovated")+ ylab("Price ($K)") + 
    geom_point(aes(colour = factor(view))) + geom_smooth(method=lm),
    
    ggplot(housing_prices, aes(x=view,y=price/1000)) + xlab("view")+ ylab("Price ($K)") + 
    geom_point(aes(colour = factor(view))) + geom_smooth(method=lm),
    
    ggplot(housing_prices, aes(x=sqft_lot15,y=price/1000)) + xlab("sqft_lot15")+ ylab("Price ($K)") + 
    geom_point(aes(colour = factor(view))) + geom_smooth(method=lm),
    
    ggplot(housing_prices, aes(x=sqft_living15,y=price/1000)) + xlab("sqft_living15")+ ylab("Price ($K)") + 
    geom_point(aes(colour = factor(view))) + geom_smooth(method=lm),
    
    
    ncol=2)

### Observations:

Which variables seem to have no correlation to price?  
These are the flat lines, where price does not grow.
 * floors
 * condition
 * yr_built

Which variables seem to have the strongest correlation to price?
 1. sqft_living
 2. bathrooms
 3. sqft_above
 4. bedrooms

Let's confirm this with a correlation matrix.
First, we must down select to have a purely numeric data frame, otherwise we get an error such as:  
```
Error in cor(housing_prices): 'x' must be numeric
```

In [None]:
hp <- housing_prices[c("price","bedrooms","floors","condition","yr_built","sqft_living","bathrooms","sqft_above","bedrooms","sqft_basement","grade")]
cor(hp)

Looking down the first column, we can see the most correlated variables to price are:
 1. sqft_living
 2. grade
 3. sqft_above
 4. bathrooms
 
Note that bedrooms, which we traditionally think as a large driver of home price only has a 0.315 correlation value.

Let's drop some columns from our working data frame and get a big picture!


In [None]:
# setting a column to NULL removes it from the data frame
hp$condition <- NULL
hp$yr_built <- NULL
plot(hp)

#
# NOTE: This cell may take a minute to complete running.
# 

## Linear Regression Model



#### Using linear model solver, _lm()_


`lm()` is the function used to fit linear models.  An object of class "lm" is a list containing at least the following components: coefficients, residuals, fitted.values, rank, weights, df.residual, call, terms, contrasts, xlevels, offset, y, x, model, and na.action.

`lm(LHS ~ RHS)` is the model to compute.
  *  Left-hand-side (LHS) is the dependent variable
  *  Right-hand-side (RHS) are the independent variables (predictors)
  
Our model will be:
```R
price ~ ?
```
We will start with just two predictors, variables with highest correlation as we found previously.
 1. sqft_living
 2. grade


In [None]:
# fit variable will hold a statistical model
fit <- lm(price ~ sqft_living + grade, data=hp)
summary(fit) # show results of analysis

**NOTE** R-squared error is 0.5345.

__Reference/Reading__: [$R^2$](https://en.wikipedia.org/wiki/Coefficient_of_determination#As_squared_correlation_coefficient)


What if we add the next predictor, sqft_above?

In [None]:
# fit variable will hold a statistical model
fit2 <- lm(price ~ sqft_living + grade + sqft_above, data=hp)
summary(fit2) # show results of analysis

**NOTE** The R-squared error is 0.5411.

What if we add the next predictor, bathrooms?

In [None]:
# fit variable will hold a statistical model
fit3 <- lm(price ~ sqft_living + grade + sqft_above + bathrooms, data=hp)
summary(fit3) # show results of analysis

We can see that as we add predictor (independent variables) with lower correlations to price, we get diminishing returns on the R-squared ($R^2$) measure of fitness.

Lets finish up this lab with some visualization of the multiple regression models.

Look at our original model, `fit`

In [None]:
fit

In [None]:
require(ggplot2)

########################
# adapted from: 
# https://susanejohnston.wordpress.com/2012/08/09/a-quick-and-easy-function-to-plot-lm-results-in-r/
########################

    # Note that fit is an object with various things, such as a model.
ggplot(hp, 
       aes_string(
                x = (184.4*hp$sqft_living + 98559.0*hp$grade - 598157.0), 
                y = names(fit$model)[1]
       ) # end of aes_string
  ) + 
  geom_point() +
  stat_smooth(method = "lm", col = "red") +
  labs(title = paste("Adj R2 = ",signif(summary(fit)$adj.r.squared, 5),
                     "Intercept =",signif(fit$coef[[1]],5 ),
                     " Slope =",signif(fit$coef[[2]], 5),
                     " P =",signif(summary(fit)$coef[2,4], 5)
                ) # end of title string concatenations
       , x = "sqft_living + grade"
      ) 

#### Ponder the few changes that are needed to plot the `fit2` or `fit3` models.

Feel free to give it a try!

# Save Your notebook