# Analysis of House Prices in Ames, IA
### Regularized Regression, Cross-Validation, and Feature Selection
##### Grant Nikseresht, Yuqing Zhao, Yue Ning

This notebook is a glimpse into the R workflow that Yuqing, Yue, and myself (Grant) used in analyzing the [Ames, IA housing dataset](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) featured on Kaggle. Our analysis was done as part of our final project for Applied Stats (MATH 564) at IIT. 

The goal of the Kaggle challenge was to predict the selling price of a home given a number of its attributes. Selecting which features to use in our model was the primary challenge in analyzing this dataset, where many of the features are collinear, trivial, or uncorrelated with the dependent variable. 

For our course project, we decided to compare different methods for feature selection including manual selection, stepwise regression, and regularized regression. In this notebook, we'll explore the dataset, implement several regression methods with cross-validation, and compare some results. 


In [None]:
library(rpart)
library(caret)
library(leaps)
library(glmnet)
library(ggplot2)
library(tabplot)
library(reshape2)
options(warn=-1) 

## Exploring the Data

We performed some preprocessing of the original Kaggle data and stored it in a file in the `data` folder. Let's load it into a dataframe and pull the index out to prepare it for analysis.

In [None]:
train_df <- read.csv("./data/train_processed.csv")
train_df <- train_df[,-c(1)]

Our processed dataset now contains 1457 observations and 51 explanatory variables. All missing and nonsensical values have been removed. Let's take a look at the data using some handy R functions.

In [None]:
dim(train_df)
summary(train_df)

Let's take a quick visual glimpse at the distribution of the log of selling price. We use both the built in R functions like `boxplot` and the more extensive plotting package `ggplot2`. 

In [None]:
boxplot(train_df$SalePrice)
ggplot(data=train_df, aes(train_df$SalePrice)) + 
  geom_histogram (col="red", aes(fill=..count..)) +
  scale_fill_gradient("Count", low = "green", high = "red")+
  labs(title="Histogram for SalePrice") +
  labs(x="SalePrice", y="Count")+
  theme(plot.title = element_text(hjust = 0.5))

Let's take a 10,000 foot view of the explanatory variables as well using `tabplot`. A few are shown below, but feel free to try this on other variables. This view proves useful for visualizing how homoegenous or incomplete a dataset is.

In [None]:
tableplot(train_df[,30:34])
tableplot(train_df[,41:45])
tableplot(train_df[,5:9])

Here's an example of an issue in the data that required some preprocessing. Each component of the house generally had a few associated explanatory variables. For instance, there are several variables each providing similar information about basements and garages. The `xtabs` function below shows a contingency table to estimate the amount of collinearity between factor variables. 

In [None]:
xtabs(~GarageQual+GarageCond+GarageFinish, data=train_df)
xtabs(~BsmtCond+BsmtFinType1, data=train_df)

Perfectly collinear or homogenous variables leading can sabotage regression variables, so we're only going to keep one of the variables for garage and one for basement. 

In [None]:
train_df <- train_df[, -c(which(colnames(train_df) == "GarageFinish"),
                            which(colnames(train_df) == "Exterior2nd"),
                            which(colnames(train_df) == "GarageCond"),
                            which(colnames(train_df) == "BsmtFinType1"))]
dim(train_df)

Here's a function we'll use at the end to compute our error. We're using root mean squared logarithmic error (RMSLE) to compare models. We've already log transformed our dependent variable, so this is basically just RMSE.

$RMSLE = \sqrt{(\frac{1}{n}\sum_{i=1}^n(\hat{Y} - Y)^2)}$

In [None]:
rmsle <- function(yhat, y) {
  n <- length(yhat)
  return(sqrt((1/n)*sum((yhat-y)^2)))
}

## Analysis

Since the only labeled data we have access to is the training set, we're going to use cross-validation to estimate how well our model will generalize to new data. 

For this, we're going to use the `caret` package, which provides an interface for cross-validating models on a variety of methods. A control parameter is initialized below that will tell future calls to `caret` what settings to use for performing cross-validation. We used 10 folds in our analysis, but we'll only use 3 here so training will finish in a reasonable time. Feel free to increase the number if you're trying this on your own.

In [None]:
set.seed(564)
controlParameter <- trainControl(method = "cv",
                                  number = 3,
                                  savePredictions = TRUE)

In [None]:
lm_ols <- train(SalePrice~.,
                data = train_df,
                method='lm',
                trControl=controlParameter)
ols_fit <- lm_ols$finalModel

In [None]:
summary(ols_fit)
plot(ols_fit)

In [None]:
lm_cats <- train(SalePrice~TotBathrooms+SaleCondition+GarageArea+
                   KitchenQual+GrLivArea+TotalBsmtSF+OverallCond+OverallQual+
                   BldgType+Condition1+MSZoning,
                 data=train_df, 
                 method="lm",
                 trControl=controlParameter)
cats_fit <- lm_cats$finalModel

In [None]:
summary(cats_fit)
plot(cats_fit)

In [None]:
lm_forward <- train(SalePrice~., 
                    data=train_df,
                    method='leapForward',
                    trControl=controlParameter,
                    tuneGrid = expand.grid(nvmax = seq(1, 180, 1)))
fwd_fit <- lm_forward$finalModel

In [None]:
lm_forward$results[lm_forward$bestTune[1,1],]

In [None]:
lm_backward <- train(SalePrice~., 
                     data=train_df,
                     method='leapBackward', 
                     trControl=controlParameter, 
                     tuneGrid = expand.grid(nvmax = seq(1, 180, 1)))
bwd_fit <- lm_backward$finalModel

In [None]:
lambdas <- 10^seq(-1, -5, length = 100) # This NaNs after like 400
ridgeGrid <- expand.grid(alpha=0,lambda=lambdas)
lm_ridge <- train(SalePrice~., data=train_df, method = 'glmnet', trControl=controlParameter, tuneGrid=ridgeGrid)
ridge_fit <- lm_ridge$finalModel

In [None]:
lambdas <- 10^seq(-2, -5, length = 300) # Opt lambda probably between .00001 and .01
lassoGrid <- expand.grid(alpha=1,lambda=lambdas)
lm_lasso <- train(SalePrice~., data=train_df, method = 'glmnet', trControl=controlParameter, tuneGrid=lassoGrid)
lasso_fit <- lm_lasso$finalModel

In [None]:
elasGrid <- expand.grid(alpha=seq(0, 1, length=21),lambda=lambdas)
lm_elas <- train(SalePrice~., data=train_df, method = 'glmnet', trControl=controlParameter, tuneGrid=elasGrid)
elas_fit <- lm_elas$finalModel

In [None]:
treeGrid <- expand.grid(cp=10^seq(-5,-3, length=101))
tree_cp <- train(SalePrice~.,
                 data=train_df,
                 method='rpart',
                 trControl=controlParameter,
                 tuneGrid=treeGrid)
# Zeroing in on the optimal value
treeFineGrid <- expand.grid(cp=seq(0.0002,.0004, length=101))
tree_cp <- train(SalePrice~.,
                 data=train_df,
                 method='rpart',
                 trControl=controlParameter,
                 tuneGrid=treeFineGrid)
tree_fit <- tree_cp$finalModel

In [None]:
lm_ols_pred <- predict(lm_ols,train_df)
lm_cats_pred <- predict(lm_cats,train_df)
lm_forward_pred <- predict(lm_forward,train_df)
lm_backward_pred <- predict(lm_backward,train_df)
lm_lasso_pred <- predict(lm_lasso,train_df)
lm_ridge_pred <- predict(lm_ridge,train_df)
lm_elas_pred <- predict(lm_elas,train_df)
tree_cp_pred <- predict(tree_cp,train_df)

In [None]:
ols_res <- ols_fit$residuals
cats_res <- cats_fit$residuals
fwd_res <- lm_forward_pred - train_df$SalePrice
bwd_res <- lm_backward_pred - train_df$SalePrice
lasso_res <- lm_lasso_pred - train_df$SalePrice
ridge_res <- lm_ridge_pred - train_df$SalePrice
elas_res <- lm_elas_pred - train_df$SalePrice
tree_res <- tree_cp_pred - train_df$SalePrice

In [None]:
lm_rmsle <- rmsle(abs(lm_ols_pred), train_df$SalePrice)
lm_cats_rmsle <- rmsle(abs(lm_cats_pred), train_df$SalePrice)
lm_forward_rmsle <- rmsle(abs(lm_forward_pred), train_df$SalePrice) 
lm_backward_rmsle <- rmsle(abs(lm_backward_pred), train_df$SalePrice)
lm_lasso_rmsle <- rmsle(abs(lm_lasso_pred), train_df$SalePrice)
lm_ridge_rmsle <- rmsle(abs(lm_ridge_pred), train_df$SalePrice)
lm_elas_rmsle <- rmsle(abs(lm_elas_pred), train_df$SalePrice)
tree_cp_rmsle <- rmsle(abs(tree_cp_pred), train_df$SalePrice)

In [None]:
rmsle_scores <- c(lm_rmsle, lm_cats_rmsle, lm_forward_rmsle,
                  lm_backward_rmsle, lm_ridge_rmsle, lm_lasso_rmsle,
                  lm_elas_rmsle, tree_cp_rmsle)
names(rmsle_scores) <- c("OLS_Full", "OLS_Manual", "OLS_Forward",
                         "OLS_Backward", "Ridge", "LASSO",
                         "Elastic","Tree_CP")

In [None]:
best_lm_ols <- lm_ols$results[as.numeric(rownames(lm_ols$bestTune)),]
best_lm_cats <- lm_cats$results[as.numeric(rownames(lm_cats$bestTune)),]
best_lm_forward <- lm_forward$results[as.numeric(rownames(lm_forward$bestTune)),]
best_lm_backward <- lm_backward$results[as.numeric(rownames(lm_backward$bestTune)),]
best_lm_ridge <- lm_ridge$results[as.numeric(rownames(lm_ridge$bestTune)),]
best_lm_lasso <- lm_lasso$results[as.numeric(rownames(lm_lasso$bestTune)),]
best_lm_elastic <- lm_elas$results[as.numeric(rownames(lm_elas$bestTune)),]
best_tree_cp <- tree_cp$results[as.numeric(rownames(tree_cp$bestTune)),]

In [None]:
cv_results <- data.frame(method = names(rmsle_scores), 
                         rmse = c(best_lm_ols['RMSE'][1,1],
                                  best_lm_cats['RMSE'][1,1],
                                  best_lm_forward['RMSE'][1,1],
                                  best_lm_backward['RMSE'][1,1],
                                  best_lm_ridge['RMSE'][1,1],
                                  best_lm_lasso['RMSE'][1,1],
                                  best_lm_elastic['RMSE'][1,1],
                                  best_tree_cp['RMSE'][1,1]),
                         rmse_sd = c(best_lm_ols['RMSESD'][1,1],
                                      best_lm_cats['RMSESD'][1,1],
                                      best_lm_forward['RMSESD'][1,1],
                                      best_lm_backward['RMSESD'][1,1],
                                      best_lm_ridge['RMSESD'][1,1],
                                      best_lm_lasso['RMSESD'][1,1],
                                      best_lm_elastic['RMSESD'][1,1],
                                      best_tree_cp['RMSESD'][1,1]))

In [None]:
ggplot(cv_results, aes(x=method, y=rmse)) + 
         geom_dotplot(binaxis = 'y', stackdir = 'center') +
         geom_errorbar(aes(ymin=rmse-rmse_sd, ymax=rmse+rmse_sd), width=.2,
                                  position=position_dodge(.0)) +
         xlab("Method") +
         ylab("Cross-Validation RMSE")

In [None]:
residuals <- data.frame(id = seq(1, length(ols_res)),
                        OLS_Full=ols_res,
                        OLS_Manual=cats_res,
                        OLS_Forward=fwd_res,
                        OLS_Backward=bwd_res,
                        Ridge=ridge_res,
                        LASSO=lasso_res,
                        Elastic=elas_res,
                        Tree=tree_res)
res_melt <- melt(residuals, id.vars = "id")

In [None]:
ggplot(res_melt, aes(x=id, y=value, color=variable)) + 
  geom_point(alpha=0.3, size=0.75) +
  scale_colour_manual(values=c("red", "blue", "green", "orange",
                               "gray", "brown", "black", "purple")) +
  xlab("Observation Index") +
  ylab("Residual Value") +
  scale_fill_discrete(name = "Model")