# INFO F422 - Statistical Fundation of Machine Learning
## Project "House Prices : Advanced Regression Techniques"

    Erica Berghman
    Master 1 - Brussels Engineer School

## Abstract

## 1. Introduction 

* with dataset description, goals, and an overview of the report structure 

Starting from a data set with 81 criteria about houses and their selling price, the goal is to create a model capable of predicting the price of other houses given some of these criterias. A good model description is a model that has been refined multiple types. This report will show the methodology used to construct a model for this particular problem. It is based on the methodology of the Chapter 6 of the syllabus.

## 2. Preprocessing 

In [40]:
dataSample = 400
mean = T          # variable to determine if we use the mean or the median to replace the NA values
set.seed(2)

source("function/replaceNA.R")
# Hide warnings
options(warn=-1)

### 2.1 Preprocessing the data set

In order to get a model, the data must be preprocessed. Firstly we read the data given and we take a sample set of 400 houses out of the 1460. There is 81 criteria.

In [41]:
data<-read.csv("input/train.csv")
data.sample<-data[sample(nrow(data),dataSample),]
#dim(data.sample)
#data[1:2,]

The categorical (factor) criterias are removed.

In [42]:
factor_variables<-which(sapply(data.sample[1,],class)=="factor")
data.sample.nofactor<-data.sample[,-factor_variables]
data.sample.factor<-data.sample[,factor_variables]
#summary(data.sample.factor)

library(dummies)
variable_to_keep<-c("CentralAir", "Street", "LotShape")
data_factor_onehot <- dummy.data.frame(data.sample.factor[,variable_to_keep], sep="_")
data.nofactor.extended<-cbind(data.sample.nofactor,data_factor_onehot)


#### Missing data 
The missing values (NA) are replaced by an estimator of these values (eg. mean or median).Ca

In [43]:
if (mean) {
    data_preprocessed<-data.frame(apply(data.nofactor.extended,2,replace_na_with_mean_value)) 
} else {
    data_preprocessed<-data.frame(apply(data.nofactor.extended,2,replace_na_with_median_value))
}

## 3. Feature selection 
Methodology and main results

The text must contain the list of selected variables and the motivation of their choice. The use of formulas, tables and pseudo-code to describe the feature selection procedure is encouraged. 

#### Redundant and irrelevant features 

The "Id" column which is irrelevant is deleted.

In [81]:
data_preprocessed<-data_preprocessed[,setdiff(colnames(data_preprocessed),"Id")]

The criterias that are redundant (linear combination of others criterias and correlation > 0.99) are deleted.

In [45]:
library(caret)
library(ggplot2)
library(lattice)

linearCombo.idx <- findLinearCombos(data_preprocessed)$remove
if (!is.null(linearCombo.idx)) data_preprocessed<-data_preprocessed[,-linearCombo.idx]

correlation.matrix <- cor(data_preprocessed)
correlation.matrix[upper.tri(correlation.matrix)] <- 0
diag(correlation.matrix) <- 0
data.uncorrelated <- data_preprocessed[,!apply(correlation.matrix,2,function(x) any(abs(x) > 0.99))]

In [46]:
X <- data.uncorrelated[,setdiff(colnames(data.uncorrelated),"SalePrice")]
Y <- data.uncorrelated[,"SalePrice"]
X <- data.frame(X)
#Y <- data.frame(Y)
X.scale <- data.frame(scale(X))
Y.scale <- scale(Y)

N<-nrow(X)    #Number of examples
n<-ncol(X)    #Number of input variables

Two feature selection methods are implemented in the featureSelection file:

** 1. Filter method using correlation with the variable to determine.**

   It create a subset of features, removing from the whole features set the ones less likely to determine the variable (SalePrice). It is robust to overfitting and effective in computational time. However it might select redundant variables as the interraction between the variables is not taken in consideration.  
   
** 2. Wrapper method**

   Its a cyclic method where a subset of variable is created and evaluated by the Learning Algorithm, modifying the chosen subset. This is done until the best subset is generated.  
    
The filter method is used to select a first "big" set of features, that is then refined by the wrapper method. This gives us the possibility use advantages of both method to get a good subset in a relatively correct computational time.

In [91]:
source("function/featureSelection.R")

In [None]:
features.filtre <- filtre(X.scale,Y.scale)

In [None]:
features.mrmr <- mrmr(X.scale, Y.scale)

In [68]:
features.pca <- pca(X.scale, Y.scale)
#features.pca <- features.pca[,features.pca$nbfeatures]
#colnames(X[features.pca])
#colnames(X[features.pca[1:18]])

 [1] "# features:  1   CV error =  0.2335   std dev =  0.1119"     
 [2] "# features:  2   CV error =  0.2357   std dev =  0.1134"     
 [3] "# features:  3   CV error =  0.2074   std dev =  0.0958"     
 [4] "# features:  4   CV error =  0.2101   std dev =  0.0961"     
 [5] "# features:  5   CV error =  0.2136   std dev =  0.0981"     
 [6] "# features:  6   CV error =  0.2219   std dev =  0.1057"     
 [7] "# features:  7   CV error =  0.219   std dev =  0.119"       
 [8] "# features:  8   CV error =  0.212   std dev =  0.108"       
 [9] "# features:  9   CV error =  0.2147   std dev =  0.1119"     
[10] "# features:  10   CV error =  0.2182   std dev =  0.1125"    
[11] "# features:  11   CV error =  0.206   std dev =  0.0964"     
[12] "# features:  12   CV error =  0.2133   std dev =  0.1024"    
[13] "# features:  13   CV error =  0.1939   std dev =  0.0794"    
[14] "# features:  14   CV error =  0.2088   std dev =  0.1123"    
[15] "# features:  15   CV error =  0.2095   std

In [19]:
features.wrapper <- wrapper(X.scale, Y.scale)
colnames(X[features.wrapper])

[1] "Round  1  ; Selected feature:  4  ; CV error =  0.3546  ; std dev =  0.1468"
[1] "Round  2  ; Selected feature:  12  ; CV error =  0.2557  ; std dev =  0.111"
[1] "Round  3  ; Selected feature:  13  ; CV error =  0.2035  ; std dev =  0.0739"
[1] "Round  4  ; Selected feature:  9  ; CV error =  0.1711  ; std dev =  0.062"
[1] "Round  5  ; Selected feature:  7  ; CV error =  0.1569  ; std dev =  0.0602"
[1] "Round  6  ; Selected feature:  2  ; CV error =  0.1489  ; std dev =  0.0534"
[1] "Round  7  ; Selected feature:  19  ; CV error =  0.1434  ; std dev =  0.0492"
[1] "Round  8  ; Selected feature:  11  ; CV error =  0.1374  ; std dev =  0.0436"
[1] "Round  9  ; Selected feature:  20  ; CV error =  0.1352  ; std dev =  0.0431"
[1] "Round  10  ; Selected feature:  24  ; CV error =  0.1335  ; std dev =  0.0422"
[1] "Round  11  ; Selected feature:  31  ; CV error =  0.132  ; std dev =  0.0449"
[1] "Round  12  ; Selected feature:  38  ; CV error =  0.1309  ; std dev =  0.0416"
[1] "Rou

In [None]:
features.wrapper.pca <- wrapper(X.pca, Y.pca)
colnames(X[features.wrapper.pca])

## Model selection  
Methodology and main results

For the learning method, the only packages that may be used are those seen during the exercise classes : stats, nnet, tree, lazy, and e1071, for linear models, neural networks, decision trees, nearest neighbours and SVM, respectively.

The accuracy of the regression models during the selection process should be assessed by using the root mean squared error between the logarithm of the
predicted value and the logarithm of the observed sale price.

The text must mention the different
(and at least three) models which have been taken into consideration and the procedure used for model assessment and selection. The use of formulas,
tables and pseudo-code to describe the feature selection procedure is encouraged

## Ensemble techniques : Combination of models strategy
Methodology and main results

The text should mention the different models taken into consideration as well as the techniques used for the combination.

## Discussion and conclusion: 
Summary of your work, and discussion of what worked well, not well, why, what insights you got from the analyses you made. 