# INFO-F-422 Statistical foundations of machine learning. Kaggle project report. Aldar Saranov.

## Introduction
The following project was implemented in order to implement an apartment price prediction for the Kaggle competition. It was implemented in R language and divided in several project parts - feature selection, model selection, model combination, model evaluation and result production.

## Feature selection
Many features (attributes) are supposed to contribute the prediction value and still some of them should be considered inconsistent or inefficient. In order to select the desired features one can apply either **filter methods** (i.e. regarding the features independently) or **wrapper methods** (i.e. regarding subsets of feature). Still for the sake of simplicity we use filter methods. As a grade of feature "usefullness" we use information grade measure (IG). This measure is based on the notion of entropy. Entropy itself characterizes the grade of system's disorder.

$$H = - \sum_{i=1}^{n} p_i log_b (p_i), \textit{where } p_i \textit{ - probability of i-th outcome}$$

IG while performing a transition from state 1 to state 2 is as follows.

$$IG(1\rightarrow2) = H(1) - H(2)$$

After computing this value for every feature separatedly we can order the features and slice the most important ones.
**Warning!** To run the code one must download the whole repository with data files and correct the folder paths by setwd(), plus install the used libraries for the jupyter.

In [2]:
library(FSelector)   #load the feature-selection library

setwd('D:/kaggle')  #TO-MODIFY sets the defaul folder depending on the directory path!!!

#takes dataframe and shows the features with satisfactory information gain
feature_selector <- function(data)
{
  #calculate the information gain of each feature
  features <- information.gain(SalePrice~., data)
  
  #result dataframe
  result = data.frame()
  
  row_names = row.names(features)
  
  #select every feature with IF higher than 0.1
  for (i in 1:nrow(features))
  {
    if (features[i, 1] > 0.1)
    {
      result <- rbind(result, data.frame("feature_name"= row_names[i], "feature_gain" = features[i, 1]))
    }
  }

  #print the selected features
  print(result)
  
  return(result)
}

#load the train data
train = read.csv("./train.csv", header = TRUE)

#do feature selection
features = feature_selector(train)

    feature_name feature_gain
1             Id  0.000000000
2     MSSubClass  0.256616139
3       MSZoning  0.111111546
4    LotFrontage  0.154759204
5        LotArea  0.144319711
6         Street  0.004013414
7          Alley  0.027327713
8       LotShape  0.059230763
9    LandContour  0.017587338
10     Utilities  0.001103294
11     LotConfig  0.020610863
12     LandSlope  0.005923644
13  Neighborhood  0.456972744
14    Condition1  0.044644384
15    Condition2  0.010496430
16      BldgType  0.039645124
17    HouseStyle  0.096955141
18   OverallQual  0.492201116
19   OverallCond  0.085166120
20     YearBuilt  0.292232271
21  YearRemodAdd  0.211247883
22     RoofStyle  0.029062897
23      RoofMatl  0.013542868
24   Exterior1st  0.143090420
25   Exterior2nd  0.140845800
26    MasVnrType  0.102788416
27    MasVnrArea  0.103144856
28     ExterQual  0.283742148
29     ExterCond  0.020143031
30    Foundation  0.193365132
31      BsmtQual  0.287010471
32      BsmtCond  0.050151243
33  BsmtEx

## References
[GitHub repository][1]
[1]: https://github.com/ElderMayday/kaggle

In [3]:
train = read.csv("./train.csv", header = TRUE)
print(train)

   Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
1   1         60       RL          65    8450   Pave  <NA>      Reg         Lvl
2   2         20       RL          80    9600   Pave  <NA>      Reg         Lvl
3   3         60       RL          68   11250   Pave  <NA>      IR1         Lvl
4   4         70       RL          60    9550   Pave  <NA>      IR1         Lvl
5   5         60       RL          84   14260   Pave  <NA>      IR1         Lvl
6   6         50       RL          85   14115   Pave  <NA>      IR1         Lvl
7   7         20       RL          75   10084   Pave  <NA>      Reg         Lvl
8   8         60       RL          NA   10382   Pave  <NA>      IR1         Lvl
9   9         50       RM          51    6120   Pave  <NA>      Reg         Lvl
10 10        190       RL          50    7420   Pave  <NA>      Reg         Lvl
   Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
1     AllPub    Inside       Gtl      Collg

In [4]:
version

               _                           
platform       x86_64-w64-mingw32          
arch           x86_64                      
os             mingw32                     
system         x86_64, mingw32             
status                                     
major          3                           
minor          3.2                         
year           2016                        
month          10                          
day            31                          
svn rev        71607                       
language       R                           
version.string R version 3.3.2 (2016-10-31)
nickname       Sincere Pumpkin Patch       