# INFO-F-422 Statistical foundations of machine learning. Kaggle project report. Aldar Saranov.

# Abstract


## Introduction
The following project was implemented in order to implement an apartment price prediction for the Kaggle competition. It was implemented in R language and divided in several project parts - feature selection, model selection, model combination, model evaluation and result production. The report contains the main snippets of the project. The rest can be found at the [GitHub repository][1].


[1]: https://github.com/ElderMayday/kaggle

## Feature selection
Many features (attributes) are supposed to contribute the prediction value and still some of them should be considered inconsistent or inefficient. In order to select the desired features one can apply either **filter methods** (i.e. regarding the features independently) or **wrapper methods** (i.e. regarding subsets of feature). Still for the sake of simplicity we use filter methods. As a grade of feature "usefullness" we use information grade measure (IG). This measure is based on the notion of entropy. Entropy itself characterizes the grade of system's disorder.

$$H = - \sum_{i=1}^{n} p_i log_b (p_i), \textit{where } p_i \textit{ - probability of i-th outcome}$$

IG while performing a transition from state 1 to state 2 is as follows.

$$IG(1\rightarrow2) = H(1) - H(2)$$

After computing this value for every feature separatedly we can order the features and slice the most important ones.
**Warning!** To run the code one must download the whole repository with data files and correct the folder paths by setwd(), plus install the used libraries for the jupyter.

In [1]:
library(FSelector)   #load the feature-selection library

setwd('D:/kaggle')  #TO-MODIFY sets the defaul folder depending on the directory path!!!

#takes dataframe and shows the features with satisfactory information gain
feature_selector <- function(data)
{
  #calculate the information gain of each feature
  features <- information.gain(SalePrice~., data)
  
  #result dataframe
  result = data.frame()
  
  row_names = row.names(features)
  
  #select every feature with IF higher than 0.1
  for (i in 1:nrow(features))
  {
    if (features[i, 1] > 0.1)
    {
      result <- rbind(result, data.frame("feature_name"= row_names[i], "feature_gain" = features[i, 1]))
    }
  }
  
  #order the features
  result = result[rev(order(result$feature_gain)),]

  #print the selected features
  print(result)
  
  return(result)
}

#load the train data
train = read.csv("./train.csv", header = TRUE)

#do feature selection
features = feature_selector(train)

"package 'FSelector' was built under R version 3.3.3"

   feature_name feature_gain
6   OverallQual    0.4922011
5  Neighborhood    0.4569727
22    GrLivArea    0.3825791
31   GarageCars    0.3281534
32   GarageArea    0.3210374
7     YearBuilt    0.2922323
15     BsmtQual    0.2870105
13    ExterQual    0.2837421
24  KitchenQual    0.2770460
18  TotalBsmtSF    0.2765359
29  GarageYrBlt    0.2569508
1    MSSubClass    0.2566161
30 GarageFinish    0.2501103
20    X1stFlrSF    0.2394652
23     FullBath    0.2339366
28   GarageType    0.2144176
8  YearRemodAdd    0.2112479
27  FireplaceQu    0.1963499
14   Foundation    0.1933651
25 TotRmsAbvGrd    0.1731205
21    X2ndFlrSF    0.1728490
3   LotFrontage    0.1547592
26   Fireplaces    0.1531060
16 BsmtFinType1    0.1519456
4       LotArea    0.1443197
33  OpenPorchSF    0.1438957
9   Exterior1st    0.1430904
10  Exterior2nd    0.1408458
19    HeatingQC    0.1329305
17   BsmtFinSF1    0.1155452
2      MSZoning    0.1111115
12   MasVnrArea    0.1031449
11   MasVnrType    0.1027884
