In [1]:
%%bash

wget -qO ../../datasets/house_prices.csv "https://github.com/INRIA/scikit-learn-mooc/raw/master/datasets/house_prices.csv"

In [1]:
import pandas as pd
ames_housing = pd.read_csv("../../datasets/house_prices.csv", na_values="?")
ames_housing = ames_housing.drop(columns="Id")

target_name = "SalePrice"
data, target = ames_housing.drop(columns=target_name), ames_housing[target_name]
target = (target > 200_000).astype(int)

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 79 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1460 non-null   int64  
 1   MSZoning       1460 non-null   object 
 2   LotFrontage    1201 non-null   float64
 3   LotArea        1460 non-null   int64  
 4   Street         1460 non-null   object 
 5   Alley          91 non-null     object 
 6   LotShape       1460 non-null   object 
 7   LandContour    1460 non-null   object 
 8   Utilities      1460 non-null   object 
 9   LotConfig      1460 non-null   object 
 10  LandSlope      1460 non-null   object 
 11  Neighborhood   1460 non-null   object 
 12  Condition1     1460 non-null   object 
 13  Condition2     1460 non-null   object 
 14  BldgType       1460 non-null   object 
 15  HouseStyle     1460 non-null   object 
 16  OverallQual    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  YearBuil

In [5]:
data.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,2,2008,WD,Normal
1,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,0,,,,0,5,2007,WD,Normal
2,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,0,,,,0,9,2008,WD,Normal
3,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,0,,,,0,2,2006,WD,Abnorml
4,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,0,,,,0,12,2008,WD,Normal


Use `data.info()` and `data.head()` to check the dataframe information.

The "Non-Null Count" column shows that some data are missing in some columns. For a column without missing data, the number of sample is 1,460. All columns with a number of samples inferior to 1,460 are columns containing missing data.

The "Dtype" column gives information of the data type. We observe three different data types: "object", "int64", and "float64".

"float64" columns typically represent numerical data. "object" columns usually store string labels for categorical values. "int64" can either represent numerical quantitities or categorical codes. The actual meaning depends and one would often need to look at the actual values of the first lines or at the dataset documentation to make sure whether this column should be treated as a numerical or categorical quantity.

Therefore this dataset has both numerical (e.g. "LotFrontage") and categorical (e.g. "BldgType") features.

In [6]:
data.shape

(1460, 79)

From the original dataframe, we should exclude the column "SalePrice" because it corresponds to the target that we want to predict.

The resulting dataframe named "data" has 79 columns as seen with the `data.shape` attribute or by measuring `len(data.columns)` for instance.

You can only select columns corresponding to data type `int64` and `float64`. The following code will filter these columns:

In [10]:
data_numbers = data.select_dtypes(['integer', 'float'])
print(len(data_numbers.columns))
data_numbers.info()

36
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 36 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1460 non-null   int64  
 1   LotFrontage    1201 non-null   float64
 2   LotArea        1460 non-null   int64  
 3   OverallQual    1460 non-null   int64  
 4   OverallCond    1460 non-null   int64  
 5   YearBuilt      1460 non-null   int64  
 6   YearRemodAdd   1460 non-null   int64  
 7   MasVnrArea     1452 non-null   float64
 8   BsmtFinSF1     1460 non-null   int64  
 9   BsmtFinSF2     1460 non-null   int64  
 10  BsmtUnfSF      1460 non-null   int64  
 11  TotalBsmtSF    1460 non-null   int64  
 12  1stFlrSF       1460 non-null   int64  
 13  2ndFlrSF       1460 non-null   int64  
 14  LowQualFinSF   1460 non-null   int64  
 15  GrLivArea      1460 non-null   int64  
 16  BsmtFullBath   1460 non-null   int64  
 17  BsmtHalfBath   1460 non-null   int64  
 18  FullB

In this case, 36 columns will be available. Programmatically, we can also access the number of columns with:

In [29]:
len(data_numbers.columns)

36

"OverallQual" and "OverallCond" are ordinal categorical variables. While technically "YearBuilt" is more a date than a quantity, it is fine for machine learning models to treat it as a quantity because the year of construction is directly related to the age of the house.

Now create a predictive model that uses the numerical columns as input data. The predictive model should be a pipeline composed of a standard scaler, a mean imputer (cf. `sklearn.impute.SimpleImputer(strategy="mean")`) and a `sklearn.linear_model.LogisticRegression`.

The accuracy score obtained by 5-fold cross-validation of this pipeline is the following:

In [22]:
%%time
numerical_features = [
  "LotFrontage", "LotArea", "MasVnrArea", "BsmtFinSF1", "BsmtFinSF2",
  "BsmtUnfSF", "TotalBsmtSF", "1stFlrSF", "2ndFlrSF", "LowQualFinSF",
  "GrLivArea", "BedroomAbvGr", "KitchenAbvGr", "TotRmsAbvGrd", "Fireplaces",
  "GarageCars", "GarageArea", "WoodDeckSF", "OpenPorchSF", "EnclosedPorch",
  "3SsnPorch", "ScreenPorch", "PoolArea", "MiscVal",
]

from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_validate

data_numerical = data[numerical_features]

model = make_pipeline(StandardScaler(),
    SimpleImputer(strategy='mean'),
    LogisticRegression()
)
cv_results_num = cross_validate(model, data_numerical, target)
cv_results_num['test_score'].mean()

CPU times: user 171 ms, sys: 677 ms, total: 848 ms
Wall time: 90.5 ms


0.8952054794520548

Instead of solely using the numerical columns, let us build a pipeline that can process both the numerical and categorical features together as follows:

- numerical features should be processed as previously;
- the left-out columns should be treated as categorical variables using a `sklearn.preprocessing.OneHotEncoder`;
- prior to one-hot encoding, insert the `sklearn.impute.SimpleImputer(strategy="most_frequent")` transformer to replace missing values by the most-frequent value in each column.

In [26]:
%%time
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer

categorical_features = data.columns.difference(numerical_features)

categorical_processor = make_pipeline(
    SimpleImputer(strategy='most_frequent'),
    OneHotEncoder(handle_unknown='ignore')
)
numerical_processor = make_pipeline(StandardScaler(),
    SimpleImputer(strategy='mean')
)
preprocessor = make_column_transformer(
    (numerical_processor, numerical_features),
    (categorical_processor, categorical_features)
)
model = make_pipeline(preprocessor, 
    LogisticRegression(max_iter=1000))
cv_results_all = cross_validate(model, data, target, 
    error_score='raise')
cv_results_all['test_score'].mean()

CPU times: user 640 ms, sys: 218 µs, total: 640 ms
Wall time: 639 ms


0.9191780821917808

Let us now define a **substantial** improvement or deterioration as an increase or decrease of the mean test score (**difference of the mean test scores** of models using only numerical features and numerical together with categorical features) of **at least three times the standard deviation** of the cross-validated test scores of the model using both categorical and numerical features.

We can now check the improvement:

In [27]:
cv_results_all['test_score'].mean() - cv_results_num['test_score'].mean()

0.023972602739726012

and compare it with the standard deviation of the generalization score:

In [28]:
3 * cv_results_all['test_score'].std()

0.02876712328767124