In [23]:
import pandas as pd

from sklearn.compose import make_column_selector, ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate

In [4]:
ames_housing = pd.read_csv("../datasets/ames_housing_no_missing.csv")

target_name = "SalePrice"
data, target = ames_housing.drop(columns=target_name), ames_housing[target_name]
target = (target > 200_000).astype(int)

We did not encounter any regression problem yet. Therefore, we convert the regression target into a classification target to predict whether or not an house is expensive. "Expensive" is defined as a sale price greater than $200,000.

### **Question 1**
Use the `data.info()` and `data.head()` commands to examine the columns of the dataframe. The dataset contains:

In [7]:
data.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,60,RL,65.0,8450,Pave,Grvl,Reg,Lvl,AllPub,Inside,...,0,0,Gd,MnPrv,Shed,0,2,2008,WD,Normal
1,20,RL,80.0,9600,Pave,Grvl,Reg,Lvl,AllPub,FR2,...,0,0,Gd,MnPrv,Shed,0,5,2007,WD,Normal
2,60,RL,68.0,11250,Pave,Grvl,IR1,Lvl,AllPub,Inside,...,0,0,Gd,MnPrv,Shed,0,9,2008,WD,Normal
3,70,RL,60.0,9550,Pave,Grvl,IR1,Lvl,AllPub,Corner,...,0,0,Gd,MnPrv,Shed,0,2,2006,WD,Abnorml
4,60,RL,84.0,14260,Pave,Grvl,IR1,Lvl,AllPub,FR2,...,0,0,Gd,MnPrv,Shed,0,12,2008,WD,Normal


#### **Question 2**
How many features are available to predict whether or not a house is expensive ?

In [8]:
data.shape

(1460, 79)

### **Question 3**
How many features are represented with numbers?

In [11]:
categorical_selector = make_column_selector(dtype_include="object")
numerical_selector = make_column_selector(dtype_exclude="object")
categorical_columns = categorical_selector(data)
numerical_columns = numerical_selector(data)

In [12]:
len(numerical_columns)

36

### **Question 4**
Refer to the dataset description regarding the meaning of the dataset.

Among the following columns, which columns express a quantitative numerical value (excluding ordinal categories)?

We consider the following numerical columns:

In [16]:
numerical_features = [
  "LotFrontage", "LotArea", "MasVnrArea", "BsmtFinSF1", "BsmtFinSF2",
  "BsmtUnfSF", "TotalBsmtSF", "1stFlrSF", "2ndFlrSF", "LowQualFinSF",
  "GrLivArea", "BedroomAbvGr", "KitchenAbvGr", "TotRmsAbvGrd", "Fireplaces",
  "GarageCars", "GarageArea", "WoodDeckSF", "OpenPorchSF", "EnclosedPorch",
  "3SsnPorch", "ScreenPorch", "PoolArea", "MiscVal",
]
categorical_features = data.drop(columns=numerical_features).columns.tolist()

### **Question 5**
Now create a predictive model that uses these numerical columns as input data. Your predictive model should be a pipeline composed of a `sklearn.preprocessing.StandardScaler` to scale these numerical data and a `sklearn.linear_model.LogisticRegression`.

What is the accuracy score obtained by 10-fold cross-validation (you can set the parameter `cv=10` when calling cross_validate) of this pipeline?

In [22]:
model = make_pipeline(StandardScaler(), LogisticRegression())
cv_results = cross_validate(model, data[numerical_features], target, cv=10)
print(f"Accuracy: {cv_results['test_score'].mean():.3f} ± {cv_results['test_score'].std():.3f}")


Accuracy: 0.892 ± 0.013


### **Question 6**
Instead of solely using the numerical columns, let us build a pipeline that can process both the numerical and categorical features together as follows:

- numerical features should be processed as previously done with a `StandardScaler`;
- the left-out columns should be treated as categorical variables using a `sklearn.preprocessing.OneHotEncoder`. To avoid any issue with rare categories that could only be present during the prediction, you can pass the parameter `handle_unknown="ignore"` to the `OneHotEncoder`.

One way to compare two models is by comparing the cross-validation test scores of both models fold-to-fold, i.e. counting the number of folds where one model has a better test score than the other. Let's compare the model using all features with the model consisting of only numerical features. Select the range of folds where the former has a better test score than the latter:

In [24]:

numerical_processor = StandardScaler()
categorical_processor = OneHotEncoder(handle_unknown="ignore")

preprocessor = ColumnTransformer([
    ('num', numerical_processor, numerical_features),
    ('cat', categorical_processor, categorical_features)
])
cat_num_model = make_pipeline(preprocessor, LogisticRegression(max_iter=500))
cat_num_model

In [25]:
cat_num_cv = cross_validate(cat_num_model, data, target, cv=10)
cat_num_cv

{'fit_time': array([0.08507752, 0.09173369, 0.07701659, 0.06505919, 0.07206535,
        0.06685185, 0.07406735, 0.07206655, 0.06906295, 0.06706071]),
 'score_time': array([0.0026238 , 0.01000857, 0.00800753, 0.00800705, 0.00800753,
        0.00900745, 0.0090096 , 0.01000857, 0.00800705, 0.00900841]),
 'test_score': array([0.95890411, 0.90410959, 0.89041096, 0.92465753, 0.9109589 ,
        0.93835616, 0.90410959, 0.91780822, 0.92465753, 0.89726027])}

In [27]:
print(f"""Using all features produced {sum(cat_num_cv["test_score"] > cv_results["test_score"])} out of {len(cv_results["test_score"])} folds better than only numerical features!""")

Using all features produced 9 out of 10 folds better than only numerical features!
