## Wrap Up

In [8]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_validate
from sklearn.compose import make_column_transformer

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
# Open the dataset ames_housing_no_missing.csv
ames_housing = pd.read_csv("/content/drive/MyDrive/DataSets/House_prices/ames_housing_no_missing.csv")

target_name = "SalePrice"
data, target = ames_housing.drop(columns=target_name), ames_housing[target_name]
target = (target > 200_000).astype(int)

ames_housing is a pandas dataframe. The column "SalePrice" contains the target variable.

We did not encounter any regression problem yet. Therefore, we convert the regression target into a classification target to predict whether or not an house is expensive. "Expensive" is defined as a sale price greater than $200,000.

In [5]:
# We consider the following numerical columns:
numerical_features = [
  "LotFrontage", "LotArea", "MasVnrArea", "BsmtFinSF1", "BsmtFinSF2",
  "BsmtUnfSF", "TotalBsmtSF", "1stFlrSF", "2ndFlrSF", "LowQualFinSF",
  "GrLivArea", "BedroomAbvGr", "KitchenAbvGr", "TotRmsAbvGrd", "Fireplaces",
  "GarageCars", "GarageArea", "WoodDeckSF", "OpenPorchSF", "EnclosedPorch",
  "3SsnPorch", "ScreenPorch", "PoolArea", "MiscVal",
]

In [7]:
data_numerical = data[numerical_features]

model = make_pipeline(StandardScaler(), LogisticRegression())
cv_results_num = cross_validate(model, data_numerical, target, cv=10)
test_score_num = cv_results_num["test_score"]
test_score_num.mean()

0.891780821917808

In [9]:
categorical_features = data.columns.difference(numerical_features)

categorical_processor = OneHotEncoder(handle_unknown="ignore")
numerical_processor = StandardScaler()

preprocessor = make_column_transformer(
    (categorical_processor, categorical_features),
    (numerical_processor, numerical_features),
)
model = make_pipeline(preprocessor, LogisticRegression(max_iter=1_000))
cv_results_all = cross_validate(model, data, target, cv=10)
test_score_all = cv_results_all["test_score"]
test_score_all.mean()

0.9171232876712327

From this analysis, we observe that the mean test score of the model taking into account both numerical and categorical features is higher than only using the numerical features.