# Scaling model data

Before moving on to model improvements, we will first scale the numerical data used for the model in the previous notebook. This is a necessary step for some models, and it can also improve the performance of others. We will use the `StandardScaler` from `sklearn` to scale the data.

Aditionally, we will also apply logarithmic transformation to a few variables. This is a common practice when dealing with skewed data, and it can also improve the performance of some models.

In [216]:
import pickle
import pathlib

import numpy as np
import pandas as pd

In [217]:
DATA_DIR = pathlib.Path.cwd().parent / 'data'
print(DATA_DIR)

c:\Users\felip\OneDrive\Documentos\GitHub\AmesHousingDataset\data


In [218]:
model_data_path = DATA_DIR / 'processed' / 'ames_model_data.pkl'

In [219]:
with open(model_data_path, 'rb') as file:
    data = pickle.load(file)

In [220]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2877 entries, 0 to 2929
Columns: 165 entries, Lot.Frontage to Exterior_Other
dtypes: bool(2), float64(34), int64(12), uint8(117)
memory usage: 1.4 MB


### Let´s import StandardScaler and create an instance of it:

In [221]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

The StandardScaler scales each feature to have a mean of 0 and a standard deviation of 1. This is done by subtracting the mean and dividing by the standard deviation for each feature. The scaling is done independently for each feature, so that the features are on the same scale. This is important for some models, such as linear models, but not for others, such as decision trees.

We want to use the StandardScaler on continuous variables only, so we will first adapt the list of the continuous variables from previous notebooks:

In [222]:
continuous_variables = [
    'Lot.Frontage',
    'Lot.Area',
    'Mas.Vnr.Area',
    'BsmtFin.SF.1',
    'BsmtFin.SF.2',
    'Bsmt.Unf.SF',
    'Total.Bsmt.SF',
    'X1st.Flr.SF',
    'X2nd.Flr.SF',
    'Low.Qual.Fin.SF',
    'Gr.Liv.Area',
    'Garage.Area',
    'Wood.Deck.SF',
    'Open.Porch.SF',
    'Enclosed.Porch',
    'X3Ssn.Porch',
    'Screen.Porch',
    'Pool.Area',
    'Misc.Val',
]

In [223]:
data_cont = data.loc[:, continuous_variables]

In [224]:
data_cont.describe()

Unnamed: 0,Lot.Frontage,Lot.Area,Mas.Vnr.Area,BsmtFin.SF.1,BsmtFin.SF.2,Bsmt.Unf.SF,Total.Bsmt.SF,X1st.Flr.SF,X2nd.Flr.SF,Low.Qual.Fin.SF,Gr.Liv.Area,Garage.Area,Wood.Deck.SF,Open.Porch.SF,Enclosed.Porch,X3Ssn.Porch,Screen.Porch,Pool.Area,Misc.Val
count,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0
mean,69.202989,10171.366354,102.87626,445.111922,50.076121,562.832812,1058.020855,1163.242614,336.9496,4.584637,1504.776851,474.941606,94.297185,47.546403,22.634341,2.610358,16.262426,2.284672,51.354536
std,21.204969,7833.442896,179.732526,456.415687,169.983156,440.58575,439.000776,389.081826,429.84432,45.759563,504.110021,214.027308,126.993526,66.613621,63.912202,25.321811,56.53912,35.922368,571.419703
min,21.0,1470.0,0.0,0.0,0.0,0.0,0.0,334.0,0.0,0.0,334.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,60.0,7500.0,0.0,0.0,0.0,222.0,796.0,880.0,0.0,0.0,1132.0,326.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,68.0,9490.0,0.0,374.0,0.0,468.0,992.0,1088.0,0.0,0.0,1452.0,480.0,0.0,27.0,0.0,0.0,0.0,0.0,0.0
75%,79.0,11600.0,166.0,735.0,0.0,808.0,1309.0,1392.0,708.0,0.0,1750.0,576.0,168.0,70.0,0.0,0.0,0.0,0.0,0.0
max,313.0,215245.0,1600.0,5644.0,1526.0,2336.0,6110.0,5095.0,2065.0,1064.0,5642.0,1488.0,1424.0,742.0,1012.0,508.0,576.0,800.0,17000.0


In [225]:
scaler.fit(data_cont)

In [226]:
scaled_data_cont = scaler.transform(data_cont)

### Create a new DataFrame with the scaled data:

In [227]:
scaled_data_cont = pd.DataFrame(scaled_data_cont, columns=data_cont.columns, index=data_cont.index)

scaled_data_cont.describe()

Unnamed: 0,Lot.Frontage,Lot.Area,Mas.Vnr.Area,BsmtFin.SF.1,BsmtFin.SF.2,Bsmt.Unf.SF,Total.Bsmt.SF,X1st.Flr.SF,X2nd.Flr.SF,Low.Qual.Fin.SF,Gr.Liv.Area,Garage.Area,Wood.Deck.SF,Open.Porch.SF,Enclosed.Porch,X3Ssn.Porch,Screen.Porch,Pool.Area,Misc.Val
count,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0
mean,5.0629570000000006e-17,-1.111381e-16,3.087169e-17,-4.6924960000000007e-17,-2.469735e-18,-1.284262e-16,1.234867e-16,1.555933e-16,7.347461000000001e-17,-3.272399e-17,1.827604e-16,-5.433417000000001e-17,4.6924960000000007e-17,-1.9757880000000002e-17,-4.1985490000000004e-17,2.2227610000000002e-17,-3.087169e-18,-2.4697350000000002e-17,7.409205e-18
std,1.000174,1.000174,1.000174,1.000174,1.000174,1.000174,1.000174,1.000174,1.000174,1.000174,1.000174,1.000174,1.000174,1.000174,1.000174,1.000174,1.000174,1.000174,1.000174
min,-2.273588,-1.11099,-0.5724848,-0.9754032,-0.2946458,-1.277687,-2.410485,-2.131651,-0.7840238,-0.1002071,-2.322867,-2.219456,-0.7426644,-0.713888,-0.354209,-0.1031053,-0.2876814,-0.06361131,-0.08988746
25%,-0.434077,-0.34108,-0.5724848,-0.9754032,-0.2946458,-0.7737247,-0.5969611,-0.7281035,-0.7840238,-0.1002071,-0.7396037,-0.696021,-0.7426644,-0.713888,-0.354209,-0.1031053,-0.2876814,-0.06361131,-0.08988746
50%,-0.05674134,-0.08699684,-0.5724848,-0.1558323,-0.2946458,-0.21528,-0.1504151,-0.1934187,-0.7840238,-0.1002071,-0.1047113,0.02363845,-0.7426644,-0.3084951,-0.354209,-0.1031053,-0.2876814,-0.06361131,-0.08988746
75%,0.4620952,0.1824079,0.3512704,0.6352509,-0.2946458,0.5565541,0.5718049,0.5880438,0.8633705,-0.1002071,0.4865322,0.4722573,0.5804676,0.3371308,-0.354209,-0.1031053,-0.2876814,-0.06361131,-0.08988746
max,11.49916,26.1838,8.33118,11.39267,8.684275,4.025267,11.50991,10.10698,4.020876,23.15581,8.208411,4.734137,10.47245,10.42691,15.48277,19.96214,9.901726,22.21051,29.66575


### Finally, join the scaled data with the categorical variables again:

In [228]:
data.loc[:, continuous_variables] = scaled_data_cont

In [229]:
data.describe()

Unnamed: 0,Lot.Frontage,Lot.Area,Lot.Shape,Land.Slope,Overall.Qual,Overall.Cond,Mas.Vnr.Area,Exter.Qual,Exter.Cond,BsmtFin.SF.1,...,Exterior_BrkFace,Exterior_CemntBd,Exterior_HdBoard,Exterior_MetalSd,Exterior_Plywood,Exterior_Stucco,Exterior_VinylSd,Exterior_Wd Sdng,Exterior_WdShing,Exterior_Other
count,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,...,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0
mean,5.0629570000000006e-17,-1.111381e-16,0.403198,0.052138,5.112965,4.570386,3.087169e-17,1.595759,1.911366,-4.6924960000000007e-17,...,0.029892,0.043796,0.15259,0.150156,0.075773,0.014599,0.355926,0.139381,0.019117,0.004519
std,1.000174,1.000174,0.57179,0.246096,1.392445,1.101427,1.000174,0.578013,0.366542,1.000174,...,0.17032,0.204676,0.359654,0.357287,0.264681,0.11996,0.478876,0.346404,0.136961,0.06708
min,-2.273588,-1.11099,0.0,0.0,0.0,0.0,-0.5724848,0.0,0.0,-0.9754032,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,-0.434077,-0.34108,0.0,0.0,4.0,4.0,-0.5724848,1.0,2.0,-0.9754032,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,-0.05674134,-0.08699684,0.0,0.0,5.0,4.0,-0.5724848,2.0,2.0,-0.1558323,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.4620952,0.1824079,1.0,0.0,6.0,5.0,0.3512704,2.0,2.0,0.6352509,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
max,11.49916,26.1838,3.0,2.0,9.0,8.0,8.33118,3.0,4.0,11.39267,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### Now for the logarithmic transformation.

First we have to check the skewness of the numerical variables, in order to select the ones that need to be transformed. To do that, we will use the `skew()` method from `scipy.stats`. The skewness of a distribution is a measure of its asymmetry around the mean (a distribution is symmetric if it looks the same to the left and right of the center point). It is calculated as:

$$\frac{\sum_{i=1}^{n}(x_i - \bar{x})^3}{(n-1)\sigma^3}$$

Where $x_i$ is the "i"th observation, $\bar{x}$ is the mean, $\sigma$ is the standard deviation and $n$ is the number of observations.

In [230]:
from scipy.stats import skew

In [231]:
skewness = data.select_dtypes(include=[np.number]).apply(lambda x: skew(x))

In [232]:
skewness

Lot.Frontage         1.717644
Lot.Area            13.164608
Lot.Shape            1.253920
Land.Slope           5.094773
Overall.Qual         0.245395
                      ...    
Exterior_Stucco      8.094122
Exterior_VinylSd     0.601820
Exterior_Wd Sdng     2.082431
Exterior_WdShing     7.023431
Exterior_Other      14.775393
Length: 163, dtype: float64

The columns that have a skewness value greater than 5 will be selected for transformation

In [233]:
skewed_cols = skewness[skewness > 5].index

print(f"{len(skewed_cols)} out of {len(data.columns)} columns are skewed.")
print(skewed_cols)

54 out of 165 columns are skewed.
Index(['Lot.Area', 'Land.Slope', 'Low.Qual.Fin.SF', 'Functional',
       'X3Ssn.Porch', 'Pool.Area', 'Misc.Val', 'MS.SubClass_85',
       'MS.SubClass_190', 'MS.SubClass_Other', 'MS.Zoning_RH',
       'Land.Contour_Low', 'Lot.Config_FR2', 'Lot.Config_FR3',
       'Neighborhood_BrDale', 'Neighborhood_ClearCr', 'Neighborhood_IDOTRR',
       'Neighborhood_MeadowV', 'Neighborhood_NPkVill', 'Neighborhood_NoRidge',
       'Neighborhood_SWISU', 'Neighborhood_StoneBr', 'Neighborhood_Timber',
       'Neighborhood_Veenker', 'Bldg.Type_2fmCon', 'Bldg.Type_Twnhs',
       'House.Style_1.5Unf', 'House.Style_2.5Fin', 'House.Style_2.5Unf',
       'House.Style_SFoyer', 'Roof.Style_Other', 'Mas.Vnr.Type_Other',
       'Foundation_Other', 'Bsmt.Qual_Fa', 'Bsmt.Qual_NA', 'Bsmt.Cond_NA',
       'Bsmt.Exposure_NA', 'BsmtFin.Type.1_NA', 'BsmtFin.Type.2_ALQ',
       'BsmtFin.Type.2_BLQ', 'BsmtFin.Type.2_LwQ', 'BsmtFin.Type.2_NA',
       'Garage.Type_Basment', 'Garage.Type_Car

We will create a logarithmic transformer using `FunctionTransformer` from `sklearn.preprocessing`. This transformer will apply the logarithmic transformation to the selected variables, adding 1 to each value before applying the logarithm, in order to avoid errors when the variable contains 0.

The values are clipped in the range [0, 1e10] to avoid negative values and overflow, which could result in a NaN value.

In [234]:
from sklearn.preprocessing import FunctionTransformer

def log_transform(X):
    return np.log1p(X.clip(lower=0, upper=1e10))

log_transformer = FunctionTransformer(func=log_transform, validate=False)

In [235]:
data.loc[:, skewed_cols] = log_transformer.transform(data.loc[:, skewed_cols])

In [236]:
data.describe()

Unnamed: 0,Lot.Frontage,Lot.Area,Lot.Shape,Land.Slope,Overall.Qual,Overall.Cond,Mas.Vnr.Area,Exter.Qual,Exter.Cond,BsmtFin.SF.1,...,Exterior_BrkFace,Exterior_CemntBd,Exterior_HdBoard,Exterior_MetalSd,Exterior_Plywood,Exterior_Stucco,Exterior_VinylSd,Exterior_Wd Sdng,Exterior_WdShing,Exterior_Other
count,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,...,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0,2877.0
mean,5.0629570000000006e-17,0.13188,0.403198,0.034539,5.112965,4.570386,3.087169e-17,1.595759,1.911366,-4.6924960000000007e-17,...,0.020737,0.043796,0.15259,0.150156,0.075773,0.010124,0.355926,0.139381,0.01326,0.003136
std,1.000174,0.277639,0.57179,0.158852,1.392445,1.101427,1.000174,0.578013,0.366542,1.000174,...,0.118103,0.204676,0.359654,0.357287,0.264681,0.083191,0.478876,0.346404,0.094971,0.046539
min,-2.273588,0.0,0.0,0.0,0.0,0.0,-0.5724848,0.0,0.0,-0.9754032,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,-0.434077,0.0,0.0,0.0,4.0,4.0,-0.5724848,1.0,2.0,-0.9754032,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,-0.05674134,0.0,0.0,0.0,5.0,4.0,-0.5724848,2.0,2.0,-0.1558323,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.4620952,0.167553,1.0,0.0,6.0,5.0,0.3512704,2.0,2.0,0.6352509,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
max,11.49916,3.302621,3.0,1.098612,9.0,8.0,8.33118,3.0,4.0,11.39267,...,0.693359,1.0,1.0,1.0,1.0,0.693359,1.0,1.0,0.693359,0.693359


Confirming that there are still no missing values:

In [237]:
data.isnull().sum()[data.isnull().sum() > 0]

Series([], dtype: int64)

### Saving the data for the next notebook:

In [238]:
model_data_scaled_path = DATA_DIR / 'processed' / 'ames_model_data_scaled.pkl'

In [239]:
with open(model_data_scaled_path, 'wb') as file:
    pickle.dump(data, file)