<a href="https://colab.research.google.com/github/Mouhsine22/Houses-sales-forecasting/blob/main/Feature_Engineering_with_Open_Source.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Engineering with Open-Source

In this notebook, we will reproduce the Feature Engineering Pipeline from the notebook 2 (02-Machine-Learning-Pipeline-Feature-Engineering), but we will replace, whenever possible, the manually created functions by open-source classes, and hopefully understand the value they bring forward.

## Reproducibility: Setting the seed
With the aim to ensure reproducibility between runs of the same notebook, but also between the research and production environment, for each step that includes some element of randomness, it is extremely important that we set the seed.

In [None]:
!pip install feature-engine

Collecting feature-engine
  Downloading feature_engine-1.1.2-py2.py3-none-any.whl (180 kB)
[K     |████████████████████████████████| 180 kB 8.4 MB/s 
[?25hCollecting statsmodels>=0.11.1
  Downloading statsmodels-0.12.2-cp37-cp37m-manylinux1_x86_64.whl (9.5 MB)
[K     |████████████████████████████████| 9.5 MB 50.2 MB/s 
Installing collected packages: statsmodels, feature-engine
  Attempting uninstall: statsmodels
    Found existing installation: statsmodels 0.10.2
    Uninstalling statsmodels-0.10.2:
      Successfully uninstalled statsmodels-0.10.2
Successfully installed feature-engine-1.1.2 statsmodels-0.12.2


In [None]:
# data manipulation and plotting
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# for saving the pipeline
import joblib

# from scikit-learn 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, Binarizer

# from feature-engine
from feature_engine.imputation import (
    AddMissingIndicator,
    MeanMedianImputer,
    CategoricalImputer,
)

from feature_engine.encoding import (
    RareLabelEncoder,
    MeanEncoder
)

from feature_engine.transformation import (
    LogTransformer,
    YeoJohnsonTransformer,
)

from feature_engine.selection import DropFeatures
from feature_engine.wrappers import SklearnTransformerWrapper

# to visualise al the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

In [None]:
# load dataset
data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Training_projects/Houses_sales/datasets/train.csv')

# rows and columns of the data
print(data.shape)

# visualise the dataset
data.head()

(1460, 81)


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


## Separate dataset into train and test

It is important to separate our data intro training and testing set. 

When we engineer features, some techniques learn parameters from data. It is important to learn these parameters only from the train set. This is to avoid over-fitting.

Our feature engineering techniques will learn:

- mean
- mode
- exponents for the yeo-johnson
- category frequency
- and category to number mappings

from the train set.

**Separating the data into train and test involves randomness, therefore, we need to set the seed.**

In [None]:
# Let's separate into train and test set
# Remember to set the seed (random_state for this sklearn function)

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['Id', 'SalePrice'], axis=1), # predictive variables
    data['SalePrice'], # target
    test_size=0.1, # portion of dataset to allocate to test set
    random_state=0, # we are setting the seed here
)

X_train.shape, X_test.shape

((1314, 79), (146, 79))

## Feature Engineering

In the following cells, we will engineer the variables of the House Price Dataset so that we tackle:

1. Missing values
2. Temporal variables
3. Non-Gaussian distributed variables
4. Categorical variables: remove rare labels
5. Categorical variables: convert strings to numbers
5. Standardize the values of the variables to the same range

### Target 
we apply the logarithm

In [None]:
y_train = np.log(y_train)
y_test = np.log(y_test)

### Missing values

#### Categorical variables

- We will replace with "missing" all values that contain alot of missing data.
- We will replace with the frequent cateogry in those vriables that contain fewer null observations

In [None]:
# Let's identify the categorical variables : 

cat_vars = [ var for var in data.columns if data[var].dtype == 'O' ]
cat_vars.append('MSSubClass') 
cat_vars

['MSZoning',
 'Street',
 'Alley',
 'LotShape',
 'LandContour',
 'Utilities',
 'LotConfig',
 'LandSlope',
 'Neighborhood',
 'Condition1',
 'Condition2',
 'BldgType',
 'HouseStyle',
 'RoofStyle',
 'RoofMatl',
 'Exterior1st',
 'Exterior2nd',
 'MasVnrType',
 'ExterQual',
 'ExterCond',
 'Foundation',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinType2',
 'Heating',
 'HeatingQC',
 'CentralAir',
 'Electrical',
 'KitchenQual',
 'Functional',
 'FireplaceQu',
 'GarageType',
 'GarageFinish',
 'GarageQual',
 'GarageCond',
 'PavedDrive',
 'PoolQC',
 'Fence',
 'MiscFeature',
 'SaleType',
 'SaleCondition',
 'MSSubClass']

In [None]:
# Cast all vraibles as categorical (this to add MSSubClass)
X_train.loc[:,cat_vars] = X_train.loc[:,cat_vars].astype('O')
X_test.loc[:,cat_vars] = X_test.loc[:,cat_vars].astype('O')

In [None]:
# Let's identify categrorical variables with null values

cat_vars_with_na = [
                    var for var in cat_vars if X_train[var].isnull().sum() > 0
]

X_train[cat_vars_with_na].isnull().mean().sort_values(ascending = False)

PoolQC          0.995434
MiscFeature     0.961187
Alley           0.938356
Fence           0.814307
FireplaceQu     0.472603
GarageCond      0.056317
GarageQual      0.056317
GarageFinish    0.056317
GarageType      0.056317
BsmtFinType2    0.025114
BsmtExposure    0.025114
BsmtFinType1    0.024353
BsmtCond        0.024353
BsmtQual        0.024353
MasVnrType      0.004566
Electrical      0.000761
dtype: float64

In [None]:
# variables to impute with the string missing
with_string_missing = [
                       var for var in cat_vars_with_na if X_train[var].isnull().mean() > 0.1 
]

# variables impute with frequent
with_frequent_category = [
                       var for var in cat_vars_with_na if X_train[var].isnull().mean() < 0.1 
]

In [None]:
with_string_missing

['Alley', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature']

In [None]:
with_frequent_category

['MasVnrType',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinType2',
 'Electrical',
 'GarageType',
 'GarageFinish',
 'GarageQual',
 'GarageCond']

In [None]:
# Let's replace missing values with 'Missing'

# Set up the class
categorical_imputer_missing = CategoricalImputer(imputation_method='missing', variables=with_string_missing)

# fit
categorical_imputer_missing.fit(X_train)

CategoricalImputer(fill_value='Missing', ignore_format=False,
                   imputation_method='missing', return_object=False,
                   variables=['Alley', 'FireplaceQu', 'PoolQC', 'Fence',
                              'MiscFeature'])

In [None]:
categorical_imputer_missing.imputer_dict_

{'Alley': 'Missing',
 'Fence': 'Missing',
 'FireplaceQu': 'Missing',
 'MiscFeature': 'Missing',
 'PoolQC': 'Missing'}

In [None]:
X_train = categorical_imputer_missing.transform(X_train)
X_test = categorical_imputer_missing.transform(X_test)

In [None]:
# Let's replace missing values with 'Fequent values'

# Set up the class
categorical_imputer_frequent = CategoricalImputer(imputation_method='frequent', variables=with_frequent_category)

# fit
categorical_imputer_frequent.fit(X_train)

CategoricalImputer(fill_value='Missing', ignore_format=False,
                   imputation_method='frequent', return_object=False,
                   variables=['MasVnrType', 'BsmtQual', 'BsmtCond',
                              'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
                              'Electrical', 'GarageType', 'GarageFinish',
                              'GarageQual', 'GarageCond'])

In [None]:
categorical_imputer_frequent.imputer_dict_

{'BsmtCond': 'TA',
 'BsmtExposure': 'No',
 'BsmtFinType1': 'Unf',
 'BsmtFinType2': 'Unf',
 'BsmtQual': 'TA',
 'Electrical': 'SBrkr',
 'GarageCond': 'TA',
 'GarageFinish': 'Unf',
 'GarageQual': 'TA',
 'GarageType': 'Attchd',
 'MasVnrType': 'None'}

In [None]:
X_train = categorical_imputer_frequent.transform(X_train)
X_test = categorical_imputer_frequent.transform(X_test)

In [None]:
X_train[cat_vars].head()

Unnamed: 0,MSZoning,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinType2,Heating,HeatingQC,CentralAir,Electrical,KitchenQual,Functional,FireplaceQu,GarageType,GarageFinish,GarageQual,GarageCond,PavedDrive,PoolQC,Fence,MiscFeature,SaleType,SaleCondition,MSSubClass
930,RL,Pave,Missing,IR1,HLS,AllPub,Inside,Gtl,Timber,Norm,Norm,1Fam,1Story,Gable,CompShg,VinylSd,VinylSd,,Gd,TA,PConc,Gd,TA,Av,GLQ,Unf,GasA,Ex,Y,SBrkr,Gd,Typ,Missing,Attchd,Fin,TA,TA,Y,Missing,Missing,Missing,WD,Normal,20
656,RL,Pave,Missing,IR1,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,Gable,CompShg,HdBoard,HdBoard,BrkFace,Gd,TA,CBlock,TA,TA,No,ALQ,Unf,GasA,Ex,Y,SBrkr,Gd,Typ,Missing,Attchd,RFn,TA,TA,Y,Missing,MnPrv,Missing,WD,Normal,20
45,RL,Pave,Missing,Reg,Lvl,AllPub,Inside,Gtl,NridgHt,Norm,Norm,TwnhsE,1Story,Hip,CompShg,MetalSd,MetalSd,BrkFace,Ex,TA,PConc,Ex,TA,No,GLQ,Unf,GasA,Ex,Y,SBrkr,Ex,Typ,Gd,Attchd,RFn,TA,TA,Y,Missing,Missing,Missing,WD,Normal,120
1348,RL,Pave,Missing,IR3,Low,AllPub,Inside,Gtl,SawyerW,Norm,Norm,1Fam,1Story,Gable,CompShg,VinylSd,VinylSd,,Gd,TA,PConc,Gd,TA,Gd,GLQ,Unf,GasA,Ex,Y,SBrkr,Gd,Typ,Fa,Attchd,RFn,TA,TA,Y,Missing,Missing,Missing,WD,Normal,20
55,RL,Pave,Missing,IR1,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,Gable,CompShg,HdBoard,Plywood,BrkFace,TA,TA,CBlock,TA,TA,No,BLQ,Unf,GasA,Gd,Y,SBrkr,TA,Typ,Gd,Attchd,RFn,TA,TA,Y,Missing,Missing,Missing,WD,Normal,20


In [None]:
X_train[cat_vars].isnull().sum()

MSZoning         0
Street           0
Alley            0
LotShape         0
LandContour      0
Utilities        0
LotConfig        0
LandSlope        0
Neighborhood     0
Condition1       0
Condition2       0
BldgType         0
HouseStyle       0
RoofStyle        0
RoofMatl         0
Exterior1st      0
Exterior2nd      0
MasVnrType       0
ExterQual        0
ExterCond        0
Foundation       0
BsmtQual         0
BsmtCond         0
BsmtExposure     0
BsmtFinType1     0
BsmtFinType2     0
Heating          0
HeatingQC        0
CentralAir       0
Electrical       0
KitchenQual      0
Functional       0
FireplaceQu      0
GarageType       0
GarageFinish     0
GarageQual       0
GarageCond       0
PavedDrive       0
PoolQC           0
Fence            0
MiscFeature      0
SaleType         0
SaleCondition    0
MSSubClass       0
dtype: int64

#### Numerical variables

To engineer missing values in numerical variables, we will:
- replace the missing values in the original variable with the median

In [None]:
# Let's identify the numerical variables

num_vars = [ var for var in X_train.columns if var not in cat_vars ]

len(num_vars)

35

In [None]:
# Let's detect variables with null values

num_vars_with_null = [ var for var in num_vars if X_train[var].isnull().sum() > 0 ]
X_train[num_vars_with_null].isnull().mean()

LotFrontage    0.177321
MasVnrArea     0.004566
GarageYrBlt    0.056317
dtype: float64

In [None]:
# Let's replace missing data with the median

# set up the imputer
median_imputer = MeanMedianImputer(
    imputation_method = 'median',
    variables = num_vars_with_null
)

# fit the imputer
median_imputer.fit(X_train)

MeanMedianImputer(imputation_method='median',
                  variables=['LotFrontage', 'MasVnrArea', 'GarageYrBlt'])

In [None]:
# the stored parameters
median_imputer.imputer_dict_

{'GarageYrBlt': 1980.0, 'LotFrontage': 69.0, 'MasVnrArea': 0.0}

In [None]:
X_train = median_imputer.transform(X_train)
X_test = median_imputer.transform(X_test)

In [None]:
# check that we have no more missing values in the engineered variables
X_train[num_vars_with_null].isnull().sum()

LotFrontage    0
MasVnrArea     0
GarageYrBlt    0
dtype: int64

In [None]:
X_train.isnull().sum() > 0

MSSubClass       False
MSZoning         False
LotFrontage      False
LotArea          False
Street           False
                 ...  
MiscVal          False
MoSold           False
YrSold           False
SaleType         False
SaleCondition    False
Length: 79, dtype: bool

### Temporal variables

In this section we will capture elapsed time

In [None]:
def elapsed_years(df, var):
  # Capture difference between the yearn variables
  # and the year in which the house was sold
  df.loc[:, var] = df['YrSold'] - df[var]
  return df

In [None]:
for var in ['YearBuilt', 'YearRemodAdd', 'GarageYrBlt']:
  X_train = elapsed_years(X_train, var)
  X_test = elapsed_years(X_test, var)

In [None]:
# now we drop YrSlod
drop_features = DropFeatures(features_to_drop=['YrSold'])
drop_features.fit(X_train)

DropFeatures(features_to_drop=['YrSold'])

In [None]:
X_train = drop_features.transform(X_train)
X_test = drop_features.transform(X_test)

### Numerical variable transformation

In the previous notebook, we observed that the numerical variables are not normally distributed.

We will transform with the logarightm the positive numerical variables in order to get a more Gaussian-like distribution.

#### Logarithmic transformation

In [None]:
log_transformer = LogTransformer(variables= ['LotFrontage', '1stFlrSF', 'GrLivArea'])
log_transformer.fit(X_train)

LogTransformer(base='e', variables=['LotFrontage', '1stFlrSF', 'GrLivArea'])

In [None]:
X_train = log_transformer.transform(X_train)
X_test = log_transformer.transform(X_test)

####Yeo-Johnson transformation

In [None]:
yeo_transformer = YeoJohnsonTransformer(
    variables=['LotArea', 'BsmtFinSF1']
    )
yeo_transformer.fit(X_train)

  loglike = -n_samples / 2 * np.log(trans.var(axis=0))
  w = xb - ((xb - xc) * tmp2 - (xb - xa) * tmp1) / denom
  tmp1 = (x - w) * (fx - fv)
  tmp2 = (x - v) * (fx - fw)


YeoJohnsonTransformer(variables=['LotArea', 'BsmtFinSF1'])

In [None]:
X_train = yeo_transformer.transform(X_train)
X_test = yeo_transformer.transform(X_test)

#### Binarize skewed variables

In [None]:
skewed = [
    'BsmtFinSF2', 'LowQualFinSF', 'EnclosedPorch',
    '3SsnPorch', 'ScreenPorch', 'MiscVal'
]

In [None]:
binarizer = SklearnTransformerWrapper(
    transformer = Binarizer(threshold = 0), variables=skewed
)

binarizer.fit(X_train)

SklearnTransformerWrapper(transformer=Binarizer(copy=True, threshold=0),
                          variables=['BsmtFinSF2', 'LowQualFinSF',
                                     'EnclosedPorch', '3SsnPorch',
                                     'ScreenPorch', 'MiscVal'])

In [None]:
X_train = binarizer.transform(X_train)
X_test = binarizer.transform(X_test)

In [None]:
X_train[skewed].head()

Unnamed: 0,BsmtFinSF2,LowQualFinSF,EnclosedPorch,3SsnPorch,ScreenPorch,MiscVal
930,0,0,0,0,0,0
656,0,0,0,0,0,0
45,0,0,0,0,0,0
1348,0,0,0,0,0,0
55,0,0,0,1,0,0


### Encoding Categorical variables

#### Apply mappings for Quality variables

These are variables which values have an assigned order, related to quality. For more information, check Kaggle website.

In [None]:
# re-map strings to numbers which determine quality

qual_mappings = {'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5, 'Missing': 0, 'NA': 0}

qual_vars = ['ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond',
             'HeatingQC', 'KitchenQual', 'FireplaceQu',
             'GarageQual', 'GarageCond',
            ]

for var in qual_vars: 
  X_train[var] = X_train[var].map(qual_mappings)
  X_test[var] = X_test[var].map(qual_mappings)

In [None]:
exposure_mappings = {'No': 1, 'Mn': 2, 'Av': 3, 'Gd': 4}

var = 'BsmtExposure'

X_train[var] = X_train[var].map(exposure_mappings)
X_test[var] = X_test[var].map(exposure_mappings)

In [None]:
finish_mappings = {'Missing': 0, 'NA': 0, 'Unf': 1, 'LwQ': 2, 'Rec': 3, 'BLQ': 4, 'ALQ': 5, 'GLQ': 6}

finish_vars = ['BsmtFinType1', 'BsmtFinType2']

for var in finish_vars:
    X_train[var] = X_train[var].map(finish_mappings)
    X_test[var] = X_test[var].map(finish_mappings)

In [None]:
garage_mappings = {'Missing': 0, 'NA': 0, 'Unf': 1, 'RFn': 2, 'Fin': 3}

var = 'GarageFinish'

X_train[var] = X_train[var].map(garage_mappings)
X_test[var] = X_test[var].map(garage_mappings)

In [None]:
fence_mappings = {'Missing': 0, 'NA': 0, 'MnWw': 1, 'GdWo': 2, 'MnPrv': 3, 'GdPrv': 4}

var = 'Fence'

X_train[var] = X_train[var].map(fence_mappings)
X_test[var] = X_test[var].map(fence_mappings)

In [None]:
#check absence of na in the train set
[var for var in X_train.columns if X_train[var].isnull().sum() > 0]

[]

#### Removing Rare Labels

For the remaining categorical variables, we will group those categories that are present in less than 1% of the observations. That is, all values of categorical variables that are shared by less than 1% of houses, well be replaced by the string "Rare".

In [None]:
# capture all quality variables

qual_vars  = qual_vars + finish_vars + ['BsmtExposure','GarageFinish','Fence']

# capture the remaining categorical variables
# (those that we did not re-map)

cat_others = [
    var for var in cat_vars if var not in qual_vars
]

len(cat_others)

30

In [None]:
cat_others

['MSZoning',
 'Street',
 'Alley',
 'LotShape',
 'LandContour',
 'Utilities',
 'LotConfig',
 'LandSlope',
 'Neighborhood',
 'Condition1',
 'Condition2',
 'BldgType',
 'HouseStyle',
 'RoofStyle',
 'RoofMatl',
 'Exterior1st',
 'Exterior2nd',
 'MasVnrType',
 'Foundation',
 'Heating',
 'CentralAir',
 'Electrical',
 'Functional',
 'GarageType',
 'PavedDrive',
 'PoolQC',
 'MiscFeature',
 'SaleType',
 'SaleCondition',
 'MSSubClass']

In [None]:
rare_encoder = RareLabelEncoder(tol=0.01, n_categories=1, variables=cat_others)

# fit 
rare_encoder.fit(X_train)

# the common labels are stored, we can save the class
# and then use it later :)
rare_encoder.encoder_dict_

{'Alley': Index(['Missing', 'Grvl', 'Pave'], dtype='object'),
 'BldgType': Index(['1Fam', 'TwnhsE', 'Duplex', 'Twnhs', '2fmCon'], dtype='object'),
 'CentralAir': Index(['Y', 'N'], dtype='object'),
 'Condition1': Index(['Norm', 'Feedr', 'Artery', 'RRAn', 'PosN'], dtype='object'),
 'Condition2': Index(['Norm'], dtype='object'),
 'Electrical': Index(['SBrkr', 'FuseA', 'FuseF'], dtype='object'),
 'Exterior1st': Index(['VinylSd', 'HdBoard', 'Wd Sdng', 'MetalSd', 'Plywood', 'CemntBd',
        'BrkFace', 'Stucco', 'WdShing', 'AsbShng'],
       dtype='object'),
 'Exterior2nd': Index(['VinylSd', 'Wd Sdng', 'HdBoard', 'MetalSd', 'Plywood', 'CmentBd',
        'Wd Shng', 'BrkFace', 'Stucco', 'AsbShng'],
       dtype='object'),
 'Foundation': Index(['PConc', 'CBlock', 'BrkTil', 'Slab'], dtype='object'),
 'Functional': Index(['Typ', 'Min2', 'Min1', 'Mod'], dtype='object'),
 'GarageType': Index(['Attchd', 'Detchd', 'BuiltIn', 'Basment'], dtype='object'),
 'Heating': Index(['GasA', 'GasW'], dtype='obj

In [None]:
X_train = rare_encoder.transform(X_train)
X_test = rare_encoder.transform(X_test)

####Encoding of categorical variables

Next, we need to transform the strings of the categorical variables into numbers.

We will do it so that we capture the monotonic relationship between the label and the target.

In [None]:
# set up the encoder
mean_enc = MeanEncoder(variables=cat_others)

# create the mappings
mean_enc.fit(X_train, y_train)

MeanEncoder(ignore_format=False,
            variables=['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour',
                       'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood',
                       'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
                       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd',
                       'MasVnrType', 'Foundation', 'Heating', 'CentralAir',
                       'Electrical', 'Functional', 'GarageType', 'PavedDrive',
                       'PoolQC', 'MiscFeature', 'SaleType', 'SaleCondition',
                       'MSSubClass'])

In [None]:
# in the encoder dict we see the target mean assigned to each
# category for each of the selected variables

mean_enc.encoder_dict_

{'Alley': {'Grvl': 11.663519817498301,
  'Missing': 12.038406180311078,
  'Pave': 11.984130657712763},
 'BldgType': {'1Fam': 12.046824240810961,
  '2fmCon': 11.733137481504304,
  'Duplex': 11.78004472951676,
  'Twnhs': 11.7865406100787,
  'TwnhsE': 12.056996438382402},
 'CentralAir': {'N': 11.49406690941563, 'Y': 12.063052200932379},
 'Condition1': {'Artery': 11.73590145246739,
  'Feedr': 11.824047662716152,
  'Norm': 12.040771322467638,
  'PosN': 12.2379893814225,
  'RRAn': 12.078025567272704,
  'Rare': 12.08412477057493},
 'Condition2': {'Norm': 12.024920419457883, 'Rare': 11.924600839547407},
 'Electrical': {'FuseA': 11.664987014291636,
  'FuseF': 11.548283987570134,
  'Rare': 11.363218337390071,
  'SBrkr': 12.063722636077063},
 'Exterior1st': {'AsbShng': 11.519159716351393,
  'BrkFace': 12.092217449966205,
  'CemntBd': 12.199015917008579,
  'HdBoard': 11.9423973526522,
  'MetalSd': 11.853362022522271,
  'Plywood': 12.046414730190149,
  'Rare': 11.910985466784075,
  'Stucco': 11.889

In [None]:
X_train = mean_enc.transform(X_train)
X_test = mean_enc.transform(X_test)

In [None]:
# check absence of na in the train set
[var for var in X_train.columns if X_train[var].isnull().sum() > 0]

[]

In [None]:
# check absence of na in the test set
[var for var in X_test.columns if X_test[var].isnull().sum() > 0]

[]

### Feature Scaling

In [None]:
# standardisation: with the StandardScaler from sklearn

# set up the scaler
scaler = StandardScaler()

# fit the scaler to the train set, it will learn the parameters
scaler.fit(X_train)

# transform train and test sets
X_train = pd.DataFrame(
    scaler.transform(X_train),
    columns=X_train.columns
)

X_test = pd.DataFrame(
    scaler.transform(X_test),
    columns=X_train.columns
)

In [None]:
X_train.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,SaleType,SaleCondition
0,0.137204,0.37393,0.292473,0.0,0.061804,0.209216,1.118073,3.45641,0.027597,-0.362327,-0.23762,1.116014,0.220878,0.091881,0.292811,-0.231797,1.364489,-0.525713,-1.146681,-1.017645,-0.514776,-0.139266,1.097043,1.162007,-0.736309,-0.564348,1.049588,-0.233275,1.087629,0.652094,-0.034059,1.298047,1.179815,-0.778386,-0.313783,-0.360329,2.009086,0.926204,0.10818,0.889627,0.271163,0.312725,0.88295,-0.801608,-0.136399,0.055347,-0.812813,-0.239358,0.794471,-0.764776,0.163141,-0.208108,0.738317,0.285857,0.253411,-0.949119,-1.007522,0.441233,-1.134978,1.514749,1.657603,0.636604,0.105448,0.110706,0.301182,0.041811,-0.434758,-0.407342,-0.124322,-0.29164,-0.066932,-0.067729,-0.458978,0.200264,-0.198889,0.247659,-0.243271,-0.132037
1,0.137204,0.37393,0.249604,0.0,0.061804,0.209216,1.118073,-0.006188,0.027597,-0.362327,-0.23762,-0.499968,0.220878,0.091881,0.292811,-0.231797,-0.796546,1.256964,0.398692,-1.017645,-0.514776,-0.139266,-0.474637,-0.361723,0.817224,-0.269402,1.049588,-0.233275,-0.68019,-0.82016,-0.034059,-0.630312,0.696712,0.880083,-0.313783,-0.360329,-0.730439,-0.012892,0.10818,0.889627,0.271163,0.312725,-0.160622,-0.801608,-0.136399,-0.936868,1.118536,-0.239358,-1.024232,1.221224,0.163141,-0.208108,0.738317,-0.943654,0.253411,-0.949119,-1.007522,0.441233,0.81024,0.280787,-1.023124,-0.748372,0.105448,0.110706,0.301182,-0.744168,-0.701348,-0.407342,-0.124322,-0.29164,-0.066932,-0.067729,2.057423,0.200264,-0.198889,0.617458,-0.243271,-0.132037
2,0.632674,0.37393,-0.265678,0.0,0.061804,0.209216,-0.734044,-0.006188,0.027597,-0.362327,-0.23762,1.946465,0.220878,0.091881,0.423772,-0.231797,2.084834,-0.525713,-1.04804,-0.872482,1.999022,-0.139266,-0.991995,-1.027098,0.817224,1.685978,2.788753,-0.233275,1.087629,2.124349,-0.034059,-0.630312,1.179815,0.537323,-0.313783,-0.360329,1.658391,1.576522,0.10818,0.889627,0.271163,0.312725,1.445019,-0.801608,-0.136399,0.589756,1.118536,-0.239358,0.794471,-0.764776,-1.061818,-0.208108,2.237772,-0.328899,0.253411,0.592466,1.20633,0.441233,-1.010815,0.280787,0.31724,0.478587,0.105448,0.110706,0.301182,0.796351,0.513117,-0.407342,-0.124322,-0.29164,-0.066932,-0.067729,-0.458978,0.200264,-0.198889,-1.601339,-0.243271,-0.132037
3,0.137204,0.37393,0.117329,0.0,0.061804,0.209216,1.499217,1.683125,0.027597,-0.362327,-0.23762,0.204734,0.220878,0.091881,0.292811,-0.231797,0.644144,-0.525713,-0.916519,-0.678933,-0.514776,-0.139266,1.097043,1.162007,-0.736309,-0.564348,1.049588,-0.233275,1.087629,0.652094,-0.034059,2.262227,1.179815,1.279086,-0.313783,-0.360329,-1.204106,0.962585,0.10818,0.889627,0.271163,0.312725,0.942618,-0.801608,-0.136399,0.112079,1.118536,-0.239358,0.794471,-0.764776,0.163141,-0.208108,0.738317,-0.943654,0.253411,0.592466,0.099404,0.441233,-0.845265,0.280787,0.31724,0.190437,0.105448,0.110706,0.301182,2.415468,-0.331085,-0.407342,-0.124322,-0.29164,-0.066932,-0.067729,-0.458978,0.200264,-0.198889,0.617458,-0.243271,-0.132037
4,0.137204,0.37393,1.270593,0.0,0.061804,0.209216,1.118073,-0.006188,0.027597,-0.362327,-0.23762,-0.499968,0.220878,0.091881,0.292811,-0.231797,-0.076201,-0.525713,0.234291,1.014625,-0.514776,-0.139266,-0.474637,-0.166643,0.817224,0.921304,-0.689578,-0.233275,-0.68019,-0.82016,-0.034059,-0.630312,0.213609,0.578208,-0.313783,-0.360329,0.836305,0.832976,0.10818,-0.155029,0.271163,0.312725,0.79349,-0.801608,-0.136399,-0.02971,-0.812813,-0.239358,0.794471,-0.764776,0.163141,-0.208108,-0.761139,0.285857,0.253411,0.592466,1.20633,0.441233,0.603302,0.280787,0.31724,0.478587,0.105448,0.110706,0.301182,-0.744168,-0.701348,-0.407342,8.043631,-0.29164,-0.066932,-0.067729,-0.458978,0.200264,-0.198889,0.247659,-0.243271,-0.132037


In [None]:
X_test.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,SaleType,SaleCondition
0,0.137204,0.37393,0.117329,0.0,0.061804,0.209216,1.118073,-0.006188,0.027597,3.668887,-0.23762,0.695106,0.220878,0.091881,0.292811,-0.231797,-0.076201,-2.30839,0.431572,0.433977,1.999022,-0.139266,-1.163236,-0.405179,-0.736309,-0.564348,1.049588,-0.233275,1.087629,-0.82016,-0.034059,-0.630312,-0.269493,1.15809,-0.313783,-0.360329,0.565313,2.220018,0.10818,-1.199685,0.271163,0.312725,2.585161,-0.801608,-0.136399,1.673789,1.118536,-0.239358,2.613174,-0.764776,1.388101,4.34945,-0.761139,1.515367,-6.38803,2.134052,0.652867,0.441233,0.10665,0.280787,0.31724,0.05101,0.105448,0.110706,0.301182,-0.744168,-0.701348,2.454942,-0.124322,-0.29164,-0.066932,-0.067729,-0.458978,0.200264,-0.198889,-1.231539,-0.243271,-1.772795
1,-0.929219,0.37393,0.537969,0.0,0.061804,0.209216,-0.734044,-0.006188,0.027597,-0.362327,-0.23762,-0.499968,-3.81364,0.091881,0.292811,-1.749912,-0.076201,1.256964,0.924777,1.595273,-0.514776,-0.139266,-1.163236,-1.138078,-0.736309,-0.564348,-0.689578,-0.233275,-0.68019,-0.82016,-0.034059,-0.630312,0.213609,0.468609,2.028891,2.775245,-0.750934,-0.57453,0.10818,-1.199685,0.271163,-2.832787,-0.458815,0.619245,-0.136399,0.276105,1.118536,-0.239358,-1.024232,-0.764776,0.163141,-0.208108,-2.260595,-0.943654,0.253411,2.134052,0.652867,0.441233,1.472441,-0.953174,-1.023124,-1.082997,0.105448,0.110706,0.301182,-0.744168,-0.701348,2.454942,-0.124322,-0.29164,-0.066932,-0.067729,2.057423,0.200264,-0.198889,0.617458,-0.243271,-0.132037
2,-0.929219,0.37393,0.117329,0.0,0.061804,0.209216,1.118073,-3.545431,0.027597,0.09761,-0.23762,-1.114647,0.220878,0.091881,0.292811,-1.749912,-0.796546,-1.417052,0.727495,1.740435,-0.514776,-0.139266,-0.991995,-1.027098,-0.791414,0.315027,-0.689578,-0.233275,-0.68019,-0.82016,-0.034059,-0.630312,-0.752596,0.077188,-0.313783,-0.360329,-0.099642,-0.795093,0.10818,-1.199685,0.271163,0.312725,-0.390429,-0.288267,-0.136399,-0.537531,1.118536,-0.239358,-1.024232,-0.764776,0.163141,-0.208108,0.738317,-0.943654,0.253411,0.592466,0.652867,-1.483364,1.224116,-0.953174,-1.023124,-0.562469,0.105448,0.110706,0.301182,-0.744168,-0.701348,2.454942,-0.124322,-0.29164,-0.066932,-0.067729,-0.458978,0.200264,-0.198889,0.247659,-0.243271,-0.132037
3,1.377887,0.37393,0.691481,0.0,0.061804,0.209216,-0.734044,-0.006188,0.027597,-0.362327,-0.23762,0.724271,0.220878,0.091881,0.292811,1.282259,0.644144,-0.525713,-0.193153,0.385589,1.999022,-0.139266,0.129777,-0.166643,0.817224,1.068777,-0.689578,-0.233275,-0.68019,0.652094,-0.034059,-0.630312,0.213609,0.45348,-0.313783,-0.360329,0.456005,0.230409,0.10818,0.889627,0.271163,0.312725,0.133698,1.183003,-0.136399,1.019541,-0.812813,-0.239358,0.794471,1.221224,1.388101,-0.208108,-0.761139,0.900612,0.253411,0.592466,0.652867,0.441233,0.065263,1.514749,0.31724,0.148609,0.105448,0.110706,0.301182,1.519452,1.031486,-0.407342,-0.124322,-0.29164,-0.066932,-0.067729,-0.458978,0.200264,-0.198889,-1.231539,-0.243271,-0.132037
4,-0.978044,-1.947321,-3.579893,0.0,0.061804,0.209216,-0.734044,-0.006188,0.027597,-0.362327,-0.23762,-1.532774,0.220878,0.091881,-3.058187,1.282259,-0.076201,-0.525713,0.069889,0.772688,-0.514776,-0.139266,-0.474637,-0.405179,0.817224,1.516657,-0.689578,-0.233275,-0.68019,-0.82016,-0.034059,-0.630312,-1.235699,-1.345139,-0.313783,-0.360329,-0.097365,-1.213479,0.10818,-1.199685,0.271163,0.312725,-2.355661,0.497785,-0.136399,-0.827817,-0.812813,-0.239358,-1.024232,1.221224,0.163141,-0.208108,-0.761139,-0.328899,0.253411,-0.949119,-1.007522,-1.483364,0.396364,-0.953174,-1.023124,-0.971455,0.105448,0.110706,0.301182,-0.744168,-0.701348,-0.407342,-0.124322,-0.29164,-0.066932,-0.067729,-0.458978,0.200264,-0.198889,-1.231539,-0.243271,-0.954821


In [None]:
# let's now save the train and test sets for the next notebook!

path_dataset = "/content/drive/MyDrive/Colab Notebooks/Training_projects/Houses_sales/dataset_after_feature_eng/"
X_train.to_csv(path_dataset+'FE_xtrain.csv', index=False)
X_test.to_csv(path_dataset+'FE_xtest.csv', index=False)

y_train.to_csv(path_dataset+'FE_ytrain.csv', index=False)
y_test.to_csv(path_dataset+'FE_ytest.csv', index=False)