## Feature Engineering

Will bring together various techniques for feature engineering.
This would give you an idea of the end-to-end pipeline to build machine learning algorithms.

will:
- build a lasso
- use feature-engine for the feature engineering steps
- set up an entire engineering and prediction pipeline using a Scikit-learn Pipeline

===================================================================================================



We will use the House Prices dataset.

In [1]:
!pip install feature_engine

Collecting feature_engine
  Downloading feature_engine-1.1.2-py2.py3-none-any.whl (180 kB)
Installing collected packages: feature-engine
Successfully installed feature-engine-1.1.2


## House Prices dataset

In [2]:
import warnings 
warnings.filterwarnings('ignore')

from math import sqrt
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# for the model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score

# for feature engineering
from sklearn.preprocessing import StandardScaler
from feature_engine import imputation as mdi
from feature_engine import discretisation as dsc
from feature_engine import encoding as ce

pd.pandas.set_option('display.max_columns', None)

In [None]:
'''
Download dataset
'''
!wget = https://hr-projects-assets-prod.s3.amazonaws.com/cnjqbce2gfg/ba26e5aee7029d68d6adc1ab20cf3c54/houseprice.csv

### Load Datasets

In [3]:
'''
load dataset, file : houseprice.csv
'''

data = pd.read_csv('houseprice.csv')

### Types of variables 

Go ahead and find out what types of variables there are in this dataset

In [7]:
'''
inspect the type of variables
'''
data.shape

(1460, 81)

There are a mixture of categorical and numerical variables. Numerical are those of type **int** and **float** and categorical those of type **object**.

Id is a unique identifier for each of the houses. Thus this is not a variable that we can use.

#### Find categorical variables

In [6]:
'''
find categorical variables
'''
print(data.select_dtypes(include=['object']).columns.tolist())
categorical = data.select_dtypes(include=['object']).columns.tolist()

print('There are {} categorical variables'.format(len(categorical)))

['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType', 'SaleCondition']
There are 43 categorical variables


In [8]:
data[categorical].head()

Unnamed: 0,MSZoning,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinType2,Heating,HeatingQC,CentralAir,Electrical,KitchenQual,Functional,FireplaceQu,GarageType,GarageFinish,GarageQual,GarageCond,PavedDrive,PoolQC,Fence,MiscFeature,SaleType,SaleCondition
0,RL,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,Gable,CompShg,VinylSd,VinylSd,BrkFace,Gd,TA,PConc,Gd,TA,No,GLQ,Unf,GasA,Ex,Y,SBrkr,Gd,Typ,,Attchd,RFn,TA,TA,Y,,,,WD,Normal
1,RL,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,Gable,CompShg,MetalSd,MetalSd,,TA,TA,CBlock,Gd,TA,Gd,ALQ,Unf,GasA,Ex,Y,SBrkr,TA,Typ,TA,Attchd,RFn,TA,TA,Y,,,,WD,Normal
2,RL,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,Gable,CompShg,VinylSd,VinylSd,BrkFace,Gd,TA,PConc,Gd,TA,Mn,GLQ,Unf,GasA,Ex,Y,SBrkr,Gd,Typ,TA,Attchd,RFn,TA,TA,Y,,,,WD,Normal
3,RL,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,Gable,CompShg,Wd Sdng,Wd Shng,,TA,TA,BrkTil,TA,Gd,No,ALQ,Unf,GasA,Gd,Y,SBrkr,Gd,Typ,Gd,Detchd,Unf,TA,TA,Y,,,,WD,Abnorml
4,RL,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,Gable,CompShg,VinylSd,VinylSd,BrkFace,Gd,TA,PConc,Gd,TA,Av,GLQ,Unf,GasA,Ex,Y,SBrkr,Gd,Typ,TA,Attchd,RFn,TA,TA,Y,,,,WD,Normal


#### Find temporal variables

There are a few variables in the dataset that are temporal. They indicate the year in which something happened. We shouldn't use these variables straightaway for model building. We should instead transform them to capture some sort of time information. Let's inspect these temporal variables:


In [21]:
'''
make a list of the numerical variables first
'''

numerical = data.select_dtypes(exclude=['object']).columns.tolist()
print(numerical)
'''
list of variables that contain year information
'''
df_years=(data.filter(regex=("Year.*|Yr.*")))
year_vars = list(df_years.columns)

year_vars

['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold', 'SalePrice']


['YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'YrSold']

In [22]:
data[year_vars].head()

Unnamed: 0,YearBuilt,YearRemodAdd,GarageYrBlt,YrSold
0,2003,2003,2003.0,2008
1,1976,1976,1976.0,2007
2,2001,2002,2001.0,2008
3,1915,1970,1998.0,2006
4,2000,2000,2000.0,2008


We can see that these variables correspond to the years in which the houses were built or remodeled or a garage was built, or the house was indeed sold. It would be better if we captured the time elapsed between the time the house was built and the time the house was sold for example. We are going to do that in the feature engineering section later. 

We have another temporal variable: MoSold, which indicates the month in which the house was sold. Let's inspect if the house price varies with the time of the year in which it is sold:

In [None]:
'''
plot median house price per month in which it was sold
'''



### Q1 . Dose the price seems to vary depending on the month in which the house is sold?

    A. No
    B. Yes
    
    assgin A or B to q1.

In [None]:
# Replace "X" with A or B
q1 = str("X")

In [None]:
'''
Please run this cell to to, submit your option for evalutaion
'''

file=open("submit0.txt","w+")

s0=q1
file.write(s0)
file.write("\n")
file.close()

#### Find discrete variables

To identify discrete variables, we will select from all the numerical ones, those that contain a finite and small number of distinct values.

In [None]:
'''
visualise the values of the discrete variables
'''
discrete = []

# Code

print('There are {} discrete variables'.format(len(discrete)))

#### Continuous variables

In [None]:
'''
find continuous variables
let's remember to skip the Id variable and the target variable SalePrice
which are both also numerical'''

numerical = []

print('There are {} numerical and continuous variables'.format(len(numerical)))

    Now we have inspected and have a view of the different types of variables that we have in the house price dataset. 
    Let's move on to understand the types of problems that these variables have.

### Types of problems within the variables

#### Missing values

In [None]:
'''
Find variables with NA and the percentage of NA
'''

# Code

#### Outliers and distributions

In [None]:
'''
let's make boxplots to visualise outliers in the continuous variables 
and histograms to get an idea of the distribution
'''

#code

#### Outliers in discrete variables

Now, let's identify outliers in the discrete variables.
**Discrete variables can be pre-processed / engineered as if they were categorical**. 

In [None]:
'''
outlies in discrete variables
'''

# Code

### Monotonicity between discrete variables and target values

In [None]:
'''
plot the median sale price per value of the discrete
variable
'''


Some of the discrete variables show some sort of monotonic relationship and some don't.

#### Number of labels: cardinality

Let's go ahead now and examine the cardinality of our categorical variables. That is, the number of different labels.

In [None]:
'''
plot number of categories per categorical variable
'''

# Code

Most of the variables, contain only a few labels. Then, we do not have to deal with high cardinality.

### Separate train and test set

In [None]:
'''
Split data into train and test set
'''

X_train, X_test, y_train, y_test = train_test_split(data.drop(['Id', 'SalePrice'], axis=1),
                                                    data['SalePrice'],
                                                    test_size=0.2,
                                                    random_state=42)

X_train.shape, X_test.shape

**Now we will move on and engineer the features of this dataset. The most important part for this course.**

### Temporal variables 

First, we will create those temporal variables.

In [None]:
'''
function to calculate elapsed time
'''

def elapsed_years(df, var):
    # capture difference between year variable and
    # year the house was sold
    
    df[var] = None
    return df

In [None]:
for var in ['YearBuilt', 'YearRemodAdd', 'GarageYrBlt']:
    X_train = elapsed_years(X_train, var)
    X_test = elapsed_years(X_test, var)

In [None]:
X_train[['YearBuilt', 'YearRemodAdd', 'GarageYrBlt']].head()

Instead of the "year", now we have the amount of **years that passed** since the house was built or remodeled and the house was sold. Next, we drop the YrSold variable from the datasets, because we already extracted its value.

In [None]:
'''
drop YrSold
'''


In [None]:
'''
capture the column names for later use in the notebook
'''
final_columns = X_train.columns

### Missing data imputation
#### Continuous variables

In [None]:
'''
print variables with missing data
keep in mind that now that we created those new temporal variables, we
are going to treat them as numerical and continuous:
'''

'''
remove YrSold from the variable list
because it is no longer in our dataset
'''


'''
examine percentage of missing values
'''


In [None]:
'''
print variables with missing data
'''



In [None]:
'''
will treat discrete variables as if they were categorical
to treat discrete as categorical using Feature-engine
we need to re-cast them as object
'''
X_train[discrete] = X_train[discrete].astype('O')
X_test[discrete] = X_test[discrete].astype('O')

## Putting it all together

Create pipeline using following parameter

   ### Pipeline
    
       # missing data imputation
       
       AddMissingIndicator : varibales = 'LotFrontage', 'MasVnrArea',  'GarageYrBlt'
       MeanMedianImputer   : imputation_method='median',
                              variables= 'LotFrontage', 'MasVnrArea',  'GarageYrBlt'
       CategoricalImputer  : variables= categorical
       
       
       # categorical encoding
       
       RareLabelEncoder    : tol=0.05, n_categories=6, variables= categorical+discrete
       OrdinalEncoder      : encoding_method='ordered', variables=categorical+discrete
       
       # discretisation + encoding 
       
       EqualFrequencyDiscretiser : q=5, return_object=True, variables=numerical
       OrdinalEncoder      : encoding_method='ordered', variables=numerical
       
       # feature Scaling
       StandardScaler
       
       # regression
       Lasso : random_state=42

In [None]:
'''
Create Pipeline
'''

house_pipe = Pipeline([

    # missing data imputation
    #Code

    # categorical encoding - 
    #Code

    # discretisation + encoding 
    #Code

    # feature Scaling    
    #Code
    
    # regression
    #Code
])

In [None]:
'''
fit the pipeline
'''
house_pipe.fit(X_train, y_train)
house_pipe.fit(X_test, y_test)

# let's get the predictions
X_test_preds = house_pipe.predict(X_test)

In [None]:
# a peek into the prediction values
X_test_preds

In [None]:
MSE_ = mean_squared_error(y_test, X_test_preds)
RMSE_ = sqrt(mean_squared_error(y_test, X_test_preds))
r2_ = r2_score(y_test, X_test_preds)

print('test mse: {}'.format(MSE_))
print('test rmse: {}'.format(RMSE_))
print('test r2: {}'.format(r2_))

In [None]:
'''
Please run this cell to to, submit your answer for evalutaion
'''

file=open("submit2.txt","w+")
file.write(str(RMSE_))
file.write("\n")
file.write(str(MSE_))
file.write("\n")
file.write(str(r2_))
file.write("\n")
file.close()

In [None]:
'''
plot predictions vs real value
'''

plt.scatter(y_test,X_test_preds)
plt.xlabel('True Price')
plt.ylabel('Predicted Price')

In [None]:
'''
explore the importance of the features
'''
importance = pd.Series(np.abs(house_pipe.named_steps['lasso'].coef_))
importance.index = list(final_columns)+['LotFrontage_na', 'MasVnrArea_na',  'GarageYrBlt_na']
importance.sort_values(inplace=True, ascending=False)
importance.plot.bar(figsize=(18,6))