In [1]:
import pandas as pd

#### Data Description
LotFrontage: Linear feet of street connected to property

LotArea: Lot size in square feet

TotalBsmtSF: Total square feet of basement area

BedroomAbvGr: Bedrooms above grade (does NOT include basement bedrooms)

Fireplaces: Number of fireplaces

PoolArea: Pool area in square feet

GarageCars: Size of garage in car capacity

WoodDeckSF: Wood deck area in square feet

ScreenPorch: Screen porch area in square feet

MSZoning: Identifies the general zoning classification of the sale.
		
       A	Agriculture
       C	Commercial
       FV	Floating Village Residential
       I	Industrial
       RH	Residential High Density
       RL	Residential Low Density
       RP	Residential Low Density Park 
       RM	Residential Medium Density

Condition1: Proximity to various conditions
	
       Artery	Adjacent to arterial street
       Feedr	Adjacent to feeder street	
       Norm	Normal	
       RRNn	Within 200' of North-South Railroad
       RRAn	Adjacent to North-South Railroad
       PosN	Near positive off-site feature--park, greenbelt, etc.
       PosA	Adjacent to postive off-site feature
       RRNe	Within 200' of East-West Railroad
       RRAe	Adjacent to East-West Railroad

Heating: Type of heating
		
       Floor	Floor Furnace
       GasA	Gas forced warm air furnace
       GasW	Gas hot water or steam heat
       Grav	Gravity furnace	
       OthW	Hot water or steam heat other than gas
       Wall	Wall furnace

Street: Type of road access to property

       Grvl	Gravel	
       Pave	Paved

CentralAir: Central air conditioning

       N	No
       Y	Yes

Foundation: Type of foundation
		
       BrkTil	Brick & Tile
       CBlock	Cinder Block
       PConc	Poured Contrete	
       Slab	Slab
       Stone	Stone
       Wood	Wood

ExterQual: Evaluates the quality of the material on the exterior 
		
       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       Po	Poor
		
ExterCond: Evaluates the present condition of the material on the exterior
		
       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       Po	Poor

BsmtQual: Evaluates the height of the basement

       Ex	Excellent (100+ inches)	
       Gd	Good (90-99 inches)
       TA	Typical (80-89 inches)
       Fa	Fair (70-79 inches)
       Po	Poor (<70 inches
       NA	No Basement
		
BsmtCond: Evaluates the general condition of the basement

       Ex	Excellent
       Gd	Good
       TA	Typical - slight dampness allowed
       Fa	Fair - dampness or some cracking or settling
       Po	Poor - Severe cracking, settling, or wetness
       NA	No Basement
	
BsmtExposure: Refers to walkout or garden level walls

       Gd	Good Exposure
       Av	Average Exposure (split levels or foyers typically score average or above)	
       Mn	Mimimum Exposure
       No	No Exposure
       NA	No Basement
	
BsmtFinType1: Rating of basement finished area

       GLQ	Good Living Quarters
       ALQ	Average Living Quarters
       BLQ	Below Average Living Quarters	
       Rec	Average Rec Room
       LwQ	Low Quality
       Unf	Unfinshed
       NA	No Basement

KitchenQual: Kitchen quality

       Ex	Excellent
       Gd	Good
       TA	Typical/Average
       Fa	Fair
       Po	Poor

FireplaceQu: Fireplace quality

       Ex	Excellent - Exceptional Masonry Fireplace
       Gd	Good - Masonry Fireplace in main level
       TA	Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
       Fa	Fair - Prefabricated Fireplace in basement
       Po	Poor - Ben Franklin Stove
       NA	No Fireplace

In [2]:
url = '/Users/sadiakhanrupa/Bootcamp Main Phase/Chapter_7 Supervised_ML/Data/housing_iteration_4_classification/housing_iteration_4_classification.csv'

In [3]:
data = pd.read_csv(url)

In [4]:
data.head()

Unnamed: 0,LotArea,LotFrontage,TotalBsmtSF,BedroomAbvGr,Fireplaces,PoolArea,GarageCars,WoodDeckSF,ScreenPorch,Expensive,...,CentralAir,Foundation,ExterQual,ExterCond,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,KitchenQual,FireplaceQu
0,8450,65.0,856,3,0,0,2,0,0,0,...,Y,PConc,Gd,TA,Gd,TA,No,GLQ,Gd,
1,9600,80.0,1262,3,1,0,2,298,0,0,...,Y,CBlock,TA,TA,Gd,TA,Gd,ALQ,TA,TA
2,11250,68.0,920,3,1,0,2,0,0,0,...,Y,PConc,Gd,TA,Gd,TA,Mn,GLQ,Gd,TA
3,9550,60.0,756,3,1,0,3,0,0,0,...,Y,BrkTil,TA,TA,TA,Gd,No,ALQ,Gd,Gd
4,14260,84.0,1145,4,1,0,3,192,0,0,...,Y,PConc,Gd,TA,Gd,TA,Av,GLQ,Gd,TA


## Split X and y


In [5]:
y = data.pop('Expensive')

In [6]:
y.head()

0    0
1    0
2    0
3    0
4    0
Name: Expensive, dtype: int64

In [7]:
X = data.copy()

In [8]:
X.head()

Unnamed: 0,LotArea,LotFrontage,TotalBsmtSF,BedroomAbvGr,Fireplaces,PoolArea,GarageCars,WoodDeckSF,ScreenPorch,MSZoning,...,CentralAir,Foundation,ExterQual,ExterCond,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,KitchenQual,FireplaceQu
0,8450,65.0,856,3,0,0,2,0,0,RL,...,Y,PConc,Gd,TA,Gd,TA,No,GLQ,Gd,
1,9600,80.0,1262,3,1,0,2,298,0,RL,...,Y,CBlock,TA,TA,Gd,TA,Gd,ALQ,TA,TA
2,11250,68.0,920,3,1,0,2,0,0,RL,...,Y,PConc,Gd,TA,Gd,TA,Mn,GLQ,Gd,TA
3,9550,60.0,756,3,1,0,3,0,0,RL,...,Y,BrkTil,TA,TA,TA,Gd,No,ALQ,Gd,Gd
4,14260,84.0,1145,4,1,0,3,192,0,RL,...,Y,PConc,Gd,TA,Gd,TA,Av,GLQ,Gd,TA


## Splitting into Train and test dataset

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

In [11]:
X_train.isna().sum()

LotArea           0
LotFrontage     217
TotalBsmtSF       0
BedroomAbvGr      0
Fireplaces        0
PoolArea          0
GarageCars        0
WoodDeckSF        0
ScreenPorch       0
MSZoning          0
Condition1        0
Heating           0
Street            0
CentralAir        0
Foundation        0
ExterQual         0
ExterCond         0
BsmtQual         28
BsmtCond         28
BsmtExposure     28
BsmtFinType1     28
KitchenQual       0
FireplaceQu     547
dtype: int64

## Categorical Encoding Manual Approach (without using pipelines)

### 2.1. Replacing NaNs

We will need two different strategies to deal with missing values in numerical and categorical features.

#### 2.1.1. Replacing NaNs in categorical features

We were imputing the mean to NaN’s on our preprocessing pipeline for numerical features. There's a problem with categorical values: they don’t have a “mean”. Here, we will replace NaNs with a string that marks them: “N_A”. It is not an elegant solution, but it will allow us to move forward.

In [12]:
from  sklearn.impute import SimpleImputer
#selecting non numerical values
X_train_cat = X_train.select_dtypes(exclude='number')


#defining the imputer to use 'N_A' as replacement value
cat_imputer = SimpleImputer(strategy='constant',
                            fill_value='N_A').set_output(transform='pandas')

##fitting and transforming
X_cat_imputed = cat_imputer.fit_transform(X_train_cat)
X_cat_imputed

Unnamed: 0,MSZoning,Condition1,Heating,Street,CentralAir,Foundation,ExterQual,ExterCond,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,KitchenQual,FireplaceQu
254,RL,Norm,GasA,Pave,Y,CBlock,TA,Gd,TA,TA,No,Rec,TA,N_A
1066,RL,Norm,GasA,Pave,Y,PConc,Gd,TA,Gd,TA,No,Unf,TA,TA
638,RL,Feedr,GasA,Pave,Y,CBlock,TA,TA,Fa,TA,No,Unf,TA,N_A
799,RL,Feedr,GasA,Pave,Y,BrkTil,TA,TA,Gd,TA,No,ALQ,Gd,TA
380,RL,Norm,GasA,Pave,Y,BrkTil,TA,TA,TA,TA,No,LwQ,Gd,Gd
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1095,RL,Norm,GasA,Pave,Y,PConc,Gd,TA,Gd,TA,No,GLQ,Gd,Gd
1130,RL,Norm,GasA,Pave,Y,BrkTil,TA,TA,TA,TA,No,BLQ,Gd,TA
1294,RL,Norm,GasA,Pave,Y,CBlock,TA,TA,TA,TA,No,Rec,TA,N_A
860,RL,Norm,GasA,Pave,Y,BrkTil,Gd,TA,TA,TA,No,Unf,Gd,Gd


## Replacing numerical missing values with mean

In [13]:
# selecting numerical columns
X_train_numbers = X_train.select_dtypes(include='number')
# imputing thr mean
num_imputer = SimpleImputer(strategy='mean').set_output(transform='pandas')

#fiitng and transforming
X_num_imputed = num_imputer.fit_transform(X_train_numbers)
X_num_imputed.head()

Unnamed: 0,LotArea,LotFrontage,TotalBsmtSF,BedroomAbvGr,Fireplaces,PoolArea,GarageCars,WoodDeckSF,ScreenPorch
254,8400.0,70.0,1314.0,3.0,0.0,0.0,1.0,250.0,0.0
1066,7837.0,59.0,799.0,3.0,1.0,0.0,2.0,0.0,0.0
638,8777.0,67.0,796.0,2.0,0.0,0.0,0.0,328.0,0.0
799,7200.0,60.0,731.0,3.0,2.0,0.0,1.0,0.0,0.0
380,5000.0,50.0,1026.0,3.0,1.0,0.0,1.0,0.0,0.0


In [14]:
# Concatening all columns
X_imputed = pd.concat([X_cat_imputed, X_num_imputed], axis=1)
X_imputed.head()

Unnamed: 0,MSZoning,Condition1,Heating,Street,CentralAir,Foundation,ExterQual,ExterCond,BsmtQual,BsmtCond,...,FireplaceQu,LotArea,LotFrontage,TotalBsmtSF,BedroomAbvGr,Fireplaces,PoolArea,GarageCars,WoodDeckSF,ScreenPorch
254,RL,Norm,GasA,Pave,Y,CBlock,TA,Gd,TA,TA,...,N_A,8400.0,70.0,1314.0,3.0,0.0,0.0,1.0,250.0,0.0
1066,RL,Norm,GasA,Pave,Y,PConc,Gd,TA,Gd,TA,...,TA,7837.0,59.0,799.0,3.0,1.0,0.0,2.0,0.0,0.0
638,RL,Feedr,GasA,Pave,Y,CBlock,TA,TA,Fa,TA,...,N_A,8777.0,67.0,796.0,2.0,0.0,0.0,0.0,328.0,0.0
799,RL,Feedr,GasA,Pave,Y,BrkTil,TA,TA,Gd,TA,...,TA,7200.0,60.0,731.0,3.0,2.0,0.0,1.0,0.0,0.0
380,RL,Norm,GasA,Pave,Y,BrkTil,TA,TA,TA,TA,...,Gd,5000.0,50.0,1026.0,3.0,1.0,0.0,1.0,0.0,0.0


In [15]:
X_imputed.isna().sum() #no missing values

MSZoning        0
Condition1      0
Heating         0
Street          0
CentralAir      0
Foundation      0
ExterQual       0
ExterCond       0
BsmtQual        0
BsmtCond        0
BsmtExposure    0
BsmtFinType1    0
KitchenQual     0
FireplaceQu     0
LotArea         0
LotFrontage     0
TotalBsmtSF     0
BedroomAbvGr    0
Fireplaces      0
PoolArea        0
GarageCars      0
WoodDeckSF      0
ScreenPorch     0
dtype: int64

### 2.2. One Hot encoding

As I have learnt in the Platform lesson, One Hot encoding means creating a new binary column for each category in every categorical column. Fortunately, a Scikit-Learn transformer takes care of everything.

####  Fitting the `OneHotEncoder`

As with any transformer, we have to:
1. Import it
2. Initialize it
3. Fit it to the data
4. Use it to transform the data

In [16]:
#import it
from sklearn.preprocessing import OneHotEncoder


In [17]:
#Initialize it
one_hot_coder = OneHotEncoder(sparse_output=False).set_output(transform='pandas')

In [18]:
#fit it to the data
one_hot_coder.fit(X_cat_imputed)

In [19]:
#transform the data
X_cat_hot_coded = one_hot_coder.transform(X_cat_imputed)

In [20]:
X_cat_hot_coded

Unnamed: 0,MSZoning_C (all),MSZoning_FV,MSZoning_RH,MSZoning_RL,MSZoning_RM,Condition1_Artery,Condition1_Feedr,Condition1_Norm,Condition1_PosA,Condition1_PosN,...,KitchenQual_Ex,KitchenQual_Fa,KitchenQual_Gd,KitchenQual_TA,FireplaceQu_Ex,FireplaceQu_Fa,FireplaceQu_Gd,FireplaceQu_N_A,FireplaceQu_Po,FireplaceQu_TA
254,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
1066,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
638,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
799,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
380,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1095,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1130,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1294,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
860,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


### Concatenating "one-hot" columns with numerical columns:

Now that the categorical columns are numerical, we can join them back with the originally numerical columns and assemble the dataset that will be ready for modelling:

In [21]:
# Concatenate the one-hot encoded categorical features and the imputed numerical features along the columns
X_imputed_onehot = pd.concat([X_cat_hot_coded, X_num_imputed], axis=1)

# Display the first 3 rows of the resulting DataFrame
X_imputed_onehot.head(3)

Unnamed: 0,MSZoning_C (all),MSZoning_FV,MSZoning_RH,MSZoning_RL,MSZoning_RM,Condition1_Artery,Condition1_Feedr,Condition1_Norm,Condition1_PosA,Condition1_PosN,...,FireplaceQu_TA,LotArea,LotFrontage,TotalBsmtSF,BedroomAbvGr,Fireplaces,PoolArea,GarageCars,WoodDeckSF,ScreenPorch
254,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,8400.0,70.0,1314.0,3.0,0.0,0.0,1.0,250.0,0.0
1066,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,1.0,7837.0,59.0,799.0,3.0,1.0,0.0,2.0,0.0,0.0
638,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,8777.0,67.0,796.0,2.0,0.0,0.0,0.0,328.0,0.0


##  Categorical encoding - "Automated" approach (Using Pipelines)

In the manual approach, to encode the categorical columns numerically, we have:

1. Selected the categorical columns.
2. Fitted a `OneHotEncoder` to them.
3. Transformed the categorical columns with the encoder.
4. Converted the sparse matrix into a dataframe.
5. Recovered the names of the columns.
6. Concatenated the one-hot columns with the numerical columns.

All these steps can be synthetised by using Scikit-Learn Pipelines and specifically something called `ColumnTransformer`, which allows us to apply different transformations to two or more groups of columns: in our case, categorical and numerical columns.

This process is also called creating "branches" in the pipeline. One branch for the categorical columns and another for the numerical columns. Each branch will contain as many transformers as we want. Then, the branches will meet again, and the transformed columns will be automatically concatenated. Let's see the process in action:

###  Creating the "numeric pipe" and the "categoric pipe"

In [22]:
# Import the necessary modules
from sklearn.pipeline import make_pipeline

# Select categorical and numerical column names
X_cat_columns = X.select_dtypes(exclude='number').columns
X_num_columns = X.select_dtypes(include='number').columns

# Create a numerical pipeline, only with the SimpleImputer(strategy="mean")
numeric_pipe = make_pipeline(SimpleImputer(strategy='mean'))

# Create a categorical pipeline, with the SimpleImputer(fill_value="N_A") and the OneHotEncoder
categoric_pipe = make_pipeline(
    SimpleImputer(strategy='constant', fill_value='N_A'),
    OneHotEncoder(sparse_output=False, handle_unknown='infrequent_if_exist', min_frequency=0.03)
)

##  Using `ColumnTransformer` a pipeline with 2 branches (the `preprocessor`)

We simply tell the pipeline the following:

- One branch, called `"num_pipe"`, will apply the steps in the `numeric_pipe` to the columns named in `X_num_columns`
- The second branch, called `"cat_pipe"`, will apply the steps in the `categoric_pipe` to the columns named in `X_cat_columns`

In [23]:
from sklearn.compose import make_column_transformer
preprocessor = make_column_transformer(
    (numeric_pipe, X_num_columns),
    (categoric_pipe, X_cat_columns), 
     )


In [24]:
preprocessor

### Creating the `full_pipeline` (`preprocessor` + Decision Tree)

Pipelines are modular. The `preprocessor` we created above with the `ColumnTransformer` can become now a step in a new pipeline, that we'll call `full_pipeline` and will include, as a last step, a Decision Tree model:

In [25]:
from sklearn.tree import DecisionTreeClassifier

# Create a pipeline with a preprocessor and a DecisionTreeClassifier
full_pipeline = make_pipeline(
    preprocessor,
    DecisionTreeClassifier()
)


In [26]:
full_pipeline.fit(X_train, y_train)

In [30]:
#importing necessary modules
from sklearn.model_selection import GridSearchCV

param_grid = {
    'columntransformer__pipeline-1__simpleimputer__strategy':['mean', 'median'],
    'decisiontreeclassifier__criterion':['gini', 'entropy'],
    'decisiontreeclassifier__max_depth':range(2,14,2),
    'decisiontreeclassifier__min_samples_split':range(2,14,2),
    'decisiontreeclassifier__min_samples_leaf':range(3,12,2),
    'decisiontreeclassifier__max_features':['sqrt', 'log2', 4],
    'decisiontreeclassifier__max_leaf_nodes':range(2,15,2)
    
}

#define GridSearchCV
dt_search = GridSearchCV(full_pipeline,
                         param_grid,
                         cv=5,
                         verbose=1)
dt_search.fit(X_train,y_train)

best_parameters = dt_search.best_params_
best_parameters

Fitting 5 folds for each of 15120 candidates, totalling 75600 fits


{'columntransformer__pipeline-1__simpleimputer__strategy': 'mean',
 'decisiontreeclassifier__criterion': 'entropy',
 'decisiontreeclassifier__max_depth': 6,
 'decisiontreeclassifier__max_features': 'sqrt',
 'decisiontreeclassifier__max_leaf_nodes': 14,
 'decisiontreeclassifier__min_samples_leaf': 3,
 'decisiontreeclassifier__min_samples_split': 2}

In [32]:
y_predict = dt_search.predict(X_test)

In [34]:
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_test, y_predict)
acc

0.9417808219178082