```
1. LotFrontage: Linear feet of street connected to property
2. LotArea: Lot size in square feet
3. TotalBsmtSF: Total square feet of basement area
4. BedroomAbvGr: Bedrooms above grade (does NOT include basement bedrooms)
5. Fireplaces: Number of fireplaces
6. PoolArea: Pool area in square feet
7. GarageCars: Size of garage in car capacity 
8. WoodDeckSF: Wood deck area in square feet
9. ScreenPorch: Screen porch area in square feet
10. MSZoning: Identifies the general zoning classification of the sale.
    * A		Agriculture
    * C		Commercial
    * FV	Floating Village Residential
    * I		Industrial
    * RH	Residential High Density
    * RL	Residential Low Density
    * RP	Residential Low Density Park 
    * RM	Residential Medium Density
11. Condition1: Proximity to various conditions
    * Artery	Adjacent to arterial street
    * Feedr	Adjacent to feeder street	
    * Norm	Normal	
    * RRNn	Within 200' of North-South Railroad
    * RRAn	Adjacent to North-South Railroad
    * PosN	Near positive off-site feature--park, greenbelt, etc.
    * PosA	Adjacent to postive off-site feature
    * RRNe	Within 200' of East-West Railroad
    * RRAe	Adjacent to East-West Railroad
12. Heating: Type of heating
    * Floor	Floor Furnace
    * GasA	Gas forced warm air furnace
    * GasW	Gas hot water or steam heat
    * Grav	Gravity furnace	
    * OthW	Hot water or steam heat other than gas
    * Wall	Wall furnace
13. Street: Type of road access to property
    * Grvl	Gravel	
    * Pave	Paved
14. CentralAir: Central air conditioning
    * N	No
    * Y	Yes
15. Foundation: Type of foundation
    * BrkTil	Brick & Tile
    * CBlock	Cinder Block
    * PConc	Poured Contrete	
    * Slab	Slab
    * Stone	Stone
    * Wood	Wood
16. ExterQual: Evaluates the quality of the material on the exterior 
    * Ex	Excellent
    * Gd	Good
    * TA	Average/Typical
    * Fa	Fair
    * Po	Poor
17. ExterCond: Evaluates the present condition of the material on the exterior
    * Ex	Excellent
    * Gd	Good
    * TA	Average/Typical
    * Fa	Fair
    * Po	Poor
18. BsmtQual: Evaluates the height of the basement
    * Ex	Excellent (100+ inches)	
    * Gd	Good (90-99 inches)
    * TA	Typical (80-89 inches)
    * Fa	Fair (70-79 inches)
    * Po	Poor (<70 inches
    * NA	No Basement
19. BsmtCond: Evaluates the general condition of the basement
    * Ex	Excellent
    * Gd	Good
    * TA	Typical - slight dampness allowed
    * Fa	Fair - dampness or some cracking or settling
    * Po	Poor - Severe cracking, settling, or wetness
    * NA	No Basement
20. BsmtExposure: Refers to walkout or garden level walls
    * Gd	Good Exposure
    * Av	Average Exposure (split levels or foyers typically score average or above)	
    * Mn	Mimimum Exposure
    * No	No Exposure
    * NA	No Basement
21. BsmtFinType1: Rating of basement finished area
    * GLQ	Good Living Quarters
    * ALQ	Average Living Quarters
    * BLQ	Below Average Living Quarters
    * Rec	Average Rec Room
    * LwQ	Low Quality
    * Unf	Unfinshed
    * NA	No Basement
22. KitchenQual: Kitchen quality
    * Ex	Excellent
    * Gd	Good
    * TA	Typical/Average
    * Fa	Fair
    * Po	Poor
23. FireplaceQu: Fireplace quality
    * Ex	Excellent - Exceptional Masonry Fireplace
    * Gd	Good - Masonry Fireplace in main level
    * TA	Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
    * Fa	Fair - Prefabricated Fireplace in basement
    * Po	Poor - Ben Franklin Stove
    * NA	No Fireplace
```

# Importing the libraries

In [None]:
from sklearn.model_selection import GridSearchCV
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn import set_config

# Reading and Splitting and Defining Train and Test

In [None]:


# reading
url = "https://drive.google.com/file/d/11uge7w4gJr_ufpN6bNsyGOcdWNUsO7Kc/view?usp=sharing"
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
data = df = pd.read_csv(path)

# X and y creation
X = data
y = X.pop("Expensive")

# Feature Engineering
#X.loc[:, "Cabin"] = X.Cabin.str[0]

# data splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Selecting features for Ordinal Encoding

In [None]:
X_ordinal = pd.DataFrame(X.iloc[:,15:22])
X_ordinal.head(1)

Unnamed: 0,ExterQual,ExterCond,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,KitchenQual
0,Gd,TA,Gd,TA,No,GLQ,Gd


In [None]:
ordinal_columns = X.columns.get_indexer(["ExterQual", "ExterCond", "BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "KitchenQual", "FireplaceQu" ])
#16
ExterQual1 = ["Ex","Gd","TA","Fa","Po","NA"]
#17
ExterCond1 = ["Ex","Gd","TA","Fa","Po","NA"]
#18
BsmtQual1 = ["Ex","Gd","TA","Fa","Po","NA"]
#19
BsmtCond1 = ["Ex","Gd","TA","Fa","Po","NA"]
#20
BsmtExposure1 = ["Gd","Av","Mn","No","NA"]
#21
BsmtFinType11 = ["GLQ", "ALQ", "BLQ", "Rec", "LwQ", "Unf","NA"]
#22
KitchenQual1 = ["Ex","Gd","TA","Fa","Po","NA"]
#23
FireplaceQu1 = ["Ex","Gd","TA","Fa","Po","NA"]

ordinal_cats1 = [ExterQual1, ExterCond1, BsmtQual1, BsmtCond1, BsmtExposure1, BsmtFinType11, KitchenQual1, FireplaceQu1 ]

In [None]:
ordinal_columns

array([15, 16, 17, 18, 19, 20, 21, 22])

# Imputing Missing Values

In [None]:
X_ordinal_imputed = SimpleImputer(strategy="constant", fill_value="NA").fit_transform(X_ordinal)
pd.DataFrame(X_ordinal_imputed).head(1)

Unnamed: 0,0,1,2,3,4,5,6
0,Gd,TA,Gd,TA,No,GLQ,Gd


In [None]:
#X_ordinal_imputed_ord = OrdinalEncoder().fit_transform(X_ordinal_imputed)
#pd.DataFrame(X_ordinal_imputed_ord).head(10)

Ex    Excellent
    * Gd    Good
    * TA    Average/Typical
    * Fa    Fair
    * Po  

16. ExterQual: Evaluates the quality of the material on the exterior 
    * Ex    Excellent
    * Gd    Good
    * TA    Average/Typical
    * Fa    Fair
    * Po    Poor
17. ExterCond: Evaluates the present condition of the material on the exterior
    * Ex    Excellent
    * Gd    Good
    * TA    Average/Typical
    * Fa    Fair
    * Po    Poor
18. BsmtQual: Evaluates the height of the basement
    * Ex    Excellent (100+ inches)    
    * Gd    Good (90-99 inches)
    * TA    Typical (80-89 inches)
    * Fa    Fair (70-79 inches)
    * Po    Poor (<70 inches
    * NA    No Basement
19. BsmtCond: Evaluates the general condition of the basement
    * Ex    Excellent
    * Gd    Good
    * TA    Typical - slight dampness allowed
    * Fa    Fair - dampness or some cracking or settling
    * Po    Poor - Severe cracking, settling, or wetness
    * NA    No Basement
20. BsmtExposure: Refers to walkout or garden level walls
    * Gd    Good Exposure
    * Av    Average Exposure (split levels or foyers typically score average or above)    
    * Mn    Mimimum Exposure
    * No    No Exposure
    * NA    No Basement
21. BsmtFinType1: Rating of basement finished area
    * GLQ    Good Living Quarters
    * ALQ    Average Living Quarters
    * BLQ    Below Average Living Quarters
    * Rec    Average Rec Room
    * LwQ    Low Quality
    * Unf    Unfinshed
    * NA    No Basement
22. KitchenQual: Kitchen quality
    * Ex    Excellent
    * Gd    Good
    * TA    Typical/Average
    * Fa    Fair
    * Po    Poor
23. FireplaceQu: Fireplace quality
    * Ex    Excellent - Exceptional Masonry Fireplace
    * Gd    Good - Masonry Fireplace in main level
    * TA    Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
    * Fa    Fair - Prefabricated Fireplace in basement
    * Po    Poor - Ben Franklin Stove
    * NA    No Fireplace

In [None]:
ordinal_columns = X.columns.get_indexer(["ExterQual", "ExterCond", "BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "KitchenQual", "FireplaceQu" ])
#16
ExterQual1 = ["Ex","Gd","TA","Fa","Po","NA"]
#17
ExterCond1 = ["Ex","Gd","TA","Fa","Po","NA"]
#18
BsmtQual1 = ["Ex","Gd","TA","Fa","Po","NA"]
#19
BsmtCond1 = ["Ex","Gd","TA","Fa","Po","NA"]
#20
BsmtExposure1 = ["Gd","Av","Mn","No","NA"]
#21
BsmtFinType11 = ["GLQ", "ALQ", "BLQ", "Rec", "LwQ", "Unf","NA"]
#22
KitchenQual1 = ["Ex","Gd","TA","Fa","Po","NA"]
#23
FireplaceQu1 = ["Ex","Gd","TA","Fa","Po","NA"]

ordinal_cats1 = [ExterQual1, ExterCond1, BsmtQual1, BsmtCond1, BsmtExposure1, BsmtFinType11, KitchenQual1, FireplaceQu1 ]

In [None]:
Rank_Cat_15_18_21_22 = ['Ex','Gd','TA','Fa','Po', 'Na']
Rank_Cat_20 = ['GLQ','ALQ','BLQ','Rec','LwQ','Unf','NA']
Rank_Cat_19 = ['Gd','Av','Mn','No','NA']

In [None]:
ordinal_columns

array([15, 16, 17, 18, 19, 20, 21, 22])

In [None]:
# 0. Set the config so that we can view our preprocessor
set_config(display="diagram")

# Defining categorical & ordinal columns

In [None]:
# 1. defining categorical & ordinal columns
X_cat = X.select_dtypes(exclude="number").copy()
X_num = X.select_dtypes(include="number").copy()

# Numerical pipeline

In [None]:
# 2. numerical pipeline
numeric_pipe = make_pipeline(
SimpleImputer(strategy="mean"))

# categorical pipeline

In [None]:
# 3. categorical pipeline

# # 3.1 defining ordinal & onehot columns
# .get_indexer() get's the index to solve the problem described above about losing column names
ordinal_cols = X_cat.columns.get_indexer(['ExterQual', 'ExterCond', 
                                          'BsmtQual', 'BsmtCond','BsmtExposure', 
                                          'BsmtFinType1', 'KitchenQual', 'FireplaceQu'])
onehot_cols = X_cat.columns.get_indexer(list(set(X_cat) - set(ordinal_cols)))


In [None]:
ordinal_cats = [ExterQual1, ExterCond1, BsmtQual1, BsmtCond1, BsmtExposure1, BsmtFinType11, KitchenQual1, FireplaceQu1 ]

## Defining the categorical encoder

In [None]:
# # 3.2. defining the categorical encoder
# # # 3.2.1. we manually establish the order of the categories for our ordinal feature (Cabin), including "N_A"
ordinal_cats = [ExterQual1, ExterCond1, BsmtQual1, BsmtCond1, BsmtExposure1, BsmtFinType11, KitchenQual1, FireplaceQu1]

In [None]:
#16
ExterQual1 = ["Ex","Gd","TA","Fa","Po","NA"]
#17
ExterCond1 = ["Ex","Gd","TA","Fa","Po","NA"]
#18
BsmtQual1 = ["Ex","Gd","TA","Fa","Po","NA"]
#19
BsmtCond1 = ["Ex","Gd","TA","Fa","Po","NA"]
#20
BsmtExposure1 = ["Gd","Av","Mn","No","NA"]
#21
BsmtFinType11 = ["GLQ", "ALQ", "BLQ", "Rec", "LwQ", "Unf","NA"]
#22
KitchenQual1 = ["Ex","Gd","TA","Fa","Po","NA"]
#23
FireplaceQu1 = ["Ex","Gd","TA","Fa","Po","NA"]

## ColumnTransformer with 2 branches: ordinal & onehot

In [None]:
# # # 3.2.2. defining the categorical encoder: a ColumnTransformer with 2 branches: ordinal & onehot
categorical_encoder = ColumnTransformer(
    transformers=[
        ("cat_ordinal", OrdinalEncoder(categories=ordinal_cats), ordinal_cols),
        ("cat_onehot", OneHotEncoder(handle_unknown="ignore"), onehot_cols),
    ]
)

# Categorical pipeline = "N_A" imputer + categorical encoder

In [None]:
# # 3.3. categorical pipeline = "N_A" imputer + categorical encoder
categorical_pipe = make_pipeline(SimpleImputer(strategy="constant", fill_value="NA"),
                                 categorical_encoder
                                )

# 4. full preprocessing: a ColumnTransformer with 2 branches: numeric & categorical
full_preprocessing = ColumnTransformer(
    transformers=[
        ("num_pipe", numeric_pipe, X_num.columns),
        ("cat_pipe", categorical_pipe, X_cat.columns),
    ]
)

full_preprocessing

# Full pipeline: preprocessor + model

# DECISION TREE

In [None]:
from sklearn.model_selection import GridSearchCV

# full pipeline: preprocessor + model
full_pipeline = make_pipeline(full_preprocessing, 
                              DecisionTreeClassifier())

# define parameter grid
param_grid = {
    "columntransformer__num_pipe__simpleimputer__strategy":["constant", "median"],
    "decisiontreeclassifier__max_depth": range(2, 14, 2),
    "decisiontreeclassifier__min_samples_leaf": range(3, 12, 2),
    "decisiontreeclassifier__criterion":["gini", "entropy"]
    
}

# define GridSearchCV
search = GridSearchCV(full_pipeline,
                      param_grid,
                      cv=5,
                      verbose=1)

search

In [None]:
search.fit(X_train, y_train)
 
print(f"The best average score in cross validation was {search.best_score_}")

Fitting 5 folds for each of 120 candidates, totalling 600 fits
The best average score in cross validation was 0.9178093246762774


In [32]:
search.best_params_

{'columntransformer__num_pipe__simpleimputer__strategy': 'constant',
 'decisiontreeclassifier__criterion': 'entropy',
 'decisiontreeclassifier__max_depth': 10,
 'decisiontreeclassifier__min_samples_leaf': 9}

In [None]:
from sklearn.metrics import accuracy_score
y_train_pred = search.predict(X_train)

accuracy_score(y_train, y_train_pred)

0.952054794520548

In [None]:
# testing accuracy
y_test_pred = search.predict(X_test)
accuracy_score(y_test, y_test_pred)

0.9417808219178082

# KNN MODEL

In [49]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
knn_full_pipeline = make_pipeline(full_preprocessing, 
                                  scaler,                           
                                  KNeighborsClassifier()
                                 )

param_grid = {
    "columntransformer__num_pipe__simpleimputer__strategy":["mean", "median"],
    "kneighborsclassifier__n_neighbors": range(2, 50),
    "kneighborsclassifier__weights": ["uniform", "distance"]
}

knn_search = GridSearchCV(knn_full_pipeline,
                      param_grid,
                      cv=5,
                      scoring='accuracy',
                      verbose=1)

knn_search.fit(X_train, y_train)

knn_search.best_score_

Fitting 5 folds for each of 192 candidates, totalling 960 fits


0.9238032353912182

In [50]:
knn_search.best_params_

{'columntransformer__num_pipe__simpleimputer__strategy': 'mean',
 'kneighborsclassifier__n_neighbors': 7,
 'kneighborsclassifier__weights': 'distance'}

In [51]:
from sklearn.metrics import accuracy_score
y_train_pred_knn = knn_search.predict(X_train)

accuracy_score(y_train, y_train_pred_knn)

1.0

In [52]:
# testing accuracy
y_test_pred_knn = knn_search.predict(X_test)
accuracy_score(y_test, y_test_pred_knn)

0.9383561643835616

# RANDOM FOREST

In [53]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=3, 
                             max_depth=2
                             )

In [62]:
#from sklearn.neighbors import KNeighborsClassifier

rfc_full_pipeline = make_pipeline(full_preprocessing,
                                  scaler,                            
                                  rfc
                                 )
param_grid = {
    "columntransformer__num_pipe__simpleimputer__strategy":["mean", "median"],
    "randomforestclassifier__max_depth": range(2, 14, 2),
    "randomforestclassifier__min_samples_leaf": range(3, 12, 2),
    "randomforestclassifier__criterion":["gini", "entropy"]
  #  "randomforestclassifier__weights": ["uniform", "distance"]
}

rfc_search = GridSearchCV(rfc_full_pipeline,
                      param_grid,
                      cv=5,
                      scoring='accuracy',
                      verbose=1)

rfc_search.fit(X_train, y_train)
rfc_search.best_score_

Fitting 5 folds for each of 120 candidates, totalling 600 fits


0.9297714683980777

In [63]:
# define cross validation
from sklearn.model_selection import RandomizedSearchCV
rfc_full_pipeline = make_pipeline(full_preprocessing,
                                  scaler,                            
                                  rfc
                                 )
param_grid = {
    "columntransformer__num_pipe__simpleimputer__strategy":["mean", "median"],
    "randomforestclassifier__max_depth": range(2, 14, 2),
    "randomforestclassifier__min_samples_leaf": range(3, 12, 2),
    "randomforestclassifier__criterion":["gini", "entropy"]
  #  "randomforestclassifier__weights": ["uniform", "distance"]
}

search = RandomizedSearchCV(rfc_full_pipeline,
                      param_grid,
                      cv=10,
                      verbose=1,
                      scoring="accuracy",
                      n_jobs=-2, 
                      n_iter=100)


rfc_search.fit(X_train, y_train)
rfc_search.best_score_

Fitting 5 folds for each of 120 candidates, totalling 600 fits


0.9315065478155606

In [64]:
rfc_search.best_params_

{'columntransformer__num_pipe__simpleimputer__strategy': 'mean',
 'randomforestclassifier__criterion': 'entropy',
 'randomforestclassifier__max_depth': 8,
 'randomforestclassifier__min_samples_leaf': 7}

In [60]:
from sklearn.metrics import accuracy_score
y_train_pred_rfc = rfc_search.predict(X_train)

accuracy_score(y_train, y_train_pred_rfc)

0.934931506849315

In [61]:
# testing accuracy
y_test_pred_rfc = rfc_search.predict(X_test)

accuracy_score(y_test, y_test_pred_rfc)

0.9452054794520548

# SUPPORT VECTOR MACHINE