# House Price - Pipeline, GridSearch, CrossValidation

**Column Descriptions:**

1. LotFrontage: Linear feet of street connected to property
2. LotArea: Lot size in square feet
3. TotalBsmtSF: Total square feet of basement area
4. BedroomAbvGr: Bedrooms above grade (does NOT include basement bedrooms)
5. Fireplaces: Number of fireplaces
6. PoolArea: Pool area in square feet
7. GarageCars: Size of garage in car capacity 
8. WoodDeckSF: Wood deck area in square feet
9. ScreenPorch: Screen porch area in square feet
10. MSZoning: Identifies the general zoning classification of the sale.
    * A		Agriculture
    * C		Commercial
    * FV	Floating Village Residential
    * I		Industrial
    * RH	Residential High Density
    * RL	Residential Low Density
    * RP	Residential Low Density Park 
    * RM	Residential Medium Density
11. Condition1: Proximity to various conditions
    * Artery	Adjacent to arterial street
    * Feedr	Adjacent to feeder street	
    * Norm	Normal	
    * RRNn	Within 200' of North-South Railroad
    * RRAn	Adjacent to North-South Railroad
    * PosN	Near positive off-site feature--park, greenbelt, etc.
    * PosA	Adjacent to postive off-site feature
    * RRNe	Within 200' of East-West Railroad
    * RRAe	Adjacent to East-West Railroad
12. Heating: Type of heating
    * Floor	Floor Furnace
    * GasA	Gas forced warm air furnace
    * GasW	Gas hot water or steam heat
    * Grav	Gravity furnace	
    * OthW	Hot water or steam heat other than gas
    * Wall	Wall furnace
13. Street: Type of road access to property
    * Grvl	Gravel	
    * Pave	Paved
14. CentralAir: Central air conditioning
    * N	No
    * Y	Yes
15. Foundation: Type of foundation
    * BrkTil	Brick & Tile
    * CBlock	Cinder Block
    * PConc	Poured Contrete	
    * Slab	Slab
    * Stone	Stone
    * Wood	Wood

## 1. Import data and libraries

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score
from sklearn.pipeline import make_pipeline

For other Classification models see here:

In [2]:
#other models:
#from sklearn.model_selection import train_test_split
#from sklearn.datasets import make_moons, make_circles, make_classification

#from sklearn.neural_network import MLPClassifier
#from sklearn.neighbors import KNeighborsClassifier
#from sklearn.svm import SVC
#from sklearn.gaussian_process import GaussianProcessClassifier
#from sklearn.gaussian_process.kernels import RBF
#from sklearn.tree import DecisionTreeClassifier
#from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
#from sklearn.naive_bayes import GaussianNB
#from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

In [3]:
data = pd.read_csv("housing-classification-iter3_Liane.csv")
data.columns

Index(['LotArea', 'LotFrontage', 'TotalBsmtSF', 'BedroomAbvGr', 'Fireplaces',
       'PoolArea', 'GarageCars', 'WoodDeckSF', 'ScreenPorch', 'Expensive',
       'MSZoning', 'Condition1', 'Heating', 'Street', 'CentralAir',
       'Foundation'],
      dtype='object')

## 2. Split X and y

In [4]:
y = data.pop("Expensive")

In [5]:
X = data #.drop(columns=["PassengerId", "Name", "Ticket"])

In [6]:
X.isna().sum()

LotArea           0
LotFrontage     259
TotalBsmtSF       0
BedroomAbvGr      0
Fireplaces        0
PoolArea          0
GarageCars        0
WoodDeckSF        0
ScreenPorch       0
MSZoning          0
Condition1        0
Heating           0
Street            0
CentralAir        0
Foundation        0
dtype: int64

In [7]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   LotArea       1460 non-null   int64  
 1   LotFrontage   1201 non-null   float64
 2   TotalBsmtSF   1460 non-null   int64  
 3   BedroomAbvGr  1460 non-null   int64  
 4   Fireplaces    1460 non-null   int64  
 5   PoolArea      1460 non-null   int64  
 6   GarageCars    1460 non-null   int64  
 7   WoodDeckSF    1460 non-null   int64  
 8   ScreenPorch   1460 non-null   int64  
 9   MSZoning      1460 non-null   object 
 10  Condition1    1460 non-null   object 
 11  Heating       1460 non-null   object 
 12  Street        1460 non-null   object 
 13  CentralAir    1460 non-null   object 
 14  Foundation    1460 non-null   object 
dtypes: float64(1), int64(8), object(6)
memory usage: 171.2+ KB


**Conclusion: We have some missing values**

## 3. Split Train and Test


In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=31416)

## 4. Creating a numeric pipe and a categorical pipe

In [9]:
#Maybe some stuff to add a scaler:
# initialize transformers & model
#imputer = SimpleImputer()
#scaler = StandardScaler()

In [38]:
# select categorical and numerical column names
X_cat_columns = X.select_dtypes(exclude="number").copy().columns
X_num_columns = X.select_dtypes(include="number").copy().columns

# create numerical pipeline, only with the SimpleImputer(strategy="mean")
numeric_pipe = make_pipeline(
    SimpleImputer(), 
    StandardScaler())
 
 # create categorical pipeline, with the SimpleImputer(fill_value="N_A") and the OneHotEncoder
categoric_pipe = make_pipeline(
    SimpleImputer(),
    OneHotEncoder()
)

## 5. Preprocess all data by running both pipes and merge the results 
**with `ColumnTransformer`**

In [39]:
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ("num_pipe", numeric_pipe, X_num_columns),
        ("cat_pipe", categoric_pipe, X_cat_columns),
    ]
)

## 6. Create the `full_pipe` with preprocessor and Model

In [40]:
# initialize the model
dtree = DecisionTreeClassifier()
#rforest = RandomForestClassifier()

#create the full_pipeline
full_pipe = make_pipeline(preprocessor, 
                          dtree)

In [41]:
#look at full_pipe and check if its good
from sklearn import set_config

set_config(display="diagram")
full_pipe # click on the diagram below to see the details of each step

## 5. Define the SearchGrid

In [42]:
# create parameter grid
param_grid = {
    "columntransformer__num_pipe__standardscaler__with_mean":[True, False],
    "columntransformer__num_pipe__standardscaler__with_std":[True, False],
    "columntransformer__num_pipe__simpleimputer__strategy":["mean", "median"],
    "columntransformer__cat_pipe__simpleimputer__strategy": ["constant"],
    "columntransformer__cat_pipe__simpleimputer__fill_value": ["N_A"],
    "columntransformer__cat_pipe__onehotencoder__handle_unknown": ["ignore"],
    "decisiontreeclassifier__max_depth": range(2, 14),
    "decisiontreeclassifier__min_samples_leaf": range(2, 12),
    "decisiontreeclassifier__min_samples_split": range(3, 40, 2),
    "decisiontreeclassifier__criterion":["gini", "entropy"]
#    "randomforestclassifier__n_estimators": [100, 200],
#    "randomforestclassifier__max_depth": range(2, 14),
#    "randomforestclassifier__min_samples_leaf": range(2, 10),
#    "randomforestclassifier__criterion":["gini", "entropy"]
}

## 6. Define the Cross-Validation

In [43]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

In [44]:
# define cross validation
search = RandomizedSearchCV(full_pipe,
                      param_grid,
                      cv=10,
                      verbose=1,
                      scoring="accuracy",
                      n_jobs=-2, 
                      n_iter=100)

## 7. Build the model

In [35]:
# fit
search.fit(X_train, y_train)

Fitting 10 folds for each of 100 candidates, totalling 1000 fits


In [28]:
# cross validation average accuracy
search.best_score_

0.917816091954023

In [29]:
# best parameters
search.best_params_

{'decisiontreeclassifier__min_samples_split': 13,
 'decisiontreeclassifier__min_samples_leaf': 2,
 'decisiontreeclassifier__max_depth': 4,
 'decisiontreeclassifier__criterion': 'entropy',
 'columntransformer__num_pipe__standardscaler__with_std': False,
 'columntransformer__num_pipe__standardscaler__with_mean': True,
 'columntransformer__num_pipe__simpleimputer__strategy': 'mean',
 'columntransformer__cat_pipe__simpleimputer__strategy': 'constant',
 'columntransformer__cat_pipe__simpleimputer__fill_value': 'N_A'}

In [20]:
#dtree.feature_importances_

## 8. Accuracy

In [21]:
from sklearn.metrics import accuracy_score

In [30]:
# training accuracy ON the ENTIRE TRAIN-DATA
y_train_pred = search.predict(X_train)

accuracy_score(y_train, y_train_pred)

0.9289383561643836

In [31]:
# testing accuracy
y_test_pred = search.predict(X_test)

accuracy_score(y_test, y_test_pred)

0.9178082191780822