## KNRegressor Model

In this project we will be using the housing prices dataset from sklearn and try to predict the prices using the KNRegressor model.

The steps will be:
1. [Checking the decription and the info of the data](#info)
2. [Data Cleaning](#cleaning)
3. [Creating the Pipeline](#pipeline)
4. [Cross-Validation Step](#cv)
5. [Picking the Best Model](#best)

In [33]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
## to get sample data for our model
from sklearn.datasets import fetch_openml
## preprocessing best model selection
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, QuantileTransformer
## pipeline and the main model
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsRegressor
## the metrics to compare
from sklearn.metrics import precision_score, recall_score, make_scorer

<a id='info'> The Basic Checks </a>

In [2]:
## we will be using the house_prices dataset
## and the first step is to check the description
print(fetch_openml(name="house_prices", as_frame=True)['DESCR'])

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

MSSubClass: Identifies the type of dwelling involved in the sale.	

        20	1-STORY 1946 & NEWER ALL STYLES
        30	1-STORY 1945 & OLDER
        40	1-STORY W/FINISHED ATTIC ALL AGES
        45	1-1/2 STORY - UNFINISHED ALL AGES
        50	1-1/2 STORY FINISHED ALL AGES
        60	2-STORY 1946 & NEWER
        70	2-STORY 1945 & OLDER
        75	2-1/2 STORY ALL AGES
        80	SPLIT OR MULTI-LEVEL
        85	SPLIT FOYER
        90	DUPLEX - ALL STYLES AND AGES
       120	1-STORY PUD (Planned Unit Development) - 1

In [3]:
housing_prop = fetch_openml(name="house_prices", as_frame=True)['data'].set_index('Id')
housing_prices = fetch_openml(name='house_prices', as_frame = True)['target']
housing_prop.info()

<class 'pandas.core.frame.DataFrame'>
Float64Index: 1460 entries, 1.0 to 1460.0
Data columns (total 79 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1460 non-null   float64
 1   MSZoning       1460 non-null   object 
 2   LotFrontage    1201 non-null   float64
 3   LotArea        1460 non-null   float64
 4   Street         1460 non-null   object 
 5   Alley          91 non-null     object 
 6   LotShape       1460 non-null   object 
 7   LandContour    1460 non-null   object 
 8   Utilities      1460 non-null   object 
 9   LotConfig      1460 non-null   object 
 10  LandSlope      1460 non-null   object 
 11  Neighborhood   1460 non-null   object 
 12  Condition1     1460 non-null   object 
 13  Condition2     1460 non-null   object 
 14  BldgType       1460 non-null   object 
 15  HouseStyle     1460 non-null   object 
 16  OverallQual    1460 non-null   float64
 17  OverallCond    1460 non-null   float64
 18  Ye

<a id='cleaning'>Data Cleaning</a>

In [4]:
## data cleaning
def data_cleaner(df):
    datatype_dict = {}
    df.columns = [x.replace(r'\s+', '_').lower() for x in df.columns]
    for col in df.columns:
        ## dropping the columns that are mostly null values
        if df[col].isnull().sum()/df.shape[0] >= 0.5:
            df.drop(col, axis=1, inplace=True)
        elif df[col].dtype == 'object' and df[col].nunique() < 10:
            df[col] = df[col].str.replace(r'\s+','_', regex = True).str.lower()
            df = pd.get_dummies(data=df, columns = [col])
        elif df[col].dtype == 'object':
            df.drop(col, axis=1, inplace=True)
        elif df[col].dtype in ['float64', 'float32'] and df[col].min() == df.astype({col:'float16'})[col].min:
            datatype_dict[col] = 'float16'
        elif df[col].dtype == 'float64' and df[col].min() == df.astype({col:'float32'})[col].min:
            datatype_dict[col] = 'float32'
        elif df[col].dtype in ['int64', 'int32', 'int16'] and df[col].min() == df.astype({col:'int8'})[col].min:
            datatype_dict[col] = 'int8'
        elif df[col].dtype in ['int64', 'int32'] and df[col].min() == df.astype({col:'int16'})[col].min:
            datatype_dict[col] = 'int16'
        elif df[col].dtype in ['int64'] and df[col].min() == df.astype({col:'int32'})[col].min:
            datatype_dict[col] = 'int32'
    df.fillna(0, inplace= True)
    return df.astype(datatype_dict)

cleaned_X = data_cleaner(housing_prop.copy())
print(cleaned_X.info())
cleaned_X.head()

  return arr.astype(dtype, copy=True)


<class 'pandas.core.frame.DataFrame'>
Float64Index: 1460 entries, 1.0 to 1460.0
Columns: 219 entries, mssubclass to salecondition_partial
dtypes: float64(36), uint8(183)
memory usage: 682.9 KB
None


Unnamed: 0_level_0,mssubclass,lotfrontage,lotarea,overallqual,overallcond,yearbuilt,yearremodadd,masvnrarea,bsmtfinsf1,bsmtfinsf2,...,saletype_cwd,saletype_new,saletype_oth,saletype_wd,salecondition_abnorml,salecondition_adjland,salecondition_alloca,salecondition_family,salecondition_normal,salecondition_partial
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1.0,60.0,65.0,8450.0,7.0,5.0,2003.0,2003.0,196.0,706.0,0.0,...,0,0,0,1,0,0,0,0,1,0
2.0,20.0,80.0,9600.0,6.0,8.0,1976.0,1976.0,0.0,978.0,0.0,...,0,0,0,1,0,0,0,0,1,0
3.0,60.0,68.0,11250.0,7.0,5.0,2001.0,2002.0,162.0,486.0,0.0,...,0,0,0,1,0,0,0,0,1,0
4.0,70.0,60.0,9550.0,7.0,5.0,1915.0,1970.0,0.0,216.0,0.0,...,0,0,0,1,1,0,0,0,0,0
5.0,60.0,84.0,14260.0,8.0,5.0,2000.0,2000.0,350.0,655.0,0.0,...,0,0,0,1,0,0,0,0,1,0


<a id='pipeline'>Creating the Pipeline</a>

In [5]:
## splitting the sample
X_train, X_test, y_train, y_test = train_test_split(cleaned_X, housing_prices, test_size=.2)

In [37]:
## starting with the model pipeline
## using the QuantileTransformer for scaling
random_seed = 40
pipe = Pipeline(
    [
        ('scale', QuantileTransformer(random_state=random_seed)),
         ('model', KNeighborsRegressor())
    ]
)
## checking all the available params
pipe.get_params()

{'memory': None,
 'steps': [('scale', QuantileTransformer(random_state=40)),
  ('model', KNeighborsRegressor())],
 'verbose': False,
 'scale': QuantileTransformer(random_state=40),
 'model': KNeighborsRegressor(),
 'scale__copy': True,
 'scale__ignore_implicit_zeros': False,
 'scale__n_quantiles': 1000,
 'scale__output_distribution': 'uniform',
 'scale__random_state': 40,
 'scale__subsample': 100000,
 'model__algorithm': 'auto',
 'model__leaf_size': 30,
 'model__metric': 'minkowski',
 'model__metric_params': None,
 'model__n_jobs': None,
 'model__n_neighbors': 5,
 'model__p': 2,
 'model__weights': 'uniform'}

<a id='cv'> Cross-Validation Step </a>

In [48]:
## then the grid params
## first try
grid_params = {'scale__n_quantiles':[500, 700, 900],
               'model__n_neighbors':[3, 5, 7],
               'model__weights':['uniform','distance'],
               'model__leaf_size':[20, 30, 40],
               'model__algorithm' : ['auto', 'kd_tree', 'brute']
              }
## and the actual cross-validation
## with 3 folds
gc = GridSearchCV(pipe, param_grid=grid_params,
                cv=3, return_train_score=True)
gc.fit(X_train, y_train)
best_model = gc.best_estimator_
## checking the params picked
## and moving to the direction of the best estimators
best_model

Pipeline(steps=[('scale',
                 QuantileTransformer(n_quantiles=500, random_state=40)),
                ('model',
                 KNeighborsRegressor(leaf_size=20, n_neighbors=7,
                                     weights='distance'))])

In [49]:
pd.DataFrame(gc.cv_results_).sort_values('rank_test_score').head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_model__algorithm,param_model__leaf_size,param_model__n_neighbors,param_model__weights,param_scale__n_quantiles,params,...,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,mean_train_score,std_train_score
33,0.236251,0.050055,0.033568,0.004099,auto,30,7,distance,500,"{'model__algorithm': 'auto', 'model__leaf_size...",...,0.710443,0.743146,0.720806,0.01581,1,1.0,1.0,1.0,1.0,3.268408e-16
159,0.228702,0.030108,0.035179,0.00645,brute,40,7,distance,500,"{'model__algorithm': 'brute', 'model__leaf_siz...",...,0.710443,0.743146,0.720806,0.01581,1,1.0,1.0,1.0,1.0,3.268408e-16
87,0.369451,0.005706,0.212166,0.010609,kd_tree,30,7,distance,500,"{'model__algorithm': 'kd_tree', 'model__leaf_s...",...,0.710443,0.743146,0.720806,0.01581,1,1.0,1.0,1.0,1.0,0.0
15,0.220149,0.010337,0.030034,0.002578,auto,20,7,distance,500,"{'model__algorithm': 'auto', 'model__leaf_size...",...,0.710443,0.743146,0.720806,0.01581,1,1.0,1.0,1.0,1.0,3.268408e-16
69,0.315941,0.018118,0.256257,0.018539,kd_tree,20,7,distance,500,"{'model__algorithm': 'kd_tree', 'model__leaf_s...",...,0.710443,0.743146,0.720806,0.01581,1,1.0,1.0,1.0,1.0,0.0


In [50]:
## 2nd try
grid_params_2 = {'scale__n_quantiles':[100, 300, 500],
               'model__n_neighbors':[7, 11, 13],
               'model__weights':['distance'],
               'model__leaf_size':[5, 10, 20],
               'model__algorithm' : ['auto']
              }
gc2 = GridSearchCV(pipe, param_grid=grid_params_2,
                 cv=3, return_train_score=True)
gc2.fit(X_train, y_train)
best_model = gc2.best_estimator_
## checking the params picked
## and moving to the direction of the best estimators
best_model

Pipeline(steps=[('scale',
                 QuantileTransformer(n_quantiles=100, random_state=40)),
                ('model',
                 KNeighborsRegressor(leaf_size=5, n_neighbors=7,
                                     weights='distance'))])

In [None]:
## 3rd try
grid_params_3 = {'scale__n_quantiles':[50, 70, 100],
               'model__n_neighbors':[3, 5, 7],
               'model__weights':['distance'],
               'model__leaf_size':[1, 3, 5],
               'model__algorithm' : ['auto']
              }
gc3 = GridSearchCV(pipe, param_grid=grid_params_3,
                 cv=3, return_train_score=True)
gc3.fit(X_train, y_train)
best_model = gc3.best_estimator_
## checking the params picked again
## and stopping if it's picked the middle params
best_model

In [None]:
best_model.score(X_test, y_test)

In [27]:
## creating a new pipeline
## using the Standard Scaler for scaling
random_seed = 40
pipe = Pipeline(
    [
        ('scale', StandardScaler()),
         ('model', KNeighborsRegressor())
    ]
)
## checking all the available params
pipe.get_params()

{'memory': None,
 'steps': [('scale', StandardScaler()), ('model', KNeighborsRegressor())],
 'verbose': False,
 'scale': StandardScaler(),
 'model': KNeighborsRegressor(),
 'scale__copy': True,
 'scale__with_mean': True,
 'scale__with_std': True,
 'model__algorithm': 'auto',
 'model__leaf_size': 30,
 'model__metric': 'minkowski',
 'model__metric_params': None,
 'model__n_jobs': None,
 'model__n_neighbors': 5,
 'model__p': 2,
 'model__weights': 'uniform'}

In [29]:
## then the grid params
## first try
grid_params = {'model__n_neighbors':[3, 5, 7],
               'model__weights':['uniform','distance'],
               'model__leaf_size':[20, 30, 40],
               'model__algorithm' : ['auto', 'kd_tree', 'brute']
              }
## and the actual cross-validation
## with 3 folds
gc = GridSearchCV(pipe, param_grid=grid_params,
                 cv=3, return_train_score=True)
gc.fit(X_train, y_train)
best_model = gc.best_estimator_
## checking the params picked
## and moving to the direction of the best estimators
best_model

Pipeline(steps=[('scale', StandardScaler()),
                ('model',
                 KNeighborsRegressor(leaf_size=20, n_neighbors=7,
                                     weights='distance'))])

In [30]:
## 2nd try
grid_params_2 = {'model__n_neighbors':[7, 9, 11],
               'model__weights':['distance'],
               'model__leaf_size':[5, 10, 20],
               'model__algorithm' : ['auto']
              }
gc2 = GridSearchCV(pipe, param_grid=grid_params_2,
                 cv=3, return_train_score=True)
gc2.fit(X_train, y_train)
best_model = gc2.best_estimator_
## checking the params picked
## and moving to the direction of the best estimators
best_model

Pipeline(steps=[('scale', StandardScaler()),
                ('model',
                 KNeighborsRegressor(leaf_size=5, n_neighbors=9,
                                     weights='distance'))])

In [31]:
## 3rd try
grid_params_3 = {'model__n_neighbors':[9],
               'model__weights':['distance'],
               'model__leaf_size':[1, 3, 5],
               'model__algorithm' : ['auto']
              }
gc3 = GridSearchCV(pipe, param_grid=grid_params_3,
                 cv=3, return_train_score=True)
gc3.fit(X_train, y_train)
best_model = gc3.best_estimator_
## checking the params picked again
## and stopping if it's picked the middle params
best_model

Pipeline(steps=[('scale', StandardScaler()),
                ('model',
                 KNeighborsRegressor(leaf_size=1, n_neighbors=9,
                                     weights='distance'))])

In [32]:
best_model.score(X_test, y_test)

0.7173219937841399

So far, the model using the quantile transformer has the highest score.

<a id='best'> Testing the Best Model Picked </a>