In this notebook, we will create features using k-means clustering

In [1]:
#Set up notebooks
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.model_selection import cross_val_score
from xgboost import XGBRegressor

#matplotlib style
plt.style.use('seaborn-v0_8-whitegrid')
plt.rc('figure', autolayout=True)
plt.rc('axes',
      labelweight='bold',
      labelsize='large',
      titleweight='bold',
      titlesize=14,
      titlepad=10)

#score functions
def score_dataset(X, y, model=XGBRegressor()):
    #label encoding
    for colname in X.select_dtypes(['category', 'object']):
        X[colname], _ = X[colname].factorize()
        #Metric for Housing competition is RMSLE (Root Mean Squared Log Error)
        score = cross_val_score(
        model, X, y, cv=5, scoring='neg_mean_squared_log_error')
        
        score = -1 * score.mean()
        score = np.sqrt(score)
        return score
    
#load dataset Ames
df = pd.read_csv('./data/house-prices-advanced-regression-techniques/ames.csv')

The k-means algorithm is sensitive to scale. This means we need to be thoughtful about how and whether rescale our features since we might get very different results depending on our choices. As a rule of thumb, if the features are already directly comparable (like a test result at different times), then we would not want rescale. On the other hand, faetures that arenot on comparable scales(like height and weight) will ussually benefit from rescaling. Sometimes, the choice won't be clear though. In that case, we should try to use common sense, remembering that faetures with larger values will be weighted more heavily.

In [6]:
df.columns

Index(['MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley',
       'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope',
       'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
       'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle',
       'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea',
       'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
       'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC',
       'CentralAir', 'Electrical', 'FirstFlrSF', 'SecondFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
       'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd',
       'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageFinish',
       'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive',
       'WoodDeckSF',

In [8]:
df[['LotArea', 'GrLivArea']]

Unnamed: 0,LotArea,GrLivArea
0,31770.0,1656.0
1,11622.0,896.0
2,14267.0,1329.0
3,11160.0,2110.0
4,13830.0,1629.0
...,...,...
2925,7937.0,1003.0
2926,8885.0,902.0
2927,10441.0,970.0
2928,10010.0,1389.0


#### 1. Scaling Features

Consider the following sets of features. For each, decide whether:
* they definitely should be rescaled,
* they definitely should not be rescaled or either might be reasonable

Example of rescaling features or should not:
1. Latitude and longitude of cities in California
 * This is natural distances described by Latitude and Longitude, we will skip off scaling
2. Lot Area and Living Area of houses in Ames
 * Lot 
3. Number of Doors and Horsepower of a 1989 model car

