# Ames, Iowa housingprices

# Frame the problem and look at the big picture

1. Frame the problem and look at the big picture
2. Get the data
3. Explore the data to gain insights
4. Prepare the data to better expose the underlying data patterns to machine learning algorithms
5. Explore many different models and short-list the best ones
6. Fine-tune your models and combine them into a great solution
7. Present your solution
8. Launch, monitor and maintain your system

# Get the data

In [27]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.linear_model import LinearRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.model_selection import GridSearchCV


In [28]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/home-data-for-ml-course/sample_submission.csv
/kaggle/input/home-data-for-ml-course/sample_submission.csv.gz
/kaggle/input/home-data-for-ml-course/train.csv.gz
/kaggle/input/home-data-for-ml-course/data_description.txt
/kaggle/input/home-data-for-ml-course/test.csv.gz
/kaggle/input/home-data-for-ml-course/train.csv
/kaggle/input/home-data-for-ml-course/test.csv


In [29]:
train_path = '../input/home-data-for-ml-course/train.csv'
test_path = '../input/home-data-for-ml-course/test.csv'

train = pd.read_csv(train_path)
test = pd.read_csv(test_path)

# Explore the data to gain insights

In [None]:
train.head()

In [None]:
train.info()

In [None]:
test.head()

In [None]:
test.info()

In [None]:
test.describe(include='all')

In [None]:
train.describe(include='all')

Count means how many of the rows have data in them. Mean means average. Std means standard deviation, i.e. how much the data normally deviates from the average. Min is the smallest value of all the rows. 25% means that 25% of the rows are below the stated value, same with 50% and 75%. Max is the highest measured value.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
train.hist(bins=50, figsize=(20,15))
plt.show()

In [None]:
train.dtypes

Identifying object datatypes

In [4]:
# Numerical variables
q = train.dtypes!='object'
#list(s[s].index)
q[q].index

Index(['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual',
       'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd',
       'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF',
       'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea',
       'MiscVal', 'MoSold', 'YrSold', 'SalePrice'],
      dtype='object')

In [5]:
# categorical variables
s = train.dtypes=='object'
#list(s[s].index)
s[s].index

Index(['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities',
       'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
       'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
       'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation',
       'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
       'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
       'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual',
       'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature',
       'SaleType', 'SaleCondition'],
      dtype='object')

In [None]:
train.select_dtypes(include='object').describe()

In [None]:
f, ax = plt.subplots(figsize=(14, 12))
corr = train.corr()
sns.heatmap(corr, annot=True)

Correlation between all features and the sale price

In [None]:
train.shape

In [5]:
# Looking for missing values
train.isnull().sum().sort_values(ascending=False).head(20)

PoolQC          1453
MiscFeature     1406
Alley           1369
Fence           1179
FireplaceQu      690
LotFrontage      259
GarageYrBlt       81
GarageCond        81
GarageType        81
GarageFinish      81
GarageQual        81
BsmtFinType2      38
BsmtExposure      38
BsmtQual          37
BsmtCond          37
BsmtFinType1      37
MasVnrArea         8
MasVnrType         8
Electrical         1
Id                 0
dtype: int64

In [None]:
sns.heatmap(train.isnull(),yticklabels=False,cbar=False)

From this we can see that there are many features which are getting alot of null values. 
This dataset has 81 features. This means we can remove the features which get the most null values, as these are not important to predict the sale price. 
We can however, doublecheck to make sure which features has the best correlation with the sale price. 

In [6]:
 #Find what values has the highest correlation with salePrice
    
train_corr = abs(pd.DataFrame(train.corr()['SalePrice']))
train_corr.sort_values('SalePrice', ascending=False)

Unnamed: 0,SalePrice
SalePrice,1.0
OverallQual,0.790982
GrLivArea,0.708624
GarageCars,0.640409
GarageArea,0.623431
TotalBsmtSF,0.613581
1stFlrSF,0.605852
FullBath,0.560664
TotRmsAbvGrd,0.533723
YearBuilt,0.522897


# Prepare the data to better expose the underlying data patterns to machine learning algorithms

In [31]:
prices = pd.DataFrame({'price':train['SalePrice'],'log(price + 1)':np.log1p(train['SalePrice'])})

Many of the datatypes is of type 'object'. These values should be changed as the model we will use will be based on numerical values. 

In [5]:
train.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


Now we can see there are only numerical values shown. 

In [32]:
train.drop(['Alley', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature'], axis=1, inplace=True)
test.drop(['Alley', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature'], axis=1, inplace=True)

In [7]:
# drop columns where 20% of data is Null/NaN.
#thresh = len(train) * .8
#train.dropna(thresh = thresh, axis = 1, inplace=True)

Dropping features with most NULL values.

In [33]:
train.drop(['BedroomAbvGr', 'ScreenPorch', 'PoolArea', 'MoSold', '3SsnPorch', 'LowQualFinSF', 'YrSold', 'MiscVal', 'BsmtFinSF2', 'BsmtHalfBath', 'MSSubClass', 'KitchenAbvGr', 'EnclosedPorch'], axis=1, inplace=True)
test.drop(['BedroomAbvGr', 'ScreenPorch', 'PoolArea', 'MoSold', '3SsnPorch', 'LowQualFinSF', 'YrSold', 'MiscVal', 'BsmtFinSF2', 'BsmtHalfBath', 'MSSubClass', 'KitchenAbvGr', 'EnclosedPorch'], axis=1, inplace=True)

Removing features with lowest correlation with the sale price. 

In [34]:
train.fillna(train.mean(), inplace=True)

  """Entry point for launching an IPython kernel.


filling Null/NaN values with mean values

In [None]:
train.shape  

In [None]:
test.shape 

In [10]:
train.columns.sort_values()


Index(['1stFlrSF', '2ndFlrSF', 'BldgType', 'BsmtCond', 'BsmtExposure',
       'BsmtFinSF1', 'BsmtFinType1', 'BsmtFinType2', 'BsmtFullBath',
       'BsmtQual', 'BsmtUnfSF', 'CentralAir', 'Condition1', 'Condition2',
       'Electrical', 'ExterCond', 'ExterQual', 'Exterior1st', 'Exterior2nd',
       'Fireplaces', 'Foundation', 'FullBath', 'Functional', 'GarageArea',
       'GarageCars', 'GarageCond', 'GarageFinish', 'GarageQual', 'GarageType',
       'GarageYrBlt', 'GrLivArea', 'HalfBath', 'Heating', 'HeatingQC',
       'HouseStyle', 'Id', 'KitchenQual', 'LandContour', 'LandSlope',
       'LotArea', 'LotConfig', 'LotFrontage', 'LotShape', 'MSZoning',
       'MasVnrArea', 'MasVnrType', 'Neighborhood', 'OpenPorchSF',
       'OverallCond', 'OverallQual', 'PavedDrive', 'RoofMatl', 'RoofStyle',
       'SaleCondition', 'SalePrice', 'SaleType', 'Street', 'TotRmsAbvGrd',
       'TotalBsmtSF', 'Utilities', 'WoodDeckSF', 'YearBuilt', 'YearRemodAdd'],
      dtype='object')

In [43]:
# one-hot encoding
train2 = pd.get_dummies(train,drop_first=True)
train2.head()

Unnamed: 0,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtUnfSF,TotalBsmtSF,...,SaleType_ConLI,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,65.0,8450,7,5,2003,2003,196.0,706,150,856,...,0,0,0,0,1,0,0,0,1,0
1,80.0,9600,6,8,1976,1976,0.0,978,284,1262,...,0,0,0,0,1,0,0,0,1,0
2,68.0,11250,7,5,2001,2002,162.0,486,434,920,...,0,0,0,0,1,0,0,0,1,0
3,60.0,9550,7,5,1915,1970,0.0,216,540,756,...,0,0,0,0,1,0,0,0,0,0
4,84.0,14260,8,5,2000,2000,350.0,655,490,1145,...,0,0,0,0,1,0,0,0,1,0


One hot encoding to transform categorical variables

In [44]:
P_train = train2.SalePrice
train2.drop(['SalePrice','Id'],axis=1,inplace = True)
print(train2.shape)

AttributeError: 'DataFrame' object has no attribute 'SalePrice'

In [12]:
print(P_train)

0       208500
1       181500
2       223500
3       140000
4       250000
         ...  
1455    175000
1456    210000
1457    266500
1458    142125
1459    147500
Name: SalePrice, Length: 1460, dtype: int64


# Explore many different models and short-list the best ones

In [46]:
#making train and validation
x = train['Id']
y = train['SalePrice']
X_train, X_test, y_train, y_test = train_test_split(train2, y, 
                                                    test_size = 0.3, random_state = 42)
print("X_train: ", str(X_train.shape))
print("X_test: ", str(X_test.shape))
print("y_train: ", str(y_train.shape))
print("y_test: ", str(y_test.shape))

KeyError: 'Id'

In [47]:
# Define error measure for official scoring : RMSE
scorer = make_scorer(mean_squared_error, greater_is_better = False)

def rmse_cv(model):
    rmse = np.sqrt(-cross_val_score(model, X_train, 
                                    y_train, scoring = scorer, cv = 5))
    return(rmse)
    
def rmse_cv_test(model):
    rmse = np.sqrt(-cross_val_score(model, X_test, 
                                   y_test, scoring = scorer, cv = 5))
    return(rmse)

In [51]:
# Check for missing columns in testing dataset
X_train.columns.difference(X_test.columns).tolist()

[]

In [58]:
X_test.columns.difference(X_train.columns).tolist()

[]

In [59]:
from sklearn.feature_selection import RFECV

rfecv = RFECV(estimator=XGBRegressor(),
              step=10,
              n_jobs=-1,
              scoring="r2",
              cv=5,
              verbose=True)

rfecv.fit(X_train, y_train)

Fitting estimator with 219 features.
Fitting estimator with 209 features.
Fitting estimator with 199 features.
Fitting estimator with 189 features.
Fitting estimator with 179 features.
Fitting estimator with 169 features.
Fitting estimator with 159 features.
Fitting estimator with 149 features.
Fitting estimator with 139 features.
Fitting estimator with 129 features.
Fitting estimator with 119 features.
Fitting estimator with 109 features.
Fitting estimator with 99 features.
Fitting estimator with 89 features.
Fitting estimator with 79 features.
Fitting estimator with 69 features.
Fitting estimator with 59 features.


RFECV(cv=5,
      estimator=XGBRegressor(base_score=None, booster=None, callbacks=None,
                             colsample_bylevel=None, colsample_bynode=None,
                             colsample_bytree=None, early_stopping_rounds=None,
                             enable_categorical=False, eval_metric=None,
                             gamma=None, gpu_id=None, grow_policy=None,
                             importance_type=None, interaction_constraints=None,
                             learning_rate=None, max_bin=None,
                             max_cat_to_onehot=None, max_delta_step=None,
                             max_depth=None, max_leaves=None,
                             min_child_weight=None, missing=nan,
                             monotone_constraints=None, n_estimators=100,
                             n_jobs=None, num_parallel_tree=None,
                             predictor=None, random_state=None, reg_alpha=None,
                             reg_lambda=None, 

In [61]:
X_train.columns[rfecv.support_]

Index(['LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'BsmtFinSF1', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'GrLivArea',
       'BsmtFullBath', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea',
       'OpenPorchSF', 'MSZoning_RM', 'LandContour_HLS', 'LandContour_Lvl',
       'LotConfig_FR2', 'Neighborhood_Crawfor', 'Neighborhood_NWAmes',
       'Neighborhood_StoneBr', 'Condition1_Norm', 'Condition1_RRAe',
       'BldgType_2fmCon', 'BldgType_Duplex', 'Exterior1st_BrkComm',
       'Exterior1st_BrkFace', 'Exterior1st_HdBoard', 'Exterior2nd_CmentBd',
       'MasVnrType_BrkFace', 'MasVnrType_Stone', 'ExterQual_TA',
       'ExterCond_Fa', 'BsmtQual_Fa', 'BsmtQual_Gd', 'BsmtQual_TA',
       'BsmtExposure_Gd', 'BsmtFinType1_GLQ', 'CentralAir_Y', 'KitchenQual_Gd',
       'KitchenQual_TA', 'Functional_Mod', 'Functional_Sev', 'Functional_Typ',
       'SaleType_ConLD', 'SaleType_New', 'SaleCondition_Family'],
      dtype='object')

In [62]:
z = X_train.columns[rfecv.support_]
x_train_final = X_train[X_train.columns[rfecv.support_]]

x_test_final = X_test[X_train.columns[rfecv.support_]]

(438, 49)

In [65]:
x_train_final.shape # (1094, 52)

(1022, 49)

In [66]:
X_test_final.shape # (1108, 52)

(438, 49)

**Linear Regression**

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, make_scorer

lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

print("RMSEon train set: ", rmse_cv(lin_reg).mean())
print("RMSE on test set: ", rmse_cv_test(lin_reg).mean())

**XGboost**

In [67]:
from xgboost import XGBRegressor
xgb_0 = XGBRegressor(n_estimators=1000, learning_rate=0.05)
xgb_0.fit(x_train_final, y_train)

XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.05, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
             missing=nan, monotone_constraints='()', n_estimators=1000,
             n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=0,
             reg_alpha=0, reg_lambda=1, ...)

In [69]:
x_train_final.columns.difference(X_test_final.columns).tolist()

[]

Fitting estimator with 219 features.
Fitting estimator with 209 features.
Fitting estimator with 199 features.
Fitting estimator with 189 features.
Fitting estimator with 179 features.
Fitting estimator with 169 features.
Fitting estimator with 159 features.
Fitting estimator with 149 features.
Fitting estimator with 139 features.
Fitting estimator with 129 features.
Fitting estimator with 119 features.
Fitting estimator with 109 features.
Fitting estimator with 99 features.
Fitting estimator with 89 features.
Fitting estimator with 79 features.
Fitting estimator with 69 features.
Fitting estimator with 59 features.
Fitting estimator with 49 features.
Fitting estimator with 39 features.
Fitting estimator with 29 features.
Fitting estimator with 19 features.
Fitting estimator with 9 features.
Fitting estimator with 219 features.
Fitting estimator with 209 features.
Fitting estimator with 199 features.
Fitting estimator with 189 features.
Fitting estimator with 179 features.
Fitting esti

In [70]:
test_pred = xgb_0.predict(X_test_final)

Fitting estimator with 219 features.
Fitting estimator with 209 features.
Fitting estimator with 199 features.
Fitting estimator with 189 features.
Fitting estimator with 179 features.
Fitting estimator with 169 features.
Fitting estimator with 159 features.
Fitting estimator with 149 features.
Fitting estimator with 139 features.
Fitting estimator with 129 features.
Fitting estimator with 119 features.
Fitting estimator with 109 features.
Fitting estimator with 99 features.
Fitting estimator with 89 features.
Fitting estimator with 79 features.
Fitting estimator with 69 features.
Fitting estimator with 59 features.
Fitting estimator with 49 features.
Fitting estimator with 39 features.
Fitting estimator with 29 features.
Fitting estimator with 19 features.
Fitting estimator with 9 features.
Fitting estimator with 219 features.
Fitting estimator with 209 features.
Fitting estimator with 199 features.
Fitting estimator with 189 features.
Fitting estimator with 179 features.
Fitting esti

In [22]:
print("Mean Absolute Error : " , rmse_cv(xgb_0).mean())
print("Mean Absolute Error : " , rmse_cv_test(xgb_0).mean())

Mean Absolute Error :  36984.737664355416
Mean Absolute Error :  33686.5590512496


Good model!

In [26]:
#import xgboost as xgb

#xgb.__version__

'1.6.2'

**Random Forest Regressor**

In [None]:
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor(random_state=42)
forest_reg.fit(X_train, y_train)
print("Mean Aboluste Error on train set: ", rmse_cv(forest_reg).mean())
print("Mean Aboluste Error on test set: ", rmse_cv_test(forest_reg).mean())

Decided to go with XGboost as this was the best model for this project

# Fine-tune your models and combine them into a great solution

# Present your solution

# Launch, monitor and maintain your system

# Submittion

In [19]:
output = pd.DataFrame({'ID': X_test.index,
                       'SalePrice': test_pred})
output.to_csv('submission.csv', index=False)
output.head()

Unnamed: 0,ID,SalePrice
0,892,140007.15625
1,1105,316375.0
2,413,117039.601562
3,522,160711.140625
4,1036,304131.53125


In [71]:
import pickle
pickle.dump(xgb_0, open('model.pkl', 'wb'))