# Predicting House Prices XGBoost + GBM Models



**Bugra Sebati E.** - **July 2021**

## Introduction

For this competiton, we are given a data set of 1460 homes, each with a few dozen features of types: float, integer, and categorical. We are tasked with building a regression model to estimate a home's sale price. Total number of attributes equals 81, of which 36 is quantitative, 43 categorical + Id and SalePrice.

**What you can find on this notebook?**

* Understanding the data
* Exploratory Data Analysis
* Data Preprocessing
* PCA Trial
* GBM and XGBoost Models
* Submission

![](http://media1.tenor.com/images/286156bd33ce64d69f6a2367557392b5/tenor.gif?itemid=10804810)


### Lets meet variables

* **SalePrice** : The property's sale price in dollars. This is target variable for predict
* **MSSubClass**: The building class
* **MSZoning**: The general zoning classification
* **LotFrontage**: Linear feet of street connected to property
* **LotArea**: Lot size in square feet
* **Street**: Type of road access
* **Alley**: Type of alley access
* **LotShape**: General shape of property
* **LandContour**: Flatness of the property
* **Utilities**: Type of utilities available
* **LotConfig**: Lot configuration
* **LandSlope**: Slope of property
* **Neighborhood**: Physical locations within Ames city limits
* **Condition1**: Proximity to main road or railroad
* **Condition2**: Proximity to main road or railroad (if a second is present)
* **BldgType**: Type of dwelling
* **HouseStyle**: Style of dwelling
* **OverallQual**: Overall material and finish quality
* **OverallCond**: Overall condition rating
* **YearBuilt**: Original construction date
* **YearRemodAdd**: Remodel date
* **RoofStyle**: Type of roof
* **RoofMatl**: Roof material
* **Exterior1st**: Exterior covering on house
* **Exterior2nd**: Exterior covering on house (if more than one material)
* **MasVnrType**: Masonry veneer type
* **MasVnrArea**: Masonry veneer area in square feet
* **ExterQual**: Exterior material quality
* **ExterCond**: Present condition of the material on the exterior
* **Foundation**: Type of foundation
* **BsmtQual**: Height of the basement
* **BsmtCond**: General condition of the basement
* **BsmtExposure**: Walkout or garden level basement walls
* **BsmtFinType1**: Quality of basement finished area
* **BsmtFinSF1**: Type 1 finished square feet
* **BsmtFinType2**: Quality of second finished area (if present)
* **BsmtFinSF2**: Type 2 finished square feet
* **BsmtUnfSF**: Unfinished square feet of basement area
* **TotalBsmtSF**: Total square feet of basement area
* **Heating**: Type of heating
* **HeatingQC**: Heating quality and condition
* **CentralAir**: Central air conditioning
* **Electrical**: Electrical system
* **1stFlrSF**: First Floor square feet
* **2ndFlrSF**: Second floor square feet
* **LowQualFinSF**: Low quality finished square feet (all floors)
* **GrLivArea**: Above grade (ground) living area square feet
* **BsmtFullBath**: Basement full bathrooms
* **BsmtHalfBath**: Basement half bathrooms
* **FullBath**: Full bathrooms above grade
* **HalfBath**: Half baths above grade
* **Bedroom**: Number of bedrooms above basement level
* **Kitchen**: Number of kitchens
* **KitchenQual**: Kitchen quality
* **TotRmsAbvGrd**: Total rooms above grade (does not include bathrooms)
* **Functional**: Home functionality rating
* **Fireplaces**: Number of fireplaces
* **FireplaceQu**: Fireplace quality
* **GarageType**: Garage location
* **GarageYrBlt**: Year garage was built
* **GarageFinish**: Interior finish of the garage
* **GarageCars**: Size of garage in car capacity
* **GarageArea**: Size of garage in square feet
* **GarageQual**: Garage quality
* **GarageCond**: Garage condition
* **PavedDrive**: Paved driveway
* **WoodDeckSF**: Wood deck area in square feet
* **OpenPorchSF**: Open porch area in square feet
* **EnclosedPorch**: Enclosed porch area in square feet
* **3SsnPorch**: Three season porch area in square feet
* **ScreenPorch**: Screen porch area in square feet
* **PoolArea**: Pool area in square feet
* **PoolQC**: Pool quality
* **Fence**: Fence quality
* **MiscFeature**: Miscellaneous feature not covered in other categories
* **MiscVal**: Value of miscellaneous feature
* **MoSold**: Month Sold
* **YrSold**: Year Sold
* **SaleType**: Type of sale
* **SaleCondition**: Condition of sale

#### Since we learn variables, we can start now...
If you like this notebook,dont forget to upvote :) **Thanks !**

In [None]:
#### IMPORT LIBRARIES


import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from scipy.stats import skew
from scipy.special import boxcox1p
from sklearn.decomposition import PCA
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import mean_squared_error
import xgboost as xgb
from sklearn.ensemble import GradientBoostingRegressor
import warnings
warnings.filterwarnings("ignore")

In [None]:
train = pd.read_csv("../input/house-prices-advanced-regression-techniques/train.csv")
test = pd.read_csv("../input/house-prices-advanced-regression-techniques/test.csv")
traindf = train.copy()
testdf = test.copy()

In [None]:
train.head()

In [None]:
test.head()

In [None]:
#### I like colors :-9

trainshape = ("Train Data:",train.shape[0],"obs, and", train.shape[1], "features" )
print("\033[95m {}\033[00m" .format(trainshape))
testshape = ("Test Data:",test.shape[0],"obs, and", test.shape[1], "features" )
print("\033[95m {}\033[00m" .format(testshape))

In [None]:
# save id 
train_id = train["Id"]
test_id = test["Id"]

# drop id
train.drop("Id" , axis = 1 , inplace = True)
test.drop("Id" , axis = 1 , inplace = True)

In [None]:
train.describe().T

In [None]:
# Focus Target Variable

sns.distplot(train["SalePrice"] , color = "g", bins = 60 , hist_kws={"alpha": 0.4});

As we can see at the above, the target variable SalePrice is not distributed normally.

This can reduce the performance of the ML regression models because some of them assume normal distribution.

Therfore we need to log transform.

In [None]:
sns.distplot(np.log1p(train["SalePrice"]) , color = "g", bins = 60 , hist_kws={"alpha": 0.4});

It looks like better :) 

Now, let's look at the best 8 correlation with heatmap.

In [None]:
corrmatrix = train.corr()
plt.figure(figsize = (10,6))
columnss = corrmatrix.nlargest(8, "SalePrice")["SalePrice"].index
cm = np.corrcoef(train[columnss].values.T)
sns.set(font_scale = 1.1)
hm = sns.heatmap(cm, cbar = True, annot = True, square = True, cmap = "RdPu" ,  fmt = ".2f", annot_kws = {"size": 10},
                 yticklabels = columnss.values, xticklabels = columnss.values)
plt.show()

Now let's look at the distribution of the variable with the 3 highest correlations.

In [None]:
f, ax = plt.subplots(figsize = (10, 7))
sns.boxplot(x = "OverallQual", y = "SalePrice", data = train);

In [None]:
sns.jointplot(x = train["GrLivArea"], y = train["SalePrice"], kind = "reg");

In [None]:
sns.boxplot(x = train["GarageCars"], y = train["SalePrice"]);

#### - **Outliers**

Can you see two points at the bottom right on GrLivArea. Yes ! It's outliers !

Car garages result in less Sale Price? That doesn't make much sense.

We need to remove outliers.

In [None]:
train = train.drop(train[(train["GrLivArea"] > 4000) 
                         & (train["SalePrice"] < 200000)].index).reset_index(drop = True)
train = train.drop(train[(train["GarageCars"] > 3) 
                         & (train["SalePrice"] < 300000)].index).reset_index(drop = True)

It should look better.

In [None]:
sns.jointplot(x = train["GrLivArea"], y = train["SalePrice"], kind = "reg");

In [None]:
sns.boxplot(x = train["GarageCars"], y = train["SalePrice"]);

They Look succesfull.

Now, we need to concanete train and test data for some cleaning operations.

In [None]:
df = pd.concat((train, test)).reset_index(drop = True)
df.drop(["SalePrice"], axis = 1, inplace = True)
df.shape

In [None]:
#### Focus missing values

df.isna().sum().nlargest(35)

In [None]:
sns.set_style("whitegrid")
f , ax = plt.subplots(figsize = (12, 6))
miss = round(df.isnull().mean()*100,2)
miss = miss[miss > 0]
miss.sort_values(inplace = True)
miss.plot.bar(color = "g")
ax.set(title="Percent missing data by variables");

As can be seen, there are many missing observations in the data.

#### - **Filling missing values**

For a few columns there is lots of NaN entries.

However, reading the data description we find this is not missing data:

For PoolQC, NaN is not missing data but means no pool, likewise for Fence, FireplaceQu etc.

Now, lets filling NA values :)

In [None]:
some_miss_columns = ["PoolQC","MiscFeature","Alley","Fence","FireplaceQu","GarageType","GarageFinish","GarageQual","GarageCond",
                  "BsmtQual","BsmtCond","BsmtExposure","BsmtFinType1","BsmtFinType2","MasVnrType","MSSubClass"]

for i in some_miss_columns :
        df[i].fillna("None" , inplace = True)

In [None]:
df["Functional"] = df["Functional"].fillna("Typ")

In [None]:
some_miss_columns2 = ["MSZoning", "BsmtFullBath", "BsmtHalfBath", "Utilities","MSZoning",
                      "Electrical", "KitchenQual", "SaleType","Exterior1st", "Exterior2nd","MasVnrArea"]
for i in some_miss_columns2:
    df[i].fillna(df[i].mode()[0], inplace = True)

In [None]:
some_miss_columns3 = ["GarageYrBlt", "GarageArea", "GarageCars","BsmtFinSF1","BsmtFinSF2","BsmtUnfSF","TotalBsmtSF"]
for i in some_miss_columns3 :
    df[i] = df[i].fillna(0)

In [None]:
df["LotFrontage"] = df.groupby("Neighborhood")["LotFrontage"].transform(lambda x: x.fillna(x.median()))

We've filled out all the missing data.

Let's control.

In [None]:
df.isna().sum().nlargest(3)

We should transform for some variables.

In [None]:
Nm = ["MSSubClass","MoSold","YrSold"]
for col in Nm:
    df[col] = df[col].astype(str)

#### - **Label Encoder**

Convert this kind of categorical text data into model-understandable numerical data, we use the Label Encoder class.

In [None]:
lbe = LabelEncoder()
encodecolumns = ("FireplaceQu","BsmtQual","BsmtCond","ExterQual","ExterCond","HeatingQC","GarageQual",
                "GarageCond","PoolQC","KitchenQual","BsmtFinType1","BsmtFinType2","Functional","Fence",
                "BsmtExposure","GarageFinish","LandSlope","LotShape","PavedDrive","Street","Alley",
                "CentralAir","MSSubClass","OverallCond","YrSold","MoSold")
for i in encodecolumns :
    lbe.fit(list(df[i].values))
    df[i] = lbe.transform(list(df[i].values))

#### - **Log Transform for SalePrice**

We must apply logarithmic transformation to our target variable.Because ML models work better with normal distribution.

In [None]:
train["SalePrice"] = np.log1p(train["SalePrice"])
y = train.SalePrice.values
y[:5]

#### - **Fixing "Skewed" features**

We need to fix all of the skewed data to be more normal so that our models will be more accurate when making predictions.

In [None]:
numeric = df.dtypes[df.dtypes != "object"].index
skewed_var = df[numeric].apply(lambda x: skew(x.dropna())).sort_values(ascending = False)
skewness = pd.DataFrame({"Skewed Features" :skewed_var})
skewness.head()

Now we will apply box cox transformation to these skewed values. So what is box cox transformation?

#### - **Box Cox Transformation** 

 A Box Cox transformation is a transformation of a non-normal dependent variables into a normal shape. Normality is an important assumption for many statistical techniques; if your data isn’t normal, applying a Box-Cox means that you are able to run a broader number of tests.
 
References : Box, G. E. P. and Cox, D. R. (1964). An analysis of transformations.

Lets do it.

In [None]:
skewness = skewness[abs(skewness) > 0.75]
skewed_var2 = skewness.index
for i in skewed_var2:
    df[i] = boxcox1p(df[i], 0.15)
    df[i] += 1

#### - **Dummy Variables**

Next step is dummy variables ! 

In statistics and econometrics, particularly in regression analysis, a dummy variable is one that takes only the value 0 or 1 to indicate the absence or presence of some categorical effect that may be expected to shift the outcome.

In [None]:
df = pd.get_dummies(df)
df.head()

In [None]:
X_train = df[:train.shape[0]]
X_test = df[train.shape[0]:]

Now, we are ready to ML, but i want to try PCA. So what is the PCA ?


#### **PCA (Principal component analysis)**
PCA is used in exploratory data analysis and for making predictive models. It is commonly used for dimensionality reduction by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data's variation as possible. The first principal component can equivalently be defined as a direction that maximizes the variance of the projected data. Lets try it.

**Note** : You need to **standardize** the data before using PCA.

In [None]:
dff = df.copy()
##df_standardize = StandardScaler().fit_transform(dff)
##I didn't standardize it again because the data is already close to the standard.
pca = PCA()
pca_fit = pca.fit_transform(dff)
pca = PCA().fit(dff)
plt.plot(np.cumsum(pca.explained_variance_ratio_));

With about 30 variables, we can explain 90% of the variance in the dataset.How do we do that ?

In [None]:
pca = PCA(n_components = 30)
pca_fit = pca.fit_transform(dff)
pca_df = pd.DataFrame(data = pca_fit)
pca_df.head()

I didn't have much experience with PCA , so I just wanted to try it. Your positive and negative opinions are important to me :)

Now, we will predict models ! Firstly start Cross-validation with k-folds

In [None]:
n_folds = 5

def rmsle_cv(model):
    kf = KFold(n_folds, shuffle = True, random_state = 42).get_n_splits(X_train.values)
    rmse = np.sqrt(-cross_val_score(model, X_train.values, y, scoring = "neg_mean_squared_error", cv = kf))
    return(rmse)

In [None]:
model_xgb = xgb.XGBRegressor(colsample_bytree = 0.2, gamma = 0.0 ,
                             learning_rate = 0.05, max_depth = 6, 
                             min_child_weight = 1.5, n_estimators = 7200,
                             reg_alpha = 0.9, reg_lambda = 0.6,
                             subsample = 0.2,seed = 42,
                             random_state = 7)

model_gbm = GradientBoostingRegressor(n_estimators = 3000, learning_rate = 0.05,
                                   max_depth = 4, max_features = "sqrt",
                                   min_samples_leaf = 15, min_samples_split = 10, 
                                   loss = "huber", random_state = 5)

Checking performance of base models by evaluating the cross-validation RMSLE error.

In [None]:
score = rmsle_cv(model_xgb)
print("XGBoost score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
score = rmsle_cv(model_gbm)
print("Gradient Boosting score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

In [None]:
## we need this func

def rmsle(y, y_pred):
    return np.sqrt(mean_squared_error(y, y_pred))

#### - **XGBoost**

In [None]:
model_xgb.fit(X_train, y)
xgb_train_pred = model_xgb.predict(X_train)
xgb_pred = np.expm1(model_xgb.predict(X_test))
print(rmsle(y, xgb_train_pred))

In [None]:
xgb_pred[:5]

#### - **GBM (Gradient Boosting Machines)**

In [None]:
model_gbm.fit(X_train, y)
gbm_train_pred = model_gbm.predict(X_train)
gbm_pred = np.expm1(model_gbm.predict(X_test.values))
print(rmsle(y, gbm_train_pred))

In [None]:
gbm_pred[:5]

#### - **SUBMISSION**

In [None]:
trybest = (0.5 * xgb_pred ) + (0.5 * gbm_pred)

In [None]:
submission = pd.DataFrame({"Id": test_id, "SalePrice": trybest})
submission.head(5)

In [None]:
submission.to_csv("submission.csv", index = False)

![](http://media.giphy.com/media/mofrKGJMwOHM4/giphy.gif)

You can get better scores with different models and combinations. I just wanted to try these 2 models and a combination in this notebook.

**Thanks for attention ! ;)**