# **Home Selling Price Prediction**

> <center><img src="https://media.istockphoto.com/photos/home-for-sale-real-estate-sign-and-house-picture-id168769007?k=20&m=168769007&s=612x612&w=0&h=uPj_q8BUB6N27npzmIsZlnUu-ysnvsoR1elRfmPhwlc=" width="1300px"></center>



* **Problem Statement:**
<div align='left'><font size="3" color="#000000"> It is your job to predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable. 
</font></div>


## 1. EDA
> Analyze and investigate data sets and summarize their main characteristics.

## 2. Feature Engineering
* Convert non numeric features to string
> There are some of the features that are actually included as an 'object' but represent in numerical, convert them to data type 'object' and encode them with One-hot-encoder.
* Normalize skewed feature (SalePrice)
> Make a skew data into normal or Gaussian one by applying log-transformation to remove or reduce skewness.
* Deal with outliers
> Remove outliers from some of the features.

## 3. Data preprocessing
* Apply Pipeline
> `sklearn.pipeline.Pipeline` Sequentially apply a list of transforms and a final estimator. The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters.
* Check for null values
> If there are null values then impute the numerical data type with mean and most frequent one for the object data type by using SimpleImputer().

* Check for categorical feature columns and encode them 
>  All features with data type 'object' can be ecoded using a One-hot-encoder. <br>

* Standardize
> Standardize numerical features by removing the mean and scaling to unit variance using StandardScaler().

* Building model and apply k-fold cross validation
> Preprocess the models and evaluate a score by k-fold cross validation.


## 4. Model Comparison
> Compare the score from two Regressions models:
> * XGBoost Regressor 
> * Support Vector Regressor

## 5. Model Improvement
* RandomSearchCV
> Get the best parameters for the model to get better accuracy.
* Stacking
> Combining the predictions from multiple machine learning models on the same dataset.

## 6. Making Final Prediction
> Selecting the best model and train with whole training dataset to be used for predicting test data.

## 7. Submission
> Submit the selected model with the best prediction.

## Import Libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn.model_selection import train_test_split

# preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer

# Stats
from scipy.stats import skew, norm

# Plots
import seaborn as sns
import matplotlib.pyplot as plt

# column transformer
from sklearn.compose import ColumnTransformer

# cross validation
from sklearn.model_selection import KFold, cross_val_score

# Stacking
from mlxtend.regressor import StackingCVRegressor

# hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV

# Pipeline
from sklearn.pipeline import Pipeline

# model
from sklearn.svm import SVR
from xgboost import XGBRegressor

# Removes warning
import warnings
warnings.filterwarnings('ignore')

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Load Data

In [None]:
train_df = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
test_df = pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/test.csv")

print("Data is loaded")

## 1. EDA

In [None]:
print ("Train: ",train_df.shape[0],"sales, and ",train_df.shape[1],"features")
print ("Test: ",test_df.shape[0],"sales, and ",test_df.shape[1],"features")

In [None]:
train_df.head()

In [None]:
train_df.info()

### 1.1 SalePrice Distribution (Target Variable)

In [None]:
sns.set_style("white")
sns.set_color_codes(palette='deep')
f, ax = plt.subplots(figsize=(8, 7))

# Check the distribution 
sns.distplot(train_df['SalePrice'], color="b");
ax.xaxis.grid(False)
ax.set(ylabel="Frequency")
ax.set(xlabel="SalePrice")
ax.set(title="SalePrice distribution")
sns.despine(trim=True, left=True)
plt.show()

From the graph above, it shows that the price is right skewed. Skew data makes a model difficult to find a proper pattern in the data which is the reason why we have to make a skew data into normal or Gaussian one. The log-transformation does remove or reduce skewness.

In [None]:
# Skew
print("Skewness: %f" % train_df['SalePrice'].skew())

### 1.2 Correlation

#### 1.2.1 Heatmap

In [None]:
corr = train_df.corr()
plt.subplots(figsize=(15,12))
sns.heatmap(corr, vmax=0.9, cmap='coolwarm', square=True)

In [None]:
train_df.corr()['SalePrice'].sort_values()

From the result above, **OverallQual** and **GrLivArea** are highly correlated with the SalePrice. From this observation we should look up at to these two features and check if there are outliers visible.

#### 1.2.2 GrLivArea

In [None]:
sns.scatterplot(data=train_df,x='GrLivArea', y='SalePrice')
plt.axhline(y=300000, color='r')
plt.axvline(x=4550, color='r')

From the graph above the there are outliers there are two visible outliers.

In [None]:
train_df[(train_df['GrLivArea']>4500) & (train_df['SalePrice']<300000)][['SalePrice', 'GrLivArea']]

#### 1.2.3 OverallQual

In [None]:
sns.scatterplot(data=train_df,x='OverallQual', y='SalePrice')
plt.axvline(x=4.9,color='r')
plt.axhline(y=650000,color='r')

In [None]:
train_df[(train_df['OverallQual']<5) & (train_df['SalePrice']<200000)][['SalePrice', 'OverallQual']]

### 1.3 Check on Missing Value

In [None]:
# Define function to get the percentage of missing values in attributes
def missing_percent(train_df):
    nan_percent= 100*(train_df.isnull().sum()/len(train_df))
    nan_percent= nan_percent[nan_percent>0].sort_values()
    return nan_percent

In [None]:
# Missing values in attrbiutes comparison in %
nan_percent= missing_percent(train_df)
nan_percent

In [None]:
sns.set_style("whitegrid")
missing = train_df.isnull().sum()
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar()

## 2. Feature Engineering

### 2.1 Convert non-numeric features to strings

In [None]:
# Some of the non-numeric predictors are stored as numbers; convert them into strings 
train_df['MSSubClass'] = train_df['MSSubClass'].apply(str)
train_df['YrSold'] = train_df['YrSold'].astype(str)
train_df['MoSold'] = train_df['MoSold'].astype(str)

### 2.1 Normalize Skewed Feature

#### 2.1.1 SalePrice Distribution (Target Variable)

In [None]:
sns.set_style("white")
sns.set_color_codes(palette='deep')
f, ax = plt.subplots(figsize=(8, 7))

# Check the distribution 
sns.distplot(train_df['SalePrice'], color="b");
ax.xaxis.grid(False)
ax.set(ylabel="Frequency")
ax.set(xlabel="SalePrice")
ax.set(title="SalePrice distribution")
sns.despine(trim=True, left=True)
plt.show()

From the graph above, it shows that the price is right skewed. Skew data makes a model difficult to find a proper pattern in the data which is the reason why we have to make a skew data into normal or Gaussian one. The log-transformation does remove or reduce skewness.

In [None]:
# Apply log-transfomation (log(1+x))
train_df["SalePrice"] = np.log1p(train_df["SalePrice"])

#### 2.2.2 After apply Log-Transformation

In [None]:
sns.set_style("white")
sns.set_color_codes(palette='deep')
f, ax = plt.subplots(figsize=(8, 7))
#Check the new distribution 
sns.distplot(train_df['SalePrice'] , fit=norm, color="b");

# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train_df['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
ax.xaxis.grid(False)
ax.set(ylabel="Frequency")
ax.set(xlabel="SalePrice")
ax.set(title="SalePrice distribution")
sns.despine(trim=True, left=True)

plt.show()

<div align='left'><font size="3" color="#000000"> The graph now normally distributed and shows a better result as it does not have skewness present after applying log-transformation.
</font></div>

### 2.3 Deal with Outliers

#### 2.3.1 What is an Outlier?
<div align='left'><font size="3" color="#000000"> Outlier is an observation that is numerically distant from the rest of the data or in a simple word it is the value which is out of the range. let’s take an example to check what happens to a data set with and data set without outliers.
</font></div>

|| | Data without outlier |  | Data with outlier | 
|--||--||--|
|**Data**| |1,2,3,3,4,5,4 |  |1,2,3,3,4,5,**400** | 
|**Mean**| |3.142 | |**59.714** |  
|**Median**| |3|  |3|
|**Standard Deviation**| |1.345185| |**150.057**|

<div align='left'><font size="3" color="#000000"> As you can see, data set with outliers has significantly different mean and standard deviation. In the first scenario, we will say that average is 3.14. But with the outlier, average soars to 59.71. This would change the estimate completely.
</font></div>

<div align='left'><font size="3" color="#000000"> Lets take a real world example. In a company of 50 employees, 45 people having monthly salary of Rs.6,000, 5 senior employees having monthly salary of Rs.100000 each. If you calculate the average monthly salary of employees in the company is Rs.14,500, which will give you the wrong conclusion (majority of employees have lesser than 14.5k salary). But if you take median salary, it is Rs.6000 which is more sense than the average. For this reason median is appropriate measure than mean. Here you can see the effect of outlier.
</font></div>    
<hr>   
<div class="alert alert-info" ><font size="3"><strong> Outlier </strong> is a commonly used terminology by analysts and data scientists as it needs close attention else it can result in wildly wrong estimations. Simply speaking, Outlier is an observation that appears far away and diverges from an overall pattern in a sample.</div>

Source and credit to https://www.kaggle.com/nareshbhat/outlier-the-silent-killer/notebook

#### 2.3.2 Remove Outliers

In [None]:
# Remove outliers
train_df.drop(train_df[(train_df['OverallQual'] < 5) & (train_df['SalePrice'] > 200000)].index, inplace=True)
train_df.drop(train_df[(train_df['GrLivArea'] > 4500) & (train_df['SalePrice'] < 300000)].index, inplace=True)
train_df.reset_index(drop=True, inplace=True)

## 3. Data Preprocessing

### 3.1 Splitting Dataset

In [None]:
X = train_df.drop(columns=['SalePrice'])
y = train_df['SalePrice']

### 3.2 Generate Pipeline for Transformations

In [None]:
#  integer category
int_cat_features = list(X.select_dtypes(include='int64').columns)
int_cat_transformers = Pipeline(steps=[('imputer', SimpleImputer()),
                                      ('scale', StandardScaler())])

# string category
str_cat_features = list(X.select_dtypes(include='object').columns)
str_cat_transformers = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')),
                                       ('one-hot', OneHotEncoder(handle_unknown='ignore'))])

# continues neumerical - floats
float_cat_features = list(X.select_dtypes(include='float64').columns)
float_cat_transformers = Pipeline(steps=[('imputer', SimpleImputer()),
                                         ('scale', StandardScaler())])

### 3.3 Building Model

#### 3.3 What is K-Fold Cross Validation?
<div align='left'><font size="3" color="#000000"> K-Fold CV is where a given data set is split into a K number of sections/folds where each fold is used as a testing set at some point. Lets take the scenario of 5-Fold cross validation(K=5). Here, the data set is split into 5 folds. In the first iteration, the first fold is used to test the model and the rest are used to train the model. In the second iteration, 2nd fold is used as the testing set while the rest serve as the training set. This process is repeated until each fold of the 5 folds have been used as the testing set.
</font></div>

> <center><img src="https://miro.medium.com/max/2000/1*IjKy-Zc9zVOHFzMw2GXaQw.png" width="800px"></center>
> <center><font size="3" color="#000000">5-Fold Cross Validation

source and credit to https://medium.datadriveninvestor.com/k-fold-cross-validation-6b8518070833

In [None]:
# Setup cross validation folds
kf = KFold(n_splits=5, random_state=42, shuffle=True)

In [None]:
# Model building

def model_building(model):
    #applying transformations
    preprocess = ColumnTransformer(transformers=[('int_cat', int_cat_transformers, int_cat_features),
                                                 ('str_cat', str_cat_transformers, str_cat_features),
                                                 ('float_cat', float_cat_transformers, float_cat_features)
                                                ])
    # preprocessing and modeling pipeline
    pipe = Pipeline(steps=[('preprocessing', preprocess),
                           ('modeling', model)])
    
    return pipe
    
# Cross validating
def cross_validate_pipeline(pipeline, X, y):
    cv_scores = -cross_val_score(pipeline, X, y, scoring="neg_root_mean_squared_error", cv=kf)
    return cv_scores

## 4. Model Comparison

In [None]:
models = [('SVR', SVR()),
          ('XGBRegressor',XGBRegressor()),
         ]


for name,model in models:
    model_pipeline = model_building(model)
    cv_scores = cross_validate_pipeline(model_pipeline, X, y)
    print(f'{name :20} {cv_scores.mean()}')

<div align='left'><font size="3" color="#000000"> The results showed that XGBRegressor gave better results than SVR with an RMSE score.
</font></div>

> <center><img src="https://c.tenor.com/74hajejcvqwAAAAS/rock.gif" width="320px"></center>
> <center><font size="3" color="#000000">Even so, can the models still be improved?

<div align='left'><font size="3" color="#000000"> Another way to improve results is to do hyper-parameters tuning of each model using GridSearchCV or RandomSearchCV and also applying stacking. From there it can improve the model in terms of accuracy to predict.
</font></div>

## 5. Model Improvement

### 5.1 RandomSearchCV

In [None]:
models = [('SVR',
           SVR(),
           {'modeling__C':[20,30,40],
            'modeling__epsilon':[0.007,0.008,0.009],
            'modeling__gamma':[0.0002,0.0003,0.0004],}),
          
          
          ('XGBRegressor',
           XGBRegressor(),
           {'modeling__learning_rate':[0.01],
            'modeling__max_depth':[4],
            'modeling__n_estimators':[3000],
            'modeling__subsample':[0.6,0.5,0.7,]})
         ]
#[2000,3000]

for name, model, param_grid in models:
    pipe = model_building(model)
    rs = RandomizedSearchCV(estimator = pipe, 
                            param_distributions = param_grid,
                            scoring="neg_mean_squared_error", 
                            cv = 5,
                            n_iter = 5,
                            random_state = 34)
    rs.fit(X,y)
    print(f'{name :20} {np.sqrt(np.negative(rs.best_score_))}')
    print(f'{name :20} {rs.best_params_}')

<div align='left'><font size="3" color="#000000"> The results show that both model scores show much better accuracy than before, RandomSearchCV allows us to find the best parameters for the model to get the best score from various predefined parameters.
</font></div>

### 5.2 Stacking

#### 5.2.2 What is stacking?
<div align='left'><font size="3" color="#000000"> Stacking is an ensemble learning technique to combine multiple regression models via a meta-regressor. The StackingCVRegressor extends the standard stacking algorithm (implemented as StackingRegressor) using out-of-fold predictions to prepare the input data for the level-2 regressor.


</font></div>

<div align='left'><font size="3" color="#000000"> In the standard stacking procedure, the first-level regressors are fit to the same training set that is used prepare the inputs for the second-level regressor, which may lead to overfitting. The StackingCVRegressor, however, uses the concept of out-of-fold predictions: the dataset is split into k folds, and in k successive rounds, k-1 folds are used to fit the first level regressor. In each round, the first-level regressors are then applied to the remaining 1 subset that was not used for model fitting in each iteration. The resulting predictions are then stacked and provided -- as input data -- to the second-level regressor. After the training of the StackingCVRegressor, the first-level regressors are fit to the entire dataset for optimal predicitons.
</font></div>

> <center><img src="http://rasbt.github.io/mlxtend/user_guide/regressor/StackingCVRegressor_files/stacking_cv_regressor_overview.png" width="500px"></center>

Source and credit to http://rasbt.github.io/mlxtend/user_guide/regressor/StackingCVRegressor/

#### 5.3 Hyper-Parameters Tuning

In [None]:
svr_model= SVR(gamma= 0.0003,
               epsilon= 0.009,
               C= 30)

xgb_model= XGBRegressor(subsample=0.5,
                        n_estimators=3000,
                        max_depth=4,
                        learning_rate=0.01)

# Stack up all the models above, optimized using xgboost
stack_reg = StackingCVRegressor(regressors=(svr_model, xgb_model),
                                meta_regressor=svr_model,
                                use_features_in_secondary=True)

<div align='left'><font size="3" color="#000000"> In this implementation, previous regressions model that have already tuned will be be stack and SVR model will be apply as the MetaRegressor.
</font></div>

In [None]:
stack_model = [('StackingCVRegressor', stack_reg)]


for name,model in stack_model:
    model_pipeline = model_building(model)
    cv_scores = cross_validate_pipeline(model_pipeline, X, y)
    print(f'{name :20} {cv_scores.mean()}')

<div align='left'><font size="3" color="#000000"> The result gives a slight better result compare to two regressions model that have already tuned.
</font></div>

## 6. Making Final Prediction

In [None]:
# modeling
model = model_building(stack_reg)
# training
model.fit(X,y)
# making predictions
preds = model.predict(test_df)

## 7. Submission

In [None]:
output = pd.DataFrame(data={'Id':test_df['Id'],'SalePrice':np.expm1(preds)})
output.to_csv('submission.csv', index=False)
print('Your submission was successfully saved!')