# House Pricing Prediction Model

## Objective: 
- To build a sophisticated model to predict the sales price of a house based on several features spanning from location, area to number of rooms, pool etc.,. 
- Observe the feature impact, to see which feature has the highest or lowest weight.
- Evaluating the performance of the model by using metrics such as MSE, MAE etc.,.
- Try to implement several other models such as Linear Regression, Binary Decision tree and so, and deciding the best one. 
- Tuning Hyperparameters to improve performance

### Loading datasets

In [1]:
# Basic Dependencies 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

train_dataset = pd.read_csv('datasets/train.csv')
test_dataset = pd.read_csv("datasets/test.csv")
test_labels = pd.read_csv("datasets/sample_submission.csv")

dataset = pd.read_csv("datasets/dataset.csv")

df = pd.DataFrame(dataset)

print(df)
df.describe(include='all')

        Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0        1          60       RL         65.0     8450   Pave   NaN      Reg   
1        2          20       RL         80.0     9600   Pave   NaN      Reg   
2        3          60       RL         68.0    11250   Pave   NaN      IR1   
3        4          70       RL         60.0     9550   Pave   NaN      IR1   
4        5          60       RL         84.0    14260   Pave   NaN      IR1   
...    ...         ...      ...          ...      ...    ...   ...      ...   
2914  2915         160       RM         21.0     1936   Pave   NaN      Reg   
2915  2916         160       RM         21.0     1894   Pave   NaN      Reg   
2916  2917          20       RL        160.0    20000   Pave   NaN      Reg   
2917  2918          85       RL         62.0    10441   Pave   NaN      Reg   
2918  2919          60       RL         74.0     9627   Pave   NaN      Reg   

     LandContour Utilities  ... PoolArea PoolQC  Fe

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
count,2919.0,2919.0,2915,2433.0,2919.0,2919,198,2919,2919,2917,...,2919.0,10,571,105,2919.0,2919.0,2919.0,2918,2919,2919.0
unique,,,5,,,2,2,4,4,2,...,,3,4,4,,,,9,6,
top,,,RL,,,Pave,Grvl,Reg,Lvl,AllPub,...,,Ex,MnPrv,Shed,,,,WD,Normal,
freq,,,2265,,,2907,120,1859,2622,2916,...,,4,329,95,,,,2525,2402,
mean,1460.0,57.137718,,69.305795,10168.11408,,,,,,...,2.251799,,,,50.825968,6.213087,2007.792737,,,180052.854648
std,842.787043,42.517628,,23.344905,7886.996359,,,,,,...,35.663946,,,,567.402211,2.714762,1.314964,,,57381.565721
min,1.0,20.0,,21.0,1300.0,,,,,,...,0.0,,,,0.0,1.0,2006.0,,,34900.0
25%,730.5,20.0,,59.0,7478.0,,,,,,...,0.0,,,,0.0,4.0,2007.0,,,154795.08415
50%,1460.0,50.0,,68.0,9453.0,,,,,,...,0.0,,,,0.0,6.0,2008.0,,,176734.8415
75%,2189.5,70.0,,80.0,11570.0,,,,,,...,0.0,,,,0.0,8.0,2009.0,,,191895.74415


### Handling missing data

In [2]:
missing_values = df.isnull().sum()

print(f"Number of columns with missing values: {len(missing_values[missing_values>0])}")
missing_values[missing_values>0]

Number of columns with missing values: 34


MSZoning           4
LotFrontage      486
Alley           2721
Utilities          2
Exterior1st        1
Exterior2nd        1
MasVnrType      1766
MasVnrArea        23
BsmtQual          81
BsmtCond          82
BsmtExposure      82
BsmtFinType1      79
BsmtFinSF1         1
BsmtFinType2      80
BsmtFinSF2         1
BsmtUnfSF          1
TotalBsmtSF        1
Electrical         1
BsmtFullBath       2
BsmtHalfBath       2
KitchenQual        1
Functional         2
FireplaceQu     1420
GarageType       157
GarageYrBlt      159
GarageFinish     159
GarageCars         1
GarageArea         1
GarageQual       159
GarageCond       159
PoolQC          2909
Fence           2348
MiscFeature     2814
SaleType           1
dtype: int64

##### It wouldn't be practical to fill the missing data with either median or mode for the features. It's better to go through each, and find the best approach respectively.

The Approach: 
- Numerical Data: 
    - Crucial Data such as LotFrontage etc., to be filled with median.
    - Data such as Pool Area, Garage Area etc., can be negative if they don't exist, hence to be filled with 0. 

- Categorical Data:
    - Mostly to be filled with either the mode for necessary data ( For intance, YrBuilt, YrSold), or just 'Unknown' to indicate non-existence of the data. 

In [3]:
# Features to be filled with median (Numerical data)

num_cols = df.select_dtypes([np.number]).columns
category_cols = df.select_dtypes(['object']).columns

median_fill = ['LotFrontage', 'LotArea', 'OverallQual',
        'OverallCond', 'YearBuilt', 'YearRemodAdd',
        'GrLivArea', 'GarageYrBlt', 'GarageArea', 'MoSold', 'YrSold']

for col in median_fill:
    df[col] = df[col].fillna(df[col].median())

for col in num_cols:
    df[col] = df[col].fillna(0)
    


# ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Features to be just filled with 'Unknown' wherever NA
none_fill = ['Alley' ,'Utilities' , 'MasVnrType' , 'MasVnrArea' , 'BsmtQual' , 'BsmtCond' , 'BsmtExposure' , 'BsmtFinType1' , 'BsmtFinType2' ,
              'Electrical' , 'FireplaceQu' , 'GarageType' , 'GarageYrBlt' , 'GarageFinish' , 'GarageQual' ,'GarageCond' , 
              'PoolQC' , 'Fence' , 'MiscFeature' , 'MSZoning' , 'Exterior1st' , 'Exterior2nd' , 'KitchenQual' , 'Functional' , 
              'SaleType']



for col in none_fill:
    df[col] = df[col].fillna('Unknown')

for col in category_cols:
    df[col] = df[col].fillna(df[col].mode())

    
missing_values = df.isnull().sum()
missing_values[missing_values>0]

Series([], dtype: int64)

#### Scaling Numerical Features

In [4]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])


### One-Hot Encoding Categorical Features

In [5]:
df = pd.get_dummies(df, columns=category_cols, drop_first=True)

df.isnull().sum()[df.isnull().sum()>0]

Series([], dtype: int64)

#### Model Training

There are several models available in Scikit-Learn library, but in our case, we are ought to use a 'Regressor' class model to predict a numerical data (in our case - 'SalePrice'). 
There isn't one best regression for all cases, it depends on the data. It's best to try out several models, compare the performanced and then cross-validate. 

#### Splitting data into Training and Testing

In [6]:
from sklearn.model_selection import train_test_split

y = df['SalePrice']
df = df.drop('SalePrice' , axis=1)
X = df

X_train , X_test , y_train , y_test = train_test_split(X , y , test_size=0.15 , random_state=69 )

## Linear Regressor (Baseline)

In [7]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

linear_reg = LinearRegression()
linear_reg.fit(X_train,y_train)

y_pred_lin_reg = linear_reg.predict(X_test)

mse = mean_squared_error(y_test, y_pred_lin_reg)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred_lin_reg)
r2 = r2_score(y_test, y_pred_lin_reg)

print("Linear Regression Model Performance Stats:\n")

print(f"MSE: {(mse*100):.2f}%")
print(f"RMSE: {(rmse*100):.2f}%")
print(f"MAE: {(mae*100):.2f}%")
print(f"R^2 Score: {r2}")

Linear Regression Model Performance Stats:

MSE: 56.09%
RMSE: 74.90%
MAE: 49.20%
R^2 Score: 0.32366669685428606


## Decision Tree Regressor

In [8]:
from sklearn.tree import DecisionTreeRegressor

decision_tree = DecisionTreeRegressor()
decision_tree.fit(X_train,y_train)

y_pred_dec_tree = decision_tree.predict(X_test)



mse = mean_squared_error(y_test, y_pred_dec_tree)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred_dec_tree)
r2 = r2_score(y_test, y_pred_dec_tree)

print("Decision Tree Regressor Model Performance Stats: \n")

print(f"MSE: {(mse*100):.2f}%")
print(f"RMSE: {(rmse*100):.2f}%")
print(f"MAE: {(mae*100):.2f}%")
print(f"R^2 Score: {r2}")



Decision Tree Regressor Model Performance Stats: 

MSE: 20.39%
RMSE: 45.16%
MAE: 25.71%
R^2 Score: 0.7541037773506954


## Random Forest Regressor


In [9]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

random_forest = RandomForestRegressor(random_state=69)

random_forest.fit(X_train,y_train)
y_pred = random_forest.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Random Forest Regressor Model Performance Stats: \n")

print(f"MSE: {(mse*100):.2f}%")
print(f"RMSE: {(rmse*100):.2f}%")
print(f"MAE: {(mae*100):.2f}%")
print(f"R^2 Score: {r2}")

print('\n\n\n')


parameter_grid = {
    'n_estimators': [1000],
    'max_depth': [None, 10],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    'max_features': ['sqrt', 'log2']
}


grid_search = GridSearchCV(estimator=random_forest, param_grid=parameter_grid, cv=10, scoring='r2', n_jobs=-1)
grid_search.fit(X_train, y_train)

best_rf = grid_search.best_estimator_

y_pred_ran_for = best_rf.predict(X_test)

mse = mean_squared_error(y_test, y_pred_ran_for)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred_ran_for)
r2 = r2_score(y_test, y_pred_ran_for)

print("Random Forest Regressor Model Performance Stats: \n")

print(f"MSE: {(mse*100):.2f}%")
print(f"RMSE: {(rmse*100):.2f}%")
print(f"MAE: {(mae*100):.2f}%")
print(f"R^2 Score: {r2}")


Random Forest Regressor Model Performance Stats: 

MSE: 9.90%
RMSE: 31.47%
MAE: 16.55%
R^2 Score: 0.880604213402168




Random Forest Regressor Model Performance Stats: 

MSE: 21.01%
RMSE: 45.84%
MAE: 31.02%
R^2 Score: 0.7466930860929133


## Gradient Boosting Regressor

In [10]:
from sklearn.ensemble import GradientBoostingRegressor

gradient_boosting = GradientBoostingRegressor()
gradient_boosting.fit(X_train,y_train)

y_pred_gbr = gradient_boosting.predict(X_test)

mse = mean_squared_error(y_test, y_pred_gbr)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred_gbr)
r2 = r2_score(y_test, y_pred_gbr)

print("Gradient Boosting Regressor Model Performance Stats: \n")

print(f"MSE: {(mse*100):.2f}%")
print(f"RMSE: {(rmse*100):.2f}%")
print(f"MAE: {(mae*100):.2f}%")
print(f"R^2 Score: {r2}")



Gradient Boosting Regressor Model Performance Stats: 

MSE: 9.90%
RMSE: 31.47%
MAE: 19.09%
R^2 Score: 0.8806156871950728


## Support Vector Regressor 

In [11]:
from sklearn.svm import SVR

svr = SVR()
svr.fit(X_train,y_train)

y_pred_svr = svr.predict(X_test)

mse = mean_squared_error(y_test, y_pred_svr)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred_svr)
r2 = r2_score(y_test, y_pred_svr)

print("Support Vector Regressor Model Performance Stats: \n")

print(f"MSE: {(mse*100):.2f}%")
print(f"RMSE: {(rmse*100):.2f}%")
print(f"MAE: {(mae*100):.2f}%")
print(f"R^2 Score: {r2}")



Support Vector Regressor Model Performance Stats: 

MSE: 18.59%
RMSE: 43.12%
MAE: 25.57%
R^2 Score: 0.7758372238176671
