# House Price Prediction using Multiple Linear Regression

<center><img src="https://raw.githubusercontent.com/Masterx-AI/Project_Housing_Price_Prediction_/main/hs.jpg" style="width: 700px;"/>

Description:

A simple yet challenging project, to predict the housing price based on certain factors like house area, bedrooms, furnished, nearness to mainroad, etc. The dataset is small yet, it's complexity arises due to the fact that it has strong multicollinearity. Can you overcome these obstacles & build a decent predictive model?

## Import libraries 🐍

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

## Step 1: Understand the Dataset

Load the dataset

In [None]:
df = pd.read_csv("house_prices.csv")

Show 10 random samples

In [None]:
# Random 10 samples of data
df.sample(10)

Display dataset information

In [None]:
# Data Information
df.info()

Show dataset dimensions

In [None]:
df.shape # data shape

Show dataset statistical summary

In [None]:
df.describe() # data stats

Check for null values

In [None]:
df.isnull().sum() # null values check

Check for duplicate values

In [None]:
df.duplicated().sum() # duplicate values check

## Step 2: Visualize the Dataset

Check the relationship between area and price using a scatterplot

In [None]:
# Put your answer here
sns.scatterplot(y=df['price'],x=df['area'],hue=df['furnishingstatus'])

 Check the relationships between the independent and dependent variables using `.pairplot()`

In [None]:
sns.pairplot(df,hue="furnishingstatus")

## Step 3: Perform necessary data pre-processing

Create a duplicate of the original dataset

In [None]:
# Put your answer here
data=df.copy()

Convert the categorical columns into numbers/dummy variables by using the `get_dummies()` method.

- furnishingstatus
- mainroad
- guestroom
- basement
- hotwaterheating
- airconditioning
- prefarea

In [None]:
status = pd.get_dummies(data[['furnishingstatus','mainroad', 'guestroom', 'basement', 'hotwaterheating', 'airconditioning', 'prefarea']])

In [None]:
status

Concatenate the converted columns to the dataframe copy using `.concat()`

In [None]:
data = pd.concat([data, status], axis = 1)

In [None]:
data.head()

Remove the categorical columns using `.drop()`.

- furnishingstatus
- mainroad
- guestroom
- basement
- hotwaterheating
- airconditioning
- prefarea

In [None]:
# Put your answer here
data.drop(['furnishingstatus','mainroad', 'guestroom', 'basement', 'hotwaterheating', 'airconditioning', 'prefarea'], axis = 1, inplace = True)

In [None]:
data.head()

## Step 4: Feature Selection

Check for multicollinearity between the features/independent variables using `.corr()`

In [None]:
# Put your answer here
correlation = data.corr()
correlation

 Visualize the correlation by using a heatmap.

In [None]:
plt.figure(figsize=[25,20])
sns.heatmap(correlation, annot=True, vmin=-1, vmax=1, center=0)
plt.show()

By looking at the matrix, choose the independent variables that you would use in your model that would help in predicting the house price.

When choosing independent variables, we need to make sure that there are:
1. No redundant multicollinear variables.
2. All independent variables have a correlation with the dependent variable.



## Step 5: Building the Model

In [None]:
print(data.columns)

Split the data set into a training and test set

In [None]:
# Put your answer here
limit = 13
at_least = 1
steps = [1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1]

# Try every possible combination of columns and store it in a list
def tryAll(columns, curr, i):
    if (i >= len(columns)):
        return [curr]
    
    res = []

    # choose and move on
    if (len(curr) < limit):
        res += tryAll(columns, curr + [columns[i]], i+steps[i])
    # move on
    if (len(curr) + (len(columns) - i - 1) >= at_least):
        res += tryAll(columns, curr, i + 1)

    return res

X = data.drop(['price'], axis=1)
X_COL = tryAll([i for i in X.columns], [], 0)
X_DATA = [data[i] for i in X_COL]
y = data['price']

splitted_res = []
for sublist in X_DATA:
    X_train,X_test,y_train,y_test = train_test_split(sublist,y,test_size=0.3,random_state=12353)
    splitted_res.append([X_train,X_test,y_train,y_test])

In [None]:
print(X_COL[-45])

Apply scaling on the independent variables in the training and test set using the `MinMaxScaler()` method.

In [None]:
# Put your answer here
# Apply scaling to all the possible columns
scalers = [MinMaxScaler() for _ in range(len(splitted_res))]

for i in range(len(splitted_res)):
    scalers[i].fit(splitted_res[i][0])

X_train_scaled_vals, X_test_scaled_vals = [], []
for i in range(len(splitted_res)):
    X_train_scaled_vals.append(scalers[i].transform(splitted_res[i][0]))
    X_test_scaled_vals.append(scalers[i].transform(splitted_res[i][1]))

Create a new dataframe containing the unscaled features

In [None]:
'''
unscaled_df = pd.DataFrame(X_train, columns=X.columns)
unscaled_df.head()
'''

Create a new dataframe containing the scaled features

In [None]:
'''
scaled_df = pd.DataFrame(X_train_scaled, columns=X.columns)
scaled_df.head()
'''

Use `LinearRegression()` as our algorithm for our model

In [None]:
# Put your answer here
models = [LinearRegression() for _ in range(len(splitted_res))]

Train our model using the training set.

In [None]:
# Put your answer here
for i in range(len(splitted_res)):
    models[i].fit(X_train_scaled_vals[i], splitted_res[i][2])

Test the performance of the model using the test set

In [None]:
# Put your answer here

y_preds = []
for i in range(len(splitted_res)):
    y_preds.append(models[i].predict(X_test_scaled_vals[i]))


## Coefficient of Determination ($R^2$)

Coefficient of determination is the portion of the total variation in the dependent variable that is explained by variation in the independent variable.  $R^2$ scores are calculated as below:

$$ R^2 = \frac{\sum(\hat{Y_i}-\bar{Y})^2}{\sum(Y_i-\bar{Y})^2} $$ 

In statsmodel we can obtain the $R^2$ value of our model by accesing the `.rsquared` attribute of the our model.

In [None]:
# Put your answer here
olsmods = [sm.OLS(splitted_res[i][3], X_test_scaled_vals[i]).fit() for i in range(len(splitted_res))]

'''
olsmod.summary()
'''

In [None]:
r_squareds = []

for i in range(len(splitted_res)):
     r_squareds.append([olsmods[i].rsquared, i])

r_squareds = sorted(r_squareds, key=lambda x: x[0])

$R^2$ range between 0 and 1, 

where $R^2=0$ means there are no linear relationship between the variables 

and 

$R^2=1$ shows a perfect linear relationship. 

In our case, we got $R^2$ score about 0.9471 which means 94.71% of our dependent variable can be explained using our independent variables.

In [None]:
ans = r_squareds[-1]
print(ans[0])
print(X_COL[ans[1]])