# Predicting Houses Prices using Regression Models

## Importing the Dataset and Data Preprocessing

Using the house sales dataset available on [kaggle](https://www.kaggle.com/datasets/harlfoxem/housesalesprediction), we will build a linear and multiple regression models to predict the price of a house given its features. 

In [39]:
import pandas as pd
import os
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import GridSearchCV

# Importing the dataset
file_name ='kc_house_data.csv'
current_dir = os.getcwd()
file_path = os.path.join(current_dir, file_name)
df = pd.read_csv(file_path,header=0)
df.head()


Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [40]:
# Dropping the 'id' column
df.drop('id',axis=1,inplace=True)
# Checking for missing Values
df.isnull().sum()


date             0
price            0
bedrooms         0
bathrooms        0
sqft_living      0
sqft_lot         0
floors           0
waterfront       0
view             0
condition        0
grade            0
sqft_above       0
sqft_basement    0
yr_built         0
yr_renovated     0
zipcode          0
lat              0
long             0
sqft_living15    0
sqft_lot15       0
dtype: int64

In [3]:
df.describe()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
count,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0
mean,540088.1,3.370842,2.114757,2079.899736,15106.97,1.494309,0.007542,0.234303,3.40943,7.656873,1788.390691,291.509045,1971.005136,84.402258,98077.939805,47.560053,-122.213896,1986.552492,12768.455652
std,367127.2,0.930062,0.770163,918.440897,41420.51,0.539989,0.086517,0.766318,0.650743,1.175459,828.090978,442.575043,29.373411,401.67924,53.505026,0.138564,0.140828,685.391304,27304.179631
min,75000.0,0.0,0.0,290.0,520.0,1.0,0.0,0.0,1.0,1.0,290.0,0.0,1900.0,0.0,98001.0,47.1559,-122.519,399.0,651.0
25%,321950.0,3.0,1.75,1427.0,5040.0,1.0,0.0,0.0,3.0,7.0,1190.0,0.0,1951.0,0.0,98033.0,47.471,-122.328,1490.0,5100.0
50%,450000.0,3.0,2.25,1910.0,7618.0,1.5,0.0,0.0,3.0,7.0,1560.0,0.0,1975.0,0.0,98065.0,47.5718,-122.23,1840.0,7620.0
75%,645000.0,4.0,2.5,2550.0,10688.0,2.0,0.0,0.0,4.0,8.0,2210.0,560.0,1997.0,0.0,98118.0,47.678,-122.125,2360.0,10083.0
max,7700000.0,33.0,8.0,13540.0,1651359.0,3.5,1.0,4.0,5.0,13.0,9410.0,4820.0,2015.0,2015.0,98199.0,47.7776,-121.315,6210.0,871200.0


In [4]:
df.dtypes

date              object
price            float64
bedrooms           int64
bathrooms        float64
sqft_living        int64
sqft_lot           int64
floors           float64
waterfront         int64
view               int64
condition          int64
grade              int64
sqft_above         int64
sqft_basement      int64
yr_built           int64
yr_renovated       int64
zipcode            int64
lat              float64
long             float64
sqft_living15      int64
sqft_lot15         int64
dtype: object

## Feature Selection

In [6]:
# Finding features that are most correlated with Price to train the model
df.corr(numeric_only=True)['price'].sort_values()

zipcode         -0.053203
long             0.021626
condition        0.036362
yr_built         0.054012
sqft_lot15       0.082447
sqft_lot         0.089661
yr_renovated     0.126434
floors           0.256794
waterfront       0.266369
lat              0.307003
bedrooms         0.308350
sqft_basement    0.323816
view             0.397293
bathrooms        0.525138
sqft_living15    0.585379
sqft_above       0.605567
grade            0.667434
sqft_living      0.702035
price            1.000000
Name: price, dtype: float64

In [8]:
# Selecting the features with a high correlation
features = ['sqft_living','grade','sqft_above','sqft_living15','bathrooms','view','sqft_basement','bedrooms','lat','waterfront','floors']

## Multiple Linear Regression Model

In [50]:
# Selecting X and Y for training and standardizing the dataset
scaler = StandardScaler()
X = scaler.fit_transform(df[features])
Y = df[['price']]
# Splitting the dataset into training and testing data
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.15,random_state=2)
# Training a Multiple Linear Regression model
mlr =  LinearRegression()
mlr.fit(X_train,Y_train)
mlr_accuracy = mlr.score(X_test,Y_test)
print(f"The multiple Linear Regression model has an R2 score of{mlr_accuracy: .3f} on the testing dataset.")

The multiple Linear Regression model has an R2 score of 0.683 on the testing dataset.


## Multiple Linear Regression with polynomial features

In [49]:
# Creating a pipeline to have polynomial features in the regression model
input = [('poly',PolynomialFeatures(include_bias=False)),('model',LinearRegression())]
pipeline = Pipeline(input)
# Performing a grid search to obtain the best degree for the polynomial features
poly_parameters = {'poly__degree':[1, 2, 3, 4]}
mlr_poly = GridSearchCV(pipeline,poly_parameters,cv=5)
mlr_poly.fit(X_train,Y_train)
mlr_poly_accuracy = mlr_poly.score(X_test,Y_test)
mlr_poly_best_degree = mlr_poly.best_params_['poly__degree']
print(f"The best Multiple Linear Regression model is the one with polynomial features of degree {mlr_poly_best_degree: .0f} \n and has an R2 score of {mlr_poly_accuracy: .3f} on the testing dataset.")


The best Multiple Linear Regression model is the one with polynomial features of degree  2 
 and has an R2 score of  0.755 on the testing dataset.


## Ridge Regression with Polynomial Features

In [47]:
ridge_input = [('ridge_poly',PolynomialFeatures(include_bias=False)),(('ridge',Ridge()))]
ridge_pipeline = Pipeline(ridge_input)
# performing a grid search to find the best parameters for the alpha and the degree of the polynomial features
ridge_parameters = {'ridge_poly__degree':[1, 2, 3, 4],'ridge__alpha':[0.01, 0.1, 1, 10, 100]}
ridge_poly = GridSearchCV(ridge_pipeline,ridge_parameters,cv=5)
ridge_poly.fit(X_train,Y_train)
ridge_poly_accuracy = ridge_poly.score(X_test, Y_test)
print(f"The Ridge Regression model with polynomial features has an R2 score of {ridge_poly_accuracy: .3f} on the testing dataset.")
print("The best parameters for the model are : \n", ridge_poly.best_params_)

The Ridge Regression model with polynomial features has an R2 score of  0.755 on the testing dataset.
The best parameters for the model are : 
 {'ridge__alpha': 100, 'ridge_poly__degree': 2}


## Results and Conclusion

| Model                                            |   R2 Score   |
|--------------------------------------------------|--------------|
| Multiple Linear Regression Model                 |    0.683     |
| Multiple Linear Regression with Polynomial features |    0.755  |
|Ridge Regression model | 0.755 |

Multiple Linear Regression model with polynomial features and the Ridge Regression models are the ones that perform the best at predicting unknown house prices based on the given dataset, both performing equally having an R2 score of 0.755
The multiple Linear regression model without polynomial features performs the worst on the dataset, obtaining an R2 score of 0.683

## Author

Amogha Subramanya