# Housing Price Prediction

## Using Linear Regression

### About the dataset

**Dataset source:** [kaggle.com](https://www.kaggle.com/datasets/harishkumardatalab/housing-price-prediction) 

>This dataset provides key features for predicting house prices, including area, bedrooms, bathrooms, stories, amenities like air conditioning and parking, and information on furnishing status. It enables analysis and modelling to understand the factors impacting house prices and develop accurate predictions in real estate markets.

![dataset cover image](./images/dataset-cover.jpeg)

This dataset provides comprehensive information for house price prediction, with 13 column names:

1. **Price:** The price of the house.
2. **Area:** The total area of the house in square feet.
3. **Bedrooms:** The number of bedrooms in the house.
4. **Bathrooms:** The number of bathrooms in the house.
5. **Stories:** The number of stories in the house.
6. **Mainroad:** Whether the house is connected to the main road (Yes/No).
7. **Guestroom:** Whether the house has a guest room (Yes/No).
8. **Basement:** Whether the house has a basement (Yes/No).
9. **Hot water heating:** Whether the house has a hot water heating system (Yes/No).
10. **Airconditioning:** Whether the house has an air conditioning system (Yes/No).
11. **Parking:** The number of parking spaces available within the house.
12. **Prefarea:** Whether the house is located in a preferred area (Yes/No).
13. **Furnishing status:** The furnishing status of the house (Fully Furnished, Semi-Furnished, Unfurnished).

In [5]:
# import all the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import xgboost as xgb
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split as split
from sklearn.preprocessing import LabelEncoder
%matplotlib inline

In [None]:
# Read the csv file

df = pd.read_csv('./dataset/Housing.csv')
df

### Data Cleaning and Preprocessing

Get general information about the dataset

In [2]:
df.info()

NameError: name 'df' is not defined

From the above description, there are no null values in any of the column however there are some columns with object Dtype. These columns are to be converted into numerical data i.e the columns are to be encoded. This is because machine learning models are basically mathematical models meaning that at their core level they can only process numerical data.
before encoding the dataset a copy is take for our own future use in visulization

>For the categorical data, count plots are plotted to get a good distribution of data and as it may help avoid overfitting when processing data

In [None]:
fig, ax = plt.subplots(6, 2, figsize=(10,22))
sns.countplot(x='mainroad', data=df, ax=ax[0,0])
sns.countplot(x='guestroom', data=df, ax=ax[0, 1])
sns.countplot(x='basement', data=df, ax=ax[1,0])
sns.countplot(x='hotwaterheating', data=df, ax=ax[1,1])
sns.countplot(x='parking', data=df,ax=ax[2,0])
sns.countplot(x='prefarea', data=df, ax=ax[2,1])
sns.countplot(x='furnishingstatus', data=df, ax=ax[3,0])
sns.countplot(x='bedrooms',data=df, ax=ax[3,1])
sns.countplot(x='stories',data=df, ax=ax[4,0])
sns.countplot(x='airconditioning',data=df, ax=ax[4,1])
sns.countplot(x='bathrooms',data=df, ax=ax[5,0])
fig.savefig('./images/countplot.png')

Okay now plot the scatter plot and see how these categories are distributed for each numerical feature starting with area

#### Insight

All the features in the dataset are categorical with the execption for area and price. Since machine learning models  are built upon numerical values. The categorical values are to be encoded.
>From our one hot encoding shall be used assign numerical values by adding more columns for and assign 1 or 0 value to features that do not have any order to their category for example, mainroad, guestroom, basement, hotwaterheating, prefarea and airconditioning

>For a feature like furnishingstatus LabelEncoder shall be used because it has some order into it. i.e unfurnished<semi-furnished<furnished. and features like parking, bedrooms, stories and bathrooms they shall be left as they are because they already have numerical value into them

### Encode the data

In [None]:
ordinal = df[['furnishingstatus']]
le = LabelEncoder()
ordinal['furnishingstatus'] = le.fit_transform(ordinal)
ordinal

In [None]:
nominal = df[['mainroad', 'guestroom', 'basement', 'hotwaterheating', 'prefarea','airconditioning']]
nominal = pd.get_dummies(nominal, drop_first=True)
nominal

In [None]:
ordinal = pd.concat([ordinal,df[['parking','bedrooms','stories','bathrooms']]], axis=1)
ordinal

In [None]:
X = pd.concat([ordinal,nominal],axis=1)
X

In [None]:
Y = df.price
Y

### Split, Train, Test, Evaluate and Optimize

**split the dataset**

In [None]:
x_train, x_test, y_train, y_test = split(X, Y, test_size=0.2)

In [None]:
# create the models
XGB = xgb.XGBRegressor()
decisiontree = DecisionTreeRegressor()
randomforest = RandomForestRegressor()

In [None]:
# train the models
XGB.fit(x_train, y_train)
decisiontree.fit(x_train, y_train)
randomforest.fit(x_train, y_train)

**get the score for each model**

In [None]:
# score the models
print("The score of the XG-Boost Regressor in predicting the housing prices is: ", XGB.score(x_train, y_train) * 100, "%")
print("The score of the Decition Tree Regressor in predicting the housing prices is: ", decisiontree.score(x_train, y_train) * 100, "%")
print("The score of the Random Forest Regressor in predicting the housing prices is: ", randomforest.score(x_train, y_train) * 100, "%")

### Optimize the models

In [3]:
# XG-Boost Regressor
best_score = best_seed = 0
for i in range(10000):
    x_train, x_test, y_train, y_test = split(X, Y, test_size=0.2, random_state=i)
    XGB.fit(x_train, y_train)
    score = XGB.score(x_train, y_train)
    if(score > best_score):
        best_score = score
        best_Seed = i

XGBScore = best_score * 100
XGBSeed = best_seed

# DecisionTree Regressor
best_score = best_seed = 0
for i in range(10000):
    x_train, x_test, y_train, y_test = split(X, Y, test_size=0.2, random_state=i)
    decisiontree.fit(x_train, y_train)
    score = decisiontree.score(x_train, y_train)
    if(score > best_score):
        best_score = score
        best_Seed = i

decisiontreeScore = best_score * 100
decisiontreeSeed = best_seed

# RandomForest Regressor
best_score = best_seed = 0
for i in range(10000):
    x_train, x_test, y_train, y_test = split(X, Y, test_size=0.2, random_state=i)
    randomforest.fit(x_train, y_train)
    score = randomforest.score(x_train, y_train)
    if(score > best_score):
        best_score = score
        best_Seed = i

randomforestScore = best_score * 100
randomforestSeed = best_seed


NameError: name 'X' is not defined

In [4]:
print("The best score for the XG-Boost Regressor is: {} with the seed value of: {}".format(XGBScore, XGBSeed))
print("The best score for the DecisionTree Regressor is: {} with the seed value of: {}".format(decisiontreeScore, decisiontreeSeed))
print("The best score for the RandomForest Regressor is: {} with the seed value of: {}".format(randomforestScore, randomforestSeed))

NameError: name 'XGBScore' is not defined