## **PREDICTING HOUSE PRICES WITH MACHINE LEARNING**

#### DESCRIPTION:
The objective of this project is to build a predictive model using linear regression to estimate a
numerical outcome based on a dataset with relevant features. Linear regression is a
fundamental machine learning algorithm, and this project provides hands-on experience in
developing, evaluating, and interpreting a predictive model.

#### **WORK- BROKEDOWN** :
    1. Data Collection: Obtain a dataset with numerical features and a target variable for
    prediction.
    2. Data Exploration and Cleaning: Explore the dataset to understand its structure, handle
    missing values, and ensure data quality.
    3. Feature Selection: Identify relevant features that may contribute to the predictive model.
    Model Training: Implement linear regression using a machine learning library (e.g., Scikit-
    Learn).
    4. Model Evaluation: Evaluate the model's performance on a separate test dataset using
    metrics such as Mean Squared Error or R-squared.
    5. Visualization: Create visualizations to illustrate the relationship between the predicted and
    actual values.

DATASET LINK: https://www.kaggle.com/code/ashydv/housing-price-prediction-linear-regression

It is Project 1 Proposal Level-2 of Oaisis infobyte.                        
**DATE**: 27 August 2024

==================================*LOADING THE BASIC LIBRARIES*====================================

In [1]:
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
#to keep everything in one plane
%matplotlib inline 

In [2]:
import warnings

In [3]:
warnings.filterwarnings('ignore')#ignore the warnings

======================================*LOADING THE DATA*==========================================

In [4]:
Data= pd.read_csv('Housing.csv')

==================================*DATA INSPECTION*=========================================

In [5]:
# first 5 rows of the dataset
Data.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished


In [6]:
## last 5 rows of the dataset
Data.tail()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
540,1820000,3000,2,1,1,yes,no,yes,no,no,2,no,unfurnished
541,1767150,2400,3,1,1,no,no,no,no,no,0,no,semi-furnished
542,1750000,3620,2,1,1,yes,no,no,no,no,0,no,unfurnished
543,1750000,2910,3,1,1,no,no,no,no,no,0,no,furnished
544,1750000,3850,3,1,2,yes,no,no,no,no,0,no,unfurnished


In [7]:
#for finding out the shape of the data. it is a attribute not a method
Data.shape

(545, 13)

In [8]:
#printing the no. of rows and columns
print("Number of Rows are",Data.shape[0])
print("Number of Columns are",Data.shape[1])

Number of Rows are 545
Number of Columns are 13


In [9]:
#Information About Our Dataset Like #the Total Number of Rows, Total Number of Columns, Datatypes of Each Column And Memory Requirement
Data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   price             545 non-null    int64 
 1   area              545 non-null    int64 
 2   bedrooms          545 non-null    int64 
 3   bathrooms         545 non-null    int64 
 4   stories           545 non-null    int64 
 5   mainroad          545 non-null    object
 6   guestroom         545 non-null    object
 7   basement          545 non-null    object
 8   hotwaterheating   545 non-null    object
 9   airconditioning   545 non-null    object
 10  parking           545 non-null    int64 
 11  prefarea          545 non-null    object
 12  furnishingstatus  545 non-null    object
dtypes: int64(6), object(7)
memory usage: 55.5+ KB


In [10]:
#to Get Overall Statistics About The Dataset
Data.describe()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,parking
count,545.0,545.0,545.0,545.0,545.0,545.0
mean,4766729.0,5150.541284,2.965138,1.286239,1.805505,0.693578
std,1870440.0,2170.141023,0.738064,0.50247,0.867492,0.861586
min,1750000.0,1650.0,1.0,1.0,1.0,0.0
25%,3430000.0,3600.0,2.0,1.0,1.0,0.0
50%,4340000.0,4600.0,3.0,1.0,2.0,0.0
75%,5740000.0,6360.0,3.0,2.0,2.0,1.0
max,13300000.0,16200.0,6.0,4.0,4.0,3.0


In [11]:
Data.columns

Index(['price', 'area', 'bedrooms', 'bathrooms', 'stories', 'mainroad',
       'guestroom', 'basement', 'hotwaterheating', 'airconditioning',
       'parking', 'prefarea', 'furnishingstatus'],
      dtype='object')

=======================================*CHECKING NULL VALUES*====================================================

In [12]:
#Check Null Values In The Dataset
Data.isnull()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
540,False,False,False,False,False,False,False,False,False,False,False,False,False
541,False,False,False,False,False,False,False,False,False,False,False,False,False
542,False,False,False,False,False,False,False,False,False,False,False,False,False
543,False,False,False,False,False,False,False,False,False,False,False,False,False


In [13]:
#Check the sum of Null Values In The Dataset
Data.isnull().sum()

price               0
area                0
bedrooms            0
bathrooms           0
stories             0
mainroad            0
guestroom           0
basement            0
hotwaterheating     0
airconditioning     0
parking             0
prefarea            0
furnishingstatus    0
dtype: int64

There is no null values

#### ==============================CHECK DUPLICACY==============================

In [14]:
#to check duplicate values in dataset
Data.duplicated().any()

False

There are no duplicate values 

In [15]:
for column in Data.columns:
    print(Data[column].value_counts())
    print("*"*20)

price
3500000     17
4200000     17
4900000     12
3150000      9
5600000      9
            ..
6580000      1
4319000      1
4375000      1
4382000      1
13300000     1
Name: count, Length: 219, dtype: int64
********************
area
6000    24
3000    14
4500    13
4000    11
5500     9
        ..
6862     1
4815     1
9166     1
6321     1
3620     1
Name: count, Length: 284, dtype: int64
********************
bedrooms
3    300
2    136
4     95
5     10
6      2
1      2
Name: count, dtype: int64
********************
bathrooms
1    401
2    133
3     10
4      1
Name: count, dtype: int64
********************
stories
2    238
1    227
4     41
3     39
Name: count, dtype: int64
********************
mainroad
yes    468
no      77
Name: count, dtype: int64
********************
guestroom
no     448
yes     97
Name: count, dtype: int64
********************
basement
no     354
yes    191
Name: count, dtype: int64
********************
hotwaterheating
no     520
yes     25
Name: count, dty

In [16]:
Data.columns

Index(['price', 'area', 'bedrooms', 'bathrooms', 'stories', 'mainroad',
       'guestroom', 'basement', 'hotwaterheating', 'airconditioning',
       'parking', 'prefarea', 'furnishingstatus'],
      dtype='object')

#### Price per sq feet

In [17]:
Data['price_per_sqft'] = Data['price'] * 100000 / Data['area']

In [18]:
Data['price_per_sqft']

0      1.792453e+08
1      1.367188e+08
2      1.229920e+08
3      1.628667e+08
4      1.537736e+08
           ...     
540    6.066667e+07
541    7.363125e+07
542    4.834254e+07
543    6.013746e+07
544    4.545455e+07
Name: price_per_sqft, Length: 545, dtype: float64

In [19]:
Data.describe()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,parking,price_per_sqft
count,545.0,545.0,545.0,545.0,545.0,545.0,545.0
mean,4766729.0,5150.541284,2.965138,1.286239,1.805505,0.693578,99332700.0
std,1870440.0,2170.141023,0.738064,0.50247,0.867492,0.861586,34653700.0
min,1750000.0,1650.0,1.0,1.0,1.0,0.0,27039560.0
25%,3430000.0,3600.0,2.0,1.0,1.0,0.0,74537040.0
50%,4340000.0,4600.0,3.0,1.0,2.0,0.0,95238100.0
75%,5740000.0,6360.0,3.0,2.0,2.0,1.0,118461500.0
max,13300000.0,16200.0,6.0,4.0,4.0,3.0,264000000.0


In [20]:
Data.shape

(545, 14)

In [21]:
Data

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus,price_per_sqft
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished,1.792453e+08
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished,1.367188e+08
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished,1.229920e+08
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished,1.628667e+08
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished,1.537736e+08
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
540,1820000,3000,2,1,1,yes,no,yes,no,no,2,no,unfurnished,6.066667e+07
541,1767150,2400,3,1,1,no,no,no,no,no,0,no,semi-furnished,7.363125e+07
542,1750000,3620,2,1,1,yes,no,no,no,no,0,no,unfurnished,4.834254e+07
543,1750000,2910,3,1,1,no,no,no,no,no,0,no,furnished,6.013746e+07


In [22]:
Data.drop(columns=['price_per_sqft'],inplace=True)

In [23]:
Data.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished


### ============================SAVING=================================

In [25]:
Data.to_csv("final_dataset.csv")

In [27]:
X=Data.drop(columns=['price'])
y=Data['price']

In [28]:
#=====================================IMPORTING OTHER REPOSITRIES=================================================
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression,Lasso,Ridge
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import r2_score

In [29]:
X_train,X_test,y_train,y_test = train_test_split(X,y, test_size=0.2, random_state=0)

In [30]:
print(X_train.shape)
print(y_train.shape)

(436, 12)
(436,)


### ===================LINEAR REGRESSION==========================

In [32]:
column_trans = make_column_transformer((OneHotEncoder(sparse_output=False), ['bedrooms']), remainder='passthrough')

In [33]:
scaler = StandardScaler()

In [43]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
lr = make_pipeline(StandardScaler(),LinearRegression())

In [35]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

In [46]:
pipe = make_pipeline(column_trans,scaler, lr)

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
# X being feature matrix
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
lr = LinearRegression()
lr.fit(X_scaled, y)

In [None]:
pipe.fit(X_train,y_train)

In [None]:
y_pred_lr = pipe.predict(X_test)

In [None]:
r2_score(y_test,y_pred_lr)

In [None]:
#85% accuracy

================================================THANKYOU===============================================

for queries mail at: ranisoni6298@gmail.com