# Expecting House Price using Linear Regression

This project uses a ready cleaned data. This cleaned data will be used to expect the house price in the future based on linear regression models. I got this data from Kaggle but I removed some columns. I just tried to explain the general idea  with very simple dataset.

You can see the video which I used during creating this notebook from this link: https://www.youtube.com/watch?v=acfKAcqaT2w
    

In [20]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score 
from sklearn.preprocessing import LabelEncoder 
from sklearn.model_selection import train_test_split


In [2]:
data = pd.read_csv('kc_house.csv')

In [3]:
data.head()

Unnamed: 0,id,price,bedrooms,bathrooms,area,zip_code,age,grade
0,7129300520,221900.0,3,1.0,1180,98178,65,7
1,6414100192,538000.0,3,2.25,2570,98125,69,7
2,5631500400,180000.0,2,1.0,770,98028,87,6
3,2487200875,604000.0,4,3.0,1960,98136,55,7
4,1954400510,510000.0,3,2.0,1680,98074,33,8


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   id         21613 non-null  int64  
 1   price      21613 non-null  float64
 2   bedrooms   21613 non-null  int64  
 3   bathrooms  21613 non-null  float64
 4   area       21613 non-null  int64  
 5   zip_code   21613 non-null  int64  
 6   age        21613 non-null  int64  
 7   grade      21613 non-null  int64  
dtypes: float64(2), int64(6)
memory usage: 1.3 MB


As we see above, the data consists of the following columns:
-  The price of the appartment.
-  The bedrooms number in the appartment.
-  The bathrooms number in the appartment.
-  The area in sq feet for the appartment.
-  The zip_code of the appartment.
-  The age is the age of the appartment.
-  The grade is the appartment grade.


Since the id column is the no sense column, so it will be removed fro our datset.

In [5]:
data = data.iloc[:,1:]

Converting the data type of zip_code field into category.

In [6]:
data.zip_code = pd.Categorical(data.zip_code)

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   price      21613 non-null  float64 
 1   bedrooms   21613 non-null  int64   
 2   bathrooms  21613 non-null  float64 
 3   area       21613 non-null  int64   
 4   zip_code   21613 non-null  category
 5   age        21613 non-null  int64   
 6   grade      21613 non-null  int64   
dtypes: category(1), float64(2), int64(4)
memory usage: 1.0 MB


In [8]:
data.head(10)

Unnamed: 0,price,bedrooms,bathrooms,area,zip_code,age,grade
0,221900.0,3,1.0,1180,98178,65,7
1,538000.0,3,2.25,2570,98125,69,7
2,180000.0,2,1.0,770,98028,87,6
3,604000.0,4,3.0,1960,98136,55,7
4,510000.0,3,2.0,1680,98074,33,8
5,1230000.0,4,4.5,5420,98053,19,11
6,257500.0,3,2.25,1715,98003,25,7
7,291850.0,3,1.5,1060,98198,57,7
8,229500.0,3,1.0,1780,98146,60,7
9,323000.0,3,2.5,1890,98038,17,7


Machine Learning can't work with cateogrical data, so the zip_code column values should be encodded to eas it for the machine learning. The LabelEncoder is used to label the target value between 0 and the num_of_classes-1. To know more about this function, please read the following link, https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html 

fit_transform function is used to transform non-numerical labels to numerical labels. 
It is one of sklearn.preprocessing.LabelEncoder functions. 

In [9]:
#Fit label encoder and return encoded labels
enc = LabelEncoder()
data.iloc[:,4] = enc.fit_transform(data.iloc[:,4])

In [10]:
data.head(3)

Unnamed: 0,price,bedrooms,bathrooms,area,zip_code,age,grade
0,221900.0,3,1.0,1180,66,65,7
1,538000.0,3,2.25,2570,55,69,7
2,180000.0,2,1.0,770,16,87,6


Here in the above we see that the zip_codes become numbers instead of categorical.

We need to expect the price in the future which is y.

In [11]:
y = data.iloc[:,0]

In [12]:
y

0        221900.0
1        538000.0
2        180000.0
3        604000.0
4        510000.0
           ...   
21608    360000.0
21609    400000.0
21610    402101.0
21611    400000.0
21612    325000.0
Name: price, Length: 21613, dtype: float64

All the table columns except the price are considered to features for our model.

In [13]:
x = data.iloc[:,1:]

In [14]:
x.head()

Unnamed: 0,bedrooms,bathrooms,area,zip_code,age,grade
0,3,1.0,1180,66,65,7
1,3,2.25,2570,55,69,7
2,2,1.0,770,16,87,6
3,4,3.0,1960,58,55,7
4,3,2.0,1680,37,33,8


In [15]:
x['zip_code']

0        66
1        55
2        16
3        58
4        37
         ..
21608    42
21609    60
21610    59
21611    15
21612    59
Name: zip_code, Length: 21613, dtype: int64

pd.get_dummies(series) is a function which can be used to convert the cateogrical data into bolean data. If you need more information please see this link: 
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html

Note:
-  In Machine Learning, we should convert the features from categorical into boolean. Since, Any algorithm cannot deal with categorical.

Here is an example below:

In [16]:
# convering x to boolean data
all_data = pd.get_dummies(x)

In [17]:
all_data.head()

Unnamed: 0,bedrooms,bathrooms,area,zip_code,age,grade
0,3,1.0,1180,66,65,7
1,3,2.25,2570,55,69,7
2,2,1.0,770,16,87,6
3,4,3.0,1960,58,55,7
4,3,2.0,1680,37,33,8


train_test_split is used to split arrays or matrices into random train and test subsets. To know more about it please visit:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [44]:
x_train, x_test, y_train, y_test = train_test_split(all_data, y, test_size =0.3)

In [45]:
x_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15129 entries, 7137 to 8700
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   bedrooms   15129 non-null  int64  
 1   bathrooms  15129 non-null  float64
 2   area       15129 non-null  int64  
 3   zip_code   15129 non-null  int64  
 4   age        15129 non-null  int64  
 5   grade      15129 non-null  int64  
dtypes: float64(1), int64(5)
memory usage: 827.4 KB


x_train.info(): from ths function you notice that the train set is 70% of data.

## Writng a command to fit the linear:

In [46]:
linear = LinearRegression()

In [47]:
linear.fit(x_train, y_train)

LinearRegression()

In [48]:
pred = linear.predict(x_test)

Now we have the prediction for the test data. We can test it and get the result.

In [49]:
r2_score(y_test, pred)

0.618177161272839

it is not bad because there is no any model tuning. Now we have trained a model.

## Saving the model into file

No we need to get an object form into a file. I can use this model after that in other models to predict. Save and Load Machine Learning Models in Python with scikit-learn

In [50]:
import joblib

In [51]:
# save the model to disk
joblib.dump(linear,'hous_price_model.ml')

['hous_price_model.ml']

## Loading a model later

In [52]:
# load the model from disk
loaded_model = joblib.load(open('hous_price_model.ml', 'rb'))
result = loaded_model.score(x_test, y_test)
print(result)

0.618177161272839
