<a href="https://colab.research.google.com/github/AjinJayan/AJ/blob/master/Predict_Boston_Housing_Price.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predict Boston Housing Prices

This python program predicts the price of houses in Boston using a machine learning algorithm called a Linear Regression.

<p align="center">
  <img src="https://www.maxpixel.net/static/photo/1x/Top-View-Top-Boston-City-Urban-Houses-1401212.jpg" width="400"/>
</p>


# Linear Regression
Linear regression is a linear approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables).

## Pros:
1. Simple to implement.
2. Used to predict numeric values.

## Cons:
1. Prone to overfitting.
2. Cannot be used when the relation between independent and dependent variable are non linear.


##Resources:

*   https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html
*   https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
*   https://youtu.be/gOXoFDrseis





In [0]:
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn.model_selection import train_test_split


In [2]:
#Load the Boston Housing Data Set from sklearn.datasets and print it
from sklearn.datasets import load_boston
boston = load_boston()
print(boston)

{'data': array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,
        4.9800e+00],
       [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,
        9.1400e+00],
       [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,
        4.0300e+00],
       ...,
       [6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
        5.6400e+00],
       [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,
        6.4800e+00],
       [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
        7.8800e+00]]), 'target': array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15. ,
       18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2, 13.6, 19.6,
       15.2, 14.5, 15.6, 13.9, 16.6, 14.8, 18.4, 21. , 12.7, 14.5, 13.2,
       13.1, 13.5, 18.9, 20. , 21. , 24.7, 30.8, 34.9, 26.6, 25.3, 24.7,
       21.2, 19.3, 20. , 16.6, 14.4, 19.4, 19.7, 20.5, 25. , 23.4, 18.9,
       35.4, 24.7, 31.6, 23.3, 19.6, 1

In [0]:
#Transform the data set into a data frame 
#NOTE: boston.data = the data we want, 
#      boston.feature_names = the column names of the data
#      boston.target = Our target variable or the price of the houses
df_x = pd.DataFrame(boston.data, columns = boston.feature_names)
df_y = pd.DataFrame(boston.target)


In [4]:
#Get some statistics from our data set, count, mean standard deviation etc.
df_x.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36
75%,3.677083,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97


In [0]:
#Initialize the linear regression model
reg = linear_model.LinearRegression()

In [0]:
#Split the data into 67% training and 33% testing data
#NOTE: We have to split the dependent variables (x) and the target or independent variable (y)
x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.33, random_state=42)

In [7]:
#Train our model with the training data
reg.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [8]:
df_x_n=np.c_[np.ones((506,1)),df_x]
df_x_n

array([[1.0000e+00, 6.3200e-03, 1.8000e+01, ..., 1.5300e+01, 3.9690e+02,
        4.9800e+00],
       [1.0000e+00, 2.7310e-02, 0.0000e+00, ..., 1.7800e+01, 3.9690e+02,
        9.1400e+00],
       [1.0000e+00, 2.7290e-02, 0.0000e+00, ..., 1.7800e+01, 3.9283e+02,
        4.0300e+00],
       ...,
       [1.0000e+00, 6.0760e-02, 0.0000e+00, ..., 2.1000e+01, 3.9690e+02,
        5.6400e+00],
       [1.0000e+00, 1.0959e-01, 0.0000e+00, ..., 2.1000e+01, 3.9345e+02,
        6.4800e+00],
       [1.0000e+00, 4.7410e-02, 0.0000e+00, ..., 2.1000e+01, 3.9690e+02,
        7.8800e+00]])

In [9]:
coff=np.linalg.inv(df_x_n.T.dot(df_x_n)).dot(df_x_n.T).dot(df_y)
coff

array([[ 3.64594884e+01],
       [-1.08011358e-01],
       [ 4.64204584e-02],
       [ 2.05586264e-02],
       [ 2.68673382e+00],
       [-1.77666112e+01],
       [ 3.80986521e+00],
       [ 6.92224640e-04],
       [-1.47556685e+00],
       [ 3.06049479e-01],
       [-1.23345939e-02],
       [-9.52747232e-01],
       [ 9.31168327e-03],
       [-5.24758378e-01]])

In [10]:
y_predict=np.dot(x_test,coff[1:14])+(3.64594884e+01)
y_predict

array([[29.07194309],
       [36.10639166],
       [13.7590854 ],
       [24.55744959],
       [18.58215238],
       [22.94545405],
       [17.61572574],
       [13.57141933],
       [22.97718033],
       [20.46870848],
       [24.91525465],
       [18.92026212],
       [-4.28131069],
       [21.26198268],
       [19.45123792],
       [25.57824628],
       [20.56099821],
       [ 6.45198571],
       [40.84555723],
       [17.51834652],
       [27.20340973],
       [30.35192657],
       [11.4551176 ],
       [23.53334056],
       [18.12427477],
       [16.39147633],
       [22.42817375],
       [15.28000135],
       [22.39251098],
       [19.10854145],
       [22.97657221],
       [24.88682246],
       [25.30713066],
       [18.22846806],
       [16.94776852],
       [17.28601895],
       [30.56759673],
       [20.16097178],
       [23.49428994],
       [24.04295714],
       [14.01320789],
       [32.70905125],
       [43.18949845],
       [17.42153787],
       [27.11901941],
       [17

In [11]:
# Two different ways to check model performance/accuracy using,
# mean squared error which tells you how close a regression line is to a set of points.

# 1. Mean squared error by numpy
print(np.mean((y_predict-y_test)**2))

0    19.038545
dtype: float64
