<a href="https://colab.research.google.com/github/AjinJayan/AJ/blob/master/Predict_Boston_Housing_Price%20corrected.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predict Boston Housing Prices

This python program predicts the price of houses in Boston using a machine learning algorithm called a Linear Regression.

<p align="center">
  <img src="https://www.maxpixel.net/static/photo/1x/Top-View-Top-Boston-City-Urban-Houses-1401212.jpg" width="400"/>
</p>


# Linear Regression
Linear regression is a linear approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables).

## Pros:
1. Simple to implement.
2. Used to predict numeric values.

## Cons:
1. Prone to overfitting.
2. Cannot be used when the relation between independent and dependent variable are non linear.


##Resources:

*   https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html
*   https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
*   https://youtu.be/gOXoFDrseis





In [0]:

import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn.model_selection import train_test_split


In [15]:
#Load the Boston Housing Data Set from sklearn.datasets and print it
from sklearn.datasets import load_boston
boston = load_boston()
print(boston)

{'data': array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,
        4.9800e+00],
       [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,
        9.1400e+00],
       [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,
        4.0300e+00],
       ...,
       [6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
        5.6400e+00],
       [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,
        6.4800e+00],
       [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
        7.8800e+00]]), 'target': array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15. ,
       18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2, 13.6, 19.6,
       15.2, 14.5, 15.6, 13.9, 16.6, 14.8, 18.4, 21. , 12.7, 14.5, 13.2,
       13.1, 13.5, 18.9, 20. , 21. , 24.7, 30.8, 34.9, 26.6, 25.3, 24.7,
       21.2, 19.3, 20. , 16.6, 14.4, 19.4, 19.7, 20.5, 25. , 23.4, 18.9,
       35.4, 24.7, 31.6, 23.3, 19.6, 1

In [0]:
#Transform the data set into a data frame 
#NOTE: boston.data = the data we want, 
#      boston.feature_names = the column names of the data
#      boston.target = Our target variable or the price of the houses
df_x = pd.DataFrame(boston.data, columns = boston.feature_names)
df_y = pd.DataFrame(boston.target)


In [17]:
#Get some statistics from our data set, count, mean standard deviation etc.
df_x.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36
75%,3.677083,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97


In [0]:
#Initialize the linear regression model
reg = linear_model.LinearRegression()

In [0]:
#Split the data into 67% training and 33% testing data
#NOTE: We have to split the dependent variables (x) and the target or independent variable (y)
x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.33, random_state=42)

In [20]:
#Train our model with the training data
reg.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [21]:
df_x_n=np.c_[np.ones((339,1)),x_train]
df_x_n

array([[1.00000e+00, 1.02330e+01, 0.00000e+00, ..., 2.02000e+01,
        3.79700e+02, 1.80300e+01],
       [1.00000e+00, 6.71910e-01, 0.00000e+00, ..., 2.10000e+01,
        3.76880e+02, 1.48100e+01],
       [1.00000e+00, 1.44550e-01, 1.25000e+01, ..., 1.52000e+01,
        3.96900e+02, 1.91500e+01],
       ...,
       [1.00000e+00, 1.50100e-02, 8.00000e+01, ..., 1.70000e+01,
        3.90940e+02, 5.99000e+00],
       [1.00000e+00, 1.11604e+01, 0.00000e+00, ..., 2.02000e+01,
        1.09850e+02, 2.32700e+01],
       [1.00000e+00, 2.28760e-01, 0.00000e+00, ..., 2.09000e+01,
        7.08000e+01, 1.06300e+01]])

In [22]:
coff=np.linalg.inv(df_x_n.T.dot(df_x_n)).dot(df_x_n.T).dot(y_train)
coff

array([[ 3.33349758e+01],
       [-1.28749718e-01],
       [ 3.78232228e-02],
       [ 5.82109233e-02],
       [ 3.23866812e+00],
       [-1.61698120e+01],
       [ 3.90205116e+00],
       [-1.28507825e-02],
       [-1.42222430e+00],
       [ 2.34853915e-01],
       [-8.21331947e-03],
       [-9.28722459e-01],
       [ 1.17695921e-02],
       [-5.47566338e-01]])

In [23]:
y_predict=np.dot(x_test,coff[1:14])+(3.33349758e+01)
y_predict

array([[28.53469473],
       [36.61870064],
       [15.63751083],
       [25.50144964],
       [18.70967345],
       [23.16471596],
       [17.31011039],
       [14.07736372],
       [23.01064393],
       [20.54223486],
       [24.91632355],
       [18.41098057],
       [-6.52079683],
       [21.83372609],
       [19.14903069],
       [26.05873225],
       [20.3023263 ],
       [ 5.74943571],
       [40.33137816],
       [17.4579145 ],
       [27.47486669],
       [30.21707575],
       [10.80555629],
       [23.87721733],
       [17.99492215],
       [16.02608795],
       [23.26828805],
       [14.36825211],
       [22.38116976],
       [19.30920685],
       [22.1728458 ],
       [25.05925446],
       [25.1378073 ],
       [18.46730203],
       [16.60405716],
       [17.46564051],
       [30.71367738],
       [20.05106792],
       [23.98977685],
       [24.94322412],
       [13.9794536 ],
       [31.64706971],
       [42.48057211],
       [17.70042818],
       [26.92507873],
       [17

In [24]:
# Two different ways to check model performance/accuracy using,
# mean squared error which tells you how close a regression line is to a set of points.

# 1. Mean squared error by numpy
print(np.mean((y_predict-y_test)**2))

0    20.724023
dtype: float64
