# Boston Housing Price Prediction 
This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib archive (http://lib.stat.cmu.edu/datasets/boston), and has been used extensively throughout the literature to benchmark algorithms. The dataset is small in size with only 506 cases.

### Dataset Naming
The name for this dataset is simply boston. It has two prototasks: nox, in which the nitrous oxide level is to be predicted; and price, in which the median value of a home is to be predicted.

### Variables
There are 14 attributes in each case of the dataset. They are:
  - CRIM - per capita crime rate by town
  - ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
  - INDUS - proportion of non-retail business acres per town.
  - CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
  - NOX - nitric oxides concentration (parts per 10 million)
  - RM - average number of rooms per dwelling
  - AGE - proportion of owner-occupied units built prior to 1940
  - DIS - weighted distances to five Boston employment centres
  - RAD - index of accessibility to radial highways
  - TAX - full-value property-tax rate per $10,000
  - PTRATIO - pupil-teacher ratio by town
  - B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
  - LSTAT - % lower status of the population
  - MEDV - Median value of owner-occupied homes in $1000's



**Importing the dataset into a pandas DataFrame**

In [23]:
import numpy as np
import pandas as pd
import sklearn

bos1 = pd.read_csv('BostonHousing.csv')
bos1.head(10)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2
5,0.02985,0.0,2.18,0,0.458,6.43,58.7,6.0622,3,222,18.7,394.12,5.21,28.7
6,0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.6,12.43,22.9
7,0.14455,12.5,7.87,0,0.524,6.172,96.1,5.9505,5,311,15.2,396.9,19.15,27.1
8,0.21124,12.5,7.87,0,0.524,5.631,100.0,6.0821,5,311,15.2,386.63,29.93,16.5
9,0.17004,12.5,7.87,0,0.524,6.004,85.9,6.5921,5,311,15.2,386.71,17.1,18.9


**Preprocessing the data: Removing NaN values**

In [24]:
# It makes sure that all the values in columns are numbers.
# A character could throw your model off.
bos1.isna().sum() 

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
MEDV       0
dtype: int64

**Splitting model data with 70% for training**

In [25]:
from sklearn.model_selection import train_test_split
X = np.array(bos1.iloc[:, 0:13])
Y = np.array(bos1["MEDV"])
# testing data size is of 30% of entire data
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.30, random_state = 5)

**Using Linear Regression Model**

In [26]:
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
# load our first model
lr = LinearRegression()
# train the model on training data
lr.fit(x_train, y_train)
# predict the testing data so that we can later evaluate the model
pred_lr = lr.predict(x_test)

**Model Evaluation**

In [27]:
# error for linear regression (y_test are the actual values & pred_lr are values predicted by model)
mse_lr = sklearn.metrics.mean_squared_error(y_test, pred_lr, squared = False)
print("error for linear regression = {}".format(mse_lr))

error for linear regression = 5.540490745781328


**Using K-Nearest Neighbors (KNN) Model**

In [28]:
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
# train the model on training data
lr.fit(x_train, y_train)
# load the KNN model (assuming 3 nearest neighbors)
knn = KNeighborsRegressor(3)
knn.fit(x_train, y_train)
pred_knn = knn.predict(x_test)

**Hyperparameter Tuning**

In [29]:
import sklearn
# After Hyperparameter Tuning we see that the best K is not 3 but 4 for this particular dataset
for i in range(1, 50):
    model = KNeighborsRegressor(i)
    model.fit(x_train, y_train)
    pred_y = model.predict(x_test)
    mse = sklearn.metrics.mean_squared_error(y_test, pred_y, squared = False)
    print("{} error for K = {}".format(mse, i))

7.97154478854566 error for K = 1
7.159484875618533 error for K = 2
7.014927171138291 error for K = 3
7.004019640065342 error for K = 4
7.036131375752027 error for K = 5
7.103650686103268 error for K = 6
7.249246229196143 error for K = 7
7.278466403768686 error for K = 8
7.490296733721186 error for K = 9
7.573928228851226 error for K = 10
7.580880154071545 error for K = 11
7.620709624858009 error for K = 12
7.702433441773159 error for K = 13
7.745706188130712 error for K = 14
7.855546909761407 error for K = 15
7.970845764140948 error for K = 16
8.00708692880329 error for K = 17
8.05951400020052 error for K = 18
8.105972848197592 error for K = 19
8.171623447622684 error for K = 20
8.208766061680672 error for K = 21
8.266010100575647 error for K = 22
8.280897264278922 error for K = 23
8.326448746059764 error for K = 24
8.38105978099617 error for K = 25
8.410954693047014 error for K = 26
8.478704509976565 error for K = 27
8.50999986845734 error for K = 28
8.538275555508479 error for K = 29

**Model Evaluation**

In [30]:
# Error for KNN algorithm
mse_knn = sklearn.metrics.mean_squared_error(y_test, pred_knn, squared=False)
print("error for KNN is {}".format(mse_knn))
# The higher error means that Linear Regression was better compared to KNN for this dataset.

error for KNN is 7.014927171138291
