## Boston House Prices

Linear regression is one of the fundamental statistical and machine learning techniques. Whether you want to do statistics, machine learning, or scientific computing, there are good chances that you’ll need it. It’s advisable to learn it first and then proceed towards more complex methods.
https://realpython.com/linear-regression-in-python/





https://www.kaggle.com/prasadperera/the-boston-housing-dataset

The Boston Housing Dataset is a derived from information collected by the U.S. Census Service concerning housing in the area of Boston MA. There are 506 line items and 14 attributes (including price). The following describes the dataset columns:

CRIM - per capita crime rate by town

ZN - proportion of residential land zoned for lots over 25,000 sq.ft.

INDUS - proportion of non-retail business acres per town.

CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)

NOX - nitric oxides concentration (parts per 10 million)

RM - average number of rooms per dwelling

AGE - proportion of owner-occupied units built prior to 1940

DIS - weighted distances to five Boston employment centres

RAD - index of accessibility to radial highways

TAX - full-value property-tax rate per $10,000

PTRATIO - pupil-teacher ratio by town

B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

LSTAT - % lower status of the population

MEDV - Median value of owner-occupied homes in $1000's


In [20]:
# import libraries that will be used in completing this assignment
import matplotlib.pyplot as plt
import numpy as np
import sklearn
import scipy.stats as stats
import pandas as pd
import seaborn as sns #plotting#

In [21]:
# The Boston House prices dataset can be inported directly from sklearn
# https://towardsdatascience.com/linear-regression-on-boston-housing-dataset-f409b7e4a155
# https://medium.com/@haydar_ai/learning-data-science-day-9-linear-regression-on-boston-housing-dataset-cd62a80775ef (repeated)
# https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html

from sklearn.datasets import load_boston
boston_dataset = load_boston() #storing Boston House prices as a variable boston_dataset (a dictionary)

In [22]:
df = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
print(boston_dataset.keys())

dict_keys(['data', 'target', 'feature_names', 'DESCR'])


https://towardsdatascience.com/linear-regression-on-boston-housing-dataset-f409b7e4a155



In [23]:
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


The prices column does not appear as this is in the "target" column, needs to be added to the dataset as this is obviously the most important data element! https://towardsdatascience.com/linear-regression-on-boston-housing-dataset-f409b7e4a155 is heavily referenced to get this inital dataset into the dataframe required.

In [24]:
df['MEDV'] = boston_dataset.target

In [25]:
df.head() 

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


MEDV - (Median value of owner-occupied homes in $1000's) is the key value in the dataset; against which other variablesa re to be measured against for "best fit". The aim here is to explore the dataset for close alignment of varaibles which can demonstrate a relationship to the House price.

In [28]:
df.MEDV.describe() # summary of house prices

count    506.000000
mean      22.532806
std        9.197104
min        5.000000
25%       17.025000
50%       21.200000
75%       25.000000
max       50.000000
Name: MEDV, dtype: float64

In [30]:
df['MEDV'].groupby([df['CHAS']]).describe() #summary of house prices by Charles v Non Charles 

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
CHAS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0.0,471.0,22.093843,8.831362,5.0,16.6,20.9,24.8,50.0
1.0,35.0,28.44,11.816643,13.4,21.1,23.3,33.15,50.0
