# Boston House prices

## GMIT Data Analytics 
### Machine Learning and Statistics 
#### Assignment 2019


![boston](img/Boston.jpg)


# research

[link](https://towardsdatascience.com/machine-learning-project-predicting-boston-house-prices-with-regression-b4e47493633d)
[link eile](https://towardsdatascience.com/linear-regression-on-boston-housing-dataset-f409b7e4a155)

[more link](https://levelup.gitconnected.com/predict-boston-house-prices-using-python-linear-regression-90469e0a341)




## References

## Describe the Boston Housing Dataset

### The dataset

### background

<img src="img/house.png" alt="house" style="width: 500px;"/>

The Boston housing data set originated in a 1978 paper titled *'Hedonic prices and the demand for clean air'* in the journal **Environment, Economics & Management, vol.5**. The papers authors, Harrison and Rubinfeld discussed using census housing data to estimate the degree to which people will pay for anti pollution measures and how this can prove methodologically problematic. 

The **hedonic price method** is an economic pricing model that infers the price of an item (such as  housing) by looking at analogous data such as environmental features, goods or services (e.g location of property, shopping districts, access to quality schools, crime rates etc). 

Each surrogate feature comes with a cost or benefit that the market (e.g. people who buy houses) considers in assessing what a fair price is for the property. People pay more for a house close to a good school or in a nice low crime neigbourhood. A similar property in a more challenging area would be marketed at a lower price as the market would expect a discount to  compensation for poorer utility. ref(http://www.cbabuilder.co.uk/Quant5.html)

Harrison and Rubinfeld, 1978 found that small increases in air pollution damage are positively correlated with air pollution levels (as expected) and greater household income. That is more expensive houses were in marginally worse air quality areas. They found that the households willingness to pay for clean air was very sensitive to the factors used in the hedonic pricing model rather than depending on the price elasticity of the air quality demand equation. [ref](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.926.5532&rep=rep1&type=pdf)


The data itself was taken from the US census in particular the Boston Standard Metropolitan Statistical Area (SMSA) census tracts from 1970 [ref the paper by h&r](). It is a record of hedonic housing data from various regions around Boston city. Each row contains data relating to a specific region of Boston. In sum there are fourteen measures (the columns) and 506 geographical regions sampled (the rows). [REF](https://webcache.googleusercontent.com/search?q=cache:8C4R8IZYvpgJ:https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781789804744/1/ch01lvl1sec11/our-first-analysis-the-boston-housing-dataset+&cd=3&hl=en&ct=clnk&gl=ie&client=firefox-b-d)

The dataset contained no null or missing values. The information contained in the dataset deals with hedonic measures of housing in the suburbs around Boston USA. The 14 fields are as follows. 

1. CRIM: This measures the per capita crime rate by town
2. ZN: This measures the proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS: This measures the proportion of non-retail business acres per town
4. CHAS: This is the Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
5. NOX: This measures the nitric oxides concentration (parts per 10 million)
6. RM: This measures the average number of rooms per dwelling
7. AGE: This measures the proportion of owner-occupied units built prior to 1940
8. DIS: This measures the weighted distances to five Boston employment centres
9. RAD: index of accessibility to radial highways
10. TAX: This measures the full-value property-tax rate per \$10,000
11. PTRATIO: This measures the pupil-teacher ratio by town
12. B: This measures the 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town (very racist!!)
13. LSTAT: This measures the \% lower status of the population
14. MEDV: Median value of owner-occupied homes in $1000's

[ref](https://archive.fo/5RkVv#selection-6.3-855.2)


The data can be found at the above link but it is also included with the python package sklearn



### imports

The following python packages were used in analysing the dataset.

In [1]:
#https://towardsdatascience.com/linear-regression-on-boston-housing-dataset-f409b7e4a155
import numpy as np
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns 
import pandas as pd



In [2]:
# using the data from machine learning repository 
# specify the column names
names =["CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT", "MEDV"]
# Read the csv and convert it to a a dataframe with  one white space as seperator 
df =  pd.read_csv('data/housing.data.csv',  sep='\s+', header = None, names = names, engine='python') # no headers
print(df)
print(df.head())
print(df.tail())
print(df.describe())
print(df.nunique())

if df.isnull().empty:
    print("Null values present")
else:   
    print("No missing values")


         CRIM    ZN  INDUS  CHAS    NOX     RM    AGE     DIS  RAD    TAX  \
0     0.00632  18.0   2.31     0  0.538  6.575   65.2  4.0900    1  296.0   
1     0.02731   0.0   7.07     0  0.469  6.421   78.9  4.9671    2  242.0   
2     0.02729   0.0   7.07     0  0.469  7.185   61.1  4.9671    2  242.0   
3     0.03237   0.0   2.18     0  0.458  6.998   45.8  6.0622    3  222.0   
4     0.06905   0.0   2.18     0  0.458  7.147   54.2  6.0622    3  222.0   
5     0.02985   0.0   2.18     0  0.458  6.430   58.7  6.0622    3  222.0   
6     0.08829  12.5   7.87     0  0.524  6.012   66.6  5.5605    5  311.0   
7     0.14455  12.5   7.87     0  0.524  6.172   96.1  5.9505    5  311.0   
8     0.21124  12.5   7.87     0  0.524  5.631  100.0  6.0821    5  311.0   
9     0.17004  12.5   7.87     0  0.524  6.004   85.9  6.5921    5  311.0   
10    0.22489  12.5   7.87     0  0.524  6.377   94.3  6.3467    5  311.0   
11    0.11747  12.5   7.87     0  0.524  6.009   82.9  6.2267    5  311.0   

lets see what the sklearn one looks like

In [13]:
from sklearn.datasets import load_boston
boston_dataset = load_boston()
print(boston_dataset)
print(boston_dataset.keys())
print(boston_dataset.DESCR)
boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
boston.head()

## only 13 headings!! MEDV Missing - read the DESCR


#MEDV is stored in target
boston_dataset.target
# Add target to the boston database
boston['MEDV'] = boston_dataset.target
boston.head()

{'data': array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,
        4.9800e+00],
       [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,
        9.1400e+00],
       [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,
        4.0300e+00],
       ...,
       [6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
        5.6400e+00],
       [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,
        6.4800e+00],
       [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
        7.8800e+00]]), 'target': array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15. ,
       18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2, 13.6, 19.6,
       15.2, 14.5, 15.6, 13.9, 16.6, 14.8, 18.4, 21. , 12.7, 14.5, 13.2,
       13.1, 13.5, 18.9, 20. , 21. , 24.7, 30.8, 34.9, 26.6, 25.3, 24.7,
       21.2, 19.3, 20. , 16.6, 14.4, 19.4, 19.7, 20.5, 25. , 23.4, 18.9,
       35.4, 24.7, 31.6, 23.3, 19.6, 1

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [20]:
#compare with the csv database
ne = (df != boston).any(1)
ne

0      False
1      False
2      False
3       True
4      False
5      False
6      False
7      False
8      False
9      False
10     False
11     False
12     False
13     False
14      True
15     False
16     False
17     False
18     False
19     False
20     False
21      True
22      True
23      True
24     False
25     False
26     False
27      True
28     False
29     False
       ...  
476    False
477    False
478    False
479    False
480     True
481     True
482    False
483    False
484    False
485     True
486    False
487     True
488    False
489    False
490     True
491     True
492    False
493    False
494    False
495    False
496    False
497    False
498     True
499    False
500     True
501     True
502    False
503    False
504    False
505    False
Length: 506, dtype: bool

In [21]:
df.where(df.values==boston.values).notna()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,True,True,True,True,True,True,True,True,True,True,True,True,True,True
1,True,True,True,True,True,True,True,True,True,True,True,True,True,True
2,True,True,True,True,True,True,True,True,True,True,True,True,True,True
3,False,True,True,True,True,True,True,True,True,True,True,True,True,True
4,True,True,True,True,True,True,True,True,True,True,True,True,True,True
5,True,True,True,True,True,True,True,True,True,True,True,True,True,True
6,True,True,True,True,True,True,True,True,True,True,True,True,True,True
7,True,True,True,True,True,True,True,True,True,True,True,True,True,True
8,True,True,True,True,True,True,True,True,True,True,True,True,True,True
9,True,True,True,True,True,True,True,True,True,True,True,True,True,True


In [51]:
# look at some of the false values in CRIM
print(df.iloc[3, 0])
print(boston.iloc[3,0])
print(df.iloc[14, 0])
print(boston.iloc[14,0])
print(df.iloc[21, 0])
print(boston.iloc[21,0])
print(df.iloc[22, 0])
print(boston.iloc[22, 0])
print(df.iloc[23, 0])
print(boston.iloc[23, 0])




0.032369999999999996
0.03237
0.6379600000000001
0.63796
0.8520399999999999
0.85204
1.2324700000000002
1.23247
0.9884299999999999
0.98843


 looks to be rounding errors

In [62]:
df["CRIM"]=df.round({"CRIM":5})
df
df.where(df.values==boston.values).notna()
ne = (df != boston).any(1)
ne
if ne.any:
    print("same databases")
else:
    print("different databases")

same databases


The two dataframes are the same  - Ill use the csv one for now but I might change to the boston one later.

the values that we will be looking at are the house prices df\["MEDV"\] - no it says to look at the dataset not a field on it.

# descriptive statistics

In [64]:
df["MEDV"].describe()
df.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063,22.532806
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062,9.197104
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95,17.025
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36,21.2
75%,3.677083,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955,25.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


for reference

1. CRIM: CRIME
2. ZN:  residential land zoned for lots over 25,000 sq.ft.
3. INDUS:  proportion of non-retail business acres per town
4. CHAS: This is the Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) ?? no idea what that is
5. NOX: nitric oxides concentration (parts per 10 million)
6. RM: This measures the average number of rooms per dwelling
7. AGE: This measures the proportion of owner-occupied units built prior to 1940
8. DIS: This measures the weighted distances to five Boston employment centres
9. RAD: index of accessibility to radial highways
10. TAX: This measures the full-value property-tax rate per \$10,000
11. PTRATIO: This measures the pupil-teacher ratio by town
12. B: This measures the 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town (very racist!!)
13. LSTAT: This measures the \% lower status of the population
14. MEDV: Median value of owner-occupied homes in $1000's


What would I expect this data to show? I would expect high crime to reduce house prices, high res zone to increase prices as big house area means more expensive houses but I suspect its more complicated than that. high industry I'd expect this to reduce house prices up to a point. If the area is very industriliseed then house prices might be lower (or higher as more workers want to live close to work) - Id expect a non linear correlation here. NOX - I'd expect high polution areas to reduce house prices. I'd expect more rooms to increase house prices. I'd expect older people to be  more represented as home owners. I'd expect distance to work to have a complex relationship. house prices close to work would be more expensive but as long as the commute was reasonable I'd suspect subarb housing to also be expensive. I'd expect that out of the commutter zone the prices would drop. Radical highways? Dont know what radical means in this context - I supect houses beside roads would be more expensive as nearer to work access but also there is a negative utility of living close to traffic so I suppose it depends. Tax, I'd expect higher property tax to be linked to higher house prices but I think its refering to high tax being a disentive to buying a house at a stated price.  PTratio I expect that better ptratio indicates better pupil learning and higher quailty schools so would be positively correlated with house prices. B - very racist to include this in 1978 but I suppose economists dont care what they measure. I suspect that the measure  was included to indcate areas with high african americans homeowners was negatively correlated to house prices, a legacy from jim crow laws!!. LSTAT - I would suspect that areas with lower income people would  have cheaper housing. MEDV this is the value to which the other columns are compared against for the hedonic model of pricing. 

To sum I expect postive correlation with medv and RM, AGE, DIS?, ZN, PTratio; negative correlation with CRIM, NOX, TAX, B, LSTAT and mixed for INDUS, DIS?, RAD  measures 
ignore the CHAS variable 



## Infer

## predict

In [None]:
boston

In [None]:
boston.isnull().sum()

In [None]:

sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.distplot(boston['MEDV'], bins=30)
plt.show()

In [None]:
correlation_matrix = boston.corr().round(2)
# annot = True to print the values inside the square
sns.heatmap(data=correlation_matrix, annot=True)

In [None]:
plt.figure(figsize=(20, 5))

features = ['LSTAT', 'RM']
target = boston['MEDV']

for i, col in enumerate(features):
    plt.subplot(1, len(features) , i+1)
    x = boston[col]
    y = target
    plt.scatter(x, y, marker='o')
    plt.title(col)
    plt.xlabel(col)
    plt.ylabel('MEDV')