# Boston House prices

# research

[link](https://towardsdatascience.com/machine-learning-project-predicting-boston-house-prices-with-regression-b4e47493633d)
[link eile](https://towardsdatascience.com/linear-regression-on-boston-housing-dataset-f409b7e4a155)



## References

## Describe

### The data

### background

The Boston housing data set originated in a 1978 paper titled 'Hedonic prices and the demand for clean air' in the journal Environment, Economics & Management, vol.5. The papers authors, Harrison and Rubinfeld discussed how using data from the housing market to estimate the degree to which people will pay for anti pollution measures is problematic. 

The hedonic price method infers the price of an item (such as  housing) by looking at substitute features, goods or services such as location of property, local pollution levels, crime rates in the locality etc. Each ancillary feature comes with a cost or benefit that the market considers in accessing what a fair price is for the property. People may pay more for a house close to a good school or in a nice low crime neigbourhood. A similar property in a more challenging area would be marketed at a lower price as the market would expect a discount as compensation for poorer utility. ref(http://www.cbabuilder.co.uk/Quant5.html)

Harrison and Rubinfeld, 1978 found that small increases in air pollution damage are positively correlated with air pollution levels and greater household income and that household's willingness to pay for clean air was very sensitive to the factors used in the hedonic pricing model rather than the price elasticity of the air quality demand equation. [ref](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.926.5532&rep=rep1&type=pdf)


The data itself consists of 506 records with measures taken over 14 fields. The dataset contained no null or missing values. The information contained in the dataset deals with hedonic measures of housing in the suburbs around Boston USA. The 14 fields are as follows. 

1. CRIM: This measures the per capita crime rate by town
2. ZN: This measures the proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS: This measures the proportion of non-retail business acres per town
4. CHAS: This is the Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
5. NOX: This measures the nitric oxides concentration (parts per 10 million)
6. RM: This measures the average number of rooms per dwelling
7. AGE: This measures the proportion of owner-occupied units built prior to 1940
8. DIS: This measures the weighted distances to five Boston employment centres
9. RAD: index of accessibility to radial highways
10. TAX: This measures the full-value property-tax rate per $10,000
11. PTRATIO: This measures the pupil-teacher ratio by town
12. B: This measures the 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town (very racist!!)
13. LSTAT: This measures the % lower status of the population
14. MEDV: Median value of owner-occupied homes in $1000's
[ref](https://archive.fo/5RkVv#selection-6.3-855.2)


The data can be found at the above link but it is also included with the python package sklearn



In [8]:
# lets check the data first before looking at sklearn data
import pandas as pd
import re
names =["CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT", "MEDV"]
data = pd.read_csv("data/housing.data.csv", header=None) 
# yikes some have 3 spaces some have 2 as delimiter - remove the extra spaces to use one space as delimiter
#data[0] = re.sub(r'\s+', ' ', data[0])
data = data.replace(to_replace ='\s+', value = ' ', regex = True)  # https://www.geeksforgeeks.org/replace-values-in-pandas-dataframe-using-regex/


# new data frame with split value columns 
df = data[0].str.split(" ", n = 1, expand = True) 

# making separate first name column from new data frame 
data["CRIM"]= df[0] 
data["ZN"]= df[1]
'''
data["INDUS"]= df[2] 
data["CHAS"]= df[3]
data["NOX"]= df[4] 
data["RM"]= df[5] 
data["AGE"]= df[6] 
data["DIS"]= df[7] 
data["RAD"]= df[8] 
data["TAX"]= df[9] 
data["PTRATIO"]= df[10] 
data["B"]= df[11]
data["LSTAT"]= df[12] 
data["MEDV"]= df[13]
 
# Dropping old Name columns 
data.drop(columns =[0], inplace = True)
  
# df display 
data 



# new data frame with split value columns 
new = data["Name"].str.split(" ", n = 1, expand = True) 
  
# making separate first name column from new data frame 
data["First Name"]= new[0] 
  
# making separate last name column from new data frame 
data["Last Name"]= new[1] 
  
# Dropping old Name columns 
data.drop(columns =["Name"], inplace = True) 
  
# df display '''
data 

Unnamed: 0,0,CRIM,ZN
0,0.00632 18.00 2.310 0 0.5380 6.5750 65.20 4.0...,,0.00632 18.00 2.310 0 0.5380 6.5750 65.20 4.09...
1,0.02731 0.00 7.070 0 0.4690 6.4210 78.90 4.96...,,0.02731 0.00 7.070 0 0.4690 6.4210 78.90 4.967...
2,0.02729 0.00 7.070 0 0.4690 7.1850 61.10 4.96...,,0.02729 0.00 7.070 0 0.4690 7.1850 61.10 4.967...
3,0.03237 0.00 2.180 0 0.4580 6.9980 45.80 6.06...,,0.03237 0.00 2.180 0 0.4580 6.9980 45.80 6.062...
4,0.06905 0.00 2.180 0 0.4580 7.1470 54.20 6.06...,,0.06905 0.00 2.180 0 0.4580 7.1470 54.20 6.062...
5,0.02985 0.00 2.180 0 0.4580 6.4300 58.70 6.06...,,0.02985 0.00 2.180 0 0.4580 6.4300 58.70 6.062...
6,0.08829 12.50 7.870 0 0.5240 6.0120 66.60 5.5...,,0.08829 12.50 7.870 0 0.5240 6.0120 66.60 5.56...
7,0.14455 12.50 7.870 0 0.5240 6.1720 96.10 5.9...,,0.14455 12.50 7.870 0 0.5240 6.1720 96.10 5.95...
8,0.21124 12.50 7.870 0 0.5240 5.6310 100.00 6....,,0.21124 12.50 7.870 0 0.5240 5.6310 100.00 6.0...
9,0.17004 12.50 7.870 0 0.5240 6.0040 85.90 6.5...,,0.17004 12.50 7.870 0 0.5240 6.0040 85.90 6.59...


## Infer

## predict

In [None]:
#https://towardsdatascience.com/linear-regression-on-boston-housing-dataset-f409b7e4a155
import numpy as np
import matplotlib.pyplot as plt 

import pandas as pd  
import seaborn as sns 

%matplotlib inline
from sklearn.datasets import load_boston
boston_dataset = load_boston()
print(boston_dataset)
print(boston_dataset.keys())

In [None]:
print(boston_dataset.keys())
boston_dataset.DESCR
boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
boston.head()


In [None]:
boston

In [None]:
boston.isnull().sum()

In [None]:
boston['MEDV'] = boston_dataset.target
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.distplot(boston['MEDV'], bins=30)
plt.show()

In [None]:
correlation_matrix = boston.corr().round(2)
# annot = True to print the values inside the square
sns.heatmap(data=correlation_matrix, annot=True)

In [None]:
plt.figure(figsize=(20, 5))

features = ['LSTAT', 'RM']
target = boston['MEDV']

for i, col in enumerate(features):
    plt.subplot(1, len(features) , i+1)
    x = boston[col]
    y = target
    plt.scatter(x, y, marker='o')
    plt.title(col)
    plt.xlabel(col)
    plt.ylabel('MEDV')