<a href="https://colab.research.google.com/github/Salciano/Python/blob/main/Predicting_the_House_Market_(WIP).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [28]:
#@title Analizing Housing Prices

"""
Predicting the house market is not magic.

It is possible. However, to make credible predictions, we need fact-based data; then analyse the data and understand it to understand what drives the market.
Then we can start making predictions.

So let's start by importing the market data, coding libraries and so on...
"""

import numpy as np
import pandas as pd
import seaborn as snd


# The Boston Housing Dataset 2017 derived from information collected by the US Census Service concerning housing in the area of Boston MA.
df = pd.read_csv("https://ocw.mit.edu/courses/15-071-the-analytics-edge-spring-2017/d4332a3056f44e1a1dec9600a31f21c8_boston.csv", header = 0)

# Also available with scikit-learn, using load_boston()



In [29]:
# Let's see if we have our data

display(df)
#print(df)


Unnamed: 0,TOWN,TRACT,LON,LAT,MEDV,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO
0,Nahant,2011,-70.9550,42.2550,24.0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296,15.3
1,Swampscott,2021,-70.9500,42.2875,21.6,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8
2,Swampscott,2022,-70.9360,42.2830,34.7,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8
3,Marblehead,2031,-70.9280,42.2930,33.4,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7
4,Marblehead,2032,-70.9220,42.2980,36.2,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,Winthrop,1801,-70.9860,42.2312,22.4,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273,21.0
502,Winthrop,1802,-70.9910,42.2275,20.6,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273,21.0
503,Winthrop,1803,-70.9948,42.2260,23.9,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0
504,Winthrop,1804,-70.9875,42.2240,22.0,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273,21.0


In [26]:
"""
Great, we have our data! All 506 properties!

Now, let's analyze our data.

Well, 506 properties is a lot of data for a person to look at, let alone make some sense out of it.

However, we can compute this data to analyze and understand it.

With this in mind, let's examine each variable, individually (univariate analysis).
"""

df.describe()



Unnamed: 0,TRACT,LON,LAT,MEDV,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,2700.355731,-71.056389,42.21644,22.528854,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534
std,1380.03683,0.075405,0.061777,9.182176,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946
min,1.0,-71.2895,42.03,5.0,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6
25%,1303.25,-71.093225,42.180775,17.025,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4
50%,3393.5,-71.0529,42.2181,21.2,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05
75%,3739.75,-71.019625,42.25225,25.0,3.677083,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2
max,5082.0,-70.81,42.381,50.0,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0


In [30]:
"""
Okay, if this looks incomprehensible, you are not alone.

So first, let's understand in simple terms what we are looking at, before analysis.


TOWN: a factor with levels given by town names

TOWNNO: a numeric vector corresponding to TOWN

TRACT: a numeric vector of tract ID numbers

LON: a numeric vector of tract point longitudes in decimal degrees

LAT: a numeric vector of tract point latitudes in decimal degrees

MEDV: a numeric vector of median values of owner-occupied housing in USD 1000

CMEDV: a numeric vector of corrected median values of owner-occupied housing in USD 1000

CRIM: a numeric vector of per capita crime

ZN: a numeric vector of proportions of residential land zoned for lots over 25000 sq. ft per town (constant for all Boston tracts)

INDUS: a numeric vector of proportions of non-retail business acres per town (constant for all Boston tracts)

CHAS: a factor with levels 1 if tract borders Charles River; 0 otherwise

NOX: a numeric vector of nitric oxides concentration (parts per 10 million) per town

RM: a numeric vector of average numbers of rooms per dwelling

AGE: a numeric vector of proportions of owner-occupied units built prior to 1940

DIS: a numeric vector of weighted distances to five Boston employment centres

RAD: a numeric vector of an index of accessibility to radial highways per town (constant for all Boston tracts)

TAX: a numeric vector full-value property-tax rate per USD 10,000 per town (constant for all Boston tracts)

PTRATIO: a numeric vector of pupil-teacher ratios per town (constant for all Boston tracts)

B: a numeric vector of 1000*(Bk - 0.63)^2 where Bk is the proportion of blacks

LSTAT: a numeric vector of percentage values of lower status population

Source: https://search.r-project.org/CRAN/refmans/spData/html/boston.html

"""

"""



CRIM: Crime Rate.
CRIM - per capita crime rate by town
ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS - proportion of non-retail business acres per town.
CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
NOX - nitric oxides concentration (parts per 10 million)
RM - average number of rooms per dwelling
AGE - proportion of owner-occupied units built prior to 1940
DIS - weighted distances to five Boston employment centres
RAD - index of accessibility to radial highways
TAX - full-value property-tax rate já per $10,000
PTRATIO - pupil-teacher ratio by town
B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT - % lower status of the population
MEDV - Median value of owner-occupied homes in $1000's


TOWN: a factor with levels given by town names

TOWNNO: a numeric vector corresponding to TOWN

TRACT: a numeric vector of tract ID numbers

LON: a numeric vector of tract point longitudes in decimal degrees

LAT: a numeric vector of tract point latitudes in decimal degrees

MEDV: a numeric vector of median values of owner-occupied housing in USD 1000

CMEDV: a numeric vector of corrected median values of owner-occupied housing in USD 1000

CRIM: a numeric vector of per capita crime

ZN: a numeric vector of proportions of residential land zoned for lots over 25000 sq. ft per town (constant for all Boston tracts)

post source...

"""






"\n\n\n\nCRIM: Crime Rate.\nCRIM - per capita crime rate by town\nZN - proportion of residential land zoned for lots over 25,000 sq.ft.\nINDUS - proportion of non-retail business acres per town.\nCHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)\nNOX - nitric oxides concentration (parts per 10 million)\nRM - average number of rooms per dwelling\nAGE - proportion of owner-occupied units built prior to 1940\nDIS - weighted distances to five Boston employment centres\nRAD - index of accessibility to radial highways\nTAX - full-value property-tax rate já per $10,000\nPTRATIO - pupil-teacher ratio by town\nB - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town\nLSTAT - % lower status of the population\nMEDV - Median value of owner-occupied homes in $1000's\n\n\nTOWN: a factor with levels given by town names\n\nTOWNNO: a numeric vector corresponding to TOWN\n\nTRACT: a numeric vector of tract ID numbers\n\nLON: a numeric vector of tract point longitudes i