<a href="https://colab.research.google.com/github/Salciano/Python/blob/main/Predicting_the_House_Market_(WIP).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
#@title Analizing Housing Prices

"""
It is possible to make predictious about the housing market. This is not magic, but data-driven analytics powered by computing algorithms and machine learning.
However, to make credible predictions, we need fact-based data. Then, we can analyze the data to help understand it and what drives the market, before making predictions.
With this said, bear in mind that there are always margins for error.

So, without further ado, let's start by importing the market data, coding libraries and so on that will be required for this...
"""

import numpy as np
import pandas as pd
import seaborn as snd

# The Boston Housing Dataset 2017 derived from information collected by the US Census Service concerning housing in the area of Boston MA. Also available with scikit-learn, using load_boston()
df = pd.read_csv("https://ocw.mit.edu/courses/15-071-the-analytics-edge-spring-2017/d4332a3056f44e1a1dec9600a31f21c8_boston.csv", header = 0)



In [3]:
# Let's see if we have our data

display(df)
#print(df)


Unnamed: 0,TOWN,TRACT,LON,LAT,MEDV,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO
0,Nahant,2011,-70.9550,42.2550,24.0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296,15.3
1,Swampscott,2021,-70.9500,42.2875,21.6,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8
2,Swampscott,2022,-70.9360,42.2830,34.7,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8
3,Marblehead,2031,-70.9280,42.2930,33.4,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7
4,Marblehead,2032,-70.9220,42.2980,36.2,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,Winthrop,1801,-70.9860,42.2312,22.4,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273,21.0
502,Winthrop,1802,-70.9910,42.2275,20.6,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273,21.0
503,Winthrop,1803,-70.9948,42.2260,23.9,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0
504,Winthrop,1804,-70.9875,42.2240,22.0,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273,21.0


Great, we have our data! All 506 properties!

Now, if this looks incomprehensible. Don't worry, you are not alone.
This is a common problem with datasets and needs to be addressed.
If you understood this, great! Just skip over this part.

So, let's try to decypher this mess...

* There are 16 columns and 506 rows in the dataset, totaling 8096 datapoints.
* Each row has an index number, ordered from 0 to 505, that represents one housing property. We can use this to identify each property.
* Each column has a cryptic title that represents a variable or attribute for each housing property, which we will try to translate:

1. TOWN: The name of the town where the property is located.
2. TRACT: The ID number of each property. We can also use this to identify each property.
3. LON: The geographical longitude in degrees. Perhaps houses with extreme values near the coast might be appealing?
4. LAT: The geographical latitude in degrees. Perhaps some houses will be a bit warmer than others, due to being further south and this could affect housing prices? We will see...
5. MEDV: The median value of the house, when it is occupied by the owner (in 1000 USD). So, 24.0 means 24000 dollars.
6. CRIM: The crime rate (per capita or person) in each town.
7. ZN: The proportion of residential land (for lots over 25 000 square feet) (constant for all Boston tracts).
8. INDUS: the proportion of non-retail business per town (in acres). 1 acre = 4046.856 square meters.
9. CHAS: Whether the property borders the Charles River. 1 if it does, 0 if it doesn't.
10. NOX: Nitric oxides pollution per town (concentration levels in parts per 10 million).
11. RM: The number of rooms in each dwelling.
12. Proportion of (owner-occupied) housing units built before 1940.
13. DIS: How far away the property is from five Boston employment centres (weighted distance).
14. The highway transport infra-structure in each town (in index of accessibility to radial highways per town).
15. TAX: The full-value property-tax rate per town (per 10 000 USD).
16. PTRATIO: The ratio between pupils or students per teacher in each town.

Source: https://search.r-project.org/CRAN/refmans/spData/html/boston.html

Some of these things may not look important, at all. We should analyse them, later, to check if they affect anything and if they don't, we can safely exclude them from our data. This would increase performance and make it easier to understand what really matters...


In [10]:
"""
Now that we decyphered what each variable means, we can take a closer look at the data.

Well, you might be thinking that 8096 datapoints is a lot of data for a person to look at, let alone make some sense out of it or even make predictions with it.

If so, you are absolutely right. This is exactly what machines are great at. So, we can compute this data to analyze and understand it.

For this reason, we will use an extended data dictionary (EDD).

EDDs are great at summarizing or describing the data, without having to go through thousands or millions of datapoints, individually.

With this in mind, let's summarilt examine some of these variables, individually (univariate analysis).
"""

df.describe()



Unnamed: 0,TRACT,LON,LAT,MEDV,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,2700.355731,-71.056389,42.21644,22.528854,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534
std,1380.03683,0.075405,0.061777,9.182176,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946
min,1.0,-71.2895,42.03,5.0,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6
25%,1303.25,-71.093225,42.180775,17.025,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4
50%,3393.5,-71.0529,42.2181,21.2,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05
75%,3739.75,-71.019625,42.25225,25.0,3.677083,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2
max,5082.0,-70.81,42.381,50.0,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0


In [7]:
"""
Again, if this looks incomprehensible. Do not worry. I got your back!

First, let's understand in simple terms what we are looking at, before the analysis.

count: this is the number of times a value was found in the variable.
mean: As the name suggests, this is the average or mean
std:
min: this is the minimum value found
25%: The 25th percentile value. 25 percent of the data found is below this value.
50%. 50 percent of the data is below this value. This is also the median value, by the way. Since, the median is the value right in the middle of the others.
75%: The 75 percentile. 75 percent of the data is below this value.
max: This is the maximum value found. The max value is also coincides last value in the quarter, by the way.

Now that things are a bit clearer, we can start analyzing the data with this EDD.
"""

df.describe()






'\n\n\n\nZN - proportion of residential land zoned for lots over 25,000 sq.ft.\nINDUS - proportion of non-retail business acres per town.\nCHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)\nNOX - nitric oxides concentration (parts per 10 million)\nRM - average number of rooms per dwelling\nAGE - proportion of owner-occupied units built prior to 1940\nDIS - weighted distances to five Boston employment centres\nRAD - index of acc\n\n\n\nExtended Data Dictionary.\n'

In [8]:

df.describe()








Unnamed: 0,TRACT,LON,LAT,MEDV,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,2700.355731,-71.056389,42.21644,22.528854,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534
std,1380.03683,0.075405,0.061777,9.182176,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946
min,1.0,-71.2895,42.03,5.0,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6
25%,1303.25,-71.093225,42.180775,17.025,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4
50%,3393.5,-71.0529,42.2181,21.2,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05
75%,3739.75,-71.019625,42.25225,25.0,3.677083,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2
max,5082.0,-70.81,42.381,50.0,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0


TRACT

count: 206. There are 206 values in the TRACT variable. So, 206 IDs.



mean: ~2700. The mean is aproximately 2700. This is mistly irrelevant to us.

The average of ID numbers is of no interest to us. So, we will be skipping such irrelevant variables henceforth, after this explanations.







CRIM

count: There are 206 datapoints on crime rate. So, all properties were accounted for.

mean: ~3.614.

min: ~0.006.
25%: ~0.082.
75%: ~
3
You may have noticed that there is a big discrepancy between the maximum crime rate and the 75% of the data.
75% of the data falls within a crimerate if 3.677, while the maximum value found was 88.976.
This indicates that there is an outlier skewing the data.
There could be many reasons for this. Sometimes mistakes are made when collecting data. Sometimes, there are exceptional events that distort our understanding of the data, in general.















