In [1]:
# Imports
import numpy as np
import pandas as pd
import altair as alt
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Allow Altair to make plots using more than 5000 rows
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

## Description of the Dataset

This data was collected as part of California's Census in 1990. An analysis first emerged in 1997 in a paper titled Sparse Spatial Autoregressions. It was collected by Pace, R. Kelley and Ronald Barry. This dataset was improved upon by Aurélien Geron, who added a categorical column of proximity to the ocean. Each row in the dataset contains details for a single census block group in California. The data collected contains information about the houses for each block group and also details about the population in the city. 

## Load the Data

In [20]:
train = pd.read_csv('https://github.com/UBC-MDS/DSCI_522_GROUP_312/blob/master/data/train.csv?raw=true')
test = pd.read_csv('https://github.com/UBC-MDS/DSCI_522_GROUP_312/blob/master/data/test.csv?raw=true')


X_train = train.iloc[:,:9]
y_train = train['median_house_value']
X_test = test.iloc[:,:9]
y_test = test['median_house_value']

## Explore the Dataset

In [22]:
train.describe(include='all')

Unnamed: 0,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity,latitude,longitude,median_house_value
count,16348.0,16348.0,16348.0,16348.0,16348.0,16348.0,16348,16348.0,16348.0,16348.0
unique,,,,,,,5,,,
top,,,,,,,<1H OCEAN,,,
freq,,,,,,,7219,,,
mean,28.547162,2640.916687,538.405004,1425.744556,499.875398,38712.215745,,35.643701,-119.576828,206768.812943
std,12.54914,2179.474584,417.639062,1104.311037,378.105576,19021.014211,,2.141318,2.004809,115219.386524
min,1.0,2.0,1.0,3.0,1.0,4999.0,,32.55,-124.35,14999.0
25%,18.0,1462.0,298.0,792.0,282.0,25643.75,,33.93,-121.8,119500.0
50%,29.0,2140.0,437.0,1170.0,411.5,35313.0,,34.26,-118.51,179750.0
75%,37.0,3154.0,648.0,1726.0,606.0,47473.5,,37.72,-118.01,264700.0


In [24]:
print("There are {0} records in the training portion of the dataset. Each record is a census block group.\n".format(
    X_train.count().min()))

print("The median age of houses/complexes in census blocks ranges from {0} to {1} with a mean of {2} years old and a median of {3} years old.\n".format(
    X_train['housing_median_age'].min(),
    X_train['housing_median_age'].max(),
    round(X_train['housing_median_age'].mean(), 2),
    X_train['housing_median_age'].median()))

print("The total number of rooms in a census block ranges from {0} to {1} with a mean of {2} rooms and a median of {3} rooms.\n".format(
    X_train['total_rooms'].min(),
    X_train['total_rooms'].max(),
    round(X_train['total_rooms'].mean(), 2),
    X_train['total_rooms'].median()))

print("The number of bedrooms in a census block ranges from {0} to {1} with a mean of {2} bedrooms and a median of {3} bedrooms.\n".format(
    X_train['total_bedrooms'].min(),
    X_train['total_bedrooms'].max(),
    round(X_train['total_bedrooms'].mean(), 2),
    X_train['total_bedrooms'].median()))

print("The population of a census block ranges from {0} to {1} with a mean of {2} and a median of {3}.\n".format(
    X_train['population'].min(),
    X_train['population'].max(),
    round(X_train['population'].mean(), 2),
    X_train['population'].median()))

print("The number of households in a census block ranges from {0} to {1} with a mean of {2} and a median of {3}.\n".format(
    X_train['households'].min(),
    X_train['households'].max(),
    round(X_train['households'].mean(), 2),
    X_train['households'].median()))

print("The median annual income in a census block ranges from ${0} to ${1} with a mean of ${2} and a median of ${3}.\n".format(
    X_train['median_income'].min(),
    X_train['median_income'].max(),
    round(X_train['median_income'].mean(), 2),
    X_train['median_income'].median()))

print("Ocean Proximity is a categorical value with one of the values:\n\t{0}, \n\t{1} (very close to the ocean), \n\t{2}, \n\t{3}, \n\t{4}.\n".format(
    X_train['ocean_proximity'].unique()[0].lower(),
    X_train['ocean_proximity'].unique()[1].lower(),
    X_train['ocean_proximity'].unique()[2].lower(),
    X_train['ocean_proximity'].unique()[3].lower(),
    X_train['ocean_proximity'].unique()[4].lower()))

print("The median house value in a census block ranges from ${0} to ${1} with a mean of ${2} and a median of ${3}.\n".format(
    y_train.min(),
    y_train.max(),
    round(y_train.mean(), 2),
    y_train.median()))

There are 16348 records in the training portion of the dataset. Each record is a census block group.

The median age of houses/complexes in census blocks ranges from 1 to 52 with a mean of 28.55 years old and a median of 29.0 years old.

The total number of rooms in a census block ranges from 2 to 37937 with a mean of 2640.92 rooms and a median of 2140.0 rooms.

The number of bedrooms in a census block ranges from 1 to 6445 with a mean of 538.41 bedrooms and a median of 437.0 bedrooms.

The population of a census block ranges from 3 to 28566 with a mean of 1425.74 and a median of 1170.0.

The number of households in a census block ranges from 1 to 6082 with a mean of 499.88 and a median of 411.5.

The median annual income in a census block ranges from $4999.0 to $150001.0 with a mean of $38712.22 and a median of $35313.0.

Ocean Proximity is a categorical value with one of the values:
	near bay, 
	<1h ocean (very close to the ocean), 
	inland, 
	near ocean, 
	island.

The median house 

## Initial Thoughts

- Blocks vary drastically in the number of bedrooms and total number of rooms.
- Lowest median income blocks differ in the number of households (some highly populated blocks have very low income).
- New houses vary in their value (interesting to study the relationship between age and value).
- 7 out of the 10 lowest-value houses are INLAND. (proximity to ocean appears to heavily influence value).
- Median House Value and Median income appear to be exactly capped at 500000 and 150000 exactly.

## Research Question

Our research will focus on predicting median housing value given the independent variables about location and population.

## Analysis and Visualizations

The following plots visualize the relationship between various independent variables and the dependent variable, median house value. It is clear that some variables appear to have linear relationships, while others have high variance.

![Image](total-rooms_scatterplot.png) ![Image](total-bedrooms_scatterplot.png)
![Image](households_scatterplot.png) ![Image](population_scatterplot.png)
![Image](median-age_scatterplot.png) ![Image](median-income_scatterplot.png)

This raises a question of whether the variables are correlated. The following heatmap shows correlation values between all variables. The negative correlation result for latitude and longitude is purely related to the location of California.

![Image](correlation_heatmap.png)

Building on this, we can obtain the Variance Inflation Factors (VIFs) to identify multicollinearity between variables. Generally, a VIF of 5 is extremely high.

In [25]:
pd.read_csv('vif_table.csv')

Unnamed: 0,variable,vif_val
0,housing_median_age,1.163905
1,total_rooms,11.849443
2,total_bedrooms,34.891047
3,population,6.582837
4,households,33.871693
5,median_income,1.524263
6,intercept,18.278944


As we can see, VIFs are extremely high, so there are relationships between the independent variables.

The highest VIF is for total_bedrooms, and it appears to be linearly related to total_rooms (as bedrooms are included in a room count).

![Image](total-rooms_total-bedrooms.png)

## Summary and Conclusions

The California Housing Dataset contains nine variables that we identified as explanatory, for the response variable of the median housing value per housing block in California. Previous analyses have been completed, and the general conclusion is that this data can be predicted linearly. Some variables have stronger linear relationships with the response than others, but there is also very strong multicollinearity, which may cause unwanted effects. To predict the median housing value, we will compare Linear Regression, KNN, and Random Forest Regressors.