# Exercise 02 - Neural Networks for Regression in Keras

## Part I - Exploratory Data Analysis

The overall objective of this exercise is to to build and train a neural network model for predictiong housing prices in California using the California census data. By learning from this data, the model should be able to predict the **median house price of any district**, given all the input attributes. In this first part of the exercise, we apply some statistics and visualization techniques in order to explore and getting to know the data.

**Learning Objectives:**

- Learn how to perform Exploratory Data Analysis in order to get insight into the dataset.

## Dataset

In this exercise, we use the [California Housing dataset](https://scikit-learn.org/stable/datasets/index.html) from the StatLib repository (http://lib.stat.cmu.edu/datasets/). This dataset was derived from the 1990 U.S. census, having one *row per census block group*. A **block group** is the smallest geographical unit for which the U.S. Census Bureau publishes sample data. A census block group typically covers a population of 600 to 3,000 people. For the sake of brevity, we call these census block groups in this exercise "disctricts".

The original dataset appeared in R. Kelley Pace and Ronald Barry, “Sparse Spatial Autoregressions,” Statistics & Probability Letters 33, no. 3 (1997): 291–297. Although it is not very recent, it has many qualities for learning how to build data driven models. The dataset comes in tabular form, comprising more than 20,600 records with 8 columns or attributes (features) such as population, median income, median housing price, and so on for each *census block group* in California:

- MedInc - median income in block
- HouseAge - median house age in block
- AveRooms - average number of rooms
- AveBedrms - average number of bedrooms
- Population - block population
- AveOccup - average house occupancy
- Latitude - house block latitude
- Longitude - house block longitude

The target variable is the median house value for California districts.

First, we create a new directory called `datasets` in the current path `.` where the data is to be stored, if that directory does not exist.

In [None]:
import os

if not os.path.exists('./datasets'):
    os.mkdir('./datasets')

The housing prices dataset can be (down)loaded using the sklearn function `fetch_california_housing()` from the [datasets module](https://scikit-learn.org/stable/datasets/index.html), which stores the data in a user specified home directory `data_home`, and also returns the input features and target values.

In [None]:
from sklearn.datasets import fetch_california_housing

X_full, y_full = fetch_california_housing(data_home = r'./datasets', return_X_y=True) 

The data is returned as two NumPy arrays. To verify this, we apply the `type` function on the returned objects. And to see how many datasets we have and how many features, we use the `shape` attribute.

In [None]:
print(type(X_full))
print(X_full.shape)

print()

print(type(y_full))
print(y_full.shape)

## Data as DataFrames

To be able to work with the data more easily, we convert it into a (pandas) DataFrame with named columns (attributes). To do this, we first need to create a list of attribute names and then we can construct a DataFrame object from the NumPy array of input data.

In [None]:
import pandas as pd

attributes = ['median_income', 'housing_median_age', 'aveRooms', 'aveBedrms', 'population', ' aveOccup', 'latitude', 'longitude']

housing = pd.DataFrame(X_full, columns = attributes )

housing.head()

For now, we also add the target values (`median_house_value`) as another column to the housing data.

In [None]:
housing['median_house_value'] = y_full

housing.head()

With the method `describe()` of a DataFrame (or Series) object, we can get some statistics of it. So, let us quickly take a look at the house values.

In [None]:
housing[['median_house_value']].describe()

A quick description of a DataFrame object can be gotten with the `info()` method, which lists the number of entries, the column names, and for each column the number of datasets that contain data (non-null count), and the data type (here 64 bit floating point values).

In [None]:
housing.info()

- There are 20,640 instances in the dataset, which means that it is a reduced-size dataset by Machine Learning standards, but it is still useful for our purposes.
- All attributes are `numerical`.
- As you can observe, all fields include the same number of non-null values (equal to the number of records) which means that the dataset have been already cleaned for us (and there are no rows with missing data). This is not the case with real datasets, which most often include missing or wrong values and require extensive data cleaning, before it is applicable to train a machine learning model.

The statistics can also be applie to the whole DataFrame.

In [None]:
housing.describe()

Another quick way to get a feeling of the type of data you are dealing with is to *plot a histogram* for each numerical attribute.
- A histogram shows the number of instances (on the vertical axis) that have a given value range (on the horizontal axis).
- You can either plot this one attribute at a time, or you can call the `hist()` method on the whole dataset, and it will plot a histogram for each numerical attribute.

In [None]:
housing.hist(bins=50, figsize=(20,15));

**Observations from the histogram plots:**

1. The `median income` attribute does not look like it is expressed in US dollars (USD), but in units of 100k USD. Also, from the dataset metadata, we can read that the data has been scaled and capped at 15 (actually 15.0001) for higher median incomes, and at 0.5 (actually 0.4999) for lower median incomes.
2. The `housing median age` and the `median house value` were also capped. The latter may be a serious problem since it is our `target attribute` (label/ground truth). Our Deep Learning algorithm may learn that prices never go beyond that limit.

In a real life setting, we would need to evaluate if that could be a problem or not when we predict values (as our predictions could be way off for some cases to the real value, e.g. if you have a house that would be worth much more than \\$500,000). If precise predictions beyond \\$500,000 is needed, then we have mainly two options:

- a. Collect proper labels for the districts whose labels were capped.
- b. Remove those districts from the training set (and also from the test set, since your system should then not be penalized if it predicts values beyond $500,000).

3. The attributes have very *different scales* (value ranges). 
4. Finally, many histograms are *tail heavy*: they extend much farther to the right of the median than to the left. This may make it a bit harder for some Machine Learning algorithms to detect patterns. We will try transforming these attributes later on to have more bell-shaped distributions.

## Visualize dataset

If the training dataset is very large, you may want to sample an exploration dataset, to make manipulations easy and fast. In our case, the dataset is quite small so you can just work directly on the full set.

Since the dataset contains geographical information (latitude and longitude), it is a good idea to create a scatterplot of all districts to visualize the data. For DataFrames, we can use the `plot()` method. In the following, the radius of each circle (district) represents the district’s population (parameter s), and the color represents the price (parameter c). We will use a predefined color map (parameter cmap) called jet, which ranges from blue (low values) to red (high prices). Setting the alpha parameter to 0.1 makes the plotted circles transparent, so that the visualization is much easier to perceive, and we can better visualize places where there is a high density of data points.

In [None]:
import matplotlib.pyplot as plt

housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
    s=housing["population"]/100, label="Population", figsize=(10,7),
    c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
    sharex=False)
plt.legend();

This plot tells us that the housing prices are very much related to the location (e.g., close to the ocean) and to the population density.

## Looking for Correlations

Since the dataset is not too large, you can easily compute the standard correlation coefficient (also called Pearson’s r) between every pair of attributes using the `corr()` method:

In [None]:
corr_matrix = housing.corr()

corr_matrix

And extract the result for a specific column, like the target column of median house values, and sort them in descending order (largest first).

In [None]:
corr_matrix["median_house_value"].sort_values(ascending=False)

The **correlation coefficient** ranges from –1 to 1 (and only measures linear correlations).
+ When it is close to **1**, it means that there is a **strong positive correlation**; for example, the median house value tends to go up when the median income goes up. (The value of 1 in the first row shows that the median house value is correlated with itself, which is no surprise.)
+ When the coefficient is close to –1, it means that there is a **strong negative correlation**; you can see a small negative correlation between the latitude and the median house value (i.e., prices have a slight tendency to go down when you go north, which might have to do with the weather in California).
+ Finally, coefficients close to zero mean that there is no linear correlation.

We can plot some attributes in a scatterplot, to see these correlations.

In [None]:
from pandas.plotting import scatter_matrix

attributes = ["median_house_value", "median_income", "aveRooms",
              "housing_median_age"]
_ = scatter_matrix(housing[attributes], figsize=(12, 8))

The main diagonal (top left to bottom right) displays a histogram of each attribute (and not the correlation of each variable against itself, which would not be interesting). The most promising attribute to predict the `median house value` (strongest correlation as a perceivable line) is the `median income`, so let’s zoom in on their correlation scatterplot.

In [None]:
housing.plot(kind="scatter", x="median_income", y="median_house_value",alpha=0.1);

This plot reveals a few things:

- The correlation is indeed very strong; you can clearly see the upward trend and the points are not too dispersed.
- The price cap that we noted earlier is clearly visible as a horizontal line at \\$500,000.
- This plot reveals other less obvious straight lines: a horizontal line around \\$450,000, another around \\$350,000, perhaps one around \\$280,000, and a few more below that.

In a real scenario, you may want to try to remove the corresponding districts to prevent your algorithms from learning to reproduce these data quirks.

This concludes our data exploration, and we continue with learning a Deep Learning model to predict the house prices in part II of this exercise.