I finished working in this notebook. If you'd like to take a look at my further work on this dataset (mostly regarding optimizing predictive models), please go [here](https://www.kaggle.com/mateuszbagiski/calihousing-fine-tuning-ml-models).

# 0. Import everything and load the data

In [None]:
# Basic stuff
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
from tqdm import tqdm

# Preprocessing
from sklearn.model_selection import train_test_split as tts
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer


# Models, metrics etc
from sklearn.linear_model import LinearRegression as LR
from sklearn.tree import DecisionTreeRegressor as DTR
from sklearn.ensemble import RandomForestRegressor as RFR
from sklearn.metrics import mean_squared_error as mse
from sklearn.model_selection import cross_val_score as cvs
import tensorflow as tf
from tensorflow.keras import models, layers, optimizers, utils, callbacks

In [None]:
housing_dir = '../input/california-housing-prices/housing.csv'
data = pd.read_csv(housing_dir)
data.shape, data.columns

# 1. Exploratory Data Analysis

Let's take a rough look at the data, we're dealing with here

In [None]:
print(data.shape)
data.head()

So we a table with 20640 rows. Each rows contains information about one district in California, which includes its geographical location (longitude and latitude, ocean_proximity) as well as demographics (population, median_income) and residential buildings (housing_median_age, total_rooms, total_bedrooms, households, median_house_value).

All of these features are numeric with one exception, ocean_proximity, which is categorical. Let's see how many unique labels this feature contains.


In [None]:
data['ocean_proximity'].value_counts()

Let's see whether we have any missing values.

In [None]:
data.isnull().sum()

There are 207 missing values in the total_bedrooms column. Obviously, we cannot feed NaNs (Not-a-Numbers) into a Machine Learning Model, so they need to be either removed or replaced with, for example, mean or median values. Later we will employ the latter strategy, but for now let's continue to explore the data.

In [None]:
data.describe()

Maximal values for columns median_house_value and median_income (which, by the way, seems to display income in tens of thousands of dollars rather than single dollars) are suspiciously "rounded" at values almost equal to 500000 and 15, respectively. This could mean that in the original data there were rows with values higher than that, but someone considered them to be outliers and "rounded down" to a pre-set maximum value. This is potentially disadvantegous for ML model's performance and/or could result in abnormally huge amount of rows containing these maximal values in their respective columns.

We can see that that's the case, if we plot the distributions of each numerical feature with a histogram. Look at these "skyscrappers" at the right:

In [None]:
data.hist(bins=50, figsize=(20,15))
plt.show()

There seem not to have been to many outliers in terms of median_income, but quite the opposite for median_house_value and apparently also housing_median_age, which was "capped" at the value of 52.

In order to clean our data, we will later remove all the rows which satisfy at least one of the below criteria:

1) median_income equal to or greater than 15

2) median_house_value equal to or greater than 500000

3) housing_median_age equal to or greater than 52

Let's look at how these features correlate with one another

In [None]:
corr_mat = data.corr()
corr_mat

More particularly, we are interested here with correlation between median_house_value and other features:

In [None]:
corr_mhv = corr_mat['median_house_value'].sort_values()
corr_mhv

Aside from from median_income, these correlations are rather weak, although taken together may turn out to be good predictors of median_house_value. However, we can also perform Feature Engineering to obtain new useful information, absent from the original data, which may turn out to be highly correlated with median_house_value.

Note that correlation matrices do not take into account categorical features such as ocean_proximity. We can use Seaborn's violinplot function to see whether there is some kind of relationship between this label and median_house_value.


In [None]:
fig, ax = plt.subplots(figsize=(15,10))

sns.violinplot(
    x='ocean_proximity', y='median_house_value',
    inner='box',
    data=data, ax=ax
)
plt.show()

Lo and behold, there it is. Although in each of these categories we can find a district with median_house_value of above 500000, the means (marked by small white dots inside the inner boxplots) are much smaller for INLAND category than for other for categories. The overall distributions of values also differ significantly, which is clearly seen on both the outer violinplots and the inner boxplots.

We can also print this information for each individual label as a DataFrame: 

In [None]:
ocean_proximity_df = {
    label: data.query(' `ocean_proximity` == @label ')['median_house_value'].describe()
    for label in set(data['ocean_proximity'].values)
}

ocean_proximity_df = pd.DataFrame(ocean_proximity_df).round(1)

ocean_proximity_df

Now that we know what kind of data, we are dealing with, we can start cleaning it.

# 2. Data Cleaning

## 2.1. Dealing with NaNs

In [None]:
data.isnull().sum()

There are 207 NaNs (not-a-numbers, a.k.a. missing values) in the total_bedrooms column. We will replace them with the median number of bedrooms:

In [None]:
print("Before: %i NaNs" % data['total_bedrooms'].isnull().sum())

data['total_bedrooms'].fillna(data['total_bedrooms'].median(), inplace=True)

print("After: %i NaNs" % data['total_bedrooms'].isnull().sum())


# 2.2. Removing "lines" and "cappings"

In [None]:
data.hist(bins=50, figsize=(20,15))
plt.show()

These "skyscrappers" at the right end of median_house_value and housing_median_age histograms clearly indicate, that this data was capped – the most expensive households were "rounded down" to the value of $500.000. Similarly for households older than 50 years. Since we cannot retrieve the original values, we need to get rid of this instances:

## 2.2.1. The ordinary way

We could do it just by dropping these rows with a .drop() method...

In [None]:
print("Before: %i datapoints." % data.shape[0])

data_cleaned_1 = data.drop(
    index = data.query(' `median_house_value` >= 500000 | `housing_median_age` >= 52 | `median_income` >= 15 ').index.values,
    inplace = False
)

print("After: %i datapoints." % data_cleaned_1.shape[0])

(we've lost over 10% of our original data)

... or we could use a custom transformer, which would allow us to include this step in a pipeline:

In [None]:
class OutlierRemover(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X = X.copy() # To make sure that we don't change the original DataFrame
        X_cleaned = X.drop(index = X.query(' `median_house_value` >= 500000 | `housing_median_age` >= 52 | `median_income` >= 15 ').index.values)
        return X_cleaned
    
outlier_remover = OutlierRemover()
data_cleaned_2 = outlier_remover.transform(data)
data_cleaned_2.shape

We can check that the two ways are perfectly equivalent:

In [None]:
np.all(data_cleaned_1 == data_cleaned_2)

In [None]:
data = outlier_remover.transform(data)

There is one more data-cleaning problem left:

In [None]:
sns.scatterplot(
    x='median_income', y='median_house_value',
    alpha=.4,
    data=data
)

In a very suspicious way, datapoints tend to aggreagate along some precise, discrete values of median_house_value variable: 45.000, 35.000, 28.000... It looks like their original values were much more dispersed, but similar (though originally different) values were lumped together into discrete categories before being included in our data.

In [None]:
mhv_counts = data['median_house_value'].value_counts().sort_index()
mhv_counts.loc[448000:452000]

In [None]:
x = mhv_counts.index
y = mhv_counts.values

plt.plot(x, y)
plt.show()

The above plot clearly shows that there are many more such "lumpings" in the data. Removing this values, however would mean losing a lot of data, so I'm going to leave it there for now. 

Possible options:

* Remove the "lumpings"
* Add some random noise to this "lumpings"

# 3. Data encoding

There is one categorical variable in the data: ocean_proximity. Since Machine Learning algorithms can deal only with data represented numerically, we need to perform one-hot encoding.

For that we will use Scikit-learn's OneHotEncoder:

In [None]:
print(data['ocean_proximity'].value_counts())

encoder = OneHotEncoder()
#data_op = data[['ocean_proximity']]
op_ohe = encoder.fit_transform(data[['ocean_proximity']]).toarray()

op_ohe, op_ohe.shape

In [None]:
for category_i, category in enumerate(encoder.categories_[0]):
    print(category_i, category, data['ocean_proximity'].value_counts()[category], op_ohe[:,category_i].sum())

In [None]:
for category_i, category in enumerate(encoder.categories_[0]):
    data[category] = op_ohe[:, category_i]

In [None]:
data.loc[:, 'ocean_proximity':]

Thus encoded categorical data could be easily fed to a machine learning model. I think, however, that there is a way to extract a more precise information regarding ocean proximity, which will make these one-hot-encoded data redundant. We will do that in section 4.3.

# 4. Feature Engineering

Here we will create new features for our dataset by processing/combining the original ones. Hopefully they will turn out to be more predictive of the value we want to measure. i.e. median_house_value.

## 4.1. Simple combined features

Total number of rooms or bedroom in a given district don't tell us much by themselves. They much more informative when expressed in relation to this district's population or total number of households. Another possibly interesting featuer may be the fraction of all rooms being bedrooms.

* rooms_per_household, bedrooms_per_household

* rooms_per_person, bedrooms_per_person

* bedrooms_fraction

## 4.1.1. Combining features the vanilla way

We can add this features in a very simple way, one by one, like this:

In [None]:
data_expanded_1 = data.copy()

data_expanded_1['rooms_per_household'] = data_expanded_1['total_rooms'] / data_expanded_1['households']
data_expanded_1['bedrooms_per_household'] = data_expanded_1['total_bedrooms'] / data_expanded_1['households']

data_expanded_1['rooms_per_person'] = data_expanded_1['total_rooms'] / data_expanded_1['population']
data_expanded_1['bedrooms_per_person'] = data_expanded_1['total_bedrooms'] / data_expanded_1['population']

data_expanded_1['bedrooms_fraction'] = data_expanded_1['total_bedrooms'] / data_expanded_1['total_rooms']

data_expanded_1['people_per_household'] = data_expanded_1['population'] / data_expanded_1['households']

data_expanded_1.head()

## 4.1.2. Combining features the "fancy" way

... or we could use a custom transfomer, like the one below, based on the one from the second chapter of Hands-On Machine Learning notebook:

In [None]:
# Names of the new features/columns
new_features = [
    'rooms_per_household',
    'bedrooms_per_household',
    'rooms_per_person',
    'bedrooms_per_person',
    'bedrooms_fraction',
    'people_per_household'
]
class FeatExpander(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X = X.copy()
        X['rooms_per_household'] = X['total_rooms'] / X['households']
        X['bedrooms_per_household'] = X['total_bedrooms'] / X['households']
        X['rooms_per_person'] = X['total_rooms'] / X['population']
        X['bedrooms_per_person'] = X['total_bedrooms'] / X['population']
        X['bedrooms_fraction'] = X['total_bedrooms'] / X['total_rooms']
        X['people_per_households'] = X['population'] / X['households']
        return X
    
feat_expander = FeatExpander()
data_expanded_2 = feat_expander.transform(data)

data_expanded_2.head()

We can check that the two ways of adding new features are perfectly equivalent:

In [None]:
np.all(data_expanded_1.index == data_expanded_2.index)

In [None]:
data = feat_expander.fit_transform(data)
data.corr()['median_house_value']

bedrooms_fraction is quite highly negatively correlated with median_house_value.

rooms_per_person and rooms_per_household also may turn out to be useful predictors.

Transformations we performed so far can be combined into the following pipeline. We use ColumnTransformer to separate transformations performed on the categorical attribute (ocean_proximity) from those performed on the numerical attributes (all the rest).

(the only difference is that using this pipeline we are going to drop the ocean_proximity attribute, but that's okay)

In [None]:
# Re-load the original data
data_original = pd.read_csv(housing_dir)

# Separate the numerical part of the data
data_num = data_original.drop('ocean_proximity', axis=1)

# Names of the numerical attributes in the original data
attribs_num = data_num.columns.tolist()

# Name of the only categorical attribute in the original data
attribs_cat = ['ocean_proximity']

# I initialize an encoder here only to extract the list of labels in the same order in which they will be given later in the pipeline
encoder = OneHotEncoder()
encoder.fit(data[attribs_cat])
oh_labels = encoder.categories_[0].tolist()

# Names of the attributes added by FeatExpander
new_features = [
    'rooms_per_household',
    'bedrooms_per_household',
    'rooms_per_person',
    'bedrooms_per_person',
    'bedrooms_fraction',
    'people_per_household'
]

# Names of columns needed for reconversion of the numpy array returned by column_transformer back into a DataFrame
columns_tr = [*attribs_num, *new_features, *oh_labels]


class OutlierRemover(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X = X.copy() # To make sure that we don't change the original DataFrame
        X_cleaned = X.drop(index = X.query(' `median_house_value` >= 500000 | `housing_median_age` >= 52 | `median_income` >= 15 ').index.values)
        return X_cleaned

# I made my own imputer, because Scikit-learn's SimpleImputer returns a numpy array, whereas I prefer working on DataFrames
class MyImputer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X = X.copy()
        X['total_bedrooms'].fillna(value=X['total_bedrooms'].median(), inplace=True)
        return X
        

class FeatExpander(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X = X.copy()
        X['rooms_per_household'] = X['total_rooms'] / X['households']
        X['bedrooms_per_household'] = X['total_bedrooms'] / X['households']
        X['rooms_per_person'] = X['total_rooms'] / X['population']
        X['bedrooms_per_person'] = X['total_bedrooms'] / X['population']
        X['bedrooms_fraction'] = X['total_bedrooms'] / X['total_rooms']
        X['people_per_households'] = X['population'] / X['households']
        return X
    
# Another custom transformer - just to reconvert NumPy arrays returned by column_transformer back into a DataFrame
class DFConverter(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X = X.copy()
        X_df = pd.DataFrame(X)
        X_df.columns = columns_tr #
        return X_df

# A pipeline for numerical attributes
pipeline_num = Pipeline([
    ('imputer', MyImputer()),
    ('feat_expander', FeatExpander()),
])


column_transformer = ColumnTransformer([
    ('num', pipeline_num, attribs_num), # For the numerical attributes
    ('cat', OneHotEncoder(), attribs_cat), # For the categorical attribute
])

pipeline_full = Pipeline([
    ('outlier_remover', OutlierRemover()), # Remove the outliers
    ('column_transformer', column_transformer), # Process the numerical attributes and the categorical attribute separately and concatenate them after
    ('df_converter', DFConverter()) # Reconvert the concatenated NumPy array back into a DataFrame
])

data_tr = pipeline_full.fit_transform(data_original)
data_tr.columns

## 4.2. hotspot_distance - A distance from an area of high price

A quick look at the heatmap displaying median_house_value on a longitude/latitude plot indicates that there are two major "centers" distinguished by exceptionally high prices:

In [None]:
sns.set_style('white')

data.plot(
    x='longitude', y='latitude',
    kind='scatter', figsize=(10,7),
    alpha=.4,
    c='median_house_value', cmap=plt.get_cmap('jet'), colorbar=True
)

In [None]:
corr_mhv = data.corr()['median_house_value'].sort_values(ascending=False)
corr_mhv

There is no significant simple, linear correlation between longitude and median_house_value. Latitude is weakly negatively correlated. However, if we plot longitude and latitude against median_house_value we can see that there is possibly a more complex, less-obvious, non-linear trend.

This can be especially evident, when we overlay a lineplot (in red) of mean median_house_value for all the districts placed along a particular longitude/latitude (in blue).

In [None]:
sns.set_style('darkgrid')

fig, ax = plt.subplots(1,2, figsize=(30,12))

sns.scatterplot(
    x = 'longitude', y = 'median_house_value',
    alpha=.33,
    data=data, ax=ax[0]
)
sns.lineplot(
    x='longitude', y='median_house_value',
    ci=None, color='red', linewidth=1, alpha=.8,
    data=data, ax=ax[0]
)
ax[0].set_title('Linear correlation: %.5f' % (data.corr().loc['median_house_value', 'longitude']))

sns.scatterplot(
    x = 'latitude', y = 'median_house_value',
    alpha=.33,
    data=data, ax=ax[1]
)
sns.lineplot(
    x='latitude', y='median_house_value',
    ci=None, color='red', linewidth=1, alpha=.8,
    data=data, ax=ax[1]
)
ax[1].set_title('Linear correlation: %.5f' % (data.corr().loc['median_house_value', 'latitude']))

plt.show()

So, while there is no significant linear correlation between median_house_value and longitude, there seem to be two high-price regions: one around -122 and the another around -118 longitude. The same goes for 33 and 37 latitude. Moreover, a quick look at the previous map indicates that the two are not independent. 


In [None]:
sns.set_style('white')

data.plot(
    x='longitude', y='latitude',
    kind='scatter', figsize=(10,7),
    alpha=.4,
    c='median_house_value', cmap=plt.get_cmap('jet'), colorbar=True
)

So there two high-price regions, which I will further refer to as "hotspots":

1) North-western (NW): around longitude -122 and latitude 37

2) South-eastern (SE): around longitude -118 and latitude 35

Distance from these hotspots may be highly correlated with median_house_value.

We could settle at these roughly estimated values. This, however does not satisfy us and we're going to search for them methodically.

First, we will look for two longitude values, distances from which are most highly correlated with median_house_value. Then we will do the same for latitude. After that, we will combine the results into two pairs of coordinates, one for each high-price hotspot and perform a little fine-tuning.

### 4.2.1. Longitude

In [None]:
hotspot_NW_long = [0, 0] # correlation, longitude
hotspot_SE_long = [0, 0] # ^
inter_hotspot_long = -120 # longitude

while True:
    data_NW = data.copy().query('longitude < @inter_hotspot_long')
    data_SE = data.copy().query('longitude > @inter_hotspot_long')
    
    # hotspot_NW
    for long_val in np.arange(-124, inter_hotspot_long, .01):
        data_NW['hotspot_NW_long'] = data_NW['longitude'].apply(lambda x: abs(long_val-x))
        correlation = data_NW.corr().loc['hotspot_NW_long', 'median_house_value']
        if abs(correlation)>abs(hotspot_NW_long[0]):
            hotspot_NW_long = [correlation, long_val]
    
    # hotspot_SE
    for long_val in np.arange(inter_hotspot_long, -116, .01):
        data_SE['hotspot_SE_long'] = data_SE['longitude'].apply(lambda x: abs(long_val-x))
        correlation = data_SE.corr().loc['hotspot_SE_long', 'median_house_value']
        if abs(correlation)>abs(hotspot_SE_long[0]):
            hotspot_SE_long = [correlation, long_val]
            
    # inter_hotspot
    new_inter_hotspot_long = (hotspot_NW_long[1]+hotspot_SE_long[1])/2
    if new_inter_hotspot_long!=inter_hotspot_long:
        inter_hotspot_long = new_inter_hotspot_long
    else:
        break

print("NW:\t", hotspot_NW_long)
print("SE:\t", hotspot_SE_long)
print("inter-hotspot:\t", inter_hotspot_long)

### 4.2.2. Latitude

In [None]:
hotspot_NW_lat = [0, 0] # correlation, latitude
hotspot_SE_lat = [0, 0] # ^
inter_hotspot_lat = 36 # latitude

while True:
    data_NW = data.copy().query('latitude > @inter_hotspot_lat')
    data_SE = data.copy().query('latitude < @inter_hotspot_lat')
    
    # hotspot_NW
    for lat_val in np.arange(inter_hotspot_lat, inter_hotspot_lat+2, .01):
        data_NW['hotspot_NW_lat'] = data_NW['latitude'].apply(lambda x: abs(lat_val-x))
        correlation = data_NW.corr().loc['hotspot_NW_lat', 'median_house_value']
        if abs(correlation)>abs(hotspot_NW_lat[0]):
            hotspot_NW_lat = [correlation, lat_val]
    
    # hotspot_SE
    for lat_val in np.arange(inter_hotspot_lat-2, inter_hotspot_lat, .01):
        data_SE['hotspot_SE_lat'] = data_SE['latitude'].apply(lambda x: abs(lat_val-x))
        correlation = data_SE.corr().loc['hotspot_SE_lat', 'median_house_value']
        if abs(correlation)>abs(hotspot_SE_lat[0]):
            hotspot_SE_lat = [correlation, lat_val]
            
    # inter_hotspot
    new_inter_hotspot_lat = (hotspot_NW_lat[1]+hotspot_SE_lat[1])/2
    if new_inter_hotspot_lat!=inter_hotspot_lat:
        inter_hotspot_lat = new_inter_hotspot_lat
    else:
        break

print("NW:\t", hotspot_NW_lat)
print("SE:\t", hotspot_SE_lat)
print("inter-hotspot:\t", inter_hotspot_lat)

## 4.2.3. Longitude + latitude

In [None]:
hotspot_NW = [hotspot_NW_long[1], hotspot_NW_lat[1]] # longitude, latitude
hotspot_SE = [hotspot_SE_long[1], hotspot_SE_lat[1]] # ^

inter_hotspot = [inter_hotspot_long, inter_hotspot_lat] # ^

new_hotspot_NW = [0, 0, 0]   # correlation, longitude, latitue
new_hotspot_SE = [0, 0, 0]   # ^

# hotspot_NW
data_NW = data.copy().query('longitude < @inter_hotspot[0] & latitude > @inter_hotspot[1]')
for long_val in tqdm(np.arange(hotspot_NW[0]-.5, hotspot_NW[0]+.5, .05)):
    for lat_val in np.arange(hotspot_NW[1]-.5, hotspot_NW[1]+.5, .05):
        data_NW['hotspot_NW_distance'] = data_NW.apply(lambda x: np.sqrt((x['longitude']-long_val)**2 + (x['latitude']-lat_val)**2), axis=1)
        correlation = data_NW.corr().loc['hotspot_NW_distance', 'median_house_value']
        if abs(correlation)>abs(new_hotspot_NW[0]):
            new_hotspot_NW = [correlation, long_val, lat_val]
            
# hotspot_SE
data_SE = data.copy().query('longitude > @inter_hotspot[0] & latitude < @inter_hotspot[1]')
for long_val in tqdm(np.arange(hotspot_SE[0]-.5, hotspot_SE[0]+.5, .05)):
    for lat_val in np.arange(hotspot_SE[1]-.5, hotspot_SE[1]+.5, .05):
        data_SE['hotspot_SE_distance'] = data_SE.apply(lambda x: np.sqrt((x['longitude']-long_val)**2 + (x['latitude']-lat_val)**2), axis=1)
        correlation = data_SE.corr().loc['hotspot_SE_distance', 'median_house_value']
        if abs(correlation)>abs(new_hotspot_SE[0]):
            new_hotspot_SE = [correlation, long_val, lat_val]

# inter_hotspot
inter_hotspot = [(new_hotspot_NW[1]+new_hotspot_SE[1])/2, (new_hotspot_NW[2]+new_hotspot_SE[2])/2]

print("inter_hotspot:\t", inter_hotspot)
print("NW:\t", new_hotspot_NW)
print("SE:\t", new_hotspot_SE)

In [None]:
hotspot_NW = new_hotspot_NW[1:] # longitude, latitude
hotspot_SE = new_hotspot_SE[1:] # ^


# hotspot_NW
data_NW = data.copy().query('longitude < @inter_hotspot[0] & latitude > @inter_hotspot[1]')
for long_val in tqdm(np.arange(hotspot_NW[0]-.05, hotspot_NW[0]+.05, .01)):
    for lat_val in np.arange(hotspot_NW[1]-.05, hotspot_NW[1]+.05, .01):
        data_NW['hotspot_NW_distance'] = data_NW.apply(lambda x: np.sqrt((x['longitude']-long_val)**2 + (x['latitude']-lat_val)**2), axis=1)
        correlation = data_NW.corr().loc['hotspot_NW_distance', 'median_house_value']
        if abs(correlation)>abs(new_hotspot_NW[0]):
            new_hotspot_NW = [correlation, long_val, lat_val]
            
# hotspot_SE
data_SE = data.copy().query('longitude > @inter_hotspot[0] & latitude < @inter_hotspot[1]')
for long_val in tqdm(np.arange(hotspot_SE[0]-.05, hotspot_SE[0]+.05, .01)):
    for lat_val in np.arange(hotspot_SE[1]-.05, hotspot_SE[1]+.05, .01):
        data_SE['hotspot_SE_distance'] = data_SE.apply(lambda x: np.sqrt((x['longitude']-long_val)**2 + (x['latitude']-lat_val)**2), axis=1)
        correlation = data_SE.corr().loc['hotspot_SE_distance', 'median_house_value']
        if abs(correlation)>abs(new_hotspot_SE[0]):
            new_hotspot_SE = [correlation, long_val, lat_val]

# inter_hotspot
inter_hotspot = [(new_hotspot_NW[1]+new_hotspot_SE[1])/2, (new_hotspot_NW[2]+new_hotspot_SE[2])/2]
print("inter_hotspot:\t", inter_hotspot)

print("NW:\t", new_hotspot_NW)
print("SE:\t", new_hotspot_SE)

In [None]:
hotspot_NW = new_hotspot_NW[1:] # longitude, latitude
hotspot_SE = new_hotspot_SE[1:] # ^

### 4.2.4. Extracting hotspot_distance

In [None]:
def calculate_hotspot_distance(long_val, lat_val):
    NW_distance = np.sqrt((hotspot_NW[0]-long_val)**2 + (hotspot_NW[1]-lat_val)**2)
    SE_distance = np.sqrt((hotspot_SE[0]-long_val)**2 + (hotspot_SE[1]-lat_val)**2)
    return np.min([NW_distance, SE_distance])
    

data['hotspot_distance'] = data.apply(lambda x: calculate_hotspot_distance(x['longitude'], x['latitude']), axis=1)

In [None]:
data.corr().loc['median_house_value', 'hotspot_distance']

We can also create a heatmap, displaying not only the relative distance of each district from its nearest hotspot (marked by their hue), but also our calculated "perfect" locations of these idealized hotspots:

In [None]:
fig, ax = plt.subplots(1,1, figsize=(15,10))

data.plot(
    x='longitude', y='latitude',
    kind='scatter', figsize=(10,7),
    alpha=.4,
    c='hotspot_distance', cmap=plt.get_cmap('Spectral'), colorbar=True, ax=ax
)
ax.scatter(x=hotspot_NW[0], y=hotspot_NW[1], marker='X', color='k')
ax.scatter(x=hotspot_SE[0], y=hotspot_SE[1], marker='X', color='k')

plt.show()

As it turns out, they are located right on the surface of the ocean.

## 4.3 ocean_distance - Calculating the distance from the ocean

Instead of labeling each district as being located "inland", "near bay" etc..., we can simply calculate its distance to the closest district being located close to the body of water. This means that "NEAR BAY", "NEAR OCEAN" and "ISLAND" categories will automatically receive 0 value while in the cases of "<1H OCEAN" and "INLAND" categories we will search through the distances to all the districts lying close to a body of water and take the smallest distance.

The expectation is that this smallest distant to the ocean (which we will place in the new column ocean_distance) will be negatively correlated with the median_house_value.

However, districts located along the shore are quite densely packed, so we don't lose much information by using only a small fraction, like 1/20, of these datapoints. I ran some tests and it turned out that the resulting correlation is even slightly higher (in absolute terms) than if we used all the datapoints.

In [None]:
tqdm.pandas()

land_labels = ['<1H OCEAN', 'INLAND']
data_ocean = data.copy().query(' ocean_proximity == "NEAR OCEAN" | ocean_proximity == "NEAR BAY"').sample(frac=1/20, random_state=42) # skipping island districts, because they will obviously be always very far from the inland ones
def calculate_ocean_distance(long_val, lat_val):
    data_ocean['district_distance'] = data_ocean.apply(lambda x: np.sqrt((long_val-x['longitude'])**2 + (lat_val-x['latitude'])**2), axis=1)
    return data_ocean['district_distance'].min()

data['ocean_distance'] = data.progress_apply(lambda x: calculate_ocean_distance(x['longitude'], x['latitude']) if x['ocean_proximity'] in land_labels else 0, axis=1)

In [None]:
data.corr().loc['median_house_value', 'ocean_distance']

Once again, we have a very high negative correlation. This feature is also much more correlated with median_house_value than most ocean_proximity-derived labels and we can safely assume that this makes them redundant.

In [None]:
data.corr().loc['median_house_value', ['ocean_distance']+encoder.categories_[0].tolist()]

Again, we can create a heatmap to see that the districts located closer to the ocean really have higher ocean_distance values:

In [None]:
fig, ax = plt.subplots(1,1, figsize=(15,10))


data.plot(
    x='longitude', y='latitude',
    kind='scatter', figsize=(10,7),
    alpha=.4,
    c='ocean_distance', cmap=plt.get_cmap('Spectral'), colorbar=True, ax=ax
)

plt.show()

# 5. Normalization, the ultimate pipeline, and feature selection

For our models to perform well, we should normalize the data, that is, transform the values so that they have mean equal to 0 and standard deviation equal to 1. This can ve easily done with Scikit-learn's StandardScaler.

Note that we transform only the values of numerical features, not the categorical labels (one-hot-encoded ocean_proximity labels). We also don't want to normalize the target feature (median_house_value), since this is a regression task and we want to predict the value of this feature.

In [None]:
scaler = StandardScaler()
num_cols = [col for col in data.columns if col!='median_house_value' and col!='ocean_proximity' and col not in oh_labels]
data[num_cols] = scaler.fit_transform(data[num_cols])
data.describe().round(1)

All the features transformed have now their mean equal to 0 and standard deviation (std) equal to 1.

Everything done so far can be combined into the following pipeline (except data selection and splitting):

I changed the ordering slightly: I moved calculation of the ocean_distance to the beginning (right after the outliers removal), in order to make use of the ocean_proximity attribute (still present at that stage). This was much more convenient that the alternative. hotspot_distance is being computed in the numerical attributes pipeline, right after the expanded features, and followed by standard scaling applied to all numerical attributes, except the target attribute median_house_value. After that, all that's left is to combine all the processed numerical attributes with ocean_proximity-derived labels into a DataFrame with DFConverter.

In [None]:
# Re-load the original data
data_original = pd.read_csv(housing_dir)

# Separate the numerical part of the data
data_num = data_original.drop('ocean_proximity', axis=1)

# Names of the numerical attributes in the original data
attribs_num = [*data_num.columns.tolist(), 'ocean_distance'] # #####

# Name of the only categorical attribute in the original data
attribs_cat = ['ocean_proximity']

# I initialize an encoder here only to extract the list of labels in the same order in which they will be given later in the pipeline
encoder = OneHotEncoder()
encoder.fit(data_original[attribs_cat])
oh_labels = encoder.categories_[0].tolist()

# Names of the attributes added by FeatExpander
new_features = [
    'rooms_per_household',
    'bedrooms_per_household',
    'rooms_per_person',
    'bedrooms_per_person',
    'bedrooms_fraction',
    'people_per_household'
]

# Names of columns needed for reconversion of the numpy array returned by column_transformer back into a DataFrame
columns_tr = [*attribs_num, *new_features, 'hotspot_distance', *oh_labels]


class OutlierRemover(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X = X.copy() # To make sure that we don't change the original DataFrame
        X_cleaned = X.drop(index = X.query(' `median_house_value` >= 500000 | `housing_median_age` >= 52 | `median_income` >= 15 ').index.values)
        return X_cleaned

class MyImputer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X = X.copy()
        X['total_bedrooms'].fillna(value=X['total_bedrooms'].median(), inplace=True)
        return X

class FeatExpander(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X = X.copy()
        X['rooms_per_household'] = X['total_rooms'] / X['households']
        X['bedrooms_per_household'] = X['total_bedrooms'] / X['households']
        X['rooms_per_person'] = X['total_rooms'] / X['population']
        X['bedrooms_per_person'] = X['total_bedrooms'] / X['population']
        X['bedrooms_fraction'] = X['total_bedrooms'] / X['total_rooms']
        X['people_per_households'] = X['population'] / X['households']
        return X

        
class DFConverter(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X = X.copy()
        X_df = pd.DataFrame(X)
        X_df.columns = columns_tr
        return X_df
    
hotspot_NW = [-122.94, 37.04]
hotspot_SE = [-118.915, 33.165]
def calculate_hotspot_distance(long_val, lat_val):
    NW_distance = np.sqrt((hotspot_NW[0]-long_val)**2 + (hotspot_NW[1]-lat_val)**2)
    SE_distance = np.sqrt((hotspot_SE[0]-long_val)**2 + (hotspot_SE[1]-lat_val)**2)
    return np.min([NW_distance, SE_distance])
class HotspotDistanceCalculator(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X = X.copy()
        X['hotspot_distance'] = X.apply(lambda x: calculate_hotspot_distance(x['longitude'], x['latitude']), axis=1)
        return X
        
tqdm.pandas()
land_labels = ['<1H OCEAN', 'INLAND']
data_ocean = data_original.copy().query(' ocean_proximity == "NEAR OCEAN" | ocean_proximity == "NEAR BAY"').sample(frac=1/20, random_state=42) # skipping island districts, because they will obviously be always very far from the inland ones
def calculate_ocean_distance(long_val, lat_val):
    data_ocean['district_distance'] = data_ocean.apply(lambda x: np.sqrt((long_val-x['longitude'])**2 + (lat_val-x['latitude'])**2), axis=1)
    return data_ocean['district_distance'].min()
class OceanDistanceCalculator(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X = X.copy()
        X['ocean_distance'] = X.progress_apply(lambda x: calculate_ocean_distance(x['longitude'], x['latitude']) if x['ocean_proximity'] in land_labels else 0, axis=1)
        return X
        
class MyScaler(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X = X.copy()
        columns_to_norm = [col for col in X.columns if col!='median_house_value']
        X[columns_to_norm] = StandardScaler().fit_transform(X[columns_to_norm])
        return X
        
        
pipeline_num = Pipeline([
    ('imputer', MyImputer()),
    ('feat_expander', FeatExpander()),
    ('hotspot_distance_calculator', HotspotDistanceCalculator()),
    ('scaler', MyScaler())
])        

column_transformer = ColumnTransformer([
    ('num', pipeline_num, attribs_num),
    ('cat', OneHotEncoder(), attribs_cat)
])

pipeline_full = Pipeline([
    ('outlier_remover', OutlierRemover()),
    ('ocean_distance_calculator', OceanDistanceCalculator()),
    ('column_transformer', column_transformer),
    ('df_converter', DFConverter()),
])

data_original = pd.read_csv(housing_dir)

data_tr = pipeline_full.fit_transform(data_original)

In [None]:
data_tr.describe().round(1)

We now have to decide, which features we are going to use in training our models.

Initially, I wanted to retain only the features, which have absolute correlation with the target feature (median_house_value) greater than 0.1.

Discarding features with low correlation with the target value can speed up the training process, since it lower its computational load. While this usually is a desirable strategy when dealing with datasets with huge numbers of features (most of which probably are not very highly correlated with whatever it is we are trying to predict), in case of datasets with relatively few features (like this one), it can mean losing a significant amoun of information and thus lowering the predictive power of our models.

To inspect the relative significance of various features, I decided to make 5 sets of features (numbered from 0 to 1) and test the performance of several models, when trained on each of them. As it turns out, retaining all the features, including ones with very low absolute correlation with the target (below 0.1) or suspected to be redundant (one-hot-encoded ocean_proximity) can significantly lower the root mean squared error (RMSE, our metric).

So, I created the following splits:

**Set 0**: The baseline split, only the original numerical attributes (no feature-engineering) and one-hot-encoded ocean_proximity labels.

**Set 1**: Only the numerical features (including the engineered ones), without the ocean_proximity labels.

**Set 2**: Replication of the work from the Hands-On Machine Learning handbook: original numerical features, 3 combined attributes (3 out of 6 from section 4.1 of this notebook) and one-hot-encoded ocean_proximity labels.

**Set 3**: Set 1, but with the ocean_proximity labels retained.

**Set 4**: All attributes

In [None]:
# Information about correlation of each feature with median_house_value
corr_mhv = data_tr.corr()['median_house_value']

# Set 0: only the original num + OH
set_0 = [a for a in data_original.columns if a!='median_house_value' and a!='ocean_proximity']+oh_labels
# Set 1: absolute correlation above 0.1 (excluding OH) - my original idea
set_1 = [a for a in corr_mhv.index[:-5] if abs(corr_mhv[a])>.1 and a!='median_house_value']
# Set 2: replication of handbook's - original numerical + OH + handbook's combined attributes
set_2 = [a for a in data_original.columns if a!='median_house_value' and a!='ocean_proximity']+['rooms_per_household', 'people_per_household', 'bedrooms_fraction']+oh_labels
# Set 3: absolute correlation above 0.1 + OH - my original idea, but without excluding OH
set_3 = [a for a in corr_mhv.index[:-5] if abs(corr_mhv[a])>.1 and a!='median_house_value']+oh_labels
# Set 4: all attributes
set_4 = [a for a in corr_mhv.index.tolist() if a!='median_house_value']

sets_all = [
    set_0,
    set_1,
    set_2,
    set_3,
    set_4
]

data_X_all = [ data_tr[set_] for set_ in sets_all]

data_y = data_tr['median_house_value']

# 6. Models

In an attempt to replicate the results reported in the Hands-On Machine Learning handbook, we will test the same 3 models provided by Scikit-learn (Linear Regression, Decision Tree Regressor and Random Forest Regressor) and use the same cross-validation with 10 splits. The performance measure is going to be the root mean squared error (RMSE).

## 6.1. Linear Regression


In [None]:
print("\tLinear Regression:")
lr_cvs_rmse = [] # a list the scores will be written into
for i, data_X in enumerate(data_X_all):
    lr = LR()
    rmse = np.sqrt(-cvs(lr, data_X, data_y, scoring='neg_mean_squared_error', cv=10, n_jobs=-1))
    lr_cvs_rmse.append(rmse)
    print(f"Set {i}:\tMean: {rmse.mean().round(2)}\tStd: {rmse.std().round(2)}")

First observations:

1. Model trained on Set 0 (baseline) performs the worst of all. This could be predicted.

2. My original idea (Set 1), that is discarding features with absolute correlation below 0.1 and the one-hot-encoded ocean_proximity labels achieves better score (lower RMSE) than what was reported in the handbook (Set 2), so calculating hotspot_distance and/or ocean_distance seems to have been a good feature-engineering idea.

3. However, without discarding features with low absolute correlation (Set 3) we get even better results. Also, we get the best results, when we don't discard any features and train our Linear Regression model on all of them (Set 4).

## 6.2. Decision Tree Regressor

Decision Tree Regressor is much less deterministic than  Linear Regression, which can be easily seen if we run the same code several times. Whereas Linear Regression converges on the same solution, when trained on each feature set, Decision Trees achieve much more stochastic results (with their default values).

In [None]:
print("\tLinear Regression:")
for i, data_X in enumerate(data_X_all):
    print(f"Set {i}:")
    for run_i in range(5):
        lr = LR()
        rmse = np.sqrt(-cvs(lr, data_X, data_y, scoring='neg_mean_squared_error', cv=10, n_jobs=-1))
        print(f"\tRun {run_i}:\tMean: {rmse.mean().round(2)}\tStd: {rmse.std().round(2)}")

In [None]:
print("\tDecision Tree Regressor:")
dtr_cvs_rmse = []
for i, data_X in enumerate(data_X_all):
    rmses = 0
    print(f"Set {i}:")
    for run_i in range(5):
        dtr = DTR()
        rmse = np.sqrt(-cvs(dtr, data_X, data_y, scoring='neg_mean_squared_error', cv=10, n_jobs=-1))
        rmses += rmse
        print(f"\tRun {run_i}:\tMean: {rmse.mean().round(2)}\tStd: {rmse.std().round(2)}")
    rmses /= 5
    dtr_cvs_rmse.append(rmses)

However, after averaging over all the runs, we will see that algorithms train on data, which includes the newly engineered features achieve better performance, compared to the baseline (Set 0) and the handbook approach (Set 2), although there does not seem to be a significant difference between Sets 1, 3, and 4.

In [None]:
print("\tDecision Tree Regressor:")
for i, rmse in enumerate(dtr_cvs_rmse):
    print(f"Set {i}:\tMean: {rmse.mean().round(2)}\tStd: {rmse.std().round(2)}")

## 6.3. Random Forest Regressor

The same is true for random forests, which are just ensembles of decision trees.

In [None]:
print("\tRandom Forest Regressor:")
rfr_cvs_rmse = []
for i, data_X in enumerate(data_X_all):
    rfr = RFR()
    rmse = np.sqrt(-cvs(rfr, data_X, data_y, scoring='neg_mean_squared_error', cv=10, n_jobs=-1))
    rfr_cvs_rmse.append(rmse)
    print(f"Set {i}:\tMean: {rmse.mean().round(2)}\tStd: {rmse.std().round(2)}")

## 6.4. Keras Sequential Model

Let's now test a simple neural network, consisting solely of l2-regularized Dense layers, with batch normalization and a little dropout between each layer. It takes quite a while to train, but achieves much better results than the previous models.

10-fold cross-validation would take very long, so I decide just to train this model once on every feature set.

Of course, we need to split the data into train and test set first.

In [None]:
def lr_scheduler(epoch, lr):
    if epoch==110 or epoch==130:
        return lr/3
    else:
        return lr

callbacks_list = [
    callbacks.LearningRateScheduler(lr_scheduler),
    callbacks.ReduceLROnPlateau(factor=.1, monitor='val_loss', patience=3),
    #callbacks.ModelCheckpoint(filepath='model_best.h5', monitor='val_loss', save_best_only=True, save_freq='epoch'),
]

def build_model(n_features):
    model = models.Sequential(layers=[
        layers.Dense(32, activation='relu', kernel_regularizer='l2', input_shape=(n_features,)),
        layers.BatchNormalization(),
        layers.Dropout(.1),
        layers.Dense(64, activation='relu', kernel_regularizer='l2'),
        layers.BatchNormalization(),
        layers.Dropout(.1),
        layers.Dense(64, activation='relu', kernel_regularizer='l2'),
        layers.BatchNormalization(),
        layers.Dropout(.1),
        layers.Dense(64, activation='relu', kernel_regularizer='l2'),
        layers.BatchNormalization(),
        layers.Dropout(.1),
        layers.Dense(64, activation='relu', kernel_regularizer='l2'),
        layers.BatchNormalization(),
        layers.Dropout(.1),
        layers.Dense(1)
    ])
    return model
    

train_data, test_data = tts(data_tr, test_size=.1, random_state=42)

In [None]:
histories = []
models_ = [] # with an underscore (_), because 'models' name is taken by a Keras module

for i, set_ in tqdm(enumerate(sets_all)):
    train_X, train_y = train_data[set_], train_data['median_house_value']
    
    model = build_model(n_features=train_X.shape[1])
    
    model.compile(
        optimizer='rmsprop',
        loss='mse',
        metrics=['mae']
    )

    history = model.fit(
        train_X, train_y,
        validation_split=.1,
        callbacks = callbacks_list,
        epochs=150, batch_size=32,
        shuffle=True,
        verbose=0
    )
    
    histories.append(history)
    models_.append(model)
    

We can plot the training process...

In [None]:
history = histories[1]
epochs = np.arange(1, len(history.history['loss'])+1)
print("epochs:", len(epochs))

train_loss = history.history['loss']
val_loss = history.history['val_loss']
plt.plot(epochs, train_loss, 'r-', label='train_loss')
plt.plot(epochs, val_loss, 'g--', label='val_loss')
plt.legend()
print("Training and validation loss:")
plt.show()

train_mae = history.history['mae']
val_mae = history.history['val_mae']
plt.plot(epochs, train_mae, 'r-', label='train_mae')
plt.plot(epochs, val_mae, 'g--', label='val_mae')
plt.legend()
print("Training and validation MAE:")
plt.show()

lr = history.history['lr']
plt.plot(epochs, lr, 'b--', label='lr')
plt.legend()
print("Learning rate:")
plt.show()

... and evaluate each model's performance:

In [None]:
nn_rmse = []

print("\tNeural Network:")
for i, set_ in enumerate(sets_all):
    train_X, train_y = train_data[set_], train_data['median_house_value']
    test_X, test_y = test_data[set_], test_data['median_house_value']
    
    model = models_[i]
    
    train_rmse = np.sqrt(model.evaluate(train_X, train_y, verbose=0)[0])
    test_rmse = np.sqrt(model.evaluate(test_X, test_y, verbose=0)[0])
    
    nn_rmse.append(test_rmse)
    
    print(f"Set {i}:\tTrain: {train_rmse.round(2)}\tTest: {test_rmse.round(2)}")
    

For some reason unbeknownst to me, including one-hot encodings completely disturbs this network (I tried normalizing them and it didn't help at all). Also, it seems that in this case discarding low-correlation features improves the model's performance, although the difference is very small, so it may not be significant after all.

To have it all in one place, we'll make a dataframe containing RMSE scores of all the models tested so far. For comparison, we can also include the RMSE reported in the handbook for models, which for some reason differ from we've obtained for Set 2 (in attempt to replicate these results).

In [None]:
all_cvs_rmse = [
    lr_cvs_rmse,
    dtr_cvs_rmse,
    rfr_cvs_rmse
]

scores_df = pd.DataFrame({
    f'Set {i}': [rmse[i].mean().round(2) for rmse in all_cvs_rmse]+[nn_rmse[i].round(2)] for i in range(len(sets_all))
})
scores_df['Handbook RMSE'] = [69052.46, 71407.69, 50182.30, None]
scores_df.index = ['Linear Regression', 'Decision Tree Regressor', 'Random Forest Regressor', 'Neural Network']

scores_df

That's it for now. I will continue to work on this dataset, mostly trying to fine-tune these models and/or find and fine-tune better ones in [this notebook](https://www.kaggle.com/mateuszbagiski/calihousing-fine-tuning-ml-models).