This California Housing Prices dataset has been downloaded from the StatLib repository (http://lib.stat.cmu.edu/datasets/). It is based on data from the 1990 California census. It is not recent, but this is not important for deep learning. The original dataset appeared in R. Kelley Pace and Ronald Barry, “Sparse Spatial Autoregressions,” Statistics & Probability Letters 33, no. 3 (1997): 291–297.

Data for each instance (observation) is referred to as a block group in California, which could be corresponded to a district with a population of 600 to 3,000 people, and 1,425.5 on average. 

The original raw dataset contains 20,640 instances. It is cleaned, preprocessed, and prepared in this notebook. After this phase of data preparation, a final dataset of 20,433 cases is obtained with nine attributes individually standardized with a mean of 0 and a variation of 1, $\frac{x-mean}{variance}$, or normalized with a min-max scaling, $\frac{x-min}{max-min}$: *longitude* and *latitude* (location), *median age*, *total rooms*, *total bedrooms*, *population*, *households*,  *median income*, and *ocean proximity*.  The file **MedianHouseValuePreparedCleanAttributes.csv** contains the resulting dataset.

From this data, the classification problem consists of estimating the median house value, categorized into the following three classes (price intervals in thousand dollars): Cheap: [15.0, 141.3], Averaged: [141.4, 230.2], and Expensive: [230.3, 500.0]. Each class is labeled from 0 (the cheapest) to 2 (the most expensive) and one-hot encoded in <b>MedianHouseValueOneHotEncodedClasses.csv</b> file for supervised training models. 

In [2]:
from tensorflow.python.client import device_lib
device_lib.list_local_devices()

[name: "/device:CPU:0"
 device_type: "CPU"
 memory_limit: 268435456
 locality {
 }
 incarnation: 2321482439413687148
 xla_global_id: -1,
 name: "/device:GPU:0"
 device_type: "GPU"
 memory_limit: 1755224475
 locality {
   bus_id: 1
   links {
   }
 }
 incarnation: 8148464397598274374
 physical_device_desc: "device: 0, name: NVIDIA GeForce RTX 3050 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6"
 xla_global_id: 416903419]

Import neccessary libraries:
- **numpy**: Numerical python library
- **pandas**: Will be used to work with dataframes from .csv files
- **sklearn**: Will be used to modify the labels of the data and do some statistical modifications
- **matplotlib**: Will be used to plot graphics

In [3]:
import numpy as np
import pandas as pd

Constants to store the paths to the files that we will work with

In [4]:
DATA_FOLDER = "../Data/"
INPUT_FILE_NAME = f"{DATA_FOLDER}HousingRawDataset.csv"
# ATT_FILE_NAME = f"{DATA_FOLDER}MedianHouseValuePreparedCleanAttributes.csv"
# ONE_HOT_ENCODED_LABEL_FILE_NAME = f"{DATA_FOLDER}MedianHouseValueOneHotEncodedClasses.csv"
# CONTINUOUS_LABEL_FILE_NAME = f"{DATA_FOLDER}MedianHouseValueContinuousOutput.csv"

Read the dataset and display the columns information`

In [5]:
dataset = pd.read_csv(INPUT_FILE_NAME)

In [6]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


In [8]:
dataset[:10]
dataset[:-10]

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
...,...,...,...,...,...,...,...,...,...,...
20625,-121.52,39.12,37.0,102.0,17.0,29.0,14.0,4.1250,72000.0,INLAND
20626,-121.43,39.18,36.0,1124.0,184.0,504.0,171.0,2.1667,93800.0,INLAND
20627,-121.32,39.13,5.0,358.0,65.0,169.0,59.0,3.0000,162500.0,INLAND
20628,-121.48,39.10,19.0,2043.0,421.0,1018.0,390.0,2.5952,92400.0,INLAND


## Data cleaning

**First step**: find out whether or not there are missing values and, in such case, remove them

In [9]:
{attribute: dataset[dataset[attribute].isnull()].shape[0] for attribute in dataset.columns}

{'longitude': 0,
 'latitude': 0,
 'housing_median_age': 0,
 'total_rooms': 0,
 'total_bedrooms': 207,
 'population': 0,
 'households': 0,
 'median_income': 0,
 'median_house_value': 0,
 'ocean_proximity': 0}