## Context

This is the dataset is a modified version of the California Housing Data used in the paper Pace, R. Kelley, and Ronald Barry. "Sparse spatial autoregressions." Statistics & Probability Letters 33.3 (1997): 291-297.. It serves as an excellent introduction to implementing machine learning algorithms because it requires rudimentary data cleaning, has an easily understandable list of variables and sits at an optimal size between being too toyish and too cumbersome.

The data contains information from the 1990 California census. So although it may not help you with predicting current housing prices like the Zillow Zestimate dataset, it does provide an accessible introductory dataset for teaching people about the basics of machine learning.

## Source

This data was entirely modified and cleaned by: https://www.kaggle.com/fedesoriano. The original data (without the distance features) was initially featured in the following paper:
Pace, R. Kelley, and Ronald Barry. "Sparse spatial autoregressions." Statistics & Probability Letters 33.3 (1997): 291-297.

The original dataset can be found under the following link: https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

## Modifications with respect to the original data

This dataset includes 5 extra features defined by https://www.kaggle.com/fedesoriano : "Distance to coast", "Distance to Los Angeles", "Distance to San Diego", "Distance to San Jose", and "Distance to San Francisco". These extra features try to account for the distance to the nearest coast and the distance to the centre of the largest cities in California.

The distances were calculated using the Haversine formula with the Longitude and Latitude:

https://wikimedia.org/api/rest_v1/media/math/render/svg/a65dbbde43ff45bacd2505fcf32b44fc7dcd8cc0

where:

* phi_1 and phi_2 are the Latitudes of point 1 and point 2, respectively;
* lambda_1 and lambda_2 are the Longitudes of point 1 and point 2, respectively;
* r is the radius of the Earth (6371km);

## Data set summary

The data pertains to the houses found in a given California district and some summary stats about them based on the 1990 census data. The columns are as follows, their names are pretty self-explanatory:

* 1) Median House Value: Median house value for households within a block (measured in US Dollars);
* 2) Median Income: Median income for households within a block of houses (measured in tens of thousands of US Dollars) `10k$`
* 3) Median Age: Median age of a house within a block; a lower number is a newer building [years];
* 4) Total Rooms: Total number of rooms within a block;
* 5) Total Bedrooms: Total number of bedrooms within a block;
* 6) Population: Total number of people residing within a block;
* 7) Households: Total number of households, a group of people residing within a home unit, for a block;
* 8) Latitude: A measure of how far north a house is; a higher value is farther north [°];
* 9) Longitude: A measure of how far west a house is; a higher value is farther west [°];
* 10) Distance to coast: Distance to the nearest coast point [m];
* 11) Distance to Los Angeles: Distance to the centre of Los Angeles [m];
* 12) Distance to San Diego: Distance to the centre of San Diego [m];
* 13) Distance to San Jose: Distance to the centre of San Jose [m];
* 14) Distance to San Francisco: Distance to the centre of San Francisco [m].

## Main objective

Training a model that is balanced and can generalize well on the new incoming data, avoid overfitting and also underfitting and find the best parameters to use on this dataset for a linear regression model. (paynomial featuring, standardization, regularization)

Can I do it?

## Importing libraries and loading up the tools 

In [26]:
# Data analysis and manipulation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# plots to appear inside the notebook 
%matplotlib inline

# Models importing from Scikit-Learn
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import (StandardScaler, 
                                   PolynomialFeatures)

# Model Evaluations
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

## Load data

In [27]:
# Read and load California Housing dataset
df = pd.read_csv('data/California_Houses.csv')

data = df.copy() # Keep a copy our original data 

In [33]:
df.head()

Unnamed: 0,Median_House_Value,Median_Income,Median_Age,Tot_Rooms,Tot_Bedrooms,Population,Households,Latitude,Longitude,Distance_to_coast,Distance_to_LA,Distance_to_SanDiego,Distance_to_SanJose,Distance_to_SanFrancisco
0,452600.0,8.3252,41,880,129,322,126,37.88,-122.23,9263.040773,556529.158342,735501.806984,67432.517001,21250.213767
1,358500.0,8.3014,21,7099,1106,2401,1138,37.86,-122.22,10225.733072,554279.850069,733236.88436,65049.908574,20880.6004
2,352100.0,7.2574,52,1467,190,496,177,37.85,-122.24,8259.085109,554610.717069,733525.682937,64867.289833,18811.48745
3,341300.0,5.6431,52,1274,235,558,219,37.85,-122.25,7768.086571,555194.266086,734095.290744,65287.138412,18031.047568
4,342200.0,3.8462,52,1627,280,565,259,37.85,-122.25,7768.086571,555194.266086,734095.290744,65287.138412,18031.047568


## Data Exploration: EDA (cleaning and feature engineering )

In [34]:
df.shape # (rows, columns)

(20640, 14)

In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 14 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Median_House_Value        20640 non-null  float64
 1   Median_Income             20640 non-null  float64
 2   Median_Age                20640 non-null  int64  
 3   Tot_Rooms                 20640 non-null  int64  
 4   Tot_Bedrooms              20640 non-null  int64  
 5   Population                20640 non-null  int64  
 6   Households                20640 non-null  int64  
 7   Latitude                  20640 non-null  float64
 8   Longitude                 20640 non-null  float64
 9   Distance_to_coast         20640 non-null  float64
 10  Distance_to_LA            20640 non-null  float64
 11  Distance_to_SanDiego      20640 non-null  float64
 12  Distance_to_SanJose       20640 non-null  float64
 13  Distance_to_SanFrancisco  20640 non-null  float64
dtypes: flo

We can see that:

* There are 20,640 instances in the dataset.

* There are no missing values.

* All the values are numeric (float or int).

Next, let"s display some statistical summaries of the numerical columns:

In [36]:
df.describe()

Unnamed: 0,Median_House_Value,Median_Income,Median_Age,Tot_Rooms,Tot_Bedrooms,Population,Households,Latitude,Longitude,Distance_to_coast,Distance_to_LA,Distance_to_SanDiego,Distance_to_SanJose,Distance_to_SanFrancisco
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,206855.816909,3.870671,28.639486,2635.763081,537.898014,1425.476744,499.53968,35.631861,-119.569704,40509.264883,269422.0,398164.9,349187.551219,386688.422291
std,115395.615874,1.899822,12.585558,2181.615252,421.247906,1132.462122,382.329753,2.135952,2.003532,49140.03916,247732.4,289400.6,217149.875026,250122.192316
min,14999.0,0.4999,1.0,2.0,1.0,3.0,1.0,32.54,-124.35,120.676447,420.5891,484.918,569.448118,456.141313
25%,119600.0,2.5634,18.0,1447.75,295.0,787.0,280.0,33.93,-121.8,9079.756762,32111.25,159426.4,113119.928682,117395.477505
50%,179700.0,3.5348,29.0,2127.0,435.0,1166.0,409.0,34.26,-118.49,20522.019101,173667.5,214739.8,459758.877,526546.661701
75%,264725.0,4.74325,37.0,3148.0,647.0,1725.0,605.0,37.71,-118.01,49830.414479,527156.2,705795.4,516946.490963,584552.007907
max,500001.0,15.0001,52.0,39320.0,6445.0,35682.0,6082.0,41.95,-114.31,333804.686371,1018260.0,1196919.0,836762.67821,903627.663298


In [37]:
# Get the total unique values for each column
dict = {}
for col in list(df.columns):
    dict[col] = df[col].value_counts().shape[0]

pd.DataFrame(dict,index=["unique count"]).T

Unnamed: 0,unique count
Median_House_Value,3842
Median_Income,12928
Median_Age,52
Tot_Rooms,5926
Tot_Bedrooms,1928
Population,3888
Households,1815
Latitude,862
Longitude,844
Distance_to_coast,12590


## Modelling

## Choose Models , fit and score the ML models

## Add Polynomial and Regularization Techniques: Ridge, LASSO, and Elastic Net

## Evaluate