## Context

This is the dataset is a modified version of the California Housing Data used in the paper Pace, R. Kelley, and Ronald Barry. "Sparse spatial autoregressions." Statistics & Probability Letters 33.3 (1997): 291-297.. It serves as an excellent introduction to implementing machine learning algorithms because it requires rudimentary data cleaning, has an easily understandable list of variables and sits at an optimal size between being too toyish and too cumbersome.

The data contains information from the 1990 California census. So although it may not help you with predicting current housing prices like the Zillow Zestimate dataset, it does provide an accessible introductory dataset for teaching people about the basics of machine learning.

## Source

This data was entirely modified and cleaned by: https://www.kaggle.com/fedesoriano. The original data (without the distance features) was initially featured in the following paper:
Pace, R. Kelley, and Ronald Barry. "Sparse spatial autoregressions." Statistics & Probability Letters 33.3 (1997): 291-297.

The original dataset can be found under the following link: https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

## Modifications with respect to the original data

This dataset includes 5 extra features defined by https://www.kaggle.com/fedesoriano : "Distance to coast", "Distance to Los Angeles", "Distance to San Diego", "Distance to San Jose", and "Distance to San Francisco". These extra features try to account for the distance to the nearest coast and the distance to the centre of the largest cities in California.

The distances were calculated using the Haversine formula with the Longitude and Latitude:

https://wikimedia.org/api/rest_v1/media/math/render/svg/a65dbbde43ff45bacd2505fcf32b44fc7dcd8cc0

where:

* phi_1 and phi_2 are the Latitudes of point 1 and point 2, respectively;
* lambda_1 and lambda_2 are the Longitudes of point 1 and point 2, respectively;
* r is the radius of the Earth (6371km);

## Data set summary

The data pertains to the houses found in a given California district and some summary stats about them based on the 1990 census data. The columns are as follows, their names are pretty self-explanatory:

* 1) Median House Value: Median house value for households within a block (measured in US Dollars);
* 2) Median Income: Median income for households within a block of houses (measured in tens of thousands of US Dollars) `10k$`
* 3) Median Age: Median age of a house within a block; a lower number is a newer building [years];
* 4) Total Rooms: Total number of rooms within a block;
* 5) Total Bedrooms: Total number of bedrooms within a block;
* 6) Population: Total number of people residing within a block;
* 7) Households: Total number of households, a group of people residing within a home unit, for a block;
* 8) Latitude: A measure of how far north a house is; a higher value is farther north [°];
* 9) Longitude: A measure of how far west a house is; a higher value is farther west [°];
* 10) Distance to coast: Distance to the nearest coast point [m];
* 11) Distance to Los Angeles: Distance to the centre of Los Angeles [m];
* 12) Distance to San Diego: Distance to the centre of San Diego [m];
* 13) Distance to San Jose: Distance to the centre of San Jose [m];
* 14) Distance to San Francisco: Distance to the centre of San Francisco [m].

## Main objective

Training a model that is balanced and can generalize well on the new incoming data, avoid overfitting and also underfitting and find the best parameters to use on this dataset for a linear regression model. (paynomial featuring, standardization, regularization)

Can I do it?

## Importing libraries and loading up the tools 

In [2]:
# Data analysis and manipulation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# plots to appear inside the notebook 
%matplotlib inline

# Models importing from Scikit-Learn
from sklearn.linear_model import LogisticRegression

# Model Evaluations
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report

## Load data

In [4]:
df = pd.read_csv('data/California_Houses.csv')
df.head()


Unnamed: 0,Median_House_Value,Median_Income,Median_Age,Tot_Rooms,Tot_Bedrooms,Population,Households,Latitude,Longitude,Distance_to_coast,Distance_to_LA,Distance_to_SanDiego,Distance_to_SanJose,Distance_to_SanFrancisco
0,452600.0,8.3252,41,880,129,322,126,37.88,-122.23,9263.040773,556529.158342,735501.806984,67432.517001,21250.213767
1,358500.0,8.3014,21,7099,1106,2401,1138,37.86,-122.22,10225.733072,554279.850069,733236.88436,65049.908574,20880.6004
2,352100.0,7.2574,52,1467,190,496,177,37.85,-122.24,8259.085109,554610.717069,733525.682937,64867.289833,18811.48745
3,341300.0,5.6431,52,1274,235,558,219,37.85,-122.25,7768.086571,555194.266086,734095.290744,65287.138412,18031.047568
4,342200.0,3.8462,52,1627,280,565,259,37.85,-122.25,7768.086571,555194.266086,734095.290744,65287.138412,18031.047568


## Data Exploration: EDA (cleaning and feature engineering )

## Modelling

## Choose Models , fit and score the ML models

## Add Polynomial and Regularization Techniques: Ridge, LASSO, and Elastic Net

## Evaluate