# Introduction
This first project is based on Chapter 2 of "Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow" by Aurélien Géron.
It is intended as a copy-pasted project to try and confirm that the concept of this chapter were correctly assimilated.

## Direction
The intent of this ML project is to predict the price of houses in a given California district. This output should be a numerical value in US dollars, and the input data is block data from the US census bureau.

In [None]:
from pathlib import Path
import pandas as pd
import numpy as np
import tarfile
import urllib.request
import matplotlib.pyplot as plt
import pdb
from sklearn.model_selection import train_test_split

## Data Import
Define a function that will load the data. If the data  is not present locally, it will be automatically donwload in a local "datasets" repository.

In [None]:
def load_housing_data():
    tarball_path=Path("./datasets/housing.tgz")
    if not tarball_path.is_file():
        Path("./datasets").mkdir(parents=True, exist_ok=True)
        url="https://github.com/ageron/data/raw/main/housing.tgz"

        urllib.request.urlretrieve(url, tarball_path)
        with tarfile.open(tarball_path) as housing_tarball:
            housing_tarball.extractall(path="datasets")
    print(Path("./datasets/housing/housing.csv"))
    return pd.read_csv(Path("./datasets/housing/housing.csv"))

Import the data, based on the previous defined function

In [None]:
housing = load_housing_data()

## Data Exploration

Quickly reprensentation of the data contained in the table to see what it looks like.

In [None]:
housing.info()

This dataset contains 9 numeric columns, and one column with a different data type.

We can also see that only _total\_bedrooms_ is missing any value, the rest of the column is complete. 
Hence, we will have to assess what to do for this specific column, drop it or impute values?

Let's use describe and let's create an histogram to obtain more insight about numerical columns.

In [None]:
housing.describe()

In [None]:
housing.hist(bins=50, figsize=(12,8))
plt.show()

The variables _latitude_ and _longitude_ seem defined appropriately and do not present too many oddities, except for their bimodal distributions.

The variables _housing\_median\_age_ is distributed between 1 and 52, so it's relates to the age of the *house*, not of the inhabitants. Furthermore, the median age is capped at 52, and all centennial house will be given this value instead.

The variables _total\_rooms_, _total\_bedrooms_, _population_, _households_ do not present anything odd, but will require to be transformed in order to be more symmetric.

The variable _median\_income_ is not only skewed right, but is not expressed in US dollar, as well ,as being capped at 0.5 and 15. Each unit represents ~$10'000.

Finally, the variable _median\_house\_value_ is both skewed to the right, as well as being capped of at $500'000, there will need to be transformed for better symmetry.




## Train-Test Split
Before going any further, it's important to split the test data from the test dataset and to not look at the former until we are confident in our model.

### Stratification
According to experts, median income is a extremely important predictor of median housing prices, and therefore we should ensure that the distribution of this variable is similar in our training and testing sets.

In [None]:
strat_train_set, strat_test_set = train_test_split(housing, test_size=0.2, stratify=housing["median_income"], random_state = 42)