**End-to-end Machine Learning project**

*Welcome to Machine Learning Housing Corp.!  
Your task is to predict median house values in Californian districts, given a number of features from these districts.*

This is a simplified version of a Jupyter notebook you can find on github under:  
https://github.com/ageron/handson-ml/blob/master/02_end_to_end_machine_learning_project.ipynb

**This notebook will NOT execute correctly, you will have to fill in the missing parts according to the description**

# Setup

In [None]:
# Common imports
import numpy as np
import os
import pandas as pd
import sys

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Ignore useless warnings (see SciPy issue #5998)
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")
print(sys.executable)

# Get the data
In the following section, we will read in the house pricing data and try to gain first insights.  
Here are links to pandas methods which can be used for this:  

* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.hist.html

At first, try to show the [head](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html
) rows of the pandas dataframe.

In [None]:
# ".read_csv()" reads csv files into so called pandas "dataframes"
housing = pd.read_csv("datasets/housing/housing.csv")

# Try to get header information of the dataframe:
housing.

Now try to get some [informations](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) about the columns in the dataframe.  
Question: Which of the columns are special?

In [None]:
housing.

Now address the "special" column using it's name as index.  
Then, [count](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) the number of individual entries.

In [None]:
housing["special column"].

Get a statistical [description](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) of each column in housing.

In [None]:
housing.

Now try to plot [histograms](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.hist.html) of the columns  
Hint: there are parameters in the histogram method for formatting the plots, see e.g. `bins` or `figsize`

In [None]:
housing.

Plot a histogram just of the `median_house_value` column:

In [None]:
housing.

# Discover and visualize the data to gain insights
In the following, visualize the spacial (longitude, latitude) information of the dataset in a [scatter-plot](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.scatter.html).

In [None]:
housing.plot.

Try to use the plot parameters s (size) and c (color) to add more information to the plot, like e.g. for the population or median house value.  
--> more colourful is better ;-)

In [None]:
housing.plot.
plt.legend()

Compute the the [correlation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html) matrix of the housing dataset:

In [None]:
# Get the correlation matrix 
corr_matrix = housing.
corr_matrix

[Sort](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html) the correlations of the column `median_house_value` to find the highest correlations.

In [None]:
sorted_correlations = corr_matrix[]
sorted_correlations

Now plot a [scatter matrix](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.plotting.scatter_matrix.html) of the 4 highest correlating columns.  
Hint for experts: use `sorted_correlations` and the series [index](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.index.html) to retrieve the names of the needed 4 columns.

In [None]:
from pandas.plotting import scatter_matrix

attributes = 
scatter_matrix(housing[attributes], figsize=(12, 8))

In [None]:
# No try to make a scatter plot of the correlation between median_income and median_house value
housing.plot.scatter(x="median_income", y="median_house_value", alpha=0.1)
plt.axis([0, 16, 0, 550000])

The columns total_rooms, total_bedrooms and population are not representative for individual houses  
Try to define new colums which are related to the amount per household or total_rooms

In [None]:
housing["rooms_per_household"] = ...
housing["bedrooms_per_room"] = ...
housing["population_per_household"] = ...
housing.head()

And calculate the correlation matrix and sort the column `median_house_value` again, now with the new columns.  
What changed?

In [None]:
corr_matrix = housing.corr()
corr_matrix

Again, make a [scatter plot](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.scatter.html) of the new column with the highest correlation:

In [None]:
housing.plot.scatter(x = , y = , alpha=0.03)
plt.axis([0, 20, 0, 520000])
plt.show()

In [None]:
housing.describe()

# Prepare the data for Machine Learning algorithms

### Now separate labels from features:
1. Create new dataset which just contains labels ("median_house_value") we want to train for
2. Drop "median_house_value" from input data for training

In [None]:
housing_prep = housing.drop("median_house_value", axis = 1).copy() # drop labels for training set
housing_prep_labels = housing["median_house_value"].copy()

We need to find out where there is missing data before training a model.
Use the methods [isnull()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isnull.html) and [any()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.any.html) to find *any* rows where an entry *is null*.

In [None]:
sample_incomplete_rows = housing_prep[ ]
sample_incomplete_rows

How many rows with missing data do we have?

In [None]:
len( ... )

### How to deal with missing data ?
There are several posibilities how to deal with missing data:
1. Drop each row where data is missing / erroneous (see [dropna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) )
2. Drop the colums where data is missing / erroneous (see [drop](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html))
3. Replace the missing data with [mean](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html) or [median](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.median.html) of rows where data is available

Choose an option in the cells below!

In [None]:
# Option 1
sample_incomplete_rows.dropna()

In [None]:
# Option 2
sample_incomplete_rows.drop()

In [None]:
# Option 3
median_tot_bedrooms = housing_prep["total_bedrooms"].median()
median_bedrooms_p_room = housing_prep["bedrooms_per_room"].median()

sample_incomplete_rows["total_bedrooms"].fillna(median_tot_bedrooms, inplace=True)
sample_incomplete_rows
sample_incomplete_rows["bedrooms_per_room"].fillna(median_bedrooms_p_room, inplace=True)
sample_incomplete_rows

Apply one of the three options here, to pass the assertion below:

In [None]:



assert len(housing_prep[housing_prep.isnull().any(axis=1)]) == 0, "Please make sure there are no more missing values!"

### How to deal with categorical data?

The column `ocean_proximity` is categorical.
At first, check how much `unique` categories there are.

In [None]:
housing_prep['ocean_proximity'].

[Dummy variables](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) could help here:

In [None]:
cat_dummies = pd.get_dummies( ... )
assert cat_dummies.shape[1] == 5
cat_dummies.head()

[Concatenate](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) cat_dummies with housing_prep
and then drop the no more needed column `ocean_proximity`  
The resulting concatenated dataframe `housing_prep2` should have 16 columns.

In [None]:
housing_prep2 = pd.concat([housing_prep, cat_dummies], axis = 1).drop(columns=['ocean_proximity'])
assert housing_prep2.shape[1] == 16
housing_prep2.head()

All columns in housing_prep2 are numerical now, lets [scale](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) them so they have similar distribution widths and means.  
Be careful: `fit_transform` returns a numpy array, which must be converted to a dataframe before further usage.

In [None]:
from sklearn.preprocessing import StandardScaler

housing_scaled_arr = ...
housing_scaled = pd.DataFrame(housing_scaled_arr, columns = ...)

# Select and train a model 
At first, [split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) `housing_scaled_arr` and `housing_prep_labels` in to train (`X_train`, `y_train`) and test (`X_test`, `y_test`) dataset.  
Use `X_train` and `y_train` to train a model, then test it using `X_test` and `y_test`.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = 

Train (`fit()`) a [linear regressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) on the training dataset `X_train` and `y_train`

In [None]:
from sklearn.linear_model import LinearRegression

# Instanciate a LinearRegression here and fit it using X_train and y_train

Let's try the model a few training instances:

In [None]:
some_data = X_train[:5]
some_labels = y_train[:5]

print("Predictions:", lin_reg.predict(some_data))

Compare against the actual values:

In [None]:
print("Labels:", list(some_labels))

So, how good is the linear regression model?  
A common metric is the root [mean square  error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html) between the values *predicted* by the model on the training features (X_train) and the actual labels (y_train).  
Hint: to calculate the root of column elements, use e.g. [numpy sqrt](https://docs.scipy.org/doc/numpy/reference/generated/numpy.sqrt.html)

In [None]:
from sklearn.metrics import mean_squared_error

y_pred = # Use linear regressor to predict y_pred based on X_train
lin_mse = # Calculate mean square error between y_pred and y_train
lin_mse

Now calculate the same value for the test dataset X_test and y_test.

In [None]:
y_pred = # Use linear regressor to predict y_pred based on X_test
lin_mse = # Calculate mean square error between y_pred and y_test
lin_mse

In [None]:
from sklearn.metrics import mean_absolute_error

lin_mae = # Do the same as above for the mean absolute error between y_pred and y_test
lin_mae

Now train a [DecisionTreeRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) on X_train and y_train.

In [None]:
from sklearn.tree import DecisionTreeRegressor

# Instanciate a DecisionTreeRegressor here and fit it using X_train and y_train

Again, compute the mean squared error between the predicted and training labels (y_train):

In [None]:
y_pred =  # Use DecisionTreeRegressor to predict y_pred based on X_train
tree_rmse = # Calculate mean square error between y_pred and y_train
tree_rmse

assert tree_rmse == 0.0 , 'this should be 0.0'

And the mean squared error between the predictions based on X_test and training labels (y_train):

In [None]:
y_pred =  # Use DecisionTreeRegressor to predict y_pred based on X_test
tree_rmse = # Calculate mean square error between y_pred and y_test
tree_rmse

**Question:**
* Why is the error 0.0 for the training set?
* Why is there such a big difference in the errors?
* How to improve the model?  

Hints:
* Try using the DecisionTreeRegressor options `max_depth`or `max_features`
* Maybe a [RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) could help...

# Optional : Fine-tune your model
Try to optimize your model
If you think you are finished, [save](https://scikit-learn.org/stable/modules/model_persistence.html) your model to a file.  
**The model with the best performance on a hidden validation dataset wins ;-)**  
The model must take a the following columns as arguments:  

'longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'rooms_per_household', 'bedrooms_per_room', 'population_per_household',
       '<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN',

In [None]:
import pickle
import getpass

your_regressor = # Put your best regressor here

pickle.dump(your_regressor, open(getpass.getuser() + "s_model.p", "wb" ) )

Some additional code which can be used for optimization:

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
                         scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)

In [None]:
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

display_scores(tree_rmse_scores)

In [None]:
lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels,
                             scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

**Note**: we specify `n_estimators=10` to avoid a warning about the fact that the default value is going to change to 100 in Scikit-Learn 0.22.

In [None]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor(n_estimators=10, random_state=42)
forest_reg.fit(housing_prepared, housing_labels)

In [None]:
housing_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

In [None]:
from sklearn.model_selection import cross_val_score

forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
                                scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)

In [None]:
scores = cross_val_score(lin_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10)
pd.Series(np.sqrt(-scores)).describe()