## steps
1. ✅Understand the whole picture
   - Quantify final objectives ?
   - What is the current solution (if any) ? Downsides ?
2. ✅Get data
3. ✅explore, visualize data => **insight**
4. ✅prepare data for machine learning algo
5. ✅select and train a model
6. fine-tune model 
   - Mesure the errors made by the model, with **RMSE** or **MAE**
7. present solution
8.  launch, monitor, maintain system

In [1]:
import pandas as pd
housing_lables = pd.read_feather("prepared-data/strat_train_set_lables.feather").set_index("index")
housing = pd.read_feather("prepared-data/strat_train_set_features.feather").set_index("index")

In [2]:
from utils import (
    ClusterSimilarity,
    log_pipeline,
    cat_pipeline,
    default_num_pipeline,
    ratio_pipeline)
from sklearn.compose import ColumnTransformer, make_column_selector

cluster_simil = ClusterSimilarity(
    n_clusters=10, gamma=1, random_state=42
)

preprocessing=ColumnTransformer([
    ("bedrooms", ratio_pipeline, ['total_bedrooms','total_rooms']),
    ('rooms_per_house', ratio_pipeline, ['total_rooms',"households"]),
    ('people_per_house', ratio_pipeline, ['population',"households"]),
    ("log", log_pipeline, ["total_bedrooms", "total_rooms", "population",
                           "households", "median_income"]),
    ('geo', cluster_simil, ['latitude','longitude']),
    ('cat', cat_pipeline, make_column_selector(dtype_include=object))], # Ocean Proximity
    remainder=default_num_pipeline # housing_median_age
    )

# I. Train and evaluate on the training set

## test the newly created **linear regression prediction** pipeline

In [3]:
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

lin_reg = make_pipeline(
    preprocessing,
    LinearRegression()
)

In [4]:
lin_reg.fit(
    X= housing,
    y = housing_lables.median_house_value.to_numpy()
)

- evaluate model using **RMSE** (*root mean squared error*)

In [5]:
housing_predictions = lin_reg.predict(housing)

In [6]:
from sklearn.metrics import mean_squared_error

lin_rmse = mean_squared_error(
    housing_lables,
    housing_predictions,
    squared=False
)

In [7]:
(lin_rmse/ (housing_lables.max() - housing_lables.min()))

median_house_value    0.142694
dtype: float64

## Thoughts

### observations

- `lin_rmse` returns `69207.068`, which is *14.27%* of the `median_house_value` range
  ```python
  (
    lin_rmse
    /
    (housing_lables.max() - housing_lables.min())
  )
  # returns 14.27%
  ```
- => prediction performed on **train set**
- => **UNDERFITTING**

### Ideas

Options to deal with *underfitting*:
- Add more data
- Choose another model: E.g.:`DecisionTreeRegressor`
- Regularized Hyperparameter: E.g.: `gamma` and `n_clusters` in `ClusterSimilarity`

In [8]:
# Try DecisionTreeRegressor
# suitable for nonlinear relationship
from sklearn.tree import DecisionTreeRegressor
tree_reg=make_pipeline(preprocessing, DecisionTreeRegressor(random_state=42))
tree_reg.fit(housing, housing_lables.median_house_value)

In [9]:
housing_predictions = tree_reg.predict(housing)
tree_rmse = mean_squared_error(
    housing_lables.median_house_value,
    housing_predictions,
    squared=False
)

- `tree_rmse` return `0.0`, => Super **overfitting**

# II. Better Evaluation using **Cross-Validation**

- To **validate different models**, split the `train_set` into `train_set`, and `validation_set`
- E.g.: Implement the `k_-fold cross-validation`
  - Divide `train_set` into **10 sets** (called **10 folds**)
  - Use **9 folds to train** and **1 fold to validate**
  - Loop **10 times** to change the `train_fold` and `test_fold`
  - get the scores (**user defined** scoring system) of 10 loops

In [11]:
from sklearn.model_selection import cross_val_score
tree_rmse = cross_val_score(estimator = tree_reg,
                            X=housing,
                            y=housing_lables.median_house_value,
                            scoring="neg_root_mean_squared_error",
                            # cv functions in sklearn aim to maximize the score to find optimal models,
                            # => RMSE is replaced by negative RMSE
                            # the higher the score the better the model at prediction
                            cv=10 # cross validation strategy
                            )

In [15]:
pd.Series(-tree_rmse).describe()
# show the std's of 10 cross-validation
# => the accuracy of the model

count       10.000000
mean     67130.179601
std       2739.852690
min      62414.452955
25%      66279.645501
50%      67577.222842
75%      68245.416450
max      71549.880237
dtype: float64

- `Ensemble`: a model that implements other models.
- E.g.: `RandomForestRegressor` implement trainning many `DecisionTreeRegressor`'s on random **subsets of the features**, then, get the **mean out of the predictions**.

In [16]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = make_pipeline(preprocessing,
                           RandomForestRegressor(random_state=42))
forest_rmse = -cross_val_score(
    estimator=forest_reg,
    X=housing,
    y=housing_lables.median_house_value,
    scoring="neg_root_mean_squared_error",
    cv=10
)

- Training **RandomForestRegressor** takes too long (~3 minutes)
- Pretrained results is stored in `prepared-data/forest_rmse_cross_validation_10folds_randstate42.npz`
- Using the python code below
   ```python
   import numpy as np 
   np.savez_compressed('prepared-data/forest_rmse_cross_validation_10folds_randstate42.npz',
                           forest_rmse=forest_rmse)
   ```

In [19]:
pd.Series(forest_rmse).describe()
# least std
# decent mean
# => suitable candidate model

count       10.000000
mean     47328.317269
std       2527.589595
min      43625.026527
25%      45224.777860
50%      47291.793601
75%      49073.526315
max      51265.623767
dtype: float64