## steps
1. ✅Understand the whole picture
   - Quantify final objectives ?
   - What is the current solution (if any) ? Downsides ?
2. ✅Get data
3. ✅explore, visualize data => **insight**
4. ✅prepare data for machine learning algo
5. select and train a model <- 
6. fine-tune model 
   - Mesure the errors made by the model, with **RMSE** or **MAE**
7. present solution
8.  launch, monitor, maintain system

In [1]:
import pandas as pd
housing_lables = pd.read_feather("prepared-data/strat_train_set_lables.feather").set_index("index")
housing = pd.read_feather("prepared-data/strat_train_set_features.feather").set_index("index")

In [2]:
from utils import (
    ClusterSimilarity,
    log_pipeline,
    cat_pipeline,
    default_num_pipeline,
    ratio_pipeline)
from sklearn.compose import ColumnTransformer, make_column_selector

cluster_simil = ClusterSimilarity(
    n_clusters=10, gamma=1, random_state=42
)

preprocessing=ColumnTransformer([
    ("bedrooms", ratio_pipeline, ['total_bedrooms','total_rooms']),
    ('rooms_per_house', ratio_pipeline, ['total_rooms',"households"]),
    ('people_per_house', ratio_pipeline, ['population',"households"]),
    ("log", log_pipeline, ["total_bedrooms", "total_rooms", "population",
                           "households", "median_income"]),
    ('geo', cluster_simil, ['latitude','longitude']),
    ('cat', cat_pipeline, make_column_selector(dtype_include=object))], # Ocean Proximity
    remainder=default_num_pipeline # housing_median_age
    )

# I. Train and evaluate on the training set

## test the newly created **linear regression prediction** pipeline

In [3]:
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

lin_reg = make_pipeline(
    preprocessing,
    LinearRegression()
)

In [5]:
lin_reg.fit(
    X= housing,
    y = housing_lables.median_house_value.to_numpy()
)

- evaluate model using **RMSE** (*root mean squared error*)

In [6]:
housing_predictions = lin_reg.predict(housing)

In [10]:
from sklearn.metrics import mean_squared_error

lin_rmse = mean_squared_error(
    housing_lables,
    housing_predictions,
    squared=False
)

In [17]:
(lin_rmse/ (housing_lables.max() - housing_lables.min()))

median_house_value    0.142694
dtype: float64

## Thoughts

### observations

- `lin_rmse` returns `69207.068`, which is *14.27%* of the `median_house_value` range
  ```python
  (
    lin_rmse
    /
    (housing_lables.max() - housing_lables.min())
  )
  # returns 14.27%
  ```
- => prediction performed on **train set**
- => **UNDERFITTING**

### Ideas

Options to deal with *underfitting*:
- Add more data
- Choose another model: E.g.:`DecisionTreeRegressor`
- Regularized Hyperparameter: E.g.: `gamma` and `n_clusters` in `ClusterSimilarity`

In [18]:
# Try DecisionTreeRegressor
# suitable for nonlinear relationship
from sklearn.tree import DecisionTreeRegressor
tree_reg=make_pipeline(preprocessing, DecisionTreeRegressor(random_state=42))
tree_reg.fit(housing, housing_lables.median_house_value)

In [19]:
housing_predictions = tree_reg.predict(housing)
tree_rmse = mean_squared_error(
    housing_lables.median_house_value,
    housing_predictions,
    squared=False
)

- `tree_rmse` return `0.0`, => Super **overfitting**