<a href="https://colab.research.google.com/github/Carlos-Pessin/DNC_DataModeling/blob/main/Data_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Data Modeling Case - California Housing**
Working on Data Modeling step from CRISP-DM using Fetch California Housing dataset from SKLEARN.

Assessing 3 diferent models to predict the value of houses by its attributes.

Dataset attributes:
- MedInc:        median income in block group
- HouseAge:      median house age in block group
- AveRooms:      average number of rooms per household
- AveBedrms:     average number of bedrooms per household
- Population:    block group population
- AveOccup:      average number of household members
- Latitude:      block group latitude
- Longitude:     block group longitude
- Target: median house value for California districts, expressed in hundreds of thousands of dollars ($100,000)

In [16]:
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.svm import SVR
from xgboost import XGBRegressor
import pandas as pd
import numpy as np

In [4]:
housing = datasets.fetch_california_housing()

In [5]:
housing.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'feature_names', 'DESCR'])

In [6]:
print(housing.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

In [7]:
X = housing.data      # caracteristicas das casas
y = housing.target    # preco conhecido das casas

# Modeling Techniques

1. Linear Regression do SKLEARN:
<https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html>
2. Support Vector Regression do SKLEARN:
<https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html>
3. Decision Tree Regression from XBOOST:
<https://xgboost.readthedocs.io/en/stable/python/python_api.html>

Modeling Assumprionts:
- Numeric variables only

# **Test Design**

**Dataset split with SKLEARN:**
<https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html>
- 80% train
- 10% validation
- 10% test

**Assess models:**
MSE and RMSE with SKLEARN:
<https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html>

In [8]:
# spliting Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=0)

**Linear Regression SKLEARN**

In [9]:
# bulding model
linReg = LinearRegression()
linReg.fit(X_train, y_train)

In [10]:
# assessing model
y_pred_LR = linReg.predict(X_val)
MSE_LR = mean_squared_error(y_val, y_pred_LR)
RMSE_LR = np.sqrt(MSE_LR)
print("MSE: ", MSE_LR)
print("RMSE: ",RMSE_LR)

MSE:  0.5439928608878853
RMSE:  0.737558716908617


**Support Vector Regression**

In [11]:
# building model
regSVR = SVR()
regSVR.fit(X_train, y_train)

In [12]:
# assessing model
y_pred_SVR = regSVR.predict(X_val)
MSE_SVR = mean_squared_error(y_val, y_pred_SVR)
RMSE_SVR = np.sqrt(MSE_SVR)
print("MSE: ", MSE_SVR)
print("RMSE: ",RMSE_SVR)

MSE:  1.3585302653827458
RMSE:  1.1655600651115092


**Decision Tree Regression**

In [13]:
# building model
decTreeReg = XGBRegressor()
decTreeReg.fit(X_train, y_train)

In [14]:
# assessing model
y_pred_DTR = decTreeReg.predict(X_val)
MSE_DTR = mean_squared_error(y_val, y_pred_DTR)
RMSE_DTR = np.sqrt(MSE_DTR)
print("MSE: ", MSE_DTR)
print("RMSE: ",RMSE_DTR)

MSE:  0.20790119411658678
RMSE:  0.45596183405696006


The Decision tree regression from XGBOOST has achieved a better result, let's now try to improve its efficiency by changing some of its default parameters. Then we'll assess it with the test sample to double check its accuracy.

**Hyper-parameters otimization** with SKLEARN GridSearchCV <https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html>

In [18]:
# looking for parameters
decTreeReg.get_params().keys()

dict_keys(['objective', 'base_score', 'booster', 'callbacks', 'colsample_bylevel', 'colsample_bynode', 'colsample_bytree', 'device', 'early_stopping_rounds', 'enable_categorical', 'eval_metric', 'feature_types', 'gamma', 'grow_policy', 'importance_type', 'interaction_constraints', 'learning_rate', 'max_bin', 'max_cat_threshold', 'max_cat_to_onehot', 'max_delta_step', 'max_depth', 'max_leaves', 'min_child_weight', 'missing', 'monotone_constraints', 'multi_strategy', 'n_estimators', 'n_jobs', 'num_parallel_tree', 'random_state', 'reg_alpha', 'reg_lambda', 'sampling_method', 'scale_pos_weight', 'subsample', 'tree_method', 'validate_parameters', 'verbosity'])

In [29]:
# default parameters used to estimate good tries
parameters = {
    'max_depth': [6, 5, 7],
    'learning_rate': [0.3, 0.1, 0.2],
    'objective': ['reg:squarederror'],
    'booster': ['gbtree'],
    'n_jobs': [5],
    'gamma': [0, 1, 2],
    'min_child_weight': [1, 3],
    'max_delta_step': [0, 1, 3],
    'subsample': [1, 0, 0.5]
}

In [30]:
XGBGrid = GridSearchCV(XGBRegressor(), parameters, refit='neg_mean_square_error', verbose=4)

In [31]:
XGBGridModel = XGBGrid.fit(X_train, y_train)

Fitting 5 folds for each of 486 candidates, totalling 2430 fits
[CV 1/5] END booster=gbtree, gamma=0, learning_rate=0.3, max_delta_step=0, max_depth=6, min_child_weight=1, n_jobs=5, objective=reg:squarederror, subsample=1;, score=0.825 total time=   0.6s
[CV 2/5] END booster=gbtree, gamma=0, learning_rate=0.3, max_delta_step=0, max_depth=6, min_child_weight=1, n_jobs=5, objective=reg:squarederror, subsample=1;, score=0.833 total time=   0.6s
[CV 3/5] END booster=gbtree, gamma=0, learning_rate=0.3, max_delta_step=0, max_depth=6, min_child_weight=1, n_jobs=5, objective=reg:squarederror, subsample=1;, score=0.822 total time=   0.6s
[CV 4/5] END booster=gbtree, gamma=0, learning_rate=0.3, max_delta_step=0, max_depth=6, min_child_weight=1, n_jobs=5, objective=reg:squarederror, subsample=1;, score=0.829 total time=   0.6s
[CV 5/5] END booster=gbtree, gamma=0, learning_rate=0.3, max_delta_step=0, max_depth=6, min_child_weight=1, n_jobs=5, objective=reg:squarederror, subsample=1;, score=0.821 

In [35]:
print(XGBGridModel.best_params_)
print(XGBGridModel.best_score_)

{'booster': 'gbtree', 'gamma': 0, 'learning_rate': 0.2, 'max_delta_step': 0, 'max_depth': 7, 'min_child_weight': 3, 'n_jobs': 5, 'objective': 'reg:squarederror', 'subsample': 1}
0.8320037947513885


In [36]:
# assessing model
y_pred_DTR_final = XGBGridModel.predict(X_test)
MSE_DTR_final = mean_squared_error(y_test, y_pred_DTR_final)
RMSE_DTR_final = np.sqrt(MSE_DTR_final)
print("final MSE: ", MSE_DTR_final)
print("final RMSE: ",RMSE_DTR_final)

final MSE:  0.2022242506670472
final RMSE:  0.4496935074770895
