# Driver Analysis

This notebook is designed to help explore data, analyze performance drivers for our target variable, track and tune models using MLFlow, and package results to use for prediction.  This notebook is not exhaustive and not meant to capture all potential aspects of the modeling pipeline, but to provide a framework and guide to help generate predictive models to use for a whitespace analysis

In [None]:
import sys
import pathlib
import sweetviz as sv
import mlflow

sys.path.insert(0,'../../')

from models.models import WhitespaceModel
from utils.carto_helpers import set_creds, get_creds
from constants.global_constants import auv_features, MLFLOW_PATH

## Read data

In [None]:
# Set carto credentials
set_creds(type="cloud")

Here we will assign variables to core columns we will use for the rest of the analysis:
- `store_table`: Enriched client locations with our dependent variable and aggregated predictor columns
- `id_col`: ID column in `store_table`
- `target`: Target column with dependent variable in `store_table`
- `report_exclude`: List of columns to exclude from the analysis and EDA report

In [None]:
# Client store locations and revenue
store_table = 'vtg_test_modeling_clean'

# Variable assignment
id_col = 'store_id'
target = 'target_var'

# Columns to exclude
report_exclude = ['cartodb_id', 'the_geom']

Instantiate whitespace models class

In [None]:
# import cartoframes as cf
# import numpy as np

# tbl=cf.read_carto(store_table)

In [None]:
# cf.to_carto(tbl.fillna(np.nan), store_table, if_exists='replace')

In [None]:
wm = WhitespaceModel()

Build out our train and test data.  The load data takes an input of:
- `data_path`: Name of carto table stored in account (recommended) or path to local data file
- `features`: Optional dictionary of features to include or exclude in our modeling dataset.  If not provided, uses the dictionary of features `auv_features` defined in `constants/global_constants.py`
- `include`: Binary input to indicate if features list is designed to signify columns to include (1) or exclude (0) from our model dataset.  Defaults to 1
- `catboost_pool`: Binary input to determine whether catboost pool should be returned as well as an item of `model_df` output
- `test_size`: Optional column to determine the % to use for the validation dataset, defaults to 0.2.

And returns two datasets:
- `df`: Our dataset read in from Carto or our local file path
- `model_df`: Dictionary object with train-test split and catboost pool if requested. Stored as model_df['X_train'], model_df['y_train'], model_df['X_test'], etc.

## Exploratory analysis

In [None]:
df, model_df = wm.load_data(data_path=store_table, 
                            include=0)

Exploratory analysis with helper functions to help get a sense for data quality and potential areas to explore further

`check_missing` takes our dataframe as input and provides a view on the number of columns with missing data as well as the most troublesome columns

In [None]:
wm.check_missing(df)

`get_corrs` takes as input our store dataframe, target column, and columns to exclude, and the number of rows to display and in turn outputs the top correlating features and the correlation coefficient. 

In [None]:
wm.get_corrs(df, target, exclude_cols='cartodb_id', n_display=25)

This cell utilizes the sweetviz library to generate a profile report on each variable in our dataset and relations with the target column

In [None]:
# eda_report = sv.analyze([df.drop(columns=report_exclude), "Target Data"], target_feat=target)
# eda_report.show_html()

## Train models

Model training helper functions have been created to utilize MLFlow tracking capabilities and enable fast testing and iteration.  This allows us to store different model hyperparameters and study effects on performance.

Current functionality is designed to support xgboost, randomforest, and catboost regression models, but the scripts in `models.py` can be updated to incorporate more model types.  In the following cells, we train one model in each type and return the sklearn pipeline used to generate predictions and the resulting accuracy metrics.

Each model type generates a sklearn pre-processing pipeline and model object.  First, we will instantiate our mlflow tracking uri.  To adjust this value, adjust the `MLFLOW_PATH` variable in `global_constants.py`.  This saves and stores mlflow runs.

In [None]:
mlflow.set_tracking_uri(MLFLOW_PATH)

In [None]:
# Create an experiment name, which must be unique and case sensitive
experiment_id = mlflow.create_experiment("Initial Tests")
experiment = mlflow.get_experiment(experiment_id)

Metrics are low due to the randomly generated target data, but we would shoot for an R2 of .2 or above, otherwise it is recommended to move to a business rules approach 

In [None]:
# Train xgboost model
xgb, xgb_metrics = wm.mlflow_train(model_df, "xgboost", experiment_id)

In [None]:
# # Train random forest model
# rf, rf_metrics = wm.mlflow_train(model_df, "randomforest", experiment_id)

In [None]:
# Train catboost model
cb, cb_metrics = wm.mlflow_train(model_df, "catboost", experiment_id)

You can also view a report on feature importance using the package SHAP and the following helper function:

### Optimize hyperparameters

In [None]:
import mlflow
import numpy as np
import hyperopt

from hyperopt.pyll.base import scope
from hyperopt import hp, fmin, tpe, STATUS_OK, STATUS_FAIL, Trials

We can set up an experiment to optmize hyperparameters using MLFlow and Hyperopt

In [None]:
MAX_EVALS = 50
METRIC = "val_RMSE"

In [None]:
# Create an experiment name, which must be unique and case sensitive
experiment_id = mlflow.create_experiment("Hyperopt")
experiment = mlflow.get_experiment(experiment_id)

space = wm.search_space()
trials = Trials()

In [None]:
train_objective = wm.hyperopt_objective(model_df, METRIC, experiment_id)

In [None]:
hyperopt.fmin(fn=train_objective,
              space=space,
              algo=hyperopt.tpe.suggest,
              max_evals=MAX_EVALS,
              trials=trials)

In [None]:
hyperopt_best(experiment_id, METRIC)

## Results exploration

To examine model results, run `mlflow ui` in terminal, within the directory where this notebook is located

In [None]:
# !mlflow ui

## Save results

In [None]:
# import joblib
# joblib.dump(cb, '../models/mod_pipeline.pkl')