In this we will see how `wandb` enables regulated entities to
1. `Track and version` their data ETL pipelines (locally or `GCS`, `SC3`)
2. `Track experiment results` and store trained models
3. `Visually Inspect` metrics and candidate credit scorecards
4. `Optimize Performance` with hyperparameter sweeps

In [15]:
import ast
import sys
import json
from pathlib import Path
from dill.source import getsource
from dill import detect

import pandas as pd
import numpy as np
import plotly
import matplotlib.pyplot as plt

from scipy.stats import ks_2samp
from sklearn import metrics, model_selection
import xgboost as xgb

pd.set_option('display.max_columns', None)

## Data
`Wandb Artifacts` enables one to log end-to-end training pipelines to ensure your experiments are always reproducible.

### Artifacts Reference Example
<b>Create an artifact with the `S3/GCS` metadata</b>

The artifact only consists of metadata about the S3/GCS object such as its ETag, size, and version ID (if object versioning is enabled on the bucket).
```python
run = wandb.init()
artifact = wandb.Artifact('mnist', type='dataset')
artifact.add_reference('s3://my-bucket/datasets/mnist')
run.log_artifact(artifact)
```
<b>Download the `artifact` locally when needed</b>

W&B will use the metadata recorded when the artifact was logged to retrieve the files from the underlying bucket.
```python
artifact = run.use_artifact('mnist:latest', type='dataset')
artifact_dir = artifact.download()
```

In [16]:
# Setup 
import wandb
wandb.login()

ENTITY = None 

### Vehicle Load Dataset
Load the Dataset using Wandb `Artifacts`

In [17]:
data_dir = Path('.')
model_dir = Path('models')
model_dir.mkdir(exist_ok=True)

id_vars = ['UniqueID']
target_var = 'loan_default'

In [18]:
def fn_to_string(fn):
    """ Pickle Functions """
    return getsource(detect.code(fn))

Download data from Wandb Artifacts

In [19]:
run = wandb.init(entity=ENTITY, project='credit_scorecard_boosting_WandbEx', job_type='preprocess-data', config={'wandb_nb': 'wandb_credict_soc'})

In [20]:
dataset_art = run.use_artifact('morgan/credit_scorecard/vehicle_loan_defaults:latest', type='dataset')
dataset_dir = dataset_art.download(data_dir)

In [21]:
from data_utils import (
    describe_data_g_targ,
    one_hot_encode_data,
    create_feature_interaction_constraints,
    get_monotonic_constraints,
    load_training_data,
    calculate_credit_scores
)

from scorecard import generate_scorecard

### One Hot Encode Data

In [22]:
# Load data into Dataframe
dataset = pd.read_csv(data_dir/'vehicle_loans_subset.csv')

# One Hot Encode Data
dataset, p_vars = one_hot_encode_data(dataset, id_vars, target_var)

# Save Preprocessed data
processed_data_path = data_dir/'proc_ds.csv'
dataset.to_csv(processed_data_path, index=False)

### Log Processed Data to Wandb Artifacts

In [23]:
# Create a new artifact for the processed data, including the function that created it, to Artifacts
processed_ds_art = wandb.Artifact(name='vehicle_defaults_processed', 
                                    type='processed_dataset',
                                    description='One-hot encoded dataset',
                                    metadata={'preprocessing_fn': fn_to_string(one_hot_encode_data)}
                                 )

# Attach our processed data to the Artifact 
processed_ds_art.add_file(processed_data_path)

# Log this Artifact to the current wandb run
run.log_artifact(processed_ds_art)

run.finish()

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

### Get Train/Val Split

In [24]:
with wandb.init(entity=ENTITY, project='credit_scorecard_boosting_WandbEx', job_type='train-val-split', config={'wandb_nb': 'wandb_credit_soc'}) as run:
    # Download the subset of the vehicle loan default data from W&B
    dataset_art = run.use_artifact('vehicle_defaults_processed:latest', type='processed_dataset')
    dataset_dir = dataset_art.download(data_dir)
    dataset = pd.read_csv(processed_data_path)
    
    # Set Split Params
    test_size = 0.25
    random_state = 42
    
    # Log the splilt params
    run.config.update({'test_size':test_size, 'random_state': random_state})
    
    # Do the Train/Val Split
    trndat, valdat = model_selection.train_test_split(dataset, test_size=test_size, 
                                                      random_state=random_state, stratify=dataset[[target_var]])

    print(f'Train dataset size: {trndat[target_var].value_counts()} \n')
    print(f'Validation dataset sizeL {valdat[target_var].value_counts()}')
    
    # Save split datasets
    train_path = data_dir/'train.csv'
    val_path = data_dir/'val.csv'
    trndat.to_csv(train_path, index=False)
    valdat.to_csv(val_path, index=False)
    
    # Create a new artifact for the processed data, including the function that created it, to Artifacts
    split_ds_art = wandb.Artifact(name='vehicle_defaults_split', 
                                        type='train-val-dataset',
                                        description='Processed dataset split into train and valiation',
                                        metadata={'test_size': test_size, 'random_state': random_state}
                                     )
    
    # Attach our processed data to the Artifact 
    split_ds_art.add_file(train_path)
    split_ds_art.add_file(val_path)
    
    # Log the Artifact
    run.log_artifact(split_ds_art)

Train dataset size: 0    136907
1     37958
Name: loan_default, dtype: int64 

Validation dataset sizeL 0    45636
1    12653
Name: loan_default, dtype: int64


VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

#### Inspect Training Dataset

In [25]:
trndict = describe_data_g_targ(trndat, target_var)
trndat.head()

Number of records (num): 174865
Target count (n_targ): 37958
Target rate (base_rate): 0.21707031138306693
Target odds (base_odds): 0.27725390228403224
Target log odds (base_log_odds): -1.2828215778857626
Dummy model negative log-likelihood (NLL_null): 91484.9725597928
Dummy model LogLoss (LogLoss_null): 0.5231748638080393



Unnamed: 0,UniqueID,loan_default,AgeInMonths,DaysSinceDisbursement,LTV,PERFORM_CNS_SCORE,Employment_Type__Salaried,Employment_Type__Self employed,State_ID__1,State_ID__2,State_ID__3,State_ID__4,State_ID__5,State_ID__6,State_ID__7,State_ID__8,State_ID__9,State_ID__10,State_ID__11,State_ID__12,State_ID__13,State_ID__14,State_ID__15,State_ID__16,State_ID__17,State_ID__18,State_ID__19,State_ID__20,State_ID__21,State_ID__22,manufacturer_id__45,manufacturer_id__48,manufacturer_id__49,manufacturer_id__51,manufacturer_id__67,manufacturer_id__86,manufacturer_id__120,manufacturer_id__145,manufacturer_id__152,manufacturer_id__153,manufacturer_id__156
17797,435556,0,324.0,141.0,80.0,,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
4403,421891,0,351.0,150.0,64.96,718.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
190714,611145,0,255.0,69.0,83.98,,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
115372,534751,0,229.0,98.0,76.38,,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
98591,517772,0,325.0,104.0,77.25,700.0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0


### Log Dataset with Wandb Tables

In [26]:
run = wandb.init(entity=ENTITY, project='credit_scorecard_boosting_WandbEx', job_type='log-dataset', config={'wandb_nb': 'wandb_credit_soc'})

# Create a Wandb Table
table = wandb.Table(dataframe=trndat.sample(1000))

# Log the Table to your W&B workspace
wandb.log({'processed_dataset': table})

# Close the wandb run
wandb.finish()

VBox(children=(Label(value='0.144 MB of 0.144 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

## Modelling

#### Feature Interaction Modelling Constraints
Set feature interaction constraints such that no between-variable interactions are allowed. This is necessary to be able to use XGBoost to develop a scorecard. 

In [27]:
x_consts, interaction_consts, consts_path = create_feature_interaction_constraints(p_vars, data_dir)

#### Monotonic Constraints
Manually define which variables will have a `Monotonic Relationship` to our target `load_default`. 

In [28]:
monotonic_vars = ['PERFORM_CNS_SCORE', 'AgeInMonths', 'LTV']

monotonic_constraints_str = get_monotonic_constraints(monotonic_vars, p_vars, data_dir)

## Fit XGBoost Model

#### Training on GPU
If you want to run the model on GPU then change the following params.
```python
'tree_method': 'gpu_hist'
``` 

In [29]:
run = wandb.init(entity=ENTITY, project='credit_scorecard_boosting_WandbEx', job_type='train-model', config={'wandb_nb': 'wandb_credit_soc'})

#### Setup and Log Parameters
1. Using monotonic constraints means using `tree_method` as `exact, hist or gpu_hist`.
2. `n_estimators` will define the number of trees used in the final scorecard

In [30]:
base_rate = round(trndict['base_rate'], 6)
early_stopping_rounds = 40

In [31]:
bst_params = {
        'objective': 'binary:logistic'
        , 'base_score': base_rate
        , 'gamma': 1               ## def: 0
        , 'learning_rate': 0.1     ## def: 0.1
        , 'max_depth': 3
        , 'min_child_weight': 100  ## def: 1
        , 'n_estimators': 25
        , 'nthread': 24 
        , 'random_state': 42
        , 'reg_alpha': 0
        , 'reg_lambda': 0          ## def: 1
        , 'eval_metric': ['auc', 'logloss']
        , 'tree_method': 'hist'  # use `gpu_hist` to train on GPU
        , 'interaction_constraints' : interaction_consts
        , 'monotone_constraints' : monotonic_constraints_str
    }

Log the xgboost training parameters to the W&B run config

In [32]:
run.config.update(dict(bst_params))
run.config.update({'early_stopping_rounds':early_stopping_rounds})

#### Load Training Data from Wandb Artifacts

In [33]:
# Load our training data from Artifacts
trndat, valdat = load_training_data(run=run, data_dir=data_dir, 
                                    artifact_name='vehicle_defaults_split:latest')

## Extract target column as a series
y_trn = trndat.loc[:,target_var].astype(int)
y_val = valdat.loc[:,target_var].astype(int)

#### Fit Model
To log all the xgboost parameters we use `wandb_callback`.

In [34]:
from wandb.xgboost import wandb_callback

# Initialize the XGBoostClassifier
xgbmodel = xgb.XGBClassifier(**bst_params, use_label_encoder=False)

# Train the model, using the wandb_callback for logging
xgbmodel.fit(trndat[p_vars], y_trn, eval_set=[(valdat[p_vars], y_val)], 
             early_stopping_rounds=run.config['early_stopping_rounds'],
             callbacks=[wandb_callback()])

bstr = xgbmodel.get_booster()

  callbacks=[wandb_callback()])


ValueError: Constrained features are not a subset of training data feature names