# Goal: package new interpret. tools to test locally in MLBox

As we know, in MLBox a pipeline is composed of:
- preprocessing (MLBox)
- feature engineering (MLBox)
- Optimisation i.e. feature selection and model selection (MLBox)
- interpretability tools (where you are going to extend)

After the optimisation steps, MLBox outputs a 'validator' object, in which the best pipelines are encoded. For exemple, 


In particular, validator.dict_fs_models is a dictionary, where validator.dict_fs_models[model_with_metric_name] gives you the trained feature selector and predictor of the chosen model. For example, 


## This notebook
Since validator is a MLBox module, you cannot access to it directly. I prepared:
- 2 trained models, i.e. validator.dict_fs_models[model_with_metric_name]['est'].\_estimator:
[RF_trained_christine.pkl](./RF_trained_christine.pkl) and [LightGBM_trained_christine.pkl](./LightGBM_trained_christine.pkl)
- the processed test dataframe: 
[christine_test_after_fs_RF.csv](./christine_test_after_fs_RF.csv) and [christine_test_after_fs_gbm.csv](./christine_test_after_fs_gbm.csv)

with which I think you can apply interp. tools.


## Getting started

In [1]:
import pandas as pd
import pickle
import sklearn
import sklearn.ensemble
from sklearn.ensemble import RandomForestClassifier

In [2]:
sklearn.__version__

'0.22.2.post1'

In [3]:
### load model optimised by MLBox ###
with open('./RF_trained_christine.pkl', 'rb') as input_file:
    RF_trained = pickle.load(input_file)
print(RF_trained)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=17, max_features='sqrt',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=1000,
                       n_jobs=-1, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)


In [10]:
### read the test dataframe ###
df_test_RF = pd.read_csv('./christine_test_after_fs_RF.csv', index_col=[0])
print (df_test_RF.columns)

Index(['v_11_nor', 'pca0', 'v_16_nor', 'v_1_nor', 'v_9_nor', 'v_3_nor',
       'v_5_nor', 'v_2_nor', 'pca1', 'v_8_nor', 'pca2', 'v_13_nor'],
      dtype='object')


In [11]:
### you can generate predicitons ##
RF_trained.predict(df_test_RF)

array([0, 0, 0, ..., 1, 0, 0], dtype=int32)

## Now, you can write your interpret. toolbox 
that can import from here, for example

In [None]:
from interpret_toolbox import ShapExplainer, LimeExplainer
shap = ShapExplainer()
# use shap's TreeExplainer to explain our RF_trained
shap.explain(algorithm='tree', model_to_explain=RF_trained, X_test=df_test_RF) 

You should also include a tech. supporting report to guide me through the right installation of these libraries.

## Let me know if you need anything, Good luck!
