# I. Introduction
This notebook is to guide you through the first month of internship when you can only work with your personal computer from home, i.e. no access to intern code / data. The main purpose of this period is to build a state of the art for interpretability and AutoML. During the article reading, you might want to (1) take notes that will be greatly helpful when you write your final report at the end of the internship; (2) test / implement / improve those litterature algorithms using some public dataset. This notebook shows you how to: 
    - load public datasets from financial domains; 
    - explore the datasets and do some necessary data engineering, 
    - apply a ML pipeline on it (where AutoML / HP selection comes in)
    - and interpret the result (where interpretability comes in).

# II. Data
MLBox is a AutoML toolkit that 'focuses' on banking / financial data. Compare to some classical datasets, those data show some specifics: non-numeric features (e.g. names, adresses, ...), time series features (e.g. log, tracking, ...). In the shared dropbox folder, we have collected some similar public datasets, each of which is associated with a .info file that summarizes some essential properties of the dataset. In this section, we use one of them to demonstrate the entire workflow. 

Your task: You are encouraged to enrich this public dataset suite: Improving the quality of our datasets based on which we do research is the No.1 task in any Data science research.

In [1]:
Data_dir = './Data/banking-dataset-marketing-targets/' # set dataset directory

In [8]:
# read into the dataset
import os
import pandas as pd

D_tr = pd.read_csv(os.path.join(Data_dir, 'train.csv'))
D_te = pd.read_csv(os.path.join(Data_dir, 'test.csv'))

In [14]:
D_tr.head()

Unnamed: 0,ID,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,subscribed
0,26110,56,admin.,married,unknown,no,1933,no,no,telephone,19,nov,44,2,-1,0,unknown,no
1,40576,31,unknown,married,secondary,no,3,no,no,cellular,20,jul,91,2,-1,0,unknown,no
2,15320,27,services,married,secondary,no,891,yes,no,cellular,18,jul,240,1,-1,0,unknown,no
3,43962,57,management,divorced,tertiary,no,3287,no,no,cellular,22,jun,867,1,84,3,success,yes
4,29842,31,technician,married,secondary,no,119,yes,no,cellular,4,feb,380,1,-1,0,unknown,no


In [15]:
D_te.head()

Unnamed: 0,ID,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome
0,38441,32,services,married,secondary,no,118,yes,no,cellular,15,may,20,6,-1,0,unknown
1,40403,78,retired,divorced,primary,no,2787,no,no,telephone,1,jul,372,1,-1,0,unknown
2,3709,31,self-employed,single,tertiary,no,144,yes,no,unknown,16,may,676,1,-1,0,unknown
3,37422,57,services,single,primary,no,3777,yes,no,telephone,13,may,65,2,-1,0,unknown
4,12527,45,blue-collar,divorced,secondary,no,-705,no,yes,unknown,3,jul,111,1,-1,0,unknown


In [10]:
D_tr.shape, D_te.shape

((31647, 18), (13564, 17))

In [13]:
y_tr = D_tr['subscribed'] # from the info file, we know the target column in 'subscribed'
X_tr = D_tr[[col for col in D_tr.columns if not col in ['ID', 'subscribed']]] # we extract feature columns
X_te = D_te[[col for col in D_te.columns if not col in ['ID']]]

# III. ML pipelines
In this section, you are asked to build some ML pipelines to solve the above dataset. A pipeline contains maily 4 parts:
    - preprocessing (including data preprocessing and feature preprocessing)
    - predictor optimization (HP selection)
    - Final predictor and predictions
    - Interpretability

You task: 
    - A written summary of the SoA of Interpretability, which will be used in your final report
    - Improve MLBox's interpretability tool 


## III.1 Humain-knowledge-based preprocessing + grid / random search for predictor optimization

In this section, please build a simply pipeline using sklearn, in which, you 

&nbsp; &nbsp; 1. perform some preprocessing techniques using your knowledge (for example, as in MLBox, you encode non-numeric features and missing values, balance classing if they are un-balanced, select features based on Chi2 or feature importance using a RandomForest predictor)

&nbsp; &nbsp; 2. do some research on the HP space

&nbsp; &nbsp; 3. output the final predictive model and a cross-validation Notice how much your expert knowledge is required during this exercise.

&nbsp; &nbsp; 4. interpret your model / predictions, for exmaple, if your final model is a sklearn decision tree, you can explain the feature importance by returning feature_importances_.

References: [Bergstra, James, and Yoshua Bengio. "Random search for hyper-parameter optimization." Journal of machine learning research 13.Feb (2012): 281-305.](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf)


In [5]:
from sklearn.preprocessing import OneHotEncoder
########### your turn #################

## III.2 Interpretability on these traditional models

In the previous section, you have experienced a very simple ML pipeline. You might have already noticed that, not every predictor has a built-in function feature_importance_ that allows you to interpret your result. In this section, you will need to do some research (see section 'SoA en interprétabilité' in 'Plan_detaillé_StageAutoML_Interprétabilité') to propose:
    - a valid tool to interpret the predictions delivered by each of the following traditional models which are integrated into MLBox: Random Forest, XGBoost, LightGBM, ExtraTrees, Tree, Bagging, AdaBoost, Linear. For the moment, the interpretability of these models are supported either by built-in feature_importance_ (e.g. RF), or [shap](https://shap.readthedocs.io/en/latest/). Your task here is to propose any possible improvement. 
OR,
    - a model-agnostic interpretation tool
    

References: [Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/) and other resources mentioned in 'Plan_detaillé_StageAutoML_Interprétabilité'.

In [6]:
########### your experiments here #################

# IV. Automating the design of ML pipelines: towards AutoML


In the previous section, you have experimented the handcrafted ML pipeline, and you might have noticed that how much human (=you) effort you have put to achieve a good predictor. How to take human out of the loop is what AutoML studies. In this section, you are encouraged to build your understanding on this subject, try some of open source solutions, such as [auto-sklearn](https://automl.github.io/auto-sklearn/master/) and [Hyperopt](https://github.com/hyperopt/hyperopt), and use the interpretability tool you have built in the previous section to explain the result.

Your task: 
    - A written summary of the SoA of AutoML, which will be used in your final report
    - Benchmarking AutoML tools: apply these tools on the public datasets prepared in the first section, record the performance and computational time
 
References: [AutoMLBook chapter 1, 4, 5, 6](https://www.automl.org/book/) and other resources mentioned in 'Plan_detaillé_StageAutoML_Interprétabilité'.
    

In [7]:
########### your experiments here #################

# V. Going deep (or not)?

Until now, you have done your research in tradictional ML models, no deep learning algorithms are directly explored. In this section, you are asked to do some research in this domain, and help us to decide whether we should include deep learning into the MLBox, with justifiable arguments. 

Your task: 
    - A written summary of your research, which will be used in your final report
    - Identify some deep models, try them on our datasets, see if they improve the current perf / time.
 
References: 

In [None]:
########### your experiments here #################

# VI. Interprete the deep

You have certainly got some ideas about interpreting the model or the preditions output by a deep NN. Now it's time to realize them.

Your task:
    - Build interpretability tools for the deep models you identified in the prev. section
    
Refereces: section 'SoA en interprétabilité' in 'Plan_detaillé_StageAutoML_Interprétabilité'

In [None]:
########### your experiments here #################

# VII. AutoDL

During your experiments in Section V and VI, you might have encounted some challenges in HP selection (architecture design). You might want to explore AutoML in the field of deep.

Your task:
    - A written summary of your literature research, which can be used in your final report
    - Implement / Try some AutoDL algorithms on our testing suite
    
References: [AutoMLBook chapter 3, 7](https://www.automl.org/book/)
    

In [None]:
########### your experiments here #################