# I. Introduction
This notebook is to guide you through the first month of internship when you can only work with your personal computer from home, i.e. no access to intern code / data. The main purpose of this period is to build a state of the art for interpretability and AutoML. During the article reading, you might want to (1) take notes that will be greatly helpful when you write your final report at the end of the internship; (2) test / implement / improve those litterature algorithms using some public dataset. This notebook shows you how to: 
    - load public datasets from financial domains; 
    - explore the datasets and do some necessary data engineering, 
    - apply a ML pipeline on it (where AutoML / HP selection comes in)
    - and interpret the result (where interpretability comes in).

# II. Data
MLBox is a AutoML toolkit that 'focuses' on banking / financial data. Compare to some classical datasets, those data show some specifics: non-numeric features (e.g. names, adresses, ...), time series features (e.g. log, tracking, ...). In the shared dropbox folder, we have collected some similar public datasets, each of which is associated with a .info file that summarizes some essential properties of the dataset. In this section, we use one of them to demonstrate the entire workflow. 

In [1]:
Data_dir = './Data/banking-dataset-marketing-targets/' # set dataset directory

In [2]:
# read into the dataset
import os
import pandas as pd

D_tr = pd.read_csv(os.path.join(Data_dir, 'train.csv'))

In [3]:
D_tr.head()

Unnamed: 0,ID,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,subscribed
0,26110,56,admin.,married,unknown,no,1933,no,no,telephone,19,nov,44,2,-1,0,unknown,no
1,40576,31,unknown,married,secondary,no,3,no,no,cellular,20,jul,91,2,-1,0,unknown,no
2,15320,27,services,married,secondary,no,891,yes,no,cellular,18,jul,240,1,-1,0,unknown,no
3,43962,57,management,divorced,tertiary,no,3287,no,no,cellular,22,jun,867,1,84,3,success,yes
4,29842,31,technician,married,secondary,no,119,yes,no,cellular,4,feb,380,1,-1,0,unknown,no


In [4]:
D_tr.shape

(31647, 18)

In [5]:
y_tr = D_tr['subscribed'] # from the info file, we know the target column in 'subscribed'
X_tr = D_tr[[col for col in D_tr.columns if not col in ['ID', 'subscribed']]] # we extract feature columns

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome
0,56,admin.,married,unknown,no,1933,no,no,telephone,19,nov,44,2,-1,0,unknown
1,31,unknown,married,secondary,no,3,no,no,cellular,20,jul,91,2,-1,0,unknown
2,27,services,married,secondary,no,891,yes,no,cellular,18,jul,240,1,-1,0,unknown
3,57,management,divorced,tertiary,no,3287,no,no,cellular,22,jun,867,1,84,3,success
4,31,technician,married,secondary,no,119,yes,no,cellular,4,feb,380,1,-1,0,unknown
5,33,management,single,tertiary,no,0,yes,no,cellular,2,feb,116,3,-1,0,unknown
6,56,retired,married,secondary,no,1044,no,no,telephone,3,jul,353,2,-1,0,unknown
7,50,technician,single,secondary,no,1811,no,no,cellular,8,jun,97,4,-1,0,unknown
8,45,blue-collar,divorced,secondary,no,1951,yes,no,cellular,4,feb,692,1,-1,0,unknown
9,35,admin.,married,secondary,no,1204,no,no,cellular,3,sep,789,2,-1,0,unknown
