##### Data set: Banking dataset for cross-selling a term deposit, has been used for the analysis

#### Variables:

age (numeric)

job : type of job (categorical: "admin.", "unknown", "unemployed", "management", "housemaid", "entrepreneur", "student", "blue-collar", "self-employed", "retired", "technician", "services")

marital : marital status (categorical: "married", "divorced", "single"; note: "divorced" means divorced or widowed)

education (categorical: "unknown", "secondary", "primary", "tertiary")

default: has credit in default? (binary: "yes", "no")

balance: average yearly balance, in euros (numeric)

housing: has housing loan? (binary: "yes", "no")

loan: has personal loan? (binary: "yes", "no")

contact: contact communication type (categorical: "unknown", "telephone", "cellular")

day: last contact day of the month (numeric)

month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")

duration: last contact duration, in seconds (numeric)

campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

previous: number of contacts performed before this campaign and for this client (numeric)

poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")

Output variable (desired target):

y - has the client subscribed a term deposit? (binary: "0","1")

In [46]:
#packages used
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score

#Anchor related
from alibi.explainers import AnchorTabular

In [47]:
data = pd.read_csv('data1.csv')
data.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,previous,poutcome,y
0,44,blue-collar,married,basic.4y,unknown,yes,no,cellular,aug,thu,210,1,0,nonexistent,0
1,53,technician,married,unknown,no,no,no,cellular,nov,fri,138,1,0,nonexistent,0
2,28,management,single,university.degree,no,yes,no,cellular,jun,thu,339,3,2,success,1
3,39,services,married,high.school,no,no,no,cellular,apr,fri,185,2,0,nonexistent,0
4,55,retired,married,basic.4y,no,yes,no,cellular,aug,fri,137,1,1,success,1


In [48]:
X = data.drop('y', axis=1)
y = data['y']

To use explainers on alibi package need to convert categorical data into numerical labels, not into strings or dummy

In [49]:
le_job= LabelEncoder()
le_marital= LabelEncoder()
le_education= LabelEncoder()
le_default= LabelEncoder()
le_housing= LabelEncoder()
le_contact= LabelEncoder()
le_month= LabelEncoder()
le_day_of_week= LabelEncoder()
le_poutcome= LabelEncoder()
le_loan= LabelEncoder()

X['job'] = le_job.fit_transform(X['job'])
X['marital'] = le_marital.fit_transform(X['marital'])
X['education'] = le_education.fit_transform(X['education'])
X['default'] = le_default.fit_transform(X['default'])
X['housing'] = le_housing.fit_transform(X['housing'])
X['loan'] = le_poutcome.fit_transform(X['loan'])
X['contact'] = le_contact.fit_transform(X['contact'])
X['month'] = le_month.fit_transform(X['month'])
X['day_of_week'] = le_day_of_week.fit_transform(X['day_of_week'])
X['poutcome'] = le_poutcome.fit_transform(X['poutcome'])

X.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,previous,poutcome
0,44,1,1,0,1,2,0,0,1,2,210,1,0,1
1,53,9,1,7,0,0,0,0,7,0,138,1,0,1
2,28,4,2,6,0,2,0,0,4,2,339,3,2,2
3,39,7,1,3,0,0,0,0,0,0,185,2,0,1
4,55,5,1,0,0,2,0,0,1,0,137,1,1,2


In [70]:
#Split into train-test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)

#### Fitting Random Forest Classifier Model 

As tree based models do not affect from label encoding, one hot encoding is not necessary. Otherwise will have to change from ohe to label encoding when fitting explainer and vice-versa when comapring with predictions.

In [71]:
model = RandomForestClassifier()
model.fit(X_train,y_train)

The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [72]:
prediction = model.predict(X_test)

Check Model Performance

In [73]:
print('Accuracy :',accuracy_score(y_test,prediction)*100)
print('Precision :',precision_score(y_test,prediction)*100)
print('Recall :',recall_score(y_test,prediction)*100)

Accuracy : 95.62116108579353
Precision : 87.61689777491132
Recall : 97.90990990990991


In [74]:
def con_matrix(CM):
    df=pd.DataFrame(data=CM,index=['0','1'], columns=['0','1'])
    df.index.name='Actual'
    df.columns.name='Prediction'
    df.loc['Total']=df.sum()
    df['Total']=df.sum(axis=1)
    return df

In [75]:
CM1 = confusion_matrix(y_test,prediction)
con_matrix(CM1)

Prediction,0,1,Total
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,6935,384,7319
1,58,2717,2775
Total,6993,3101,10094


### Constructing Anchors

In [76]:
cat_columns = data.select_dtypes(include=['object']).columns
cat_columns

Index(['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact',
       'month', 'day_of_week', 'poutcome'],
      dtype='object')

In [77]:
category_map = {1:['entrepreneur', 'blue-collar', 'retired', 'admin.', 'technician',
                        'housemaid', 'services', 'management', 'student', 'self-employed','unemployed', 'unknown'],
                     2:['married', 'single', 'divorced', 'unknown'],
                     3:['basic.9y', 'university.degree', 'basic.4y', 'professional.course',
                        'high.school', 'unknown', 'basic.6y', 'illiterate'],
                     4:['unknown', 'no', 'yes'],
                     5:['no', 'yes', 'unknown'],
                     6:['no', 'yes', 'unknown'],
                     7:['telephone', 'cellular'],
                     8:['may', 'aug', 'sep', 'nov', 'jun', 'apr', 'jul', 'oct', 'mar','dec'],
                     9:['fri', 'mon', 'thu', 'wed', 'tue'],
                     13:['nonexistent', 'failure', 'success']
                    }

In [78]:
category_map

{1: ['entrepreneur',
  'blue-collar',
  'retired',
  'admin.',
  'technician',
  'housemaid',
  'services',
  'management',
  'student',
  'self-employed',
  'unemployed',
  'unknown'],
 2: ['married', 'single', 'divorced', 'unknown'],
 3: ['basic.9y',
  'university.degree',
  'basic.4y',
  'professional.course',
  'high.school',
  'unknown',
  'basic.6y',
  'illiterate'],
 4: ['unknown', 'no', 'yes'],
 5: ['no', 'yes', 'unknown'],
 6: ['no', 'yes', 'unknown'],
 7: ['telephone', 'cellular'],
 8: ['may', 'aug', 'sep', 'nov', 'jun', 'apr', 'jul', 'oct', 'mar', 'dec'],
 9: ['fri', 'mon', 'thu', 'wed', 'tue'],
 13: ['nonexistent', 'failure', 'success']}

In [79]:
feature_names = X_train.columns.tolist()
feature_names

['age',
 'job',
 'marital',
 'education',
 'default',
 'housing',
 'loan',
 'contact',
 'month',
 'day_of_week',
 'duration',
 'campaign',
 'previous',
 'poutcome']

##### Intialize the explainer

In [80]:
predict_fn = lambda x: model.predict(x)

In [81]:
explainer = AnchorTabular(predict_fn, feature_names, categorical_names=category_map)

Tabular data requires a fit step to map the ordinal features into quantiles and therefore needs access to a representative set of the training data. disc_perc is a list with percentiles used for binning:

In [82]:
explainer.fit(X_train.to_numpy(), disc_perc=[25, 50, 75])

AnchorTabular(meta={
    'name': 'AnchorTabular',
    'type': ['blackbox'],
    'explanations': ['local'],
    'params': {'seed': None, 'disc_perc': [25, 50, 75]}
})

#### Explain new observations

In [83]:
i=98
X_obs = X_test.iloc[[i],:]
X_obs

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,previous,poutcome
49121,31,0,2,6,0,0,0,0,1,3,265,1,3,2


In [84]:
print('True label:', y_test.iloc[i])
print('RF model prediction probability for 1:',(model.predict_proba(X_test)[i])[1])

True label: 1
RF model prediction probability for 1: 1.0


In [87]:
arr = X_test.iloc[i].values

In [88]:
%%time
explanation = explainer.explain(arr,threshold=0.95)

Wall time: 1.12 s


In [89]:
print('Anchor: %s' % (' AND '.join(explanation.anchor)))
print('Precision: %.2f' % explanation.precision)
print('Coverage: %.2f' % explanation.coverage)

Anchor: poutcome = success AND duration > 209.00 AND previous > 0.00 AND month = aug
Precision: 0.96
Coverage: 0.06


According to the constructed anchor, It is precise that customer subscribes term deposit if outcome of the previous marketing campaign is a success and if last contact duration is more than 209 seconds and if the customer has at least contacted once before and if last contact month is August, regardless of other variable values

Precision of the prediction also really high (96%)