# This notebook contains the experiments on Banknote dataset with LionForests

In [1]:
from LionForests import LionForests
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import pandas as pd 
import numpy as np

Firstly, we load the dataset and we set the feature and class names

In [2]:
banknote_datadset = pd.read_csv('https://raw.githubusercontent.com/Kuntal-G/Machine-Learning/master/R-machine-learning/data/banknote-authentication.csv',nrows= 50)
feature_names = ['variance','skew','curtosis','entropy']
class_names=['fake banknote','real banknote'] #0: no, 1: yes #or ['not authenticated banknote','authenticated banknote']

We can plot some instances to see the features and their values

In [3]:
banknote_datadset.head()

Unnamed: 0,variance,skew,curtosis,entropy,class
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0
3,3.4566,9.5228,-4.0112,-3.5944,0
4,0.32924,-4.4552,4.5718,-0.9888,0


Moreover, we can use pandas.describe() to see the ranges of each feature. For example, we observe that curtosis's range is -5.286 to 17.927

In [4]:
banknote_datadset.describe()

Unnamed: 0,variance,skew,curtosis,entropy,class
count,50.0,50.0,50.0,50.0,50.0
mean,2.326602,4.851703,0.325998,-1.525692,0.0
std,2.048826,5.328737,3.771307,2.173618,0.0
min,-1.6162,-6.81,-4.6795,-7.5034,0.0
25%,0.836623,1.32174,-2.760375,-3.06715,0.0
50%,2.5555,6.7717,-0.43942,-0.714015,0.0
75%,3.9297,9.209825,2.793025,0.101601,0.0
max,6.5633,11.0272,8.4636,1.4771,0.0


Then We extract the train data from the dataframe

In [5]:
X = banknote_datadset.iloc[:, 0:4].values 
y = banknote_datadset.iloc[:, 4].values 

In [6]:
len(X)

50

We have 1372 instances. We are going to use the build-in GridSearch of LionForests to find and train the best classifier for this dataset

In [7]:
parameters = [{
    'max_depth': [10],
    'max_features': [0.75],
    'bootstrap': [True],
    'min_samples_leaf' : [1],
    'n_estimators': [500]
}]
lf = LionForests(class_names=class_names)
scaler = MinMaxScaler(feature_range=(-1,1))
lf.train(X, y, scaler, feature_names, parameters)

Fitting 10 folds for each of 1 candidates, totalling 10 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    5.1s finished


Now, we can see the best model's parameters

In [8]:
print("Accuracy:",lf.accuracy,", Number of estimators:",lf.number_of_estimators)
print(lf.model)

Accuracy: 0.0 , Number of estimators: 500
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=10, max_features=0.75,
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=500,
                       n_jobs=-1, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)


Now lets predict and explain in the same time the third instance

In [9]:
rule = lf.following_breadcrumbs(X[2], False, True, False, complexity=4)
print(rule)

ValueError: cannot call `vectorize` on size 0 inputs unless `otypes` is set

And we can extract the rule without reduction:

In [None]:
lf.findFixedRanges(X[2], 'skew', rule)

In [None]:
X[2] #feature_names = ['variance','skew','curtosis','entropy']

But lets try to change entropy value:

In [None]:
T_X = X[2]
T_X[3] = 1 #Entropy

We can see that the prediction and the explanation

In [None]:
lf.following_breadcrumbs(T_X, False, True, False, complexity=4)