# Example Notebook for classifier finder

## 1. libraries

In [1]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

from sam_ml.models import CTest

## 2. data

In [2]:
df = load_iris()
y = pd.Series(df.target)
X = pd.DataFrame(df.data, columns=df.feature_names)
x_train, x_test, y_train, y_test = train_test_split(X,y, train_size=0.80, random_state=42)

## 3. model

## 3.1. evaluation of the models

### 3.1.1. small dataset crossvalidation

In [3]:
tester = CTest()
tester.eval_models_cv(X ,y , avg="macro", small_data_eval=True, upsampling="SMOTE")
tester.output_scores_as_pd(sort_by="recall", console_out=False)

QDA / LDA / LR / MLPC / LSVC does not work with upsampling='SMOTE' --> going on with upsampling='ros'


Crossvalidation: 100%|██████████| 18/18 [02:45<00:00,  9.17s/it]


Unnamed: 0,accuracy,precision,recall,s_score,l_score,avg train score,avg train time
LinearDiscriminantAnalysis,0.98,0.980125,0.98,0.990437,1.0,0.980133,0:00:00
MLP Classifier,0.98,0.981132,0.98,0.990119,1.0,0.980089,0:00:00
SupportVectorClassifier (rbf-kernel),0.973333,0.973333,0.973333,0.989583,1.0,0.972844,0:00:00
QuadraticDiscriminantAnalysis,0.973333,0.973825,0.973333,0.989409,1.0,0.980178,0:00:00
LogisticRegression,0.966667,0.966787,0.966667,0.988529,1.0,0.974,0:00:00
KNeighborsClassifier,0.966667,0.966787,0.966667,0.988529,1.0,0.967067,0:00:00
BaggingClassifier (DTC based),0.96,0.96,0.96,0.987445,1.0,0.996044,0:00:00
BaggingClassifier (RFC based),0.96,0.96,0.96,0.987445,1.0,0.985289,0:00:00
LinearSupportVectorClassifier,0.96,0.96,0.96,0.987445,1.0,0.967644,0:00:00
GradientBoostingMachine,0.953333,0.953448,0.953333,0.986112,1.0,1.0,0:00:00


### 3.1.2. multiple split crossvalidation

In [4]:
tester = CTest()
tester.eval_models_cv(X, y, avg="macro", small_data_eval=False)
tester.output_scores_as_pd(sort_by="recall", console_out=False)

Crossvalidation: 100%|██████████| 18/18 [00:05<00:00,  3.30it/s]


Unnamed: 0,accuracy,precision,recall,s_score,l_score,avg train score,avg train time
LinearDiscriminantAnalysis,0.98,0.980755,0.979984,0.990086,1.0,0.98,0.001951
MLP Classifier,0.98,0.982131,0.979575,0.989352,0.999999,0.976667,0.057869
KNeighborsClassifier,0.98,0.981481,0.979575,0.99011,1.0,0.97,0.001342
LogisticRegression,0.973333,0.975958,0.972631,0.988509,0.999999,0.976667,0.010198
LinearSupportVectorClassifier,0.966667,0.968288,0.966912,0.986886,0.999999,0.966667,0.005258
GaussianProcessClassifier,0.966667,0.967275,0.966503,0.98834,1.0,0.97,0.01762
RandomForestClassifier,0.966667,0.967901,0.966095,0.987494,0.999999,1.0,0.100012
GradientBoostingMachine,0.966667,0.967901,0.966095,0.987494,0.999999,1.0,0.097931
QuadraticDiscriminantAnalysis,0.966667,0.968347,0.966095,0.985158,0.999979,0.98,0.001459
SupportVectorClassifier (rbf-kernel),0.96,0.961057,0.959967,0.98675,0.999999,0.97,0.001862


### 3.1.3. evaluate on given train-test-split

In [5]:
tester = CTest()
tester.eval_models(x_train, y_train, x_test, y_test, avg="macro")
tester.output_scores_as_pd(sort_by="recall", console_out=False)

Crossvalidation: 100%|██████████| 18/18 [00:01<00:00, 17.28it/s]


Unnamed: 0,accuracy,precision,recall,s_score,l_score,train_score,train_time
LogisticRegression,1.0,1.0,1.0,0.9926,1.0,0.975,0:00:00
GradientBoostingMachine,1.0,1.0,1.0,0.9926,1.0,1.0,0:00:00
BaggingClassifier (DTC based),1.0,1.0,1.0,0.9926,1.0,1.0,0:00:00
GaussianProcessClassifier,1.0,1.0,1.0,0.9926,1.0,0.966667,0:00:00
GaussianNB,1.0,1.0,1.0,0.9926,1.0,0.95,0:00:00
ExtraTreesClassifier,1.0,1.0,1.0,0.9926,1.0,1.0,0:00:00
KNeighborsClassifier,1.0,1.0,1.0,0.9926,1.0,0.966667,0:00:00
AdaBoostClassifier (RFC based),1.0,1.0,1.0,0.9926,1.0,1.0,0:00:00
AdaBoostClassifier (DTC based),1.0,1.0,1.0,0.9926,1.0,0.966667,0:00:00
SupportVectorClassifier (rbf-kernel),1.0,1.0,1.0,0.9926,1.0,0.975,0:00:00


### 3.2. find best model

#### 3.2.1. creating scores in find_best_model method

In [6]:
tester = CTest()
tester.find_best_model(x_train, y_train, x_test, y_test, scoring="recall", avg="macro", rand_search=True)

no scores are already created -> creating scores using 'eval_models()'


Crossvalidation: 100%|██████████| 18/18 [00:01<00:00, 17.17it/s]


                                      accuracy precision    recall   s_score  \
LogisticRegression                         1.0       1.0       1.0    0.9926   
GradientBoostingMachine                    1.0       1.0       1.0    0.9926   
BaggingClassifier (DTC based)              1.0       1.0       1.0    0.9926   
GaussianProcessClassifier                  1.0       1.0       1.0    0.9926   
GaussianNB                                 1.0       1.0       1.0    0.9926   
ExtraTreesClassifier                       1.0       1.0       1.0    0.9926   
KNeighborsClassifier                       1.0       1.0       1.0    0.9926   
AdaBoostClassifier (RFC based)             1.0       1.0       1.0    0.9926   
AdaBoostClassifier (DTC based)             1.0       1.0       1.0    0.9926   
SupportVectorClassifier (rbf-kernel)       1.0       1.0       1.0    0.9926   
RandomForestClassifier                     1.0       1.0       1.0    0.9926   
DecisionTreeClassifier                  

<sam_ml.models.LogisticRegression.LR at 0x7f81d290e040>

#### 3.2.2. creating scores using eval_models_cv

In [7]:
tester = CTest()
tester.eval_models_cv(X ,y , avg="macro", small_data_eval=True, upsampling="SMOTE")
tester.find_best_model(x_train, y_train, x_test, y_test, scoring="recall", avg="macro", rand_search=True)

QDA / LDA / LR / MLPC / LSVC does not work with upsampling='SMOTE' --> going on with upsampling='ros'


Crossvalidation: 100%|██████████| 18/18 [02:40<00:00,  8.92s/it]


-> using already created scores for the models. Please run 'eval_models()'/'eval_models_cv()' again if something changed with the data

best model type (recall):  LogisticRegression  -  0.98
starting to hyperparametertune best model type (rand_search =  True )...


Best: 0.976111 using {'solver': 'sag', 'penalty': 'l2', 'C': 1.0}


... hyperparameter tuning finished

accuracy:  1.0
precision:  1.0
recall:  1.0
classification report: 
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



<sam_ml.models.LogisticRegression.LR at 0x7f81d290e040>