# Example Notebook for classifier finder

## 1. libraries

In [1]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

from sam_ml.models import CTest, LR

## 2. data

In [2]:
df = load_iris()
y = pd.Series(df.target)
X = pd.DataFrame(df.data, columns=df.feature_names)
x_train, x_test, y_train, y_test = train_test_split(X,y, train_size=0.80, random_state=42)

## 3. model

### 3.1. create tester class object

CTest is an auto-ml class. You can use it to compare different models and find the best one for your data.

**models**: list of *Classifier* subclass objects or *'all'* (for all integrated wrapper class classifier) or *'basic'* (for a smaller selection of basic classifier)

**vectorizer**, **scaler**, **selector**, **sampler**: CTest init creates *Pipeline* objects out of the given models with the data class parameters given
(look into the *iris_pipeline.ipynb* notebook to see the possible parameters)

In [3]:
tester = CTest("all", scaler="minmax")

get all models in the CTest class object

In [4]:
tester.models

{'LogisticRegression': <sam_ml.models.main_pipeline.Pipeline at 0x7fd33ba0cf70>,
 'QuadraticDiscriminantAnalysis': <sam_ml.models.main_pipeline.Pipeline at 0x7fd33ba140d0>,
 'LinearDiscriminantAnalysis': <sam_ml.models.main_pipeline.Pipeline at 0x7fd33ba141c0>,
 'MLP Classifier': <sam_ml.models.main_pipeline.Pipeline at 0x7fd33ba142b0>,
 'LinearSupportVectorClassifier': <sam_ml.models.main_pipeline.Pipeline at 0x7fd33ba143a0>,
 'DecisionTreeClassifier': <sam_ml.models.main_pipeline.Pipeline at 0x7fd33ba14490>,
 'RandomForestClassifier': <sam_ml.models.main_pipeline.Pipeline at 0x7fd33ba14580>,
 'SupportVectorClassifier (rbf-kernel)': <sam_ml.models.main_pipeline.Pipeline at 0x7fd33ba14670>,
 'GradientBoostingMachine': <sam_ml.models.main_pipeline.Pipeline at 0x7fd33ba14760>,
 'AdaBoostClassifier (DTC based)': <sam_ml.models.main_pipeline.Pipeline at 0x7fd33ba14850>,
 'AdaBoostClassifier (RFC based)': <sam_ml.models.main_pipeline.Pipeline at 0x7fd33ba14940>,
 'KNeighborsClassifier': <sa

you can add models

In [5]:
tester.add_model(LR(model_name="LogisticRegression (elasticnet penalty)", penalty="elasticnet", solver="saga", l1_ratio=0.5))

In [6]:
tester.models

{'LogisticRegression': <sam_ml.models.main_pipeline.Pipeline at 0x7fd33ba0cf70>,
 'QuadraticDiscriminantAnalysis': <sam_ml.models.main_pipeline.Pipeline at 0x7fd33ba140d0>,
 'LinearDiscriminantAnalysis': <sam_ml.models.main_pipeline.Pipeline at 0x7fd33ba141c0>,
 'MLP Classifier': <sam_ml.models.main_pipeline.Pipeline at 0x7fd33ba142b0>,
 'LinearSupportVectorClassifier': <sam_ml.models.main_pipeline.Pipeline at 0x7fd33ba143a0>,
 'DecisionTreeClassifier': <sam_ml.models.main_pipeline.Pipeline at 0x7fd33ba14490>,
 'RandomForestClassifier': <sam_ml.models.main_pipeline.Pipeline at 0x7fd33ba14580>,
 'SupportVectorClassifier (rbf-kernel)': <sam_ml.models.main_pipeline.Pipeline at 0x7fd33ba14670>,
 'GradientBoostingMachine': <sam_ml.models.main_pipeline.Pipeline at 0x7fd33ba14760>,
 'AdaBoostClassifier (DTC based)': <sam_ml.models.main_pipeline.Pipeline at 0x7fd33ba14850>,
 'AdaBoostClassifier (RFC based)': <sam_ml.models.main_pipeline.Pipeline at 0x7fd33ba14940>,
 'KNeighborsClassifier': <sa

you can remove models

In [7]:
tester.remove_model("AdaBoostClassifier (RFC based)")

In [8]:
tester.models

{'LogisticRegression': <sam_ml.models.main_pipeline.Pipeline at 0x7fd33ba0cf70>,
 'QuadraticDiscriminantAnalysis': <sam_ml.models.main_pipeline.Pipeline at 0x7fd33ba140d0>,
 'LinearDiscriminantAnalysis': <sam_ml.models.main_pipeline.Pipeline at 0x7fd33ba141c0>,
 'MLP Classifier': <sam_ml.models.main_pipeline.Pipeline at 0x7fd33ba142b0>,
 'LinearSupportVectorClassifier': <sam_ml.models.main_pipeline.Pipeline at 0x7fd33ba143a0>,
 'DecisionTreeClassifier': <sam_ml.models.main_pipeline.Pipeline at 0x7fd33ba14490>,
 'RandomForestClassifier': <sam_ml.models.main_pipeline.Pipeline at 0x7fd33ba14580>,
 'SupportVectorClassifier (rbf-kernel)': <sam_ml.models.main_pipeline.Pipeline at 0x7fd33ba14670>,
 'GradientBoostingMachine': <sam_ml.models.main_pipeline.Pipeline at 0x7fd33ba14760>,
 'AdaBoostClassifier (DTC based)': <sam_ml.models.main_pipeline.Pipeline at 0x7fd33ba14850>,
 'KNeighborsClassifier': <sam_ml.models.main_pipeline.Pipeline at 0x7fd33ba14a30>,
 'ExtraTreesClassifier': <sam_ml.model

## 3.1. evaluation of the models

CTest has 3 ways implemented to evaluate the models. Depending on the dataset you can choose which one to use

### 3.1.1. one-vs-all cross validation

**Concept:**

The model will be trained on all datapoints except one and then tested on this last one. This will be repeated for all datapoints so that we have our predictions for all datapoints.

**Advantage:** optimal use of information for training

**Disadvantage:** long train time

This concept is very useful for small datasets (datapoints < 150) because the long train time is still not too long and especially with a small amount of information for the model, it is important to use all the information one has for the training.

In [9]:
tester.eval_models_cv(X ,y , avg="macro", small_data_eval=True)
tester.output_scores_as_pd(sort_by="recall", console_out=False)

Crossvalidation: 100%|██████████| 18/18 [02:33<00:00,  8.54s/it]


Unnamed: 0,accuracy,precision,recall,s_score,l_score,avg train score,avg train time
LinearDiscriminantAnalysis,0.98,0.980125,0.98,0.9904373,1.0,0.98,0:00:00
QuadraticDiscriminantAnalysis,0.973333,0.973825,0.973333,0.9894085,1.0,0.980045,0:00:00
BaggingClassifier (DTC based),0.96,0.96,0.96,0.9874448,1.0,0.999911,0:00:00
AdaBoostClassifier (DTC based),0.953333,0.954369,0.953333,0.9856405,0.999999,0.959866,0:00:00
ExtraTreesClassifier,0.953333,0.953448,0.953333,0.9861117,1.0,1.0,0:00:00
BaggingClassifier (RFC based),0.953333,0.953448,0.953333,0.9861117,1.0,0.988456,0:00:00
GaussianNB,0.953333,0.953448,0.953333,0.9861117,1.0,0.959418,0:00:00
RandomForestClassifier,0.953333,0.953448,0.953333,0.9861117,1.0,1.0,0:00:00
SupportVectorClassifier (rbf-kernel),0.953333,0.953448,0.953333,0.9861117,1.0,0.979732,0:00:00
GradientBoostingMachine,0.953333,0.953448,0.953333,0.9861117,1.0,1.0,0:00:00


### 3.1.2. multiple split crossvalidation

does **cv_num** splits and takes the average values for evaluating the model

In [10]:
tester.eval_models_cv(X, y, avg="macro", small_data_eval=False, cv_num=10)
tester.output_scores_as_pd(sort_by="recall", console_out=False)

Crossvalidation: 100%|██████████| 18/18 [00:06<00:00,  2.89it/s]


Unnamed: 0,accuracy,precision,recall,s_score,l_score,avg train score,avg train time
LinearDiscriminantAnalysis,0.966667,0.85,0.833333,0.698512,0.7,0.979259,0:00:00
QuadraticDiscriminantAnalysis,0.966667,0.85,0.833333,0.698512,0.7,0.982222,0:00:00
SupportVectorClassifier (rbf-kernel),0.953333,0.8,0.776667,0.599006,0.6,0.979259,0:00:00
RandomForestClassifier,0.946667,0.8,0.773333,0.599004,0.6,1.0,0:00:00
KNeighborsClassifier,0.946667,0.8,0.773333,0.599005,0.6,0.964444,0:00:00
ExtraTreesClassifier,0.946667,0.8,0.773333,0.599005,0.6,1.0,0:00:00
GradientBoostingMachine,0.926667,0.8,0.763333,0.598997,0.6,1.0,0:00:00
BaggingClassifier (DTC based),0.92,0.8,0.76,0.598951,0.6,0.992593,0:00:00
DecisionTreeClassifier,0.953333,0.766667,0.743333,0.499746,0.5,1.0,0:00:00
GaussianNB,0.946667,0.766667,0.74,0.499745,0.5,0.961481,0:00:00


### 3.1.3. evaluate on given train-test-split

sometimes it only makes sense to split a dataset in one way so that cross validation is useless

In [11]:
tester.eval_models(x_train, y_train, x_test, y_test, avg="macro")
tester.output_scores_as_pd(sort_by="recall", console_out=False)

Crossvalidation: 100%|██████████| 18/18 [00:01<00:00, 17.73it/s]


Unnamed: 0,accuracy,precision,recall,s_score,l_score,train_score,train_time
AdaBoostClassifier (DTC based),1.0,1.0,1.0,0.9926,1.0,0.966667,0:00:00
GaussianNB,1.0,1.0,1.0,0.9926,1.0,0.95,0:00:00
LinearDiscriminantAnalysis,1.0,1.0,1.0,0.9926,1.0,0.975,0:00:00
BaggingClassifier (RFC based),1.0,1.0,1.0,0.9926,1.0,0.966667,0:00:00
BaggingClassifier (DTC based),1.0,1.0,1.0,0.9926,1.0,1.0,0:00:00
DecisionTreeClassifier,1.0,1.0,1.0,0.9926,1.0,1.0,0:00:00
RandomForestClassifier,1.0,1.0,1.0,0.9926,1.0,1.0,0:00:00
SupportVectorClassifier (rbf-kernel),1.0,1.0,1.0,0.9926,1.0,0.975,0:00:00
GradientBoostingMachine,1.0,1.0,1.0,0.9926,1.0,1.0,0:00:00
KNeighborsClassifier,1.0,1.0,1.0,0.9926,1.0,0.958333,0:00:00


### 3.2. find best model

**Idea:**

The find_best_model method is using one of the above evaluation methods to pick the best model type for a specific metric and will hyperparameter tune this one. So that in the end, you will have the best model for this metric on your data set.

**Useful parameters:**

- you can choose which of the three evaluation types (from above) you want to use with the **cv_kind** parameter (*"small"*, *"multi"*, *"no"*) for finding the best model type

- with the **scoring** parameter you can choose which metric to look at for searching the best model (you can use **avg**, **secondary_scoring**, **strength**, and **pos_label** to more specify it)

- for the hyperparameter tuning you can choose between *GridSearchCV* and *RandomizedSearchCV* with the **rand_search** parameter (it is recommended to use *RandomizedSearchCV* because it does not take so much time)

In [12]:
tester.find_best_model(x_train, y_train, x_test, y_test, cv_kind="no", scoring="recall", avg="macro", rand_search=True)

creating scores using 'eval_models()'


Crossvalidation: 100%|██████████| 18/18 [00:00<00:00, 18.35it/s]


                                         accuracy  precision    recall  \
LinearDiscriminantAnalysis               1.000000   1.000000  1.000000   
DecisionTreeClassifier                   1.000000   1.000000  1.000000   
RandomForestClassifier                   1.000000   1.000000  1.000000   
SupportVectorClassifier (rbf-kernel)     1.000000   1.000000  1.000000   
GradientBoostingMachine                  1.000000   1.000000  1.000000   
AdaBoostClassifier (DTC based)           1.000000   1.000000  1.000000   
KNeighborsClassifier                     1.000000   1.000000  1.000000   
ExtraTreesClassifier                     1.000000   1.000000  1.000000   
GaussianNB                               1.000000   1.000000  1.000000   
BaggingClassifier (DTC based)            1.000000   1.000000  1.000000   
BaggingClassifier (RFC based)            1.000000   1.000000  1.000000   
LogisticRegression                       0.966667   0.972222  0.962963   
QuadraticDiscriminantAnalysis         

<sam_ml.models.main_pipeline.Pipeline at 0x7fd33ba141c0>

the find_best_model method returns the best model and outputs its parameters into the console; however, if this is still not enough for you, you can access the best model as follows:

In [13]:
tester.best_model

<sam_ml.models.main_pipeline.Pipeline at 0x7fd33ba141c0>