Dokumentacja TPOT: https://epistasislab.github.io/tpot/

Przykładowe konfiguracje: https://github.com/EpistasisLab/tpot/tree/master/tpot/config

%%bash
conda install -c conda-forge tpot

# Przygotowanie danych

In [18]:
from tpot import TPOTClassifier

import pandas as pd
import numpy as np

In [26]:
X_train = pd.read_csv("../output/X_train.csv", index_col = "index")
y_train = pd.read_csv("../output/y_train.csv", names = ["index", "klasa"], index_col = "index")

X_test = pd.read_csv("../output/X_test.csv", index_col = "index")
y_test = pd.read_csv("../output/y_test.csv", names = ["index", "klasa"], index_col = "index")

In [27]:
y_train["klasa"].replace(["Ł"], 0, inplace = True)
y_train["klasa"].replace(["Z"], 1, inplace = True)

y_test["klasa"].replace(["Ł"], 0, inplace = True)
y_test["klasa"].replace(["Z"], 1, inplace = True)

# Użycie znanych modeli

In [3]:
konfiguracja_tpot = {
    'sklearn.tree.DecisionTreeClassifier': {
        'criterion': ["gini", "entropy"],
        'max_depth': range(1, 11),
        'min_samples_split': range(2, 21),
        'min_samples_leaf': range(1, 21)
    },
    'sklearn.ensemble.RandomForestClassifier': {
        'n_estimators': [100],
        'criterion': ["gini", "entropy"],
        'max_features': np.arange(0.05, 1.01, 0.05),
        'min_samples_split': range(2, 21),
        'min_samples_leaf':  range(1, 21),
        'bootstrap': [True, False]
    }
}

In [6]:
klasyfikator = TPOTClassifier(config_dict = konfiguracja_tpot, 
                              generations = 5, 
                              population_size = 50, 
                              verbosity = 2, 
                              random_state = 42)

In [7]:
klasyfikator.fit(features = X_train, target = y_train.values.ravel())

                                                                             



TPOT closed prematurely. Will use the current best pipeline.


RuntimeError: A pipeline has not yet been optimized. Please call fit() first.

# "Szybka" konfiguracja

In [32]:
klasyfikator = TPOTClassifier(config_dict = "TPOT light", 
                              generations = 100, 
                              population_size = 100, 
                              verbosity = 2, 
                              random_state = 42, 
                              early_stop = 10)
klasyfikator.fit(features = X_train, target = y_train.values.ravel())

Optimization Progress:   2%|▏         | 200/10100 [00:16<13:29, 12.22pipeline/s]

Generation 1 - Current best internal CV score: 0.9758241758241759


Optimization Progress:   3%|▎         | 300/10100 [00:31<11:35, 14.08pipeline/s]  

Generation 2 - Current best internal CV score: 0.9758241758241759


Optimization Progress:   4%|▍         | 400/10100 [00:46<14:40, 11.02pipeline/s]  

Generation 3 - Current best internal CV score: 0.9758241758241759


Optimization Progress:   5%|▍         | 500/10100 [01:15<59:44,  2.68pipeline/s]  

Generation 4 - Current best internal CV score: 0.9780219780219781


Optimization Progress:   6%|▌         | 600/10100 [01:44<35:49,  4.42pipeline/s]  

Generation 5 - Current best internal CV score: 0.9780219780219781


Optimization Progress:   7%|▋         | 700/10100 [02:18<1:24:45,  1.85pipeline/s]

Generation 6 - Current best internal CV score: 0.9780219780219781


                                                                                  



TPOT closed prematurely. Will use the current best pipeline.

Best pipeline: LogisticRegression(MinMaxScaler(SelectPercentile(input_matrix, percentile=89)), C=10.0, dual=True, penalty=l2)


TPOTClassifier(config_dict={'sklearn.naive_bayes.GaussianNB': {}, 'sklearn.naive_bayes.BernoulliNB': {'alpha': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0], 'fit_prior': [True, False]}, 'sklearn.naive_bayes.MultinomialNB': {'alpha': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0], 'fit_prior': [True, False]}, 'sklearn.tree.DecisionT...e_selection.VarianceThreshold': {'threshold': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.2]}},
        crossover_rate=0.1, cv=5, disable_update_check=False,
        early_stop=10, generations=100, max_eval_time_mins=5,
        max_time_mins=None, memory=None, mutation_rate=0.9, n_jobs=1,
        offspring_size=100, periodic_checkpoint_folder=None,
        population_size=100, random_state=42, scoring=None, subsample=1.0,
        verbosity=2, warm_start=False)

In [33]:
klasyfikator.score(testing_features = X_test, testing_target = y_test)

0.9736842105263158

In [34]:
klasyfikator.export("../output/tpot.py")

True

In [35]:
%%bash
cat ../output/tpot.py

import numpy as np
import pandas as pd
from sklearn.feature_selection import SelectPercentile, f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler

# NOTE: Make sure that the class is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1).values
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'].values, random_state=42)

# Score on the training set was:0.9780219780219781
exported_pipeline = make_pipeline(
    SelectPercentile(score_func=f_classif, percentile=89),
    MinMaxScaler(),
    LogisticRegression(C=10.0, dual=True, penalty="l2")
)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_feature