In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/default-of-credit-card-clients-dataset/UCI_Credit_Card.csv


### Genetic Hyperparameter Tuning with TPOT
You're going to undertake a simple example of genetic hyperparameter tuning. TPOT is a very powerful library that has a lot of features. You're just scratching the surface in this lesson, but you are highly encouraged to explore in your own time.

This is a very small example. In real life, TPOT is designed to be run for many hours to find the best model. You would have a much larger population and offspring size as well as hundreds more generations to find a good model.

You will create the estimator, fit the estimator to the training data and then score this on the test data.

For this example we wish to use:

- 3 generations
- 4 in the population size
- 3 offspring in each generation
- accuracy for scoring

A random_state of 2 has been set for consistency of results.

#### TPOT components

The key arguments to a TPOT classifier are:
- generations: Iterations to run training for.
- population_size: The number of models to keep after each iteration.
- offspring_size: Number of models to produce in each iteration.
- mutation_rate: The proportion of pipelines to apply randomness to.
- crossover_rate: The proportion of pipelines to breed each iteration.
- scoring: The function to determine the best models.
- cv: Cross-validation strategy to use.


In [2]:
from sklearn.model_selection import train_test_split

data = pd.read_csv('/kaggle/input/default-of-credit-card-clients-dataset/UCI_Credit_Card.csv')

X = data.drop('default.payment.next.month', axis=1)
y = data['default.payment.next.month']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [3]:
X_train.shape

(20100, 24)

In [4]:
from tpot import TPOTClassifier

# Assign the values outlined to the inputs
number_generations = 10
population_size = 10
offspring_size = None
scoring_function = 'accuracy'

# Create the tpot classifier
tpot_clf = TPOTClassifier(generations=number_generations, population_size=population_size,
                          offspring_size=offspring_size, scoring=scoring_function,
                          verbosity=2, random_state=2, cv=2)

# Fit the classifier to the training data
tpot_clf.fit(X_train, y_train)

# Score on the test set
print(tpot_clf.score(X_test,y_test))

Optimization Progress:   0%|          | 0/110 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.8183084577114428

Generation 2 - Current best internal CV score: 0.8186567164179105

Generation 3 - Current best internal CV score: 0.8192537313432836

Generation 4 - Current best internal CV score: 0.8192537313432836

Generation 5 - Current best internal CV score: 0.8192537313432836

Generation 6 - Current best internal CV score: 0.8192537313432836

Generation 7 - Current best internal CV score: 0.8192537313432836

Generation 8 - Current best internal CV score: 0.8192537313432836

Generation 9 - Current best internal CV score: 0.8200497512437811

Generation 10 - Current best internal CV score: 0.8200497512437811

Best pipeline: RandomForestClassifier(input_matrix, bootstrap=True, criterion=gini, max_features=0.7500000000000001, min_samples_leaf=18, min_samples_split=15, n_estimators=100)
0.8218181818181818


Nice work! You can see in the output the score produced by the chosen model over each generation, and then the final accuracy score with the hyperparameters chosen for the final model. This is a great first example of using TPOT for automated hyperparameter tuning. You can now extend on this on your own and build great machine learning models!

In [None]:
# NOTE: tweak random state param

Well done! You can see that TPOT is quite unstable when only running with low generations, population size and offspring. The first model chosen was a Decision Tree, then a K-nearest Neighbor model and finally a Random Forest. Increasing the generations, population size and offspring and running this for a long time will assist to produce better models and more stable results. Don't hesitate to try it yourself on your own machine!