In [2]:
!pip3 install optuna

Collecting optuna
  Downloading optuna-3.4.0-py3-none-any.whl (409 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m409.6/409.6 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.12.1-py3-none-any.whl (226 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m226.8/226.8 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting colorlog (from optuna)
  Downloading colorlog-6.7.0-py2.py3-none-any.whl (11 kB)
Collecting Mako (from alembic>=1.5.0->optuna)
  Downloading Mako-1.3.0-py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.6/78.6 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: Mako, colorlog, alembic, optuna
Successfully installed Mako-1.3.0 alembic-1.12.1 colorlog-6.7.0 optuna-3.4.0


In [3]:
!pip3 install catboost

Collecting catboost
  Downloading catboost-1.2.2-cp310-cp310-manylinux2014_x86_64.whl (98.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.7/98.7 MB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: catboost
Successfully installed catboost-1.2.2


In [4]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.datasets import load_wine
import optuna
from optuna.samplers import TPESampler
import catboost
import pickle

**Load the data**

Using the wine dataset from sklearn. This dataset contains 13 features and 3 classes. The goal is to predict the class of a wine based on its features. The load_wine() function is used to load the data and get this to return a Pandas dataframe.

In [5]:
X, y = load_wine(return_X_y=True, as_frame=True)
X.sample(5)

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
115,11.03,1.51,2.2,21.5,85.0,2.46,2.17,0.52,2.01,1.9,1.71,2.87,407.0
66,13.11,1.01,1.7,15.0,78.0,2.98,3.18,0.26,2.28,5.3,1.12,3.18,502.0
58,13.72,1.43,2.5,16.7,108.0,3.4,3.67,0.19,2.04,6.8,0.89,2.87,1285.0
130,12.86,1.35,2.32,18.0,122.0,1.51,1.25,0.21,0.94,4.1,0.76,1.29,630.0
81,12.72,1.81,2.2,18.8,86.0,2.2,2.53,0.26,1.77,3.9,1.16,3.14,714.0


In [6]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 13 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   alcohol                       178 non-null    float64
 1   malic_acid                    178 non-null    float64
 2   ash                           178 non-null    float64
 3   alcalinity_of_ash             178 non-null    float64
 4   magnesium                     178 non-null    float64
 5   total_phenols                 178 non-null    float64
 6   flavanoids                    178 non-null    float64
 7   nonflavanoid_phenols          178 non-null    float64
 8   proanthocyanins               178 non-null    float64
 9   color_intensity               178 non-null    float64
 10  hue                           178 non-null    float64
 11  od280/od315_of_diluted_wines  178 non-null    float64
 12  proline                       178 non-null    float64
dtypes: fl

**Examine the target variable**

In [7]:
y.value_counts()

1    71
0    59
2    48
Name: target, dtype: int64

By using the Pandas value_counts() function on the target variable y, I can see that this dataset has three classes. These are not balanced, but this won’t be a massive problem for CatBoost.

**Split the data into training and test sets**

Next split the data into training and test sets. Use 70% of the data for training and 30% for testing by setting the test_size parameter to 0.3. The random_state parameter is set to 1 to ensure reproducibility of the results. If you miss this part, you could get a different split each time you run the function.

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

**Create the CatBoostClassifier model**

This is a simple base model with no hyperparameter tuning. First define the model, then fit it to the training data. It should train quickly as this dataset is very small.

In [9]:
model = catboost.CatBoostClassifier(verbose=False)
model.fit(X_train, y_train)

<catboost.core.CatBoostClassifier at 0x7d5831690f40>

Now generate some predictions from the test data.

In [10]:
y_pred = model.predict(X_test)

**Evaluate the model**

There are a couple of scikit-learn functions we can use to evaluate the model. The first is the accuracy_score function, which returns the accuracy of the model. The second is the classification_report function, which returns a report with the precision, recall, and F1 score for each class.

In [11]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.96      1.00      0.98        23
           1       1.00      0.95      0.97        19
           2       1.00      1.00      1.00        12

    accuracy                           0.98        54
   macro avg       0.99      0.98      0.98        54
weighted avg       0.98      0.98      0.98        54



In [12]:
print(accuracy_score(y_test, y_pred))

0.9814814814814815


**Use Optuna to find the best hyperparameters**

To try to get extra performance out of our model and improve its accuracy we’ll now use the Optuna hyperparameter tuning library to find the best hyperparameters for our model. First thing to create a custom objective function designed specifically for CatBoostClassifier model.

This function will take in the hyperparameters we want to tune and return the accuracy of the model with those hyperparameters. Then use Optuna to find the best hyperparameters for our model by running this function many times with different hyperparameter values.

In [13]:
#Use Optuna to find the best hyperparameters
def objective(trial):
    model = catboost.CatBoostClassifier(
        iterations=trial.suggest_int("iterations", 100, 1000),
        learning_rate=trial.suggest_float("learning_rate", 1e-3, 1e-1, log=True),
        depth=trial.suggest_int("depth", 4, 10),
        l2_leaf_reg=trial.suggest_float("l2_leaf_reg", 1e-8, 100.0, log=True),
        bootstrap_type=trial.suggest_categorical("bootstrap_type", ["Bayesian"]),
        random_strength=trial.suggest_float("random_strength", 1e-8, 10.0, log=True),
        bagging_temperature=trial.suggest_float("bagging_temperature", 0.0, 10.0),
        od_type=trial.suggest_categorical("od_type", ["IncToDec", "Iter"]),
        od_wait=trial.suggest_int("od_wait", 10, 50),
        verbose=False
    )
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return accuracy_score(y_test, y_pred)

**Efficient sampling**

By default, Optuna uses a model-based approach called TPE (Tree Parzen Estimator), a Bayesian optimization based on kernel fitting that after sampling different areas of the search space, focuses its attention on the place where it had the best results and continue to look there.

In [14]:
optuna.logging.set_verbosity(optuna.logging.WARNING)

sampler = TPESampler(seed=1)

**Create the study**

Next we need to create an Optuna study using our objective function. We’ll also set the direction to maximize, since we want to maximise the accuracy score. We’ll set it to run through 100 different trials. To avoid getting a message every time a trial runs, I’ve turned off verbose mode in Optuna by manually overriding the verbosity of the logging.

In [15]:
#Create the study
study = optuna.create_study(study_name="catboost", direction="maximize", sampler=sampler)
study.optimize(objective, n_trials=100)

**Evaluate the trial**

After a couple of minutes, depending on the speed of your workstation, Optuna should have crunched through the trials and tried the hyperparameters that you specified. We can access the data from the study to find out which hyperparameters performed best.

In [16]:
#evaluate the trial
print("Number of finished trials: ", len(study.trials))
print("Best trial:")
trial = study.best_trial
print("  Value: ", trial.value)
print("  Params: ")
for key, value in trial.params.items():
    print("    {}: {}".format(key, value))

Number of finished trials:  100
Best trial:
  Value:  1.0
  Params: 
    iterations: 503
    learning_rate: 0.06564339077069614
    depth: 6
    l2_leaf_reg: 7.546635702360232e-06
    bootstrap_type: Bayesian
    random_strength: 1.4799844388224288e-07
    bagging_temperature: 0.19366957870297075
    od_type: IncToDec
    od_wait: 20


**Create the model with the best hyper parameter**

Now that Optuna has identified the optimum combination of hyperparamters to tune our CatBoostClassifier, we can create a new model with these hyperparameters and train it on the entire dataset. We can pass in **trial.params to the model to pass in the hyperparameters that Optuna identified as being the best.

In [17]:
model_tuned = catboost.CatBoostClassifier(**trial.params, verbose=False)
model_tuned.fit(X_train, y_train)
y_pred = model_tuned.predict(X_test)

**Evaluate the tunned model**

Finally, we can evaluate the model on the test set and see how well it performs.

In [18]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        23
           1       1.00      1.00      1.00        19
           2       1.00      1.00      1.00        12

    accuracy                           1.00        54
   macro avg       1.00      1.00      1.00        54
weighted avg       1.00      1.00      1.00        54



 The base model was already pretty solid, but hyperparameter tuning has given us a further boost and we’re now hitting 100% accuracy on the test set. This is a great result, and we can be confident that our model will perform well on new data.

In [19]:
print(accuracy_score(y_test, y_pred))

1.0


**Save the model using Pickle**

Since we’ve now got a perfectly optimised machine learning model that works well on data it’s never seen, and that’s been tuned to our specific dataset, we can save it for future use. We’ll use Pickle to save the ML model to disk.

In [20]:
pickle.dump(model, open("catboost_model.pkl", "wb"))

**Load the model from disk**

Pickle allows us to load the model at any time and use it to make predictions on new data without the hassle of retraining or reoptimising it.

In [21]:

loaded_model = pickle.load(open("catboost_model.pkl", "rb"))
result = loaded_model.score(X_test, y_test)
print(result)

0.9814814814814815
