In [28]:
import warnings

import pandas as pd
from numpy import (mean, std)
from sklearn.datasets import make_classification
from sklearn.model_selection import (
    train_test_split, cross_val_score, GridSearchCV)
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.multiclass import OutputCodeClassifier
from sklearn.metrics import accuracy_score

The goal of this exercise is to implement theError-Correcting Output Codes (ECOC) strategy for a classification problem.



# **Info**
---

@By: Kaiziferr

@Git: https://github.com/Kaiziferr

# **Config**
---


In [29]:
random_seed=12354

# **Data**

---



A synthetic dataset of 1,000 samples was generated, with 8 features, of which 6 are informative and 2 are redundant, and 5 classes. The classification error rate is 5%.

In [30]:
X, y = make_classification(
    n_samples=1000,
    n_features=8,
    n_informative=6,
    n_redundant=2,
    n_classes=5,
    flip_y=0.05,
    random_state=random_seed
)

A context is used where the categories represent letters, in order to give meaning to the data.

It is evident that the dataset is balanced; the small discrepancy is due to the data split.



In [31]:
y = pd.Series(y).replace({0:'A',1:'B',2:'C',3:'D', 4:'E'})
y.value_counts()

Unnamed: 0,count
D,208
B,201
E,199
A,196
C,196


# **Data Split**
---

In [32]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.35,
    stratify=y,
    random_state=random_seed
)

# **Model**
---



Two logistic regression models and two SVC models are defined, with the same parameters.

In [33]:
model_log = LogisticRegression(random_state=random_seed)

model_logB = LogisticRegression(**model_log.get_params())

model_svc = SVC(random_state=random_seed)

model_svcB = SVC(**model_svc.get_params())

**LogisticRegression**

A logistic regression model is fitted, without going into much detail about the parameters, with the aim of observing how the model performs without applying ECOC.

In [34]:
model_logB.fit(X_train, y_train)
y_predict = model_logB.predict(X_test)

The model has an accuracy performance of 51%. It is not very good at distinguishing between classes, and one could even hastily conclude that its performance is only slightly better than that of a model making random predictions.

In [35]:
accuracy_score(y_test, y_predict)

0.5085714285714286

Cross-validation was used with the entire dataset to evaluate the model’s performance across different data splits. As evidenced, regardless of the partition used, the model consistently demonstrated poor performance, with only minor deviations. This indicates that the model did not perform significantly better or worse than previously observed

In [36]:
score = cross_val_score(model_logB, X, y, scoring='accuracy', cv=5, n_jobs=-1)
print('Mean: %.3f Std: (%.3f)' % (mean(score), std(score)), "scores:" ,score)

Mean: 0.513 Std: (0.032) scores: [0.54  0.475 0.56  0.505 0.485]


**Logistic Regression with Error-Correcting Output Codes (ECOC)**

ECOC is applied using 5 logistic regression models.

In [37]:
ecoc_log = OutputCodeClassifier(model_log, code_size=1, random_state=random_seed)
ecoc_log.fit(X_train, y_train)
y_predict = ecoc_log.predict(X_test)

The model performs worse than the implementation without ECOC.

In [38]:
accuracy_score(y_test, y_predict)

0.3742857142857143

Cross-validation was used with the entire dataset to evaluate the model’s performance across different data splits. As evidenced, regardless of the partition used, the model consistently demonstrated poor performance, with only minor deviations. This indicates that the model did not perform significantly better or worse than previously observed

In [39]:
score = cross_val_score(ecoc_log, X, y, scoring='accuracy', cv=5, n_jobs=-1)
print('Mean: %.3f Std: (%.3f)' % (mean(score), std(score)), "scores:" ,score)

Mean: 0.376 Std: (0.026) scores: [0.37  0.385 0.385 0.41  0.33 ]


**Logistic Regression with Error-Correcting Output Codes (ECOC) using GridSearchCV**

In [40]:
paramgrid = {
    "code_size": list(range(1,21))
}
ecoc_log = OutputCodeClassifier(
    estimator=model_log,
    )
grid_ecoc = GridSearchCV(
    ecoc_log,
    paramgrid,
    scoring='accuracy',
    refit=True,
    cv=5,
    return_train_score=True)

mf = grid_ecoc.fit(X_train, y_train)

In [41]:
mf.best_params_

{'code_size': 14}

The model does not improve when using the ECOC technique by increasing the number of models to 20; instead, the performance worsens

In [42]:
best_model = mf.best_estimator_
y_pred = best_model.predict(X_test)
accuracy_score(y_test, y_pred)

0.5028571428571429

In [43]:
score = cross_val_score(best_model, X, y, scoring='accuracy', cv=5, n_jobs=-1)
print('Mean: %.3f Std: (%.3f)' % (mean(score), std(score)), "scores:" ,score)

Mean: 0.497 Std: (0.030) scores: [0.535 0.485 0.525 0.49  0.45 ]


**Support Vector Machine (SVC)**

A Support Vector Machine (SVC) model is fitted, without going into much detail about the parameters, with the aim of observing how the model performs without applying ECOC.

In [44]:
model_svcB.fit(X_train, y_train)
y_predict = model_svcB.predict(X_test)

The unoptimized model has an accuracy of 69%, which is an acceptable performance for a non-critical problem.

In [45]:
accuracy_score(y_test, y_predict)

0.6857142857142857

As observed in the iterations, the model tends to improve its performance, potentially reaching an average accuracy of 0.71. It’s important to highlight that all the data was used in this fold validation.

In [46]:
score = cross_val_score(model_svcB, X, y, scoring='accuracy', cv=5, n_jobs=-1)
print('Mean: %.3f Std: (%.3f)' % (mean(score), std(score)), "scores:" ,score)

Mean: 0.712 Std: (0.026) scores: [0.715 0.72  0.755 0.685 0.685]


**Support Vector Machine (SVC) with Error-Correcting Output Codes (ECOC)**

ECOC is applied using 5 logistic regression models.

In [47]:
ecoc_svc = OutputCodeClassifier(model_svc, code_size=1, random_state=random_seed)
ecoc_svc.fit(X_train, y_train)
y_predict = ecoc_svc.predict(X_test)

The model performs worse than the implementation without ECOC.

In [48]:
accuracy_score(y_test, y_predict)

0.4942857142857143

Since the average performance per iteration was around 51%, that's approximately 20 points lower in performance compared to the SVC model.

In [50]:
score = cross_val_score(ecoc_svc, X, y, scoring='accuracy', cv=5, n_jobs=-1)
print('Mean: %.3f Std: (%.3f)' % (mean(score), std(score)), "scores:" ,score)

Mean: 0.507 Std: (0.014) scores: [0.48  0.505 0.52  0.515 0.515]


**Support Vector Machine (SVC) with Error-Correcting Output Codes (ECOC) using GridSearchCV**

In [51]:
paramgrid = {
    "code_size": list(range(1,21))
}
ecoc_csv = OutputCodeClassifier(
    estimator=model_svc,
    )
grid_ecoc = GridSearchCV(
    ecoc_csv,
    paramgrid,
    scoring='accuracy',
    refit=True,
    cv=5,
    return_train_score=True)

mf = grid_ecoc.fit(X_train, y_train)

In [52]:
mf.best_params_

{'code_size': 17}

The model doesn't show a significant improvement when using the ECOC technique by increasing the number of models to 17. Instead, the performance is equivalent to the unoptimized model.

In [54]:
best_model = mf.best_estimator_
y_pred = best_model.predict(X_test)

In [55]:
accuracy_score(y_test, y_pred)

0.6742857142857143

In [56]:
score = cross_val_score(best_model, X, y, scoring='accuracy', cv=5, n_jobs=-1)
print('Mean: %.3f Std: (%.3f)' % (mean(score), std(score)), "scores:" ,score)

Mean: 0.697 Std: (0.026) scores: [0.69  0.68  0.745 0.67  0.7  ]


Although no improvement was observed between the unoptimized models and the ECOC technique, it is still worth considering ECOC for traditional (non-ensemble) models, as it can enhance their performance capabilities.