# Exercise 5
Same instructions as in 4, except that this time a classification has to
be performed and the data and the dataset is stored in FTML/Project/data/classification/.
Your objective should be to obtain a mean accuracy superior to 0.85 on the test set
(same remark about the test set).
Indication : a solution, with the correct hyperparameters, exists in scikit among
the following scikit classes :
- linear_model.LogisticRegression
- svm.SVC
- neighbors.KNeighborsClassifier
- neural_network.MLPClassifier
- ensemble.AdaBoostClassifier

We firstly define the librairies we will use, load the data, print some information on it and finally define our target accuracy:

In [38]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
import optuna

import warnings
warnings.filterwarnings('ignore')

plt.style.use('default')
sns.set_palette("husl")

X_train = np.load("../data/classification/X_train.npy")
y_train = np.load("../data/classification/y_train.npy")
X_test = np.load("../data/classification/X_test.npy")
y_test = np.load("../data/classification/y_test.npy")

print(f"dataset shape: X_train {X_train.shape}, y_train {y_train.shape}")
print(f"test set shape: X_test {X_test.shape}, y_test {y_test.shape}")
print(f"unique classes: {np.unique(y_train)}")

TARGET_ACCURACY = 0.85 # as written in the subject

dataset shape: X_train (2000, 30), y_train (2000,)
test set shape: X_test (2000, 30), y_test (2000,)
unique classes: [0 1]


## Model that we will be using

we define the different models that we will test. Our objective is classification so we use classification models.

In [39]:
models = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=2000),
    'K-Nearest Neighbors': KNeighborsClassifier(),
    'MLP Classifier': MLPClassifier(random_state=42, max_iter=2000),
    'AdaBoost': AdaBoostClassifier(random_state=42),
    'SVC': SVC(random_state=42, probability=True)
}

best_models = {}
cv_results = {}


## Hyperparameters tuning

we will use optuma to do our hyperparameter tuning. We will do 100 trials for each models. To see how our model is performing we will first test it on the train data, doing cross validation (CV), this will say how we should do the hyperparameter tuning, rejecting parameters that give worse mean accuracy while CV. We will then, when all the models are trained test them on the test data.

In [40]:
def optimize_logistic_regression(trial):
    penalty = trial.suggest_categorical('penalty', ['l1', 'l2', 'elasticnet'])
    solver = trial.suggest_categorical('solver', ['liblinear', 'saga'])
    
    params = {
        'C': trial.suggest_float('C', 1e-3, 100, log=True),
        'penalty': penalty,
        'solver': solver,
        'random_state': 42,
        'max_iter': 2000
    }
    
    if penalty == 'elasticnet':
        params['solver'] = 'saga'
        params['l1_ratio'] = trial.suggest_float('l1_ratio', 0.1, 0.9)
    elif penalty == 'l1':
        if solver not in ['liblinear', 'saga']:
            params['solver'] = 'liblinear'
    
    model = LogisticRegression(**params) 
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
    return scores.mean()

lr_study = optuna.create_study(direction='maximize', study_name="LogisticRegression_optimization")
lr_study.optimize(optimize_logistic_regression, n_trials=100)

print(f"Best Logistic Regression params: {lr_study.best_params}")
print(f"Best Logistic Regression cross-validation score: {lr_study.best_value:.4f}")

best_lr_params = lr_study.best_params.copy()

if best_lr_params['penalty'] == 'elasticnet':
    best_lr_params['solver'] = 'saga'
    if 'l1_ratio' not in best_lr_params:
        best_lr_params['l1_ratio'] = 0.5 
elif best_lr_params['penalty'] == 'l1':
    if best_lr_params['solver'] not in ['liblinear', 'saga']:
        best_lr_params['solver'] = 'liblinear'

best_lr_model = LogisticRegression(**best_lr_params)
best_lr_model.fit(X_train, y_train)

best_models['Logistic Regression'] = best_lr_model
cv_results['Logistic Regression'] = lr_study.best_value



[I 2025-07-05 02:13:43,645] A new study created in memory with name: LogisticRegression_optimization
[I 2025-07-05 02:13:43,722] Trial 0 finished with value: 0.7154999999999999 and parameters: {'penalty': 'l1', 'solver': 'saga', 'C': 0.2914552612272273}. Best is trial 0 with value: 0.7154999999999999.
[I 2025-07-05 02:13:43,766] Trial 1 finished with value: 0.7200000000000001 and parameters: {'penalty': 'l1', 'solver': 'saga', 'C': 0.010532256457261163}. Best is trial 1 with value: 0.7200000000000001.
[I 2025-07-05 02:13:43,791] Trial 2 finished with value: 0.7155 and parameters: {'penalty': 'l1', 'solver': 'liblinear', 'C': 0.045741220666837674}. Best is trial 1 with value: 0.7200000000000001.
[I 2025-07-05 02:13:43,970] Trial 3 finished with value: 0.712 and parameters: {'penalty': 'elasticnet', 'solver': 'liblinear', 'C': 0.11224410523774374, 'l1_ratio': 0.15402281101340576}. Best is trial 1 with value: 0.7200000000000001.
[I 2025-07-05 02:13:44,000] Trial 4 finished with value: 0.7

Best Logistic Regression params: {'penalty': 'l1', 'solver': 'saga', 'C': 0.0212809980245425}
Best Logistic Regression cross-validation score: 0.7230


In [None]:

def optimize_knn(trial):
    params = {
        'n_neighbors': trial.suggest_int('n_neighbors', 3, 15),
        'weights': trial.suggest_categorical('weights', ['uniform', 'distance']),
        'metric': trial.suggest_categorical('metric', ['euclidean', 'manhattan', 'minkowski'])
    }
    
    model = KNeighborsClassifier(**params) 
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
    return scores.mean()

knn_study = optuna.create_study(direction='maximize', study_name="KNN_optimization")
knn_study.optimize(optimize_knn, n_trials=100)

print(f"Best K-Nearest Neighbors params: {knn_study.best_params}")
print(f"Best K-Nearest Neighbors cross-validation score: {knn_study.best_value:.4f}")

best_knn_params = knn_study.best_params
best_knn_model = KNeighborsClassifier(**best_knn_params)
best_knn_model.fit(X_train, y_train)

best_models['K-Nearest Neighbors'] = best_knn_model
cv_results['K-Nearest Neighbors'] = knn_study.best_value


[I 2025-07-05 02:13:50,446] A new study created in memory with name: KNN_optimization
[I 2025-07-05 02:13:50,557] Trial 0 finished with value: 0.762 and parameters: {'n_neighbors': 8, 'weights': 'distance', 'metric': 'manhattan'}. Best is trial 0 with value: 0.762.



Optimizing K-Nearest Neighbors...


[I 2025-07-05 02:13:50,659] Trial 1 finished with value: 0.76 and parameters: {'n_neighbors': 3, 'weights': 'uniform', 'metric': 'minkowski'}. Best is trial 0 with value: 0.762.
[I 2025-07-05 02:13:50,766] Trial 2 finished with value: 0.769 and parameters: {'n_neighbors': 10, 'weights': 'uniform', 'metric': 'minkowski'}. Best is trial 2 with value: 0.769.
[I 2025-07-05 02:13:50,874] Trial 3 finished with value: 0.7455 and parameters: {'n_neighbors': 4, 'weights': 'uniform', 'metric': 'euclidean'}. Best is trial 2 with value: 0.769.
[I 2025-07-05 02:13:51,001] Trial 4 finished with value: 0.7779999999999999 and parameters: {'n_neighbors': 9, 'weights': 'uniform', 'metric': 'minkowski'}. Best is trial 4 with value: 0.7779999999999999.
[I 2025-07-05 02:13:51,119] Trial 5 finished with value: 0.76 and parameters: {'n_neighbors': 3, 'weights': 'uniform', 'metric': 'euclidean'}. Best is trial 4 with value: 0.7779999999999999.
[I 2025-07-05 02:13:51,233] Trial 6 finished with value: 0.7775 an

Best K-Nearest Neighbors params: {'n_neighbors': 14, 'weights': 'distance', 'metric': 'euclidean'}
Best K-Nearest Neighbors cross-validation score: 0.7860


In [42]:
def optimize_mlp(trial):
    params = {
        'hidden_layer_sizes': trial.suggest_categorical('hidden_layer_sizes', [(50,), (100,), (50, 50), (100, 50)]),
        'alpha': trial.suggest_float('alpha', 1e-4, 1e-1, log=True),
        'learning_rate': trial.suggest_categorical('learning_rate', ['constant', 'adaptive']),
        'random_state': 42,
        'max_iter': 2000
    }
    
    model = MLPClassifier(**params)
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
    return scores.mean()

mlp_study = optuna.create_study(direction='maximize', study_name="MLP_optimization")
mlp_study.optimize(optimize_mlp, n_trials=100)

print(f"Best MLP Classifier params: {mlp_study.best_params}")
print(f"Best MLP Classifier cross-validation score: {mlp_study.best_value:.4f}")

best_mlp_params = mlp_study.best_params
best_mlp_model = MLPClassifier(**best_mlp_params)
best_mlp_model.fit(X_train, y_train)

best_models['MLP Classifier'] = best_mlp_model
cv_results['MLP Classifier'] = mlp_study.best_value


[I 2025-07-05 02:14:01,471] A new study created in memory with name: MLP_optimization
[I 2025-07-05 02:14:08,333] Trial 0 finished with value: 0.749 and parameters: {'hidden_layer_sizes': (100, 50), 'alpha': 0.0010966731262273496, 'learning_rate': 'constant'}. Best is trial 0 with value: 0.749.
[I 2025-07-05 02:14:20,547] Trial 1 finished with value: 0.7195 and parameters: {'hidden_layer_sizes': (100,), 'alpha': 0.01597521465885276, 'learning_rate': 'constant'}. Best is trial 0 with value: 0.749.
[I 2025-07-05 02:14:30,741] Trial 2 finished with value: 0.7165 and parameters: {'hidden_layer_sizes': (50,), 'alpha': 0.007222985707526683, 'learning_rate': 'adaptive'}. Best is trial 0 with value: 0.749.
[I 2025-07-05 02:14:42,585] Trial 3 finished with value: 0.7154999999999999 and parameters: {'hidden_layer_sizes': (100,), 'alpha': 0.05174689363765446, 'learning_rate': 'constant'}. Best is trial 0 with value: 0.749.
[I 2025-07-05 02:14:47,364] Trial 4 finished with value: 0.7335 and parame

Best MLP Classifier params: {'hidden_layer_sizes': (100, 50), 'alpha': 0.05938953197404983, 'learning_rate': 'adaptive'}
Best MLP Classifier cross-validation score: 0.7565


In [43]:
def optimize_adaboost(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 200),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 1.0),
        'algorithm': 'SAMME',
        'random_state': 42
    }
    
    model = AdaBoostClassifier(**params)
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
    return scores.mean()

ada_study = optuna.create_study(direction='maximize', study_name="AdaBoost_optimization")
ada_study.optimize(optimize_adaboost, n_trials=100)

print(f"Best AdaBoost params: {ada_study.best_params}")
print(f"Best AdaBoost cross-validation score: {ada_study.best_value:.4f}")

best_ada_params = ada_study.best_params
best_ada_model = AdaBoostClassifier(**best_ada_params)
best_ada_model.fit(X_train, y_train)

best_models['AdaBoost'] = best_ada_model
cv_results['AdaBoost'] = ada_study.best_value


[I 2025-07-05 02:24:34,355] A new study created in memory with name: AdaBoost_optimization
[I 2025-07-05 02:24:37,195] Trial 0 finished with value: 0.701 and parameters: {'n_estimators': 130, 'learning_rate': 0.6279749248412576}. Best is trial 0 with value: 0.701.
[I 2025-07-05 02:24:40,190] Trial 1 finished with value: 0.7115 and parameters: {'n_estimators': 137, 'learning_rate': 0.4705483456215528}. Best is trial 1 with value: 0.7115.
[I 2025-07-05 02:24:43,903] Trial 2 finished with value: 0.7154999999999999 and parameters: {'n_estimators': 170, 'learning_rate': 0.31031814178700123}. Best is trial 2 with value: 0.7154999999999999.
[I 2025-07-05 02:24:47,675] Trial 3 finished with value: 0.7034999999999999 and parameters: {'n_estimators': 171, 'learning_rate': 0.5187507064501229}. Best is trial 2 with value: 0.7154999999999999.
[I 2025-07-05 02:24:49,055] Trial 4 finished with value: 0.71 and parameters: {'n_estimators': 63, 'learning_rate': 0.555894496824633}. Best is trial 2 with v

Best AdaBoost params: {'n_estimators': 86, 'learning_rate': 0.4103670333149539}
Best AdaBoost cross-validation score: 0.7200


In [44]:
def optimize_svc(trial):
    kernel = trial.suggest_categorical('kernel', ['rbf', 'poly', 'sigmoid'])
    params = {
        'C': trial.suggest_float('C', 1e-4, 1e-2, log=True),
        'kernel': kernel,
        'gamma': trial.suggest_float('gamma', 0.1, 0.2, log=True),
        'degree': trial.suggest_int('degree', 2, 5) if kernel == 'poly' else 3,
        'probability': True,
        'random_state': 42
    }
    
    model = SVC(**params)
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
    return scores.mean()

svc_study = optuna.create_study(direction='maximize', study_name="SVC_optimization")
svc_study.optimize(optimize_svc, n_trials=100)

print(f"Best SVC params: {svc_study.best_params}")
print(f"Best SVC cross-validation score: {svc_study.best_value:.4f}")

best_svc_params = svc_study.best_params
best_svc_model = SVC(**best_svc_params)
best_svc_model.fit(X_train, y_train)

best_models['SVC'] = best_svc_model
cv_results['SVC'] = svc_study.best_value


[I 2025-07-05 02:29:24,451] A new study created in memory with name: SVC_optimization
[I 2025-07-05 02:29:26,111] Trial 0 finished with value: 0.6475 and parameters: {'kernel': 'sigmoid', 'C': 0.008025045610108538, 'gamma': 0.16138916712028184}. Best is trial 0 with value: 0.6475.
[I 2025-07-05 02:29:27,485] Trial 1 finished with value: 0.5095 and parameters: {'kernel': 'rbf', 'C': 0.0005960157647714102, 'gamma': 0.1298855502502769}. Best is trial 0 with value: 0.6475.
[I 2025-07-05 02:29:28,609] Trial 2 finished with value: 0.7444999999999999 and parameters: {'kernel': 'poly', 'C': 0.002650676766349632, 'gamma': 0.1754762515483813, 'degree': 5}. Best is trial 2 with value: 0.7444999999999999.
[I 2025-07-05 02:29:29,295] Trial 3 finished with value: 0.753 and parameters: {'kernel': 'poly', 'C': 0.0003118014689234045, 'gamma': 0.19234693954050996, 'degree': 3}. Best is trial 3 with value: 0.753.
[I 2025-07-05 02:29:31,043] Trial 4 finished with value: 0.669 and parameters: {'kernel': 's

Best SVC params: {'kernel': 'poly', 'C': 0.004253297057026472, 'gamma': 0.14869220535181446, 'degree': 3}
Best SVC cross-validation score: 0.7810


## Model validation on test data

We finally validate on the test data and verify if the mean accuracy is better than 0.85.

In [45]:
for name, score in cv_results.items():
    print(f"{name:25s} | CV Score: {score:.4f}")

test_accuracies = {}
print(f"\nEvaluating models on test set (Target: {TARGET_ACCURACY}):")
print("-" * 50)

for name, model in best_models.items():
    y_pred = model.predict(X_test)
    test_acc = accuracy_score(y_test, y_pred)
    test_accuracies[name] = test_acc
    
    status = "✓" if test_acc > TARGET_ACCURACY else "✗"
    print(f"{name:25s} | Test Accuracy: {test_acc:.4f} {status}")

models_above_target = [name for name, score in test_accuracies.items() if score > TARGET_ACCURACY]
print(f"\nModels achieving target accuracy ({TARGET_ACCURACY}) on TEST SET:")
if models_above_target:
    for model in models_above_target:
        print(f"{model}: {test_accuracies[model]:.4f}")


print(f"\nBest test performer: {max(test_accuracies.keys(), key=lambda x: test_accuracies[x])}")
print(f"Best test accuracy: {max(test_accuracies.values()):.4f}")


Logistic Regression       | CV Score: 0.7230
K-Nearest Neighbors       | CV Score: 0.7860
MLP Classifier            | CV Score: 0.7565
AdaBoost                  | CV Score: 0.7200
SVC                       | CV Score: 0.7810

Evaluating models on test set (Target: 0.85):
--------------------------------------------------
Logistic Regression       | Test Accuracy: 0.7465 ✗
K-Nearest Neighbors       | Test Accuracy: 0.7925 ✗
MLP Classifier            | Test Accuracy: 0.7600 ✗
AdaBoost                  | Test Accuracy: 0.7420 ✗
SVC                       | Test Accuracy: 0.9040 ✓

Models achieving target accuracy (0.85) on TEST SET:
SVC: 0.9040

Best test performer: SVC
Best test accuracy: 0.9040


## Conclusion

We see that SVC performed really well. It got a mean accuracy of 0.9040. 