# Lab exam of block (Group 3E1 Turn 2)

In this exam, we will be using one of the classification tasks found in OpenML. More precisely, a 10% stratified subsample of the task [*KDDCup99*](https://www.openml.org/search?type=data&id=1113) (data_id=1113) is selected. The classification goal of this task is to predict whether a connection is normal or an attack, with exactly one specific attack type. The input features are basic, content and traffic features.

Below you can find a baseline result achieved with the logistic regression classifier using default parameters devoting 9% to training and 1% to test (random_state=23).

In [2]:
import warnings; warnings.filterwarnings("ignore"); import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

data_id = 1113
train_size = 0.09
test_size = 0.01
X, y = fetch_openml(data_id=data_id, return_X_y=True, as_frame=False)
# Default parameter values: tol=1e-4, C=1e0, solver='lbfgs', max_iter=1e2
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = train_size, test_size=test_size, random_state=23)
clf = LogisticRegression(random_state=23).fit(X_train, y_train)
print(f'Test error: {(1 - accuracy_score(y_test, clf.predict(X_test)))*100:5.1f}%')

Test error:   1.8%


### Exercise 1
Applying the logistic regression classifier with default parameter values except for the solver, explore different solvers to find that optimal. Report classification error rate on training and test sets. Use random_state=23. 

In [6]:
for solver in ['lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga']:
    clf = LogisticRegression(random_state=23, solver=solver, max_iter=100).fit(X_train, y_train)
    err_test = 1 - accuracy_score(y_test, clf.predict(X_test))
    print(f"Test error after training with the solver {solver!s}: {err_test:.1%}")

Test error after training with the solver lbfgs: 1.8%
Test error after training with the solver liblinear: 0.6%
Test error after training with the solver newton-cg: 0.5%
Test error after training with the solver newton-cholesky: 0.1%
Test error after training with the solver sag: 29.6%
Test error after training with the solver saga: 29.6%


The best solver for the regression classifier for this data set is the newton-cholesky one, because it provides the least amount of errors out of all. This means that it is capable of classifying 99.9% of the data in their correct and corresponding class.

### Exercise 2
Applying the logistic regression classifier with default parameter values except for the parameter C and the best solver from exercise 1, explore the values of the parameter C in logarithmic scale to determine an optimal value. Report classification error rate on training and test sets. Use random_state=23. 

In [8]:
for C in (1e-2, 1e-1, 1, 1e1, 1e2):
    clf = LogisticRegression(C=C, random_state=23, solver='newton-cholesky', max_iter=10000).fit(X_train, y_train)
    err_test = 1 - accuracy_score(y_test, clf.predict(X_test))
    print(f"Test error with C {C:g}: {err_test:.3%}")

Test error with C 0.01: 0.465%
Test error with C 0.1: 0.283%
Test error with C 1: 0.121%
Test error with C 10: 0.081%
Test error with C 100: 0.061%


The best value of C is the one performed with C = 100, as it provides the value closest to zero, and thus the maximum regularization for the adjustment performed by the logistic regression algorithm.

### Exercise 3
Applying the logistic regression classifier with default parameter values except for the maximum number of iterations, the best solver and the optimal value for the C value from previous exercises, explore the maximum number of iterations in logarithmic scale to determine an optimal value. Report classification error rate on training and test sets. Use random_state=23. 

In [14]:
for max_iter in (2, 4, 8, 16, 32):
    clf = LogisticRegression(C=100, solver='newton-cholesky', random_state=23, max_iter=max_iter).fit(X_train, y_train)
    err_test = 1 - accuracy_score(y_test, clf.predict(X_test))
    print(f"Test error with max_iter {max_iter}: {err_test:.10%}")

Test error with max_iter 2: 0.3238210888%
Test error with max_iter 4: 0.1821493625%
Test error with max_iter 8: 0.1214329083%
Test error with max_iter 16: 0.0607164542%
Test error with max_iter 32: 0.0607164542%


Using powers of two as our baseline for testing (log scale of base 2, as they provide us with a more meaningful result), we can see that the optimal amount of iterations is provided in the range around 8 and 16 given our C value and our solver. A number of iterations under this range will not provide a great estimation (with significant error), but a number of iterations over it won't either enhance the accuracy of the regression algorithm