## Introduction
In this notebook, optimal hyperparameters will be selected and the performance of both models will be evaluated.

### Imports
The analysis commences with the necessary imports.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
from pathlib import Path

project_root = Path.cwd()
while not (project_root / "src").exists():
    project_root = project_root.parent

sys.path.append(str(project_root / "src"))

from model_selection import grid_search_cv
from models import SVM, LogisticRegression

### Notebook Parameter

In [None]:
METRICS = 'f1'
RUNMODE = 'evaluation' # with 'training' value computational expensive will be enabled

### Data Loading
The data will be loaded.

In [None]:
X_train_df = pd.read_csv('../data/processed/X_train.csv')
y_train_df = pd.read_csv('../data/processed/y_train.csv')
X_test_df = pd.read_csv('../data/processed/X_test.csv')
y_test_df = pd.read_csv('../data/processed/y_test.csv')

y_train = np.where(y_train_df['quality'] >= 6, 1, -1)
y_test = np.where(y_test_df['quality'] >= 6, 1, -1)

X_train = X_train_df.to_numpy()
X_test = X_test_df.to_numpy()

# Models Evaluation

## Hyperparameter Tuning
To identify optimal hyperparameters, multiple rounds of grid search are required to thoroughly explore all possible parameter combinations.

### SVM
For SVMs, two primary parameters require optimization: the number of iterations (*n_iters*) and the regularization parameter lambda (*lambda_param*). Typically, the number of folds ranges between 5 to 10.

In [None]:
svm_param_grid = {
        'n_iters': [1000, 2000, 3000, 4000, 5000, 6000, 7000],
        'lambda_param' : [1, 1e-1, 1e-2, 1e-3, 1e-4, 1e-5]
    }

svm_best_params, svm_best_metrics = grid_search_cv(SVM, svm_param_grid, X_train, y_train, cv=5, scoring=METRICS)
print(f'SVM best parameter: {svm_best_params}')
print(f'SVM best metrics: {svm_best_metrics}')

The optimal hyperparameters identified are *n_iters: 2000* and *lambda_param: 0.01*. A refined search will now be conducted within the neighborhood of these parameters.

In [None]:
svm_param_grid = {
        'n_iters': [1500, 1750, 2000, 2250, 2500],
        'lambda_param' : [5e-1, 3e-1, 1e-1, 9e-2, 7e-2]
    }

svm_best_params, svm_best_metrics = grid_search_cv(SVM, svm_param_grid, X_train, y_train, cv=5, scoring=METRICS)
print(f'SVM best parameter: {svm_best_params}')
print(f'SVM best metrics: {svm_best_metrics}')

In [None]:
svm_n_iters = svm_best_params['n_iters']
svm_lambda_param = svm_best_params['lambda_param']

### Logistic Regression
As with SVMs, the parameters include *n_iters* and *lambda_param*, however, this model additionally incorporates the learning rate parameter (*learning_rate*).

In [None]:
if RUNMODE == 'training':
    lr_param_grid = {
            'n_iters': [1, 2, 5, 10, 20],
            'lambda_param' : [1e-1, 1e-2, 1e-3],
            'learning_rate' : [1e-1, 1e-2, 1e-3]
        }

    lr_best_params, lr_best_metrics = grid_search_cv(LogisticRegression, lr_param_grid, X_train, y_train, cv=5, scoring=METRICS)
    print(f'Logistic Regression best parameter: {lr_best_params}')
    print(f'Logistic Regression best metrics: {lr_best_metrics}')

The best parameter are *n_iters: 5*, *lambda_param: 0.001* and *learning_rate: 0.01*.

In [None]:
if RUNMODE == 'training':
    lr_param_grid = {
            'n_iters': [3, 4, 5, 6, 7],
            'lambda_param' : [5e-3, 1e-3, 5e-4],
            'learning_rate' : [5e-2, 1e-2, 5e-3]
        }

    lr_best_params, lr_best_metrics = grid_search_cv(LogisticRegression, lr_param_grid, X_train, y_train, cv=5, scoring=METRICS)
    print(f'Logistic Regression best parameter: {lr_best_params}')
    print(f'Logistic Regression best metrics: {lr_best_metrics}')

It is notable that the model does not necessarily tend toward high iteration values, indicating that convergence likely occurs rapidly. This phenomenon will be more readily observable through the examination of learning curves.

Due to time constraints, the hyperparameters are manually assigned to variables; however, the grid search procedure remains fully reproducible. *(1m 31s and 54s)*

In [None]:
if RUNMODE == 'training':
    lr_n_iters = lr_best_params['n_iters']
    lr_lambda_param = lr_best_params['lambda_param']
    lr_learning_rate = lr_best_params['learning_rate']
else: 
    lr_n_iters = 5
    lr_lambda_param = 5e-4
    lr_learning_rate = 5e-3

## Learning Curves
It is particularly valuable to analyze the learning curves of the various algorithms to observe how and when convergence occurs.

### Helper Functions

In [None]:
def calculate_f1(predictions, y_true):
    tp = np.sum((predictions == 1) & (y_true == 1))
    fp = np.sum((predictions == 1) & (y_true == -1))
    fn = np.sum((predictions == -1) & (y_true == 1))
    
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0.0
    
    return f1

def plot_learning_curve(model_class, X_train, y_train, X_test, y_test, iterations_list, **model_kwargs):
    
    train_scores = []
    test_scores = []
    
    for n_iter in iterations_list:
        model = model_class(n_iters=n_iter, **model_kwargs)
        model.fit(X_train, y_train)
        
        train_pred = model.predict(X_train)
        test_pred = model.predict(X_test)
        
        train_score = calculate_f1(train_pred, y_train)
        test_score = calculate_f1(test_pred, y_test)
        
        train_scores.append(train_score)
        test_scores.append(test_score)
        
        print(f"Iter {n_iter}: Train={train_score:.3f}, Test={test_score:.3f}")
    
    plt.figure(figsize=(8, 5))
    plt.plot(iterations_list, train_scores, 'o-', label='Training', color='blue')
    plt.plot(iterations_list, test_scores, 'o-', label='Test', color='red')
    
    plt.xlabel('Iteration Number')
    plt.ylabel('F1-Score')
    plt.title('Learning Curve')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()

### SVM

In [None]:
plot_learning_curve(SVM, X_train, y_train, X_test, y_test, [100, 300, 500, 1000, 1500, 2000, 3000, 4000, 5000, 6000, 7000], lambda_param=svm_lambda_param)

### Logistic Regression

In [None]:
plot_learning_curve(LogisticRegression, X_train, y_train, X_test, y_test, [0, 1, 3, 5, 7, 8, 9, 10, 15, 20], lambda_param=lr_lambda_param, learning_rate=lr_learning_rate)

### Conclusions
Logistic Regression demonstrates superior efficiency and stability for this dataset, exhibiting rapid convergence and enhanced generalization capability. While SVM achieves acceptable performance, it requires extended training time and displays greater instability, likely attributable to the complexity of the margin optimization process.

## Evaluation

### Helper Functions

In [None]:
def calculate_metrics(predictions, y_test):
    tp = np.sum((predictions == 1) & (y_test == 1))
    fp = np.sum((predictions == 1) & (y_test == -1))
    tn = np.sum((predictions == -1) & (y_test == -1))
    fn = np.sum((predictions == -1) & (y_test == 1))
    
    accuracy = (tp + tn) / len(y_test)
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0.0
    
    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'tp': tp, 'fp': fp, 'tn': tn, 'fn': fn
    }

def plot_metrics(predictions, y_test):
    metrics = calculate_metrics(predictions, y_test)
    
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
    
    names = ['Accuracy', 'Precision', 'Recall', 'F1']
    values = [metrics['accuracy'], metrics['precision'], 
              metrics['recall'], metrics['f1']]
    
    ax1.bar(names, values, color=['skyblue', 'lightcoral', 'lightgreen', 'orange'])
    ax1.set_ylim(0, 1)
    ax1.set_title('Metrics')
    
    for i, v in enumerate(values):
        ax1.text(i, v + 0.02, f'{v:.3f}', ha='center')
    
    cm = [[metrics['tn'], metrics['fp']], 
          [metrics['fn'], metrics['tp']]]
    
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax2,
                xticklabels=['Bad', 'Good'], yticklabels=['Bad', 'Good'])
    ax2.set_title('Confusion Matrix')
    ax2.set_xlabel('Predicted')
    ax2.set_ylabel('Actual')
    
    plt.tight_layout()
    plt.show()

### SVM

In [None]:
svm = SVM(svm_n_iters, svm_lambda_param)
svm.fit(X_train, y_train)
predictions = svm.predict(X_test)
plot_metrics(predictions, y_test)

### Logistic Regression

In [None]:
lr = LogisticRegression(lr_n_iters, lr_lambda_param, lr_learning_rate)
lr.fit(X_train, y_train)
predictions = lr.predict(X_test)
plot_metrics(predictions, y_test)

### Conclusions

Performance visualization indicates that logistic regression generally demonstrates superior performance on this specific dataset.

# Models with Kernel Evaluation

## Hyperparameter Tuning

### SVM
Non-linear kernel models will now be tuned, specifically employing polynomial kernels in this analysis.

In [None]:
if RUNMODE == 'training':
    svm_k_param_grid = {
            'kernel': ['poly'],
            'n_iters': [3000, 4000, 5000],
            'lambda_param': [1, 1e-1, 1e-2],
            'degree': [2, 3]
        }

    svm_k_best_params, svm_k_best_metrics = grid_search_cv(SVM, svm_k_param_grid, X_train, y_train, cv=5, scoring=METRICS)
    print(f'SVM best parameter: {svm_k_best_params}')
    print(f'SVM best metrics: {svm_k_best_metrics}')

Due to time constraints, the hyperparameters are manually assigned to variables; however, the grid search procedure remains fully reproducible. *(13 min 20s)*

In [None]:
if RUNMODE == 'training':
    svm_k_n_iters = svm_k_best_params['n_iters']
    svm_k_lambda_param = svm_k_best_params['lambda_param']
    svm_k_degree = svm_k_best_params['degree']
else:
    svm_k_n_iters = 4000
    svm_k_lambda_param = 1
    svm_k_degree = 3

### Logistic Regression

In [None]:
if RUNMODE == 'training':
    lr_k_param_grid = {
            'kernel': ['poly'],
            'n_iters': [2, 5, 10, 20, 50, 100],
            'lambda_param': [1e-4, 1e-5],
            'learning_rate': [1e-4, 1e-5],
            'degree': [2, 3]
        }

    lr_k_best_params, lr_k_best_metrics = grid_search_cv(LogisticRegression, lr_k_param_grid, X_train, y_train, cv=5, scoring=METRICS)
    print(f'SVM best parameter: {lr_k_best_params}')
    print(f'SVM best metrics: {lr_k_best_metrics}')

Due to time constraints, the hyperparameters are manually assigned to variables; however, the grid search procedure remains fully reproducible. *(14 min 57s)*

In [None]:
if RUNMODE == 'training':
    lr_k_n_iters = lr_k_best_params['n_iters']
    lr_k_lambda_param = lr_k_best_params['lambda_param']
    lr_k_learning_rate = lr_k_best_params['learning_rate']
    lr_k_degree = lr_k_best_params['degree']

else: 
    lr_k_n_iters = 50
    lr_k_lambda_param = 1e-4
    lr_k_learning_rate = 1e-5
    lr_k_degree = 3

With the parameters obtained, the learning curves will now be visualized to analyze convergence behavior.

## Learning Curves

### SVM

In [None]:
plot_learning_curve(SVM, X_train, y_train, X_test, y_test, [100, 300, 500, 1000, 1500, 2000, 3000, 4000, 5000, 6000, 7000], lambda_param=svm_k_lambda_param, kernel='poly', degree=svm_k_degree)

### Logistic Regression

In [None]:
plot_learning_curve(LogisticRegression, X_train, y_train, X_test, y_test, [0, 1, 3, 5, 7, 8, 9, 10, 15, 20], lambda_param=lr_k_lambda_param, learning_rate=lr_k_learning_rate, kernel='poly', degree=lr_k_degree)

It is evident that SVM requires significantly more iterations than logistic regression, which is expected given that logistic regression updates parameters for each example at every iteration. Additionally, SVM performance has improved, while logistic regression exhibits slight overfitting tendencies as iterations progress.

## Evaluation

### SVM

In [None]:
svm = SVM(n_iters=svm_k_n_iters, lambda_param=svm_k_lambda_param, kernel='poly', degree=svm_k_degree)
svm.fit(X_train, y_train)
predictions = svm.predict(X_test)
plot_metrics(predictions, y_test)

### Logistic Regression

In [None]:
lr = LogisticRegression(n_iters=lr_k_n_iters, lambda_param=lr_k_lambda_param, learning_rate=lr_k_learning_rate, kernel='poly', degree=lr_k_degree)
lr.fit(X_train, y_train)
predictions = lr.predict(X_test)
plot_metrics(predictions, y_test)

### Conclusion
With kernel methods, SVM performance has significantly improved, although logistic regression continues to demonstrate superior results. It is also notable that recall metrics are consistently higher than precision across both models. Regarding accuracy, performance approximates 75%, which represents a satisfactory result considering the baseline probability established by the dataset imbalance (60-40).