# Support Vector Machines 
### with Hyperparameter Tuning

### Hinweis:
Anders als in den Praktika und Vorlesungen verwenden wir hier nicht Scikit-learn, sondern die Bibliothek cuML (CUDA Machine Learning). Diese ermöglicht es uns, unsere SVMs auf einer NVIDIA GPU zu trainieren, was den Trainingsprozess extrem beschleunigt. An der Herangehensweise und der Art und Weise, wie wir die Hyperparameter tunen, ändert sich dadurch jedoch nichts.

1. Laden der aufbereiteten Daten der vorherigen Gruppe: (Da wir cuML verwenden, benutzen wir hier ein cuDF DataFrame anstelle eines Pandas DataFrame.)
2. cuML akzeptiert nur Arrays und keine DataFrames, deswegen müssen wir noch die Sets und Labels in ein CuPy-Array umwandeln.

In [38]:
import cudf

DATAPATH = '../Data/'

train_set = cudf.read_csv(DATAPATH + 'train_set.csv')
train_labels = cudf.read_csv(DATAPATH + 'train_labels.csv')
test_set = cudf.read_csv(DATAPATH + 'test_set.csv')
test_labels = cudf.read_csv(DATAPATH + 'test_labels.csv')

train_set = train_set.to_cupy()
test_set = test_set.to_cupy()
train_labels = train_labels.to_cupy()
test_labels = test_labels.to_cupy()


3. Vorbereiten der SVM (Wir verwenden natürlich eine lineare SVM und nicht Epsilon-SVM, da wir eine kategoriale Vorhersage treffen wollen.)
4. Zuerst verwenden wir ParameterSampler, um eine breitere Fläche an Hyperparametern testen zu koennen.


In [55]:
import cupy as cp
import numpy as np
import time
from cuml.svm import LinearSVC
from cuml.metrics import accuracy_score
from sklearn.model_selection import ParameterSampler


clf = LinearSVC()

param_distributions = {
    'C': np.logspace(-2, 2, 50),
    'class_weight': [None, 'balanced'],
    'max_iter': np.linspace(50, 1000, 10),
    'linesearch_max_iter': np.linspace(50, 1000, 10),
    'penalty': ['l1', 'l2'],
    'loss': ['hinge', 'squared_hinge'],
}

n_iter = 500

best_accuracy = 0
best_params = None

start_time = time.time()

for params in ParameterSampler(param_distributions, n_iter=n_iter, random_state=42):
    clf.set_params(**params)
    clf.fit(train_set, train_labels)
    predictions = clf.predict(test_set)
    accuracy = accuracy_score(test_labels.get(), predictions)

    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_params = params

end_time = time.time()

print("\n" + "="*50)
print("HYPERPARAMETER SEARCH RESULTS")
print("="*50)
print("\nBest Parameters Found:")
for param, value in best_params.items():
    print(f"  {param}: {value}")

print("\nModel Performance:")
print(f"  Validation Accuracy: {best_accuracy*100:.2f}%")

# Evaluate on test set
best_model = LinearSVC(**best_params)
best_model.fit(train_set, train_labels)
test_predictions = best_model.predict(test_set)
test_accuracy = accuracy_score(test_labels, test_predictions)

print(f"  Test Accuracy: {test_accuracy*100:.2f}%")
print(f"\nTraining Time: {end_time - start_time:.2f} seconds")
print("="*50)


HYPERPARAMETER SEARCH RESULTS

Best Parameters Found:
  penalty: l2
  max_iter: 683.3333333333334
  loss: hinge
  linesearch_max_iter: 261.1111111111111
  class_weight: balanced
  C: 4.941713361323833

Model Performance:
  Validation Accuracy: 74.03%
  Test Accuracy: 74.03%

Training Time: 21.55 seconds


5. Mit Parameter sampler haben wir einige gute Parameter gefunden
6. Im naechsten schritt wollen wir nun mit einem ParameterGrid unsere Hyperparameter weiter verbessern

- Best Parameters Found:
    - penalty: l2
    - max_iter: 683
    - loss: hinge
    - linesearch_max_iter: 261
    - class_weight: balanced
    - C: 4.941
  
 - Model Performance:
    - Validation Accuracy: 74.03%
    - Test Accuracy: 74.03%
  
  Training Time: 21.55 seconds

In [32]:
import cudf
import cupy as cp
from cuml.svm import LinearSVC
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, KFold
from sklearn.metrics import accuracy_score, classification_report

DATAPATH = '../Data/'

# Load data (assuming your CSVs have a header row)
train_set = cudf.read_csv(DATAPATH + 'train_set.csv')
train_labels = cudf.read_csv(DATAPATH + 'train_labels.csv')['Outcome']  # Extract 'Outcome' column
test_set = cudf.read_csv(DATAPATH + 'test_set.csv')
test_labels = cudf.read_csv(DATAPATH + 'test_labels.csv')['Outcome']  # Extract 'Outcome' column

# Convert to cuPy arrays
train_set = train_set.to_cupy()
test_set = test_set.to_cupy()
train_labels = train_labels.to_cupy()
test_labels = test_labels.to_cupy()

# 1. Randomized Search for Initial Hyperparameter Tuning
# -------------------------------------------------------
clf = LinearSVC()

# Define the hyperparameter space to search
param_distributions = {
    'C': cp.logspace(-2, 2, 50),  # Regularization parameter (log scale)
    'penalty': ['l1', 'l2'],      # Regularization type
    'loss': ['hinge', 'squared_hinge'],
    'tol': cp.logspace(-5, -2, 50)  # Tolerance for stopping criteria (log scale)
}

# Create RandomizedSearchCV object
random_search = RandomizedSearchCV(
    clf,
    param_distributions,
    n_iter=20,  # Number of random combinations to try (adjust as needed)
    cv=3,       # 3-fold cross-validation
    scoring='accuracy',
    random_state=42,
    n_jobs=-1    # this is for cpu usage, for gpu use fit_params
)

# Perform the random search
random_search.fit(train_set, train_labels)

# Print the best hyperparameters and the corresponding score
print("Best hyperparameters (Randomized Search):", random_search.best_params_)
print("Best cross-validation accuracy (Randomized Search):", random_search.best_score_)

# 2. Grid Search Based on Randomized Search Results
# -------------------------------------------------

# Refine the parameter grid based on the best results from random search
param_grid = {
    'C': cp.logspace(-1, 1, 10),  # Narrower range around the best 'C' from random search
    'penalty': [random_search.best_params_['penalty']],  # Fix to the best penalty
    'loss': [random_search.best_params_['loss']], # Fix to the best loss
    'tol': cp.logspace(-4, -3, 10)   # Narrower range around the best 'tol'
}

# Create GridSearchCV object
grid_search = GridSearchCV(
    clf,
    param_grid,
    cv=5,       # You can use more folds here for a more thorough search
    scoring='accuracy',
    n_jobs=-1
)

# Perform the grid search
grid_search.fit(train_set, train_labels)

# Print the best hyperparameters and the corresponding score
print("\nBest hyperparameters (Grid Search):", grid_search.best_params_)
print("Best cross-validation accuracy (Grid Search):", grid_search.best_score_)

# 3. Evaluate the Best Model (from Grid Search) on the Test Set
# -------------------------------------------------------------

# Get the best estimator (model) from grid search
best_model = grid_search.best_estimator_

# Make predictions on the test set
predictions = best_model.predict(test_set)

# Evaluate the model
accuracy = accuracy_score(test_labels, predictions)
print("\nTest set accuracy:", accuracy)
print(classification_report(test_labels, predictions))

# 4. K-Fold Cross-Validation with the Best Hyperparameters
# ---------------------------------------------------------

# Create a KFold object
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Store the scores
cv_scores = []

# Perform K-Fold cross-validation
for train_index, val_index in kf.split(train_set):
    X_train_fold, X_val_fold = train_set[train_index], train_set[val_index]
    y_train_fold, y_val_fold = train_labels[train_index], train_labels[val_index]

    # Create a new model with the best hyperparameters
    fold_model = LinearSVC(**grid_search.best_params_, random_state=42)

    # Fit the model on the training fold
    fold_model.fit(X_train_fold, y_train_fold)

    # Make predictions on the validation fold
    fold_predictions = fold_model.predict(X_val_fold)

    # Calculate the accuracy for the fold
    fold_accuracy = accuracy_score(y_val_fold, fold_predictions)
    cv_scores.append(fold_accuracy)

# Print the cross-validation scores
print("\nCross-validation scores:", cv_scores)
print("Average cross-validation accuracy:", cp.mean(cp.array(cv_scores)))

TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly.