# Support Vector Machines 
### with Hyperparameter Tuning

### Hinweis:
Anders als in den Praktika und Vorlesungen verwenden wir hier nicht Scikit-learn, sondern die Bibliothek cuML (CUDA Machine Learning). Diese ermöglicht es uns, unsere SVMs auf einer NVIDIA GPU zu trainieren, was den Trainingsprozess extrem beschleunigt. An der Herangehensweise und der Art und Weise, wie wir die Hyperparameter tunen, ändert sich dadurch jedoch nichts.

1. Laden der aufbereiteten Daten der vorherigen Gruppe: (Da wir cuML verwenden, benutzen wir hier ein cuDF DataFrame anstelle eines Pandas DataFrame.)
2. cuML akzeptiert nur Arrays und keine DataFrames, deswegen müssen wir noch die Sets und Labels in ein CuPy-Array umwandeln.

In [1]:
import cudf
import cupy as cp
import numpy as np
import time
from cuml.svm import SVC
from cuml.metrics import accuracy_score
from sklearn.model_selection import ParameterSampler, ParameterGrid
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
import joblib

DATAPATH = '../Data/'
MODELPATH = '../Data/Models/SVM/'
MODELPATHSKLEARN = '../Data/Models/SVM/SVM_sklearn/'

train_set = cudf.read_csv(DATAPATH + 'train_set.csv')
train_labels = cudf.read_csv(DATAPATH + 'train_labels.csv')
test_set = cudf.read_csv(DATAPATH + 'test_set.csv')
test_labels = cudf.read_csv(DATAPATH + 'test_labels.csv')

train_set = train_set.to_cupy()
test_set = test_set.to_cupy()
train_labels = train_labels.to_cupy()
test_labels = test_labels.to_cupy()


In [5]:
import pandas as pd
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report

train_set_sk = pd.read_csv(DATAPATH + 'train_set.csv')
train_labels_sk = pd.read_csv(DATAPATH + 'train_labels.csv')
test_set_sk = pd.read_csv(DATAPATH + 'test_set.csv')
test_labels_sk = pd.read_csv(DATAPATH + 'test_labels.csv')

svm_scikit_learn = LinearSVC(penalty='l2', max_iter=680, loss='hinge', class_weight='balanced', C=4.943)
start = time.time()
svm_scikit_learn.fit(train_set_sk, train_labels_sk)
end = time.time()

svm_scikit_learn_score = svm_scikit_learn.score(test_set_sk, test_labels_sk)

print('Time to train: ', end - start)
print('Accuracy: ', svm_scikit_learn_score)

joblib.dump(svm_scikit_learn, MODELPATHSKLEARN + 'SVM_para_grid_sklearn.pkl')


Time to train:  0.003141641616821289
Accuracy:  0.7402597402597403


  y = column_or_1d(y, warn=True)


['../Data/Models/SVM/SVM_sklearn/SVM_para_grid_sklearn.pkl']

In [9]:
clf = SVC()

clf.fit(train_set, train_labels)

test_pred = clf.predict(test_set)
accuracy_score = accuracy_score(test_labels, test_pred)

joblib.dump(clf, MODELPATH + 'SVM_no_hyper.pkl')

print(f'Accuracy: {accuracy_score}')

Accuracy: 0.7532467246055603


* Ohne Hyperparameter Tuning erhalten wir eine Accuracy von **ca. 0.753** ein wert den wir auf jeden fall noch verbessern wollen.
* Als erstes verwenden wir ParameterSampler, um eine breitere Fläche an Hyperparametern testen zu koennen.


<math xmlns="http://www.w3.org/1998/Math/MathML" display="block">
  <mi>C</mi>
  <munder>
    <mo data-mjx-texclass="OP">&#x2211;</mo>
    <mrow data-mjx-texclass="ORD">
      <mi>i</mi>
      <mo>=</mo>
      <mn>1</mn>
      <mo>,</mo>
      <mi>n</mi>
    </mrow>
  </munder>
  <mrow data-mjx-texclass="ORD">
    <mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi>
  </mrow>
  <mo stretchy="false">(</mo>
  <mi>f</mi>
  <mo stretchy="false">(</mo>
  <msub>
    <mi>x</mi>
    <mi>i</mi>
  </msub>
  <mo stretchy="false">)</mo>
  <mo>,</mo>
  <msub>
    <mi>y</mi>
    <mi>i</mi>
  </msub>
  <mo stretchy="false">)</mo>
  <mo>+</mo>
  <mi mathvariant="normal">&#x3A9;</mi>
  <mo stretchy="false">(</mo>
  <mi>w</mi>
  <mo stretchy="false">)</mo>
</math>

In [7]:
train_labels = train_labels.flatten()
test_labels = test_labels.flatten()

param_distributions = [
    {
        'kernel': ['linear'],
        'class_weight': [None, 'balanced'],
        'C': np.logspace(-2, 2, 10), 
        'tol': np.logspace(-5, -1, 10),
        'max_iter': [1000]
    },
    #{
    #    'kernel': ['poly'],
    #    'C': np.logspace(-2, 2, 10),
    #    'tol': np.logspace(-5, -1, 10),
    #    'degree': [2, 3, 4],
    #    'max_iter': [1000]
    #},
    #{
    #    'kernel': ['rbf'],
    #    'C': np.logspace(-2, 2, 10),
    #    'gamma': np.logspace(-2, 1, 10),
    #    'max_iter': [-1]
    #},
    #{
    #    'kernel': ['sigmoid'],
    #    'C': np.logspace(-2, 2, 10),
    #    'gamma': np.logspace(-2, 1, 10),
    #    'coef0': np.linspace(-1, 1, 10),
    #    'max_iter': [-1]
    #}
]

n_iter = 50  # Reduced for demonstration

best_accuracy = 0
best_params = None

start_time = time.time()

all_sampled_params = []
for params in param_distributions:
    sampler = ParameterSampler(params, n_iter=n_iter, random_state=42)
    sampled_params = list(sampler)
    all_sampled_params.extend(sampled_params)

# Create a single SVC instance outside the loop
clf = SVC()  # Use SVC

for params in all_sampled_params:
    try:
        clf.set_params(**params)
        clf.fit(train_set, train_labels)
        predictions = clf.predict(test_set)
        accuracy = accuracy_score(test_labels, predictions)

        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_params = params
    except ValueError as e:
        print(f"Skipping invalid parameter combination: {params}")
        print(f"Error: {e}")


end_time = time.time()

print(f"Best parameters: {best_params}")
print(f"Best accuracy: {best_accuracy}")
print(f"Time taken: {end_time - start_time:.2f} seconds")

# Train the final model with the best parameters on the full dataset
best_model = SVC(**best_params)  # Use SVC
best_model.fit(train_set, train_labels)  # Fit on train data
test_predictions = best_model.predict(test_set)
test_accuracy = accuracy_score(test_labels, test_predictions)  # Evaluate on test data
print(f"Test accuracy with best model: {test_accuracy}")

Skipping invalid parameter combination: {'tol': np.float64(0.00021544346900318823), 'max_iter': 1000, 'kernel': 'linear', 'class_weight': 'balanced', 'C': np.float64(35.93813663804626)}
Error: At least 3 classes are required to use probabilistic SVC with class weights.
Skipping invalid parameter combination: {'tol': np.float64(0.00021544346900318823), 'max_iter': 1000, 'kernel': 'linear', 'class_weight': 'balanced', 'C': np.float64(1.668100537200059)}
Error: At least 3 classes are required to use probabilistic SVC with class weights.
Skipping invalid parameter combination: {'tol': np.float64(9.999999999999999e-06), 'max_iter': 1000, 'kernel': 'linear', 'class_weight': 'balanced', 'C': np.float64(12.915496650148826)}
Error: At least 3 classes are required to use probabilistic SVC with class weights.
Skipping invalid parameter combination: {'tol': np.float64(0.001668100537200059), 'max_iter': 1000, 'kernel': 'linear', 'class_weight': 'balanced', 'C': np.float64(0.5994842503189409)}
Error

TypeError: cuml.svm.svc.SVC() argument after ** must be a mapping, not NoneType

5. Mit ParameterSampler haben wir einige gute Parameter gefunden
6. Im naechsten schritt wollen wir nun mit einem ParameterGrid unsere Hyperparameter weiter verbessern

- Best Parameters Found:
    - penalty: l2
    - max_iter: 683
    - loss: hinge
    - linesearch_max_iter: 261
    - class_weight: balanced
    - C: 4.941
  
 - Model Performance:
    - Validation Accuracy: 74.03%
    - Test Accuracy: 74.03%
  
  Training Time: 21.55 seconds

In [None]:
param_grid = {
    'C': [4.93, 4.939, 4.94, 4.941, 4.942, 4.95, 4.96],
    'class_weight': ['balanced'],
    'max_iter': [680, 681, 682, 683, 684, 685, 670],
    'linesearch_max_iter': [250, 255, 259, 260, 261, 262, 265, 270],
    'penalty': ['l2'],
    'loss': ['hinge'],
}

best_accuracy = 0
best_params = None

start_time = time.time()

for params in ParameterGrid(param_grid):
    clf.set_params(**params)
    clf.fit(train_set, train_labels)
    predictions = clf.predict(test_set)
    accuracy = accuracy_score(test_labels.get(), predictions)

    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_params = params    

end_time = time.time()

best_model = LinearSVC(**best_params)
best_model.fit(train_set, train_labels)
test_predictions = best_model.predict(test_set)

test_accuracy = accuracy_score(test_labels, test_predictions)

joblib.dump(best_model, MODELPATH + 'SVM_para_grid.pkl')

['../Data/Models/SVM/lin_reg_para_grid.pkl']

In [12]:
print("\n" + "="*50)
print("GRID SEARCH RESULTS")
print("="*50)
print("\nBest Parameters Found:")
for param, value in best_params.items():
    print(f"  {param}: {value}")

print("\nModel Performance:")
print(f"  Validation Accuracy: {best_accuracy*100:.2f}%")

print(f"  Test Accuracy: {test_accuracy*100:.2f}%")
print(f"\nTraining Time: {end_time - start_time:.2f} seconds")
print("="*50)


GRID SEARCH RESULTS

Best Parameters Found:
  C: 4.93
  class_weight: balanced
  linesearch_max_iter: 250
  loss: hinge
  max_iter: 680
  penalty: l2

Model Performance:
  Validation Accuracy: 74.03%
  Test Accuracy: 74.03%

Training Time: 18.28 seconds


Wir haben zwar bessere Hyperparameter gefunden, jedoch ist unser Model nicht besser oder Schlechter geworden:

- Best Parameters Found:
  - C: 4.93
  - class_weight: balanced
  - linesearch_max_iter: 250
  - loss: hinge
  - max_iter: 680
  - penalty: l2

- Model Performance:
  - **Validation Accuracy: 74.03%**
  - **Test Accuracy: 74.03%**

Training Time: 19.51 seconds


# Cross Validation

Wir haben ein wichtiges tool zum verbessern unseres Modells noch nicht betrachtet,
mit den bissherigen Hyperparameter Tuning methode haben wir unser Model schon etwas verbessern koennen,
aber um das tuning vollausschoepfen zu koennen muessen wir noch Cross Validation verwenden.

Im naechten schritten fangen wir nochmal damti an die Parameter zufaellig zu waehlen, dieses mal benutzen wir aber zusetzlich noch Cross Validation

In [1]:
param_distributions = {
    'C': np.logspace(-2, 2, 50),
    'class_weight': [None, 'balanced'],
    'max_iter': np.linspace(100, 1000, 10),
    'linesearch_max_iter': np.linspace(50, 1000, 10),
    'penalty': ['l1', 'l2'],
    'loss': ['hinge', 'squared_hinge'],
}

rnd_lin_reg = RandomizedSearchCV(LinearSVC(), param_distributions, n_iter=500, random_state=42, cv=5, n_jobs=-1)
rnd_lin_reg.fit(train_set.get(), train_labels.get())

NameError: name 'np' is not defined

In [None]:
best_model = rnd_lin_reg.best_estimator_
test_predictions = best_model.predict(test_set)
test_accuracy = accuracy_score(test_labels, test_predictions)

joblib.dump(best_model, MODELPATH + 'SVM_rnd_srch_CV.pkl')

In [None]:
print("\n" + "="*50)
print("RANDOMIZED SEARCH CV RESULTS")
print("="*50)
print("\nBest Parameters Found:")
for param, value in best_model.get_params().items():
    print(f"  {param}: {value}")

print("\nModel Performance:")
print(f"  Validation Accuracy: {rnd_lin_reg.best_score_*100:.2f}%")
print(f"  Test Accuracy: {test_accuracy*100:.2f}%")
print("="*50)