# Support Vector Machines 
### with Hyperparameter Tuning

### Hinweis:
Anders als in den Praktika und Vorlesungen verwenden wir hier nicht Scikit-learn, sondern die Bibliothek cuML (CUDA Machine Learning). Diese ermöglicht es uns, unsere SVMs auf einer NVIDIA GPU zu trainieren, was den Trainingsprozess extrem beschleunigt. An der Herangehensweise und der Art und Weise, wie wir die Hyperparameter tunen, ändert sich dadurch jedoch nichts.

1. Laden der aufbereiteten Daten der vorherigen Gruppe: (Da wir cuML verwenden, benutzen wir hier ein cuDF DataFrame anstelle eines Pandas DataFrame.)
2. cuML akzeptiert nur Arrays und keine DataFrames, deswegen müssen wir noch die Sets und Labels in ein CuPy-Array umwandeln.

In [2]:
import cudf
import cupy as cp
import numpy as np
import time
from cuml.svm import LinearSVC
from cuml.metrics import accuracy_score
from sklearn.model_selection import ParameterSampler, ParameterGrid
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
import joblib

DATAPATH = '../Data/'
MODELPATH = '../Data/Models/SVM/'

train_set = cudf.read_csv(DATAPATH + 'train_set.csv')
train_labels = cudf.read_csv(DATAPATH + 'train_labels.csv')
test_set = cudf.read_csv(DATAPATH + 'test_set.csv')
test_labels = cudf.read_csv(DATAPATH + 'test_labels.csv')

train_set = train_set.to_cupy()
test_set = test_set.to_cupy()
train_labels = train_labels.to_cupy()
test_labels = test_labels.to_cupy()



stdout:



stderr:

Traceback (most recent call last):
  File "/home/chris/miniconda3/envs/rapids-24.12/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/driver.py", line 254, in ensure_initialized
    self.cuInit(0)
  File "/home/chris/miniconda3/envs/rapids-24.12/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/driver.py", line 304, in safe_cuda_api_call
    self._check_ctypes_error(fname, retcode)
  File "/home/chris/miniconda3/envs/rapids-24.12/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/driver.py", line 372, in _check_ctypes_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [100] Call to cuInit results in CUDA_ERROR_NO_DEVICE

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 4, in <module>
  File "/home/chris/miniconda3/envs/rapids-24.12/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/driver.py", line 269, in __getattr__
    

MemoryError: std::bad_alloc: CUDA error at: /home/chris/miniconda3/envs/rapids-24.12/include/rmm/mr/device/cuda_memory_resource.hpp:62: cudaErrorInsufficientDriver CUDA driver version is insufficient for CUDA runtime version

In [19]:
clf = LinearSVC()

clf.fit(train_set, train_labels)

test_pred = clf.predict(test_set)
accuracy_score = accuracy_score(test_labels, test_pred)

joblib.dump(clf, MODELPATH + 'SVM_no_hyper.pkl')

print(f'Accuracy: {accuracy_score}')

Accuracy: 0.7272727489471436


* Ohne Hyperparameter Tuning erhalten wir eine Accuracy von **ca. 0.72** ein wert den wir auf jeden fall noch verbessern wollen.
* Als erstes verwenden wir ParameterSampler, um eine breitere Fläche an Hyperparametern testen zu koennen.


In [21]:
param_distributions = {
    'C': np.logspace(-2, 2, 50),
    'class_weight': [None, 'balanced'],
    'max_iter': np.linspace(50, 1000, 10),
    'linesearch_max_iter': np.linspace(50, 1000, 10),
    'penalty': ['l1', 'l2'],
    'loss': ['hinge', 'squared_hinge'],
}

n_iter = 500

best_accuracy = 0
best_params = None

start_time = time.time()

for params in ParameterSampler(param_distributions, n_iter=n_iter, random_state=42):
    clf.set_params(**params)
    clf.fit(train_set, train_labels)
    predictions = clf.predict(test_set)
    accuracy = accuracy_score(test_labels.get(), predictions)

    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_params = params

end_time = time.time()

best_model = LinearSVC(**best_params)
best_model.fit(train_set, train_labels)
test_predictions = best_model.predict(test_set)
test_accuracy = accuracy_score(test_labels, test_predictions)

joblib.dump(best_model, MODELPATH + 'SVM_para_sampl.pkl')


['../Data/Models/SVM/SVM_para_sampl.pkl']

In [10]:
print("\n" + "="*50)
print("HYPERPARAMETER SEARCH RESULTS")
print("="*50)
print("\nBest Parameters Found:")
for param, value in best_params.items():
    print(f"  {param}: {value}")

print("\nModel Performance:")
print(f"  Validation Accuracy: {best_accuracy*100:.2f}%")

print(f"  Test Accuracy: {test_accuracy*100:.2f}%")
print(f"\nTraining Time: {end_time - start_time:.2f} seconds")
print("="*50)


HYPERPARAMETER SEARCH RESULTS

Best Parameters Found:
  penalty: l2
  max_iter: 683.3333333333334
  loss: hinge
  linesearch_max_iter: 261.1111111111111
  class_weight: balanced
  C: 4.941713361323833

Model Performance:
  Validation Accuracy: 74.03%
  Test Accuracy: 74.03%

Training Time: 24.93 seconds


5. Mit ParameterSampler haben wir einige gute Parameter gefunden
6. Im naechsten schritt wollen wir nun mit einem ParameterGrid unsere Hyperparameter weiter verbessern

- Best Parameters Found:
    - penalty: l2
    - max_iter: 683
    - loss: hinge
    - linesearch_max_iter: 261
    - class_weight: balanced
    - C: 4.941
  
 - Model Performance:
    - Validation Accuracy: 74.03%
    - Test Accuracy: 74.03%
  
  Training Time: 21.55 seconds

In [None]:
param_grid = {
    'C': [4.93, 4.939, 4.94, 4.941, 4.942, 4.95, 4.96],
    'class_weight': ['balanced'],
    'max_iter': [680, 681, 682, 683, 684, 685, 670],
    'linesearch_max_iter': [250, 255, 259, 260, 261, 262, 265, 270],
    'penalty': ['l2'],
    'loss': ['hinge'],
}

best_accuracy = 0
best_params = None

start_time = time.time()

for params in ParameterGrid(param_grid):
    clf.set_params(**params)
    clf.fit(train_set, train_labels)
    predictions = clf.predict(test_set)
    accuracy = accuracy_score(test_labels.get(), predictions)

    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_params = params    

end_time = time.time()

best_model = LinearSVC(**best_params)
best_model.fit(train_set, train_labels)
test_predictions = best_model.predict(test_set)

test_accuracy = accuracy_score(test_labels, test_predictions)

joblib.dump(best_model, MODELPATH + 'SVM_para_grid.pkl')

['../Data/Models/SVM/lin_reg_para_grid.pkl']

In [12]:
print("\n" + "="*50)
print("GRID SEARCH RESULTS")
print("="*50)
print("\nBest Parameters Found:")
for param, value in best_params.items():
    print(f"  {param}: {value}")

print("\nModel Performance:")
print(f"  Validation Accuracy: {best_accuracy*100:.2f}%")

print(f"  Test Accuracy: {test_accuracy*100:.2f}%")
print(f"\nTraining Time: {end_time - start_time:.2f} seconds")
print("="*50)


GRID SEARCH RESULTS

Best Parameters Found:
  C: 4.93
  class_weight: balanced
  linesearch_max_iter: 250
  loss: hinge
  max_iter: 680
  penalty: l2

Model Performance:
  Validation Accuracy: 74.03%
  Test Accuracy: 74.03%

Training Time: 18.28 seconds


Wir haben zwar bessere Hyperparameter gefunden, jedoch ist unser Model nicht besser oder Schlechter geworden:

- Best Parameters Found:
  - C: 4.93
  - class_weight: balanced
  - linesearch_max_iter: 250
  - loss: hinge
  - max_iter: 680
  - penalty: l2

- Model Performance:
  - **Validation Accuracy: 74.03%**
  - **Test Accuracy: 74.03%**

Training Time: 19.51 seconds


# Cross Validation

Wir haben ein wichtiges tool zum verbessern unseres Modells noch nicht betrachtet,
mit den bissherigen Hyperparameter Tuning methode haben wir unser Model schon etwas verbessern koennen,
aber um das tuning vollausschoepfen zu koennen muessen wir noch Cross Validation verwenden.

Im naechten schritten fangen wir nochmal damti an die Parameter zufaellig zu waehlen, dieses mal benutzen wir aber zusetzlich noch Cross Validation

In [1]:
param_distributions = {
    'C': np.logspace(-2, 2, 50),
    'class_weight': [None, 'balanced'],
    'max_iter': np.linspace(100, 1000, 10),
    'linesearch_max_iter': np.linspace(50, 1000, 10),
    'penalty': ['l1', 'l2'],
    'loss': ['hinge', 'squared_hinge'],
}

rnd_lin_reg = RandomizedSearchCV(LinearSVC(), param_distributions, n_iter=500, random_state=42, cv=5, n_jobs=-1)
rnd_lin_reg.fit(train_set.get(), train_labels.get())

NameError: name 'np' is not defined

In [None]:
best_model = rnd_lin_reg.best_estimator_
test_predictions = best_model.predict(test_set)
test_accuracy = accuracy_score(test_labels, test_predictions)

joblib.dump(best_model, MODELPATH + 'SVM_rnd_srch_CV.pkl')

In [None]:
print("\n" + "="*50)
print("RANDOMIZED SEARCH CV RESULTS")
print("="*50)
print("\nBest Parameters Found:")
for param, value in best_model.get_params().items():
    print(f"  {param}: {value}")

print("\nModel Performance:")
print(f"  Validation Accuracy: {rnd_lin_reg.best_score_*100:.2f}%")
print(f"  Test Accuracy: {test_accuracy*100:.2f}%")
print("="*50)