# Support Vector Machines 
### with Hyperparameter Tuning

### Hinweis:
Anders als in den Praktika und Vorlesungen verwenden wir hier nicht `Scikit-learn`, sondern die Bibliothek `cuML` (CUDA Machine Learning). Diese ermöglicht es uns, unsere SVMs auf einer NVIDIA GPU zu trainieren, was den Trainingsprozess extrem beschleunigt. An der Herangehensweise und der Art, wie wir die Hyperparameter tunen, ändert sich dadurch jedoch nichts.

1. Laden der aufbereiteten Daten der vorherigen Gruppe: (Da wir cuML verwenden, benutzen wir hier ein cuDF DataFrame anstelle eines Pandas DataFrame.)
2. cuML akzeptiert nur Arrays und keine DataFrames, deswegen müssen wir noch die Sets und Labels in ein CuPy-Array umwandeln.

In [1]:
import cudf
import cupy as cp
import numpy as np
import time
from cuml.svm import SVC
from cuml.metrics import accuracy_score
from sklearn.model_selection import ParameterSampler, ParameterGrid
from cuml.model_selection import GridSearchCV
import joblib

DATAPATH = '../Data/'
MODELPATH = '../Data/Models/SVM/'
MODELPATHSKLEARN = '../Data/Models/SVM/SVM_sklearn/'

train_set = cudf.read_csv(DATAPATH + 'train_set.csv')
train_labels = cudf.read_csv(DATAPATH + 'train_labels.csv')
test_set = cudf.read_csv(DATAPATH + 'test_set.csv')
test_labels = cudf.read_csv(DATAPATH + 'test_labels.csv')


In [2]:
clf = SVC()

clf.fit(train_set, train_labels)

test_pred = clf.predict(test_set)
accuracy_score = accuracy_score(test_labels, test_pred)

joblib.dump(clf, MODELPATH + 'SVM_no_hyper.pkl')

print(f'Accuracy: {accuracy_score}')

Accuracy: 0.7532467246055603


* Ohne Hyperparameter Tuning erhalten wir eine Accuracy von **ca. 0.753** ein wert den wir auf jeden fall noch verbessern wollen.
* Als erstes verwenden wir `ParameterSampler`, um eine breitere Fläche an Hyperparametern testen zu koennen.


In [None]:
param_distributions = [
    {
        'kernel': ['linear'],
        'C': np.logspace(np.log10(0.001), np.log10(1000), 7, endpoint=True), 
        'tol': np.logspace(np.log10(0.0000001), np.log10(0.1), 7, endpoint=True),
        'max_iter': [1000, 2000, 3000, 4000, 5000],
        'class_weight': ['balanced'],
    },
    {
        'kernel': ['poly'],
        'C': np.logspace(np.log10(0.001), np.log10(1000), 7, endpoint=True), 
        'tol': np.logspace(np.log10(0.0000001), np.log10(0.1), 7, endpoint=True),
        'gamma': np.logspace(np.log10(0.001), np.log10(1000), 7, endpoint=True),
        'degree': [2, 3, 4, 5, 6], # Macht natrülich nur Sinn, wenn degree > 1 da sonst linear
        'max_iter': [1000, 2000, 3000, 4000, 5000],
        'class_weight': ['balanced'],
    },
    {
        'kernel': ['rbf'],
        'C': np.logspace(np.log10(0.001), np.log10(1000), 7, endpoint=True),
        'tol': np.logspace(np.log10(0.0000001), np.log10(0.1), 7, endpoint=True),
        'gamma': np.logspace(np.log10(0.001), np.log10(1000), 7, endpoint=True),
        'max_iter': [1000, 2000, 3000, 4000, 5000],
        'class_weight': ['balanced'],
    },
    {
        'kernel': ['sigmoid'],
        'C': np.logspace(np.log10(0.001), np.log10(1000), 7, endpoint=True),
        'tol': np.logspace(np.log10(0.0000001), np.log10(0.1), 7, endpoint=True),
        'gamma': np.logspace(np.log10(0.001), np.log10(1000), 7, endpoint=True),
        'coef0': np.linspace(-1, 1, 10),
        'max_iter': [1000, 2000, 3000, 4000, 5000],
        'class_weight': ['balanced'],
    }
]

n_iter = 100

best_accuracy = 0
best_params = None

start_time = time.time()

all_sampled_params = []
for params in param_distributions:
    sampler = ParameterSampler(params, n_iter=n_iter, random_state=42)
    sampled_params = list(sampler)
    all_sampled_params.extend(sampled_params)

clf = SVC()

for params in all_sampled_params:
    try:
        clf.set_params(**params)
        clf.fit(train_set, train_labels)
        predictions = clf.predict(test_set)
        accuracy = accuracy_score(test_labels, predictions)

        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_params = params
    except ValueError as e:
        print(f"Skipping invalid parameter combination: {params}")
        print(f"Error: {e}")


end_time = time.time()

best_model = SVC(**best_params)
best_model.fit(train_set, train_labels) 
test_predictions = best_model.predict(test_set)
test_accuracy = accuracy_score(test_labels, test_predictions)

joblib.dump(best_model, MODELPATH + 'SVM_para_sampl.pkl')

['../Data/Models/SVM/SVM_para_sampl.pkl']

In [23]:
print(f"Best parameters: {best_params}")
print(f"Best accuracy: {best_accuracy}")
print(f"Time taken: {end_time - start_time:.2f} seconds")
print(f"Test accuracy with best model: {test_accuracy}")

Best parameters: {'tol': np.float64(0.1), 'max_iter': 4000, 'kernel': 'poly', 'gamma': np.float64(0.1), 'degree': 5, 'class_weight': 'balanced', 'C': np.float64(1.0)}
Best accuracy: 0.7662337422370911
Time taken: 35.68 seconds
Test accuracy with best model: 0.701298713684082


### 5. Parameter-Optimierung mit `ParameterSampler`

Mithilfe des `ParameterSamplers` konnten wir einige vielversprechende Parameter identifizieren:

- Gefundene Parameter:
    -  kernel: poly
    -  C: 1.0
    -  degree: 5
    -  tol: 0.1
    -  gamma: 0.1
    -  max_iter: 4000
    -  class_weight: balanced

- Modell-Performance:
    - Validierungsgenauigkeit: 76.62%
    - Testgenauigkeit: 70.13% (Hinweis auf Overfitting)

- Trainingszeit: 21.55 Sekunden

### 6. Verfeinerung der Hyperparameter mit `ParameterGrid`

Im nächsten Schritt werden wir nun ein `ParameterGrid` verwenden, um die Hyperparameter weiter zu optimieren.

In [25]:
param_grid = {
        'kernel': ['poly'],
        'C': [0.01, 0.1, 0.8, 0.9, 1, 1.1, 1.2, 2, 5, 10], 
        'tol': [0.01, 0.01, 0.1],
        'gamma': [0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.2, 0.3, 0.4, 0.5],
        'degree': [4, 5, 6],
        'max_iter': np.linspace(3900, 4100, 10),
        'class_weight': ['balanced'],
    },

best_accuracy = 0
best_params = None

start_time = time.time()

for params in ParameterGrid(param_grid):
    clf.set_params(**params)
    clf.fit(train_set, train_labels)
    predictions = clf.predict(test_set)
    accuracy = accuracy_score(test_labels, predictions)

    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_params = params    

end_time = time.time()

best_model = SVC(**best_params)
best_model.fit(train_set, train_labels)
test_predictions = best_model.predict(test_set)

test_accuracy = accuracy_score(test_labels, test_predictions)

joblib.dump(best_model, MODELPATH + 'SVM_para_grid.pkl')

['../Data/Models/SVM/SVM_para_grid.pkl']

In [26]:
print("\n" + "="*50)
print("GRID SEARCH RESULTS")
print("="*50)
print("\nBest Parameters Found:")
for param, value in best_params.items():
    print(f"  {param}: {value}")

print("\nModel Performance:")
print(f"  Validation Accuracy: {best_accuracy*100:.2f}%")

print(f"  Test Accuracy: {test_accuracy*100:.2f}%")
print(f"\nTraining Time: {end_time - start_time:.2f} seconds")
print("="*50)


GRID SEARCH RESULTS

Best Parameters Found:
  C: 1.2
  class_weight: balanced
  degree: 4
  gamma: 0.07
  kernel: poly
  max_iter: 3900.0
  tol: 0.1

Model Performance:
  Validation Accuracy: 78.57%
  Test Accuracy: 66.88%

Training Time: 436.49 seconds


Obwohl wir scheinbar bessere Parameter gefunden haben, hat sich die Test-Genauigkeit sogar verschlechtert! Ein klares Zeichen für Overfitting.

- Best Parameters Found:
  - C: 1.2
  - class_weight: balanced
  - degree: 4
  - gamma: 0.07
  - kernel: poly
  - max_iter: 3900.0
  - tol: 0.1

- Model Performance:
  - **Validation Accuracy: 78.57%**
  - **Test Accuracy: 66.88%**

- Training Time: 406.26 seconds


# Cross-Validation

Bisher haben wir verschiedene Methoden des Hyperparameter-Tunings betrachtet, um unsere Modelle zu verbessern. Um das volle Potenzial des Tunings auszuschöpfen und vor allem Overfitting zu vermeiden, benötigen wir jedoch ein weiteres wichtiges Werkzeug: die Cross-Validation (Kreuzvalidierung).

Im nächsten Schritt werden wir wieder damit beginnen, die Parameter zufällig zu wählen. Dieses Mal werden wir jedoch zusätzlich die Cross-Validation verwenden.

wir benutzen wieder die selbe Parameter Distrobution wie bei `ParameterSampler`

In [2]:
param_distributions = [
    {
        'kernel': ['linear'],
        'C': np.logspace(np.log10(0.001), np.log10(1000), 7, endpoint=True), 
        'tol': np.logspace(np.log10(0.0000001), np.log10(0.1), 7, endpoint=True),
        'max_iter': [1000, 2000, 3000, 4000, 5000],
        'class_weight': ['balanced'],
    },
    {
        'kernel': ['poly'],
        'C': np.logspace(np.log10(0.001), np.log10(1000), 7, endpoint=True), 
        'tol': np.logspace(np.log10(0.0000001), np.log10(0.1), 7, endpoint=True),
        'gamma': np.logspace(np.log10(0.001), np.log10(1000), 7, endpoint=True),
        'degree': [2, 3, 4, 5, 6], # Macht natrülich nur Sinn, wenn degree > 1 da sonst linear
        'max_iter': [1000, 2000, 3000, 4000, 5000],
        'class_weight': ['balanced'],
    },
    {
        'kernel': ['rbf'],
        'C': np.logspace(np.log10(0.001), np.log10(1000), 7, endpoint=True),
        'tol': np.logspace(np.log10(0.0000001), np.log10(0.1), 7, endpoint=True),
        'gamma': np.logspace(np.log10(0.001), np.log10(1000), 7, endpoint=True),
        'max_iter': [1000, 2000, 3000, 4000, 5000],
        'class_weight': ['balanced'],
    },
    {
        'kernel': ['sigmoid'],
        'C': np.logspace(np.log10(0.001), np.log10(1000), 7, endpoint=True),
        'tol': np.logspace(np.log10(0.0000001), np.log10(0.1), 7, endpoint=True),
        'gamma': np.logspace(np.log10(0.001), np.log10(1000), 7, endpoint=True),
        'coef0': np.linspace(-1, 1, 10),
        'max_iter': [1000, 2000, 3000, 4000, 5000],
        'class_weight': ['balanced'],
    }
]

In [8]:
train_set = train_set.astype('float32')
train_labels = train_labels.astype('float32')
test_set = test_set.astype('float32')
test_labels = test_labels.astype('float32')

In [12]:
train_labels.info()

<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   Outcome  800 non-null    float32
dtypes: float32(1)
memory usage: 3.1 KB


In [15]:

def randomized_search_with_cv(X, y, param_distributions, n_iter, cv_method='kfold', k=5):
    """
    Performs randomized search with cross-validation for cuML's SVC.

    Args:
        X: cuDF DataFrame or cuPy array, feature matrix.
        y: cuDF Series or cuPy array, target variable.
        param_distributions: List of Dictionaries, parameter space to search.
        n_iter: Integer, number of parameter settings to sample from each dict.
        cv_method: String, 'kfold' or 'stratified' (default: 'kfold').
        k: Integer, number of folds for k-fold cross-validation (default: 5).

    Returns:
        best_params: Dictionary, best hyperparameter combination found.
        best_score: Float, best average cross-validation score.
    """

    results = []

    for param_distribution in param_distributions:
        # Use ParameterSampler to generate parameter combinations
        sampler = ParameterSampler(param_distribution, n_iter=n_iter, random_state=42)
        sampled_params_list = list(sampler)

        for sampled_params in sampled_params_list:
            # 2. Cross-validation
            scores = []
            if cv_method == 'kfold':
                # Implementation for k-fold
                fold_size = len(X) // k
                for fold in range(k):
                    start = fold * fold_size
                    end = (fold + 1) * fold_size
    
                    X_train = cudf.concat([X.iloc[:start], X.iloc[end:]])
                    y_train = cudf.concat([y.iloc[:start], y.iloc[end:]])
                    X_val = X.iloc[start:end]
                    y_val = y.iloc[start:end]
    
                    model = SVC(**sampled_params)
                    model.fit(X_train, y_train)
                    score = model.score(X_val, y_val)
                    scores.append(score)

            elif cv_method == 'stratified':
                # Implementation for stratified k-fold
                from cuml.model_selection import StratifiedKFold
                skf = StratifiedKFold(n_splits=k)
    
                # Ensure X is a DataFrame (if it's not already)
                if not isinstance(X, cudf.DataFrame):
                    X = cudf.DataFrame(X)
    
                for train_index, val_index in skf.split(X, y):
                    X_train, X_val = X.iloc[train_index], X.iloc[val_index]
                    y_train, y_val = y.iloc[train_index], y.iloc[val_index]
    
                    model = SVC(**sampled_params)
                    model.fit(X_train, y_train)
                    score = model.score(X_val, y_val)
                    scores.append(score)
    
            elif cv_method == 'loocv':
                # Implementation for leave-one-out cv
                for i in range(len(X)):
                    X_train = X.drop(i)
                    y_train = y.drop(i)
                    X_val = X.iloc[[i]]
                    y_val = y.iloc[[i]]
                
                    model = SVC(**sampled_params)
                    model.fit(X_train, y_train)
                    score = model.score(X_val, y_val)
                    scores.append(score)

            else:
                raise ValueError("Invalid cv_method. Choose 'kfold' or 'stratified'.")

            avg_score = np.mean(scores)
            results.append({'params': sampled_params, 'score': avg_score})

    # 3. Find the best result
    best_result = max(results, key=lambda x: x['score'])
    best_params = best_result['params']
    best_score = best_result['score']

    return best_params, best_score

In [16]:
best_params, best_score = randomized_search_with_cv(
    train_set, train_labels, param_distributions, n_iter=20, cv_method='kfold', k=5
)

[W] [20:05:52.244767] SVC with the linear kernel can be much faster using the specialized solver provided by LinearSVC. Consider switching to LinearSVC if tranining takes too long.


RuntimeError: exception occurred! file=/opt/conda/conda-bld/work/cpp/src/svm/kernelcache.cuh line=444: Working set has already been initialized!
Obtained 35 stack frames
#1 in /home/chris/miniconda3/envs/cuml/lib/python3.12/site-packages/cuml/internals/../../../../libcuml++.so: ML::SVM::KernelCache<float, std::experimental::mdspan<float, std::experimental::extents<int, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_stride, raft::host_device_accessor<std::experimental::default_accessor<float>, (raft::memory_type)2> > >::InitWorkingSet(int const*) +0x3ec [0x7f0d3597144c]
#2 in /home/chris/miniconda3/envs/cuml/lib/python3.12/site-packages/cuml/internals/../../../../libcuml++.so: void ML::SVM::SmoSolver<float>::Solve<std::experimental::mdspan<float, std::experimental::extents<int, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_stride, raft::host_device_accessor<std::experimental::default_accessor<float>, (raft::memory_type)2> > >(std::experimental::mdspan<float, std::experimental::extents<int, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_stride, raft::host_device_accessor<std::experimental::default_accessor<float>, (raft::memory_type)2> >, int, int, float*, float const*, float**, int*, ML::SVM::SupportStorage<float>*, int**, float*, int, int) +0x43a [0x7f0d3598c42a]
#3 in /home/chris/miniconda3/envs/cuml/lib/python3.12/site-packages/cuml/internals/../../../../libcuml++.so: void ML::SVM::svcFitX<float, std::experimental::mdspan<float, std::experimental::extents<int, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_stride, raft::host_device_accessor<std::experimental::default_accessor<float>, (raft::memory_type)2> > >(raft::handle_t const&, std::experimental::mdspan<float, std::experimental::extents<int, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_stride, raft::host_device_accessor<std::experimental::default_accessor<float>, (raft::memory_type)2> >, int, int, float*, ML::SVM::SvmParameter const&, raft::distance::kernels::KernelParams&, ML::SVM::SvmModel<float>&, float const*) +0xa39 [0x7f0d3598fe19]
#4 in /home/chris/miniconda3/envs/cuml/lib/python3.12/site-packages/cuml/internals/../../../../libcuml++.so: void ML::SVM::svcFit<float>(raft::handle_t const&, float*, int, int, float*, ML::SVM::SvmParameter const&, raft::distance::kernels::KernelParams&, ML::SVM::SvmModel<float>&, float const*) +0x4b [0x7f0d3599036b]
#5 in /home/chris/miniconda3/envs/cuml/lib/python3.12/site-packages/cuml/svm/svc.cpython-312-x86_64-linux-gnu.so(+0x481c7) [0x7f0c886b11c7]
#6 in /home/chris/miniconda3/envs/cuml/bin/python(+0x113339) [0x561e4e04e339]
#7 in /home/chris/miniconda3/envs/cuml/bin/python: PyEval_EvalCode +0xa1 [0x561e4e1f5741]
#8 in /home/chris/miniconda3/envs/cuml/bin/python(+0x2d5ece) [0x561e4e210ece]
#9 in /home/chris/miniconda3/envs/cuml/bin/python(+0x112f8e) [0x561e4e04df8e]
#10 in /home/chris/miniconda3/envs/cuml/bin/python(+0x2d099f) [0x561e4e20b99f]
#11 in /home/chris/miniconda3/envs/cuml/bin/python(+0x2d1c57) [0x561e4e20cc57]
#12 in /home/chris/miniconda3/envs/cuml/bin/python(+0x113e38) [0x561e4e04ee38]
#13 in /home/chris/miniconda3/envs/cuml/bin/python(+0x251adc) [0x561e4e18cadc]
#14 in /home/chris/miniconda3/envs/cuml/bin/python(+0x2515be) [0x561e4e18c5be]
#15 in /home/chris/miniconda3/envs/cuml/bin/python: _PyObject_Call +0x12b [0x561e4e1701ab]
#16 in /home/chris/miniconda3/envs/cuml/bin/python(+0x113339) [0x561e4e04e339]
#17 in /home/chris/miniconda3/envs/cuml/bin/python(+0x2d099f) [0x561e4e20b99f]
#18 in /home/chris/miniconda3/envs/cuml/lib/python3.12/lib-dynload/_asyncio.cpython-312-x86_64-linux-gnu.so(+0x8274) [0x7f0e5f8de274]
#19 in /home/chris/miniconda3/envs/cuml/lib/python3.12/lib-dynload/_asyncio.cpython-312-x86_64-linux-gnu.so(+0x8a63) [0x7f0e5f8dea63]
#20 in /home/chris/miniconda3/envs/cuml/bin/python(+0x222fbc) [0x561e4e15dfbc]
#21 in /home/chris/miniconda3/envs/cuml/bin/python(+0x34db0c) [0x561e4e288b0c]
#22 in /home/chris/miniconda3/envs/cuml/bin/python(+0x1c402e) [0x561e4e0ff02e]
#23 in /home/chris/miniconda3/envs/cuml/bin/python(+0x21940b) [0x561e4e15440b]
#24 in /home/chris/miniconda3/envs/cuml/bin/python(+0x113339) [0x561e4e04e339]
#25 in /home/chris/miniconda3/envs/cuml/bin/python: PyEval_EvalCode +0xa1 [0x561e4e1f5741]
#26 in /home/chris/miniconda3/envs/cuml/bin/python(+0x2d5ece) [0x561e4e210ece]
#27 in /home/chris/miniconda3/envs/cuml/bin/python(+0x21940b) [0x561e4e15440b]
#28 in /home/chris/miniconda3/envs/cuml/bin/python: PyObject_Vectorcall +0x2e [0x561e4e1541ae]
#29 in /home/chris/miniconda3/envs/cuml/bin/python(+0x1126a1) [0x561e4e04d6a1]
#30 in /home/chris/miniconda3/envs/cuml/bin/python(+0x2eb328) [0x561e4e226328]
#31 in /home/chris/miniconda3/envs/cuml/bin/python: Py_RunMain +0x3d1 [0x561e4e225ed1]
#32 in /home/chris/miniconda3/envs/cuml/bin/python: Py_BytesMain +0x37 [0x561e4e1e00c7]
#33 in /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f0e60d95d90]
#34 in /lib/x86_64-linux-gnu/libc.so.6: __libc_start_main +0x80 [0x7f0e60d95e40]
#35 in /home/chris/miniconda3/envs/cuml/bin/python(+0x2a4f71) [0x561e4e1dff71]
