# Data Preprocessing for SVM Classification with Separable Data

This code snippet demonstrates the preprocessing steps for SVM classification on separable data from 'A2-ring-merged.txt' and 'A2-ring-test.txt'. The code includes reading the data, assigning column names, and splitting the data into training and test sets.

## Libraries Used

- **pandas (pd):** For data manipulation and reading CSV data.
- **numpy (np):** For numerical operations on the data.
- **sklearn.datasets:** For loading datasets.
- **sklearn.model_selection:** For splitting the data into training and test sets and performing grid search for hyperparameter tuning.
- **sklearn.svm.SVC:** Support Vector Machine classifier.
- **sklearn.metrics.accuracy_score:** For evaluating the accuracy of the classifier.
- **sklearn.preprocessing.MinMaxScaler:** For normalizing numeric features.


In [1]:
# import data preprocessing libraries
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import MinMaxScaler

# import A2-ring-merged.txt as a dataframe
df_separable = pd.read_csv('A2-ring-merged.txt', sep='\t', header=None)
df_separable.columns = ['x1', 'x2', 'y']

X_train, y_train = df_separable.drop(["y"], axis=1), df_separable[["y"]]

df_test = pd.read_csv('A2-ring-test.txt', sep='\t', header=None)
df_test.columns = ['x1', 'x2', 'y']

X_test, y_test = df_test.drop(["y"], axis=1), df_test[["y"]]

# ravel y_train and y_test
y_train = np.ravel(y_train)
y_test = np.ravel(y_test)


Best Parameters: {'C': 10, 'gamma': 'scale', 'kernel': 'rbf'}
Test Set Accuracy: 0.9788

Cross-Validation Results:
Mean CV Score: 0.5515, Parameters: {'C': 0.01, 'gamma': 'scale', 'kernel': 'linear'}
Mean CV Score: 0.7501, Parameters: {'C': 0.01, 'gamma': 'scale', 'kernel': 'rbf'}
Mean CV Score: 0.5515, Parameters: {'C': 0.01, 'gamma': 'scale', 'kernel': 'poly'}
Mean CV Score: 0.4712, Parameters: {'C': 0.01, 'gamma': 'scale', 'kernel': 'sigmoid'}
Mean CV Score: 0.5515, Parameters: {'C': 0.01, 'gamma': 'auto', 'kernel': 'linear'}
Mean CV Score: 0.7546, Parameters: {'C': 0.01, 'gamma': 'auto', 'kernel': 'rbf'}
Mean CV Score: 0.5515, Parameters: {'C': 0.01, 'gamma': 'auto', 'kernel': 'poly'}
Mean CV Score: 0.5515, Parameters: {'C': 0.01, 'gamma': 'auto', 'kernel': 'sigmoid'}
Mean CV Score: 0.5515, Parameters: {'C': 0.1, 'gamma': 'scale', 'kernel': 'linear'}
Mean CV Score: 0.7698, Parameters: {'C': 0.1, 'gamma': 'scale', 'kernel': 'rbf'}
Mean CV Score: 0.5515, Parameters: {'C': 0.1, 'gamma

# Support Vector Machine (SVM) Model Training and Evaluation

This code segment demonstrates the training and evaluation of a Support Vector Machine (SVM) classifier using the scikit-learn library. Hyperparameter tuning is performed through GridSearchCV, and the model's performance is evaluated on the test set.


In [None]:

# Define the SVM model
svm_model = SVC()

# Define the parameter grid to search
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],       # regularization parameter
    'kernel': ['linear', 'rbf', 'poly', 'sigmoid'],  # kernel type
    'gamma': ['scale', 'auto']    # kernel coefficient for 'rbf' kernel
}

# Use GridSearchCV for hyperparameter tuning and cross-validation
grid_search = GridSearchCV(svm_model, param_grid, cv=5)  # 5-fold cross-validation

# Fit the model to the training data
grid_search.fit(X_train, y_train)

# Print the best parameters found
print("Best Parameters:", grid_search.best_params_)

# Evaluate the model on the test set
y_pred_test = grid_search.predict(X_test)

# Calculate the accuracy on the test set
test_accuracy = accuracy_score(y_test, y_pred_test)
print("Test Set Accuracy:", test_accuracy)

# Report cross-validation results
print("\nCross-Validation Results:")
cv_results = grid_search.cv_results_
for mean_score, params in zip(cv_results['mean_test_score'], cv_results['params']):
    print(f"Mean CV Score: {mean_score:.4f}, Parameters: {params}")

# Calculate and report expected classification error from cross-validation
expected_cv_error = 1 - grid_search.best_score_
print("\nExpected Classification Error from Cross-Validation:", expected_cv_error)

# Compare with the classification error on the test set
print("Comparison:")
print(f"Expected CV Error < Test Set Error: {expected_cv_error < (1 - test_accuracy)}")