# Introduction to SVM

A hyperplane is a plane with one less dimension than the dimension of its ambient space. For example, if space is 3-dimensional, then its hyperplanes are 2-dimensional planes. Moreover, if the space is 2-dimensional, its hyperplanes are the 1-dimensional lines.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.gridspec as gridspec

plt.style.use('https://raw.githubusercontent.com/HatefDastour/ENSF444/main/Files/mystyle.mplstyle')

# Create a figure
fig = plt.figure(figsize=(12, 8))

# Define the grid
gs = gridspec.GridSpec(1, 2, width_ratios=[3, 4])

# Left Subplot - 2D hyperplane plot
ax1 = plt.subplot(gs[0])
X1 = np.linspace(-1.5, 1.5, 1000)
X2 = -(1 + 2*X1)/3
ax1.plot(X1, X2)
ax1.set(xlabel=r'$X_1$', ylabel=r'$X_2$', xlim=[-1.5, 1.5], ylim=[-1.5, 1.5], aspect=1)
ax1.fill_between(X1, -1.5, -(1 + 2*X1)/3, color='LimeGreen', alpha=0.1)
ax1.annotate(r'$1 + 2X_1 + 3X_2 > 0$', xy=(-0.5, 0.5), fontsize='xx-large')
ax1.fill_between(X1, -(1 + 2*X1)/3, 1.5, color='Orange', alpha=0.1)
ax1.annotate(r'$1 + 2X_1 + 3X_2 < 0$', xy=(-0.9, -1), fontsize='xx-large')
ax1.grid(True)
ax1.set_title('A 2D Hyperplane Plot')

# Right Subplot - 3D-like representation
ax2 = plt.subplot(gs[1], projection='3d')
X1_3d, X2_3d = np.meshgrid(np.linspace(-1.5, 1.5, 100), np.linspace(-1.5, 1.5, 100))
# Calculate corresponding values for X3 based on the hyperplane equation -(1 + 2*X1 + 3*X2)/4
X3_3d = -(1 + 2*X1_3d + 3*X2_3d)/4
# Plot the 3D hyperplane
ax2.plot_surface(X1_3d, X2_3d, X3_3d, alpha=0.5, rstride=100, cstride=100, color='b')
# Set labels for X, Y, and Z axes
ax2.set(xlabel=r'$X_1$', ylabel=r'$X_2$', zlabel=r'$X_3$')
# Set limits for X, Y, and Z axes
ax2.set_xlim([-1.5, 1.5])
ax2.set_ylim([-1.5, 1.5])
ax2.set_zlim([-1.5, 1.5])
# Enable the grid
ax2.grid(True)
# Annotate the region above the hyperplane
ax2.text(-0.5, 0, 0.5, r'$1 + 2X_1 + 3X_2 + 4X_3 > 0$', fontsize='xx-large')
# Annotate the region below the hyperplane
ax2.text(-2.5, 0, -3, r'$1 + 2X_1 + 3X_2 + 4X_3 < 0$', fontsize='xx-large')
ax2.set_title('A 3D Hyperplane Plot')

# Ensure a tight layout
plt.tight_layout()

## Finding the Maximal Margin Hyperplane

We have *n* points *$x_1, ..., x_n$* in a *p*-dimensional space, each with a label *$y_1, ..., y_n$* of either -1 or 1. We want to find the best line that splits the points by their labels.

The best boundary is the one that has the largest distance to the nearest points from each class. This distance is the margin (*$M$*), and it shows how well the boundary splits the classes. The bigger the margin, the less likely we make mistakes.

To find the best boundary, we need to solve a math problem that finds the values *$b, w_1, ..., w_p$* that define the boundary equation:

\begin{equation}
b + w^T x = 0
\end{equation}

Where:
- *$x$* is any data point.
- *$b$* is the intercept, and *$w$* is the slope vector.

The margin, also known as the geometric margin, is the distance between the hyperplane and the closest data point on either side of the hyperplane. The margin can be calculated as:

$$\text{Margin} = \frac{2}{\|w\|}$$

where $\|w\|$ is the $\ell_2$ norm of the weight vector. The $\ell_2$ norm measures the length or magnitude of the vector.

The math problem has some rules for each data point (*$i$*) [James et al., 2023]:

\begin{equation}
y_i (b + w^T x_i) \geq M (1 - \epsilon_i)
\end{equation}

Where:
- *$y_i$* is the class label of the *$i$*-th point, either -1 or 1.
- *$x_i$* is the *$i$*-th data point.
- *$\epsilon_i$* is a variable that lets us be flexible in the separation. It is zero for points that are outside the margin and correctly classified, positive for points that are inside the margin or wrong, and negative for points that are on the boundary.

These rules make sure that the boundary separates the classes as well as we can, while allowing some errors for noisy or overlapping data.

Another rule that keeps the values normal is:

\begin{equation}
\| w \|^2_2 = \sum w_j^2 = 1
\end{equation}

This rule stops the values from getting too big, which would change the margin.

Also, the parameter *$C$* balances between maximizing the margin *$M$* and minimizing the errors *$\epsilon_i$*:

\begin{equation}
\sum_{i=1}^{n} \epsilon_i \leq C
\end{equation}

The parameter *$C$* limits how much error we can have in the separation. A small *$C$* means a large margin and a strict separation, while a large *$C$* means a small margin and a flexible separation.

The math problem tries to find the values *$b, w_1, ..., w_p$* and the variables *$\epsilon_i$* that follow these rules and maximize the margin *$M$*. By solving this problem, we get the best boundary that separates the classes in the data space. The points that are closest to the boundary and decide its position and margin are called support vectors. They have the most effect on the separation result.

*  Support vectors are the data points closest to the decision boundary (hyperplane) that define its position and orientation. They are crucial for SVM (Support Vector Machine) algorithms because they determine the margin, which is the distance between the decision boundary and the nearest data points. The support_vectors variable contains the coordinates of these data points in your feature space.

*  Coefficients are the weights of each feature in the decision boundary equation. They represent the importance and direction of each feature in the feature space. For a linear SVC model, the decision boundary is a hyperplane, and the coefficients are the normal vector to this hyperplane. The coefficients variable contains the coefficients for each feature dimension.

*  Intercept is the offset of the decision boundary from the origin in the feature space. It shifts the decision boundary along the axis that is orthogonal to the hyperplane. A positive intercept means the decision boundary is shifted in one direction, and a negative intercept means it is shifted in the opposite direction. The intercept variable contains the intercept value for the decision boundary.

*  Margin is the distance between the decision boundary and the nearest support vectors. It measures how well the model can separate the classes. The margin variable calculates the margin using the formula:

*  margin = 2 / np.linalg.norm(coefficients)
Slack variables are the distances of the data points from the margin. They allow some data points to be on the wrong side of the decision boundary while still satisfying the margin constraints. They are used in soft-margin SVMs. The slack_variables variable calculates the slack variables using the formula:

*  slack_variables = 1 - dual_coefs
where dual_coefs are the Lagrange multipliers associated with the support vectors. They indicate the importance of each support vector in determining the decision boundary. The dual_coefs variable retrieves the dual coefficients for the first class of the binary classification problem.

In [None]:
import numpy as np
from sklearn.svm import SVC

# Sample data (features and class labels)
X = np.array([[2, 1], [3, 3], [4, 3], [5, 4], [6, 5], [7, 5], [8, 6], [9, 7]])
y = np.array([-1, -1, -1, -1, 1, 1, 1, 1])

# Create a Support Vector Classifier
svc = SVC(kernel='linear')

# Fit the model to the data
svc.fit(X, y)

# Get the support vectors
support_vectors = svc.support_vectors_
print("Support Vectors:")
for i in range(support_vectors.shape[1]):
    print(f'\t\t{support_vectors[i, :]}')

# Get the coefficients (w values) and the intercept (b value)
# The hyperplane is defined by w^T x + b = 0
coefficients = svc.coef_
intercept = svc.intercept_
print("Coefficients (w values):\t\t", coefficients)
print("Intercept (b value):\t\t\t", intercept)

# Get the margin (M)
# The margin is defined by 2 / ||w||
margin = 2 / np.linalg.norm(coefficients)
print(f'Margin (M):\t\t\t\t {margin:+.3f}')

In [None]:
from sklearn.inspection import DecisionBoundaryDisplay
from matplotlib.colors import ListedColormap

# Hyperplane equation
w1, w2 = svc.coef_[0]
b = svc.intercept_[0]
equation = f'${w1:+.2f} \cdot x_1 {w2:+.2f} \cdot x_2 {b:+.2f} = 0$'

# Display the equation using LaTeX
from IPython.display import Latex, display
display(Latex(equation))


from sklearn.inspection import DecisionBoundaryDisplay
from matplotlib.colors import ListedColormap

colors = ["#f5645a", "#b781ea"]
edge_colors = ['#8A0002', '#3C1F8B']
# Create a figure and axis
fig, ax = plt.subplots(figsize=(7, 6))

# Decision boundary display
DecisionBoundaryDisplay.from_estimator(svc, X, cmap= ListedColormap(colors), ax=ax,
                                       grid_resolution = 300,
                                       response_method="predict",
                                       plot_method="pcolormesh",
                                       alpha=0.5,
                                       xlabel='Feature 1', ylabel='Feature 2', shading="auto")

# Define labels and markers for different classes
class_info = [(1, 'o', colors[1]), (-1, 's', colors[0])]

for label, marker, color in class_info:
    class_data = X[y == label]
    ax.scatter(class_data[:, 0], class_data[:, 1], fc=color, ec=edge_colors[label == 1],
               label=str(label), marker=marker)

# Plot support vectors
support_vectors = svc.support_vectors_
ax.scatter(support_vectors[:, 0], support_vectors[:, 1], s=200, facecolors='none',
           edgecolors='k', lw=2, label='Support Vectors')

# Decision boundary line
w1, w2 = svc.coef_[0]
b = svc.intercept_[0]
line_x = np.linspace(ax.get_xlim()[0], ax.get_xlim()[1], 100)
line_y = (-w1 / w2) * line_x - (b / w2)
ax.plot(line_x, line_y, color='black', linestyle='--', label= f'Decision Boundary: {equation}')

# Plot settings
ax.legend(loc = 'lower right')
ax.set_title("""Support Vector Classifier""", fontweight='bold', fontsize=16)
ax.grid(False)
ax.set_ylim(0, 8)
plt.tight_layout()

# Kernel Methods

However, not all data sets are linearly separable or have a linear relationship with the target variable. In such cases, SVM can use a technique called kernel methods to map the data points into a higher-dimensional feature space, where a linear hyperplane can be found. Kernel methods are based on the observation that many machine learning algorithms can be written in terms of inner products between data points, such as $x^T y$. By replacing these inner products with a kernel function $K(x, y)$, we can implicitly compute the inner products in the feature space without explicitly transforming the data points. This is known as the kernel trick, and it allows us to use SVM with nonlinear and complex data sets. The kernel function, also known as the similarity function, can be any function that satisfies some mathematical properties, such as symmetry and positive definiteness. Some common examples of kernel functions are:

To use support vector machines, you can map your data to a higher-dimensional space with different methods, such as:
- The **linear kernel**, which uses a simple dot product
- The **polynomial kernel**, which raises the dot product to a power
- The **RBF kernel** or **Gaussian kernel**, which measures the similarity between points
- The **Sigmond kernel**, which applies a sigmoid function to the dot product

The choice of kernel and its parameters, also known as the hyperparameters of the SVM model, depends on the characteristics of the data set and the problem at hand. It is important to evaluate the performance of different kernels and tune their parameters using various methods, such as cross-validation, grid search, random search, or Bayesian optimization. By using kernel methods, SVM can learn complex and nonlinear patterns from the data and achieve high accuracy and generalization.

# SVM Examples

## Example: Iris Flower Dataset

In [None]:
import numpy as np
from sklearn import datasets, svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.inspection import DecisionBoundaryDisplay
from matplotlib.colors import ListedColormap

# Load data
iris = datasets.load_iris()
X = iris.data[:, :2]
y = iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create SVM models
models = [svm.SVC(kernel="linear", C=1.0),
          svm.SVC(kernel="rbf", gamma=0.7, C=1.0),
          svm.SVC(kernel="poly", degree=3, gamma="auto", C=1.0),
          svm.SVC(kernel="poly", degree=5, gamma="auto", C=1.0)]

# Feature labels
xlabel, ylabel = [x.title().replace('Cm','CM') for x in iris.feature_names[:2]]

# Fit models to data
models = [clf.fit(X_train, y_train) for clf in models]

# Titles for the plots
titles = ["LinearSVC (linear kernel)",
          "SVC with RBF kernel",
          "SVC with polynomial (degree 3) kernel",
          "SVC with polynomial (degree 5) kernel"]

# Define colors and markers
colors = ["#f44336", "#40a347", '#0086ff']
edge_colors = ["#cc180b", "#16791d", '#11548f']
markers = ['o', 's', 'd']
cmap_ = ListedColormap(colors)

# Create 2x2 grid for plotting
fig, axes = plt.subplots(2, 2, figsize=(11, 9))

# Create a dictionary to map target names to numbers
label_map = dict(zip( np.arange(len(iris.target_names)), [x.title() for x in iris.target_names]))

# Plot decision boundaries
for clf, title, ax in zip(models, titles, axes.flatten()):
    disp = DecisionBoundaryDisplay.from_estimator(clf, X_train,
                                                  response_method="predict", cmap=cmap_,
                                                  grid_resolution = 300,
                                                  alpha=0.2, ax=ax,
                                                  xlabel=xlabel,
                                                  ylabel=ylabel)

    # Scatter plot of data points with target names
    for num in np.unique(y):
        ax.scatter(X_train[:, 0][y_train == num], X_train[:, 1][y_train == num],
                   c=colors[num],
                   s=40, ec=edge_colors[num],
                   marker=markers[num], label=label_map[num])

    ax.set_title(title, weight='bold')
    ax.grid(False)
    ax.legend(title = 'Flower Type', fontsize = 11, loc = 'best')

# Add a title
plt.suptitle("SVM Classifier Comparison (Train Dataset)", fontsize=16, fontweight='bold')

# Adjust layout
plt.tight_layout()

# Fit models to data and calculate accuracy scores
for clf, title in zip(models, titles):
    clf.fit(X_train, y_train)
    train_preds = clf.predict(X_train)
    test_preds = clf.predict(X_test)
    train_acc = accuracy_score(y_train, train_preds)
    test_acc = accuracy_score(y_test, test_preds)
    print(f"{title}\nTrain Accuracy: {train_acc:.3f}\nTest Accuracy: {test_acc:.3f}\n")

## Example: SVR

In [None]:
import numpy as np
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

plt.style.use('https://raw.githubusercontent.com/HatefDastour/ENSF444/main/Files/mystyle.mplstyle')

# Generate sample data
rng = np.random.RandomState(42)
X = 5 * rng.rand(200, 1)
y = np.cos(X).ravel()

# Add noise to targets
y[::5] += 3 * (0.5 - rng.rand(X.shape[0] // 5))

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_plot = np.linspace(0, 5, 1000)[:, None]

# Specify the kernel types
kernels = ['linear', 'rbf']

# Initialize the plot
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 5), sharex=True, sharey=True)

for ax, kernel in zip(axes.ravel(), kernels):
    # Fit regression model
    svr = SVR(kernel=kernel, C=1.0, epsilon=0.1)
    svr.fit(X_train, y_train)
    y_train_pred = svr.predict(X_train)
    y_test_pred = svr.predict(X_test)
    y_plot = svr.predict(X_plot)

    # Calculate MSE
    mse_train = mean_squared_error(y_train, y_train_pred)
    mse_test = mean_squared_error(y_test, y_test_pred)
    print(f'MSE for {kernel} kernel: Train = {mse_train:.3f}, Test = {mse_test:.3f}')

    # Plot
    ax.plot(X_plot, y_plot, color='k', lw=2, label=f'{kernel} model')
    ax.scatter(X_train[svr.support_], y_train[svr.support_], facecolor="green",
               edgecolor='DarkGreen', s=50, label=f'{kernel} support vectors')
    ax.scatter(X_train[np.setdiff1d(np.arange(len(X_train)), svr.support_)], y_train[np.setdiff1d(np.arange(len(X_train)),
                                                                                svr.support_)], facecolor="none",
               edgecolor='red', s=50, label='other training data')
    ax.legend(loc='best', ncol=1, fancybox=True, shadow=True)
    ax.set_title(f'SVR with a {kernel.title()} Kernel', weight = 'bold')

# Adjust layout
plt.tight_layout()