In [None]:
## Answer 1)
In a linear Support Vector Machine (SVM), the goal is to find a hyperplane that separates the data into two classes. The mathematical formula for a linear SVM can be expressed as follows:

Given a set of training data \((\mathbf{x}_i, y_i)\) where \(\mathbf{x}_i\) is the input feature vector and \(y_i\) is the class label (+1 or -1), the decision function for a linear SVM is given by:

\[ f(\mathbf{x}) = \mathbf{w} \cdot \mathbf{x} + b \]

Here:
- \(f(\mathbf{x})\) is the decision function.
- \(\mathbf{w}\) is the weight vector, perpendicular to the hyperplane.
- \(\mathbf{x}\) is the input feature vector.
- \(b\) is the bias term (also known as the intercept).

The decision function \(f(\mathbf{x})\) classifies an input \(\mathbf{x}\) as follows:
- If \(f(\mathbf{x}) > 0\), then \(\mathbf{x}\) is classified as class +1.
- If \(f(\mathbf{x}) < 0\), then \(\mathbf{x}\) is classified as class -1.

The goal during the training of a linear SVM is to find the optimal values for \(\mathbf{w}\) and \(b\) that maximize the margin between the two classes. The margin is the distance between the hyperplane and the nearest data point from either class. The optimization problem involves finding \(\mathbf{w}\) and \(b\) that minimize \(\frac{1}{2} \|\mathbf{w}\|^2\) subject to the constraints \(y_i (\mathbf{w} \cdot \mathbf{x}_i + b) \geq 1\) for all training points.

In practice, the optimization is often solved using techniques such as quadratic programming or gradient descent. The resulting SVM can then be used for classification by applying the decision function \(f(\mathbf{x})\).

In [None]:
## Answer 2)

The objective function of a linear Support Vector Machine (SVM) aims to find the optimal hyperplane that maximizes the margin between the classes. The margin is the distance between the hyperplane and the nearest data point from either class. The objective function involves minimizing the norm of the weight vector while simultaneously satisfying certain constraints.

For a linear SVM, the objective function is typically formulated as follows:

\[ \text{Minimize} \quad \frac{1}{2} \|\mathbf{w}\|^2 \]

Subject to the constraints:

\[ y_i (\mathbf{w} \cdot \mathbf{x}_i + b) \geq 1 \quad \text{for all training points } (\mathbf{x}_i, y_i) \]

Here:
- \(\mathbf{w}\) is the weight vector.
- \(b\) is the bias term (intercept).
- \(\|\mathbf{w}\|\) represents the Euclidean norm (magnitude) of the weight vector.

The objective function is minimized with respect to \(\mathbf{w}\) and \(b\) under the constraint that each data point \((\mathbf{x}_i, y_i)\) is correctly classified with a margin of at least 1. The factor \(\frac{1}{2}\) is included for mathematical convenience, and the objective is essentially a regularization term to prevent overfitting.

In summary, the objective function of a linear SVM seeks to find the values of \(\mathbf{w}\) and \(b\) that result in the maximum margin between the classes, while ensuring that each training point is correctly classified and lies outside a margin of at least 1. This optimization problem is often solved using techniques such as quadratic programming or gradient descent during the training phase of the SVM.

In [None]:
## Answer 3)

The kernel trick is a technique used in Support Vector Machines (SVM) to handle non-linear decision boundaries by implicitly mapping input features into a higher-dimensional space. In SVM, the kernel trick allows the algorithm to compute the dot product between transformed feature vectors without explicitly calculating the transformation. This is computationally efficient and avoids the need to explicitly work in the higher-dimensional space.

The idea behind the kernel trick can be understood as follows:

1. **Linear SVM in the Original Feature Space:**
   - In the case of a linear SVM, the decision boundary is a hyperplane in the original feature space.

2. **Non-Linear Decision Boundaries:**
   - For problems where a linear decision boundary is insufficient (e.g., non-linearly separable data), the kernel trick is applied to map the input features into a higher-dimensional space.

3. **Kernel Function:**
   - A kernel function, denoted as \(K(\mathbf{x}_i, \mathbf{x}_j)\), is used to compute the dot product between the transformed feature vectors \(\Phi(\mathbf{x}_i)\) and \(\Phi(\mathbf{x}_j)\) in the higher-dimensional space without explicitly calculating \(\Phi(\mathbf{x}_i)\) and \(\Phi(\mathbf{x}_j)\).
   - The choice of kernel function determines the mapping.

4. **Common Kernel Functions:**
   - Some commonly used kernel functions include:
     - **Linear Kernel:** \(K(\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i \cdot \mathbf{x}_j\)
     - **Polynomial Kernel:** \(K(\mathbf{x}_i, \mathbf{x}_j) = (\mathbf{x}_i \cdot \mathbf{x}_j + c)^d\)
     - **Radial Basis Function (RBF) or Gaussian Kernel:** \(K(\mathbf{x}_i, \mathbf{x}_j) = \exp\left(-\frac{\|\mathbf{x}_i - \mathbf{x}_j\|^2}{2\sigma^2}\right)\)

5. **Implicit Higher-Dimensional Representation:**
   - The kernel trick allows the SVM to operate in an implicitly higher-dimensional space without the need to explicitly calculate or represent the transformed features. This is computationally efficient.

6. **Solving the Dual Optimization Problem:**
   - The optimization problem solved by SVM involves only the dot products between input vectors. The kernel trick is crucial for expressing these dot products in terms of the chosen kernel function.

By using the kernel trick, SVMs can effectively handle non-linear relationships in the data, making them powerful tools for various machine learning tasks. The choice of the appropriate kernel function depends on the specific characteristics of the data and the problem at hand.

In [None]:
## Answer 4)

In Support Vector Machines (SVM), support vectors play a crucial role in determining the decision boundary and maximizing the margin between different classes. Support vectors are the data points that lie closest to the decision boundary (hyperplane) and have a non-zero contribution to defining the boundary. These points influence the position and orientation of the hyperplane, making SVM robust and effective, especially in high-dimensional spaces. Let's explain the role of support vectors with an example:

### Example:

Consider a binary classification problem with two classes, represented by red and blue points in a two-dimensional feature space. The goal is to find a decision boundary that separates the two classes. In this case, we'll use a linear SVM.

![SVM Example](https://i.imgur.com/bS3ECKE.png)

1. **Training Data:**
   - Red points belong to Class 1 (+1), and blue points belong to Class 2 (-1).

2. **Linear Decision Boundary:**
   - A linear SVM aims to find a hyperplane that maximizes the margin between the two classes. The hyperplane is represented by the equation \(f(\mathbf{x}) = \mathbf{w} \cdot \mathbf{x} + b\).
   - Support vectors are the data points that lie on or closest to the decision boundary. In this example, these are the circled points.

3. **Margin:**
   - The margin is the perpendicular distance between the support vectors and the decision boundary. Maximizing the margin is crucial for better generalization to new data.

4. **Optimization:**
   - The SVM optimization problem involves finding the optimal weights (\(\mathbf{w}\)) and bias (\(b\)) that define the decision boundary. The support vectors have a non-zero weight in this optimization process.

5. **Decision Rule:**
   - The decision rule is based on the sign of \(f(\mathbf{x})\):
     - If \(f(\mathbf{x}) > 0\), predict Class 1.
     - If \(f(\mathbf{x}) < 0\), predict Class 2.

6. **Out-of-Sample Prediction:**
   - Support vectors are crucial for out-of-sample predictions. The decision boundary depends only on the support vectors, allowing SVM to generalize well to new, unseen data.

### Role of Support Vectors:

1. **Defining the Decision Boundary:**
   - Support vectors are critical in defining the location and orientation of the decision boundary. The decision boundary is determined by the support vectors and is insensitive to the positions of other data points.

2. **Maximizing Margin:**
   - The margin is maximized by considering the support vectors. These are the points that contribute to the margin and determine the robustness of the SVM model.

3. **Dimensionality Reduction:**
   - In high-dimensional feature spaces, the majority of data points do not influence the decision boundary. The SVM focuses on the support vectors, effectively performing a form of dimensionality reduction.

4. **Handling Non-Linearity:**
   - In the case of non-linear decision boundaries (using kernel tricks), support vectors still play a central role in defining the decision boundary in the transformed feature space.

In summary, support vectors are the key elements in SVM that define the decision boundary and maximize the margin between classes, making the model robust, effective, and capable of handling non-linear relationships in the data.

In [None]:
## Answer 5)

Let's illustrate the concepts of Hyperplane, Marginal plane, Soft margin, and Hard margin in SVM with examples and graphs. We'll use a two-dimensional feature space for simplicity.

### 1. Hyperplane:

The hyperplane is the decision boundary that separates different classes. In a two-dimensional space, it is a straight line. Here's an example:

![Hyperplane](https://i.imgur.com/53Up2bY.png)

- Red and blue points represent two classes.
- The hyperplane (solid line) separates the two classes. It is defined by the equation \(f(\mathbf{x}) = \mathbf{w} \cdot \mathbf{x} + b\).

### 2. Marginal plane:

The marginal plane represents the boundaries of the margin, which is the region between the support vectors of different classes and the hyperplane. The margin is the distance between the marginal planes. Here's an example:

![Marginal Plane](https://i.imgur.com/VyTOs1w.png)

- The hyperplane (solid line) is the decision boundary.
- The marginal planes (dashed lines) are parallel to the hyperplane and define the margin.

### 3. Hard Margin:

In a hard margin SVM, the goal is to find a hyperplane with the maximum margin without allowing any misclassifications. Here's an example:

![Hard Margin](https://i.imgur.com/Vz1KqUE.png)

- The hyperplane (solid line) maximizes the margin between classes without allowing any points inside the margin.

### 4. Soft Margin:

In a soft margin SVM, a certain amount of misclassification is allowed to find a more flexible decision boundary. This is beneficial when dealing with noisy or overlapping data. Here's an example:

![Soft Margin](https://i.imgur.com/Q6T4ovU.png)

- The hyperplane (solid line) allows for a small number of misclassified points within the margin (soft margin).

In the soft margin example, the points within the margin or on the wrong side of the hyperplane are penalized with a cost parameter \(C\). The choice of \(C\) influences the trade-off between maximizing the margin and allowing misclassifications.

These visualizations illustrate the concepts of hyperplane, marginal plane, hard margin, and soft margin in SVM. The hyperplane and marginal plane concepts are fundamental to understanding SVM's decision boundary and the trade-offs involved in choosing between hard and soft margins.

In [None]:
## Answer 6)

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train a linear SVM classifier
def train_svm_classifier(X_train, y_train, C=1.0):
    svm_classifier = SVC(kernel='linear', C=C)
    svm_classifier.fit(X_train, y_train)
    return svm_classifier

# Predict labels for the testing set
def predict_labels(model, X_test):
    return model.predict(X_test)

# Compute the accuracy of the model on the testing set
def compute_accuracy(y_true, y_pred):
    return accuracy_score(y_true, y_pred)

# Plot the decision boundaries using two features
def plot_decision_boundaries(X, y, model, feature_indices=(0, 1)):
    h = 0.02  # step size in the mesh
    x_min, x_max = X[:, feature_indices[0]].min() - 1, X[:, feature_indices[0]].max() + 1
    y_min, y_max = X[:, feature_indices[1]].min() - 1, X[:, feature_indices[1]].max() + 1

    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)
    plt.scatter(X[:, feature_indices[0]], X[:, feature_indices[1]], c=y, edgecolors='k', cmap=plt.cm.Paired)
    plt.xlabel(iris.feature_names[feature_indices[0]])
    plt.ylabel(iris.feature_names[feature_indices[1]])
    plt.title('Decision Boundaries')

# Train the SVM classifier with default regularization parameter C=1.0
svm_model_default = train_svm_classifier(X_train_scaled, y_train)
y_pred_default = predict_labels(svm_model_default, X_test_scaled)
accuracy_default = compute_accuracy(y_test, y_pred_default)

# Plot decision boundaries using the first two features
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plot_decision_boundaries(X_train_scaled, y_train, svm_model_default)
plt.title(f'Default SVM (C=1.0)\nAccuracy: {accuracy_default:.2f}')

# Try different values of the regularization parameter C
C_values = [0.1, 1.0, 10.0]
for i, C in enumerate(C_values):
    svm_model = train_svm_classifier(X_train_scaled, y_train, C=C)
    y_pred = predict_labels(svm_model, X_test_scaled)
    accuracy = compute_accuracy(y_test, y_pred)

    plt.subplot(1, len(C_values) + 1, i + 2)
    plot_decision_boundaries(X_train_scaled, y_train, svm_model)
    plt.title(f'SVM (C={C})\nAccuracy: {accuracy:.2f}')

plt.tight_layout()
plt.show()
