# Support Vector Machines (SVM) for Classification: A Comprehensive Project

## Introduction

Support Vector Machines (SVM) are powerful supervised learning models used for classification and regression tasks. They work by finding the optimal hyperplane that separates classes in the feature space and can handle both linear and non-linear problems using kernel functions.

**Applications:**
- Text Classification (e.g., sentiment analysis, product reviews)
- Image Classification (e.g., handwritten digit recognition, object detection)
- Bioinformatics (e.g., protein classification, disease prediction)

**Problem Statement:**
In this project we demonstrate the end-to-end process for building an SVM-based classification model using a real-world dataset (the Iris dataset). The dataset is stored and processed in JSON format. We include sections for data exploration, preprocessing, a mathematical explanation of SVM, model training and evaluation, visual analysis of the results, and concluding discussions.

## Dataset Description & Exploratory Data Analysis (EDA)

The Iris dataset comprises 150 samples with 4 features (sepal length, sepal width, petal length, and petal width) and 3 classes representing different iris species. In this section, we load the dataset from a JSON file, inspect its structure, and perform basic visualizations to identify key patterns.

In [11]:
# Import necessary libraries
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import collections

In [12]:
# For reproducibility
np.random.seed(42)

In [13]:
# Load the Iris dataset from scikit-learn and save it as a JSON file
from sklearn.datasets import load_iris
iris_sklearn = load_iris()

# Convert the dataset to a dictionary format
iris_dict = {
    'data': iris_sklearn.data.tolist(),
    'target': iris_sklearn.target.tolist(),
    'feature_names': iris_sklearn.feature_names,
    'target_names': iris_sklearn.target_names.tolist()
}

In [None]:
# Save the dataset to a JSON file
with open('iris.json', 'w') as f:
    json.dump(iris_dict, f, indent=4)

# Load the dataset from the JSON file
with open('iris.json', 'r') as f:
    iris_json = json.load(f)

# Convert the JSON data to a pandas DataFrame
df = pd.DataFrame(iris_json['data'], columns=iris_json['feature_names'])
df['target'] = iris_json['target']

# Map target to species names for clarity
target_map = {idx: name for idx, name in enumerate(iris_json['target_names'])}
df['species'] = df['target'].map(target_map)

# Display the first few rows of the DataFrame
df.head()

In [None]:
# Basic EDA
print("DataFrame Shape:", df.shape)
print("\nData Types:\n\n", df.dtypes)


In [None]:
print("\nStatistical Description:\n\n", df.describe())

In [None]:
# Check for missing values
print("\nMissing Values:\n", df.isnull().sum())

In [None]:
# Visualize relationships between features using seaborn's pairplot
sns.pairplot(df, hue='species')
plt.show()

### Interpretation of Pairplot

The pairplot above reveals that the features **petal length (cm)** and **petal width (cm)** provide the clearest separation between the three iris species. Setosa is linearly separable from the other two classes in almost all feature combinations, while Versicolor and Virginica overlap more, especially in sepal measurements. This insight guides our later choice to visualize decision boundaries using petal features.

## Data Preprocessing

In this section, we handle any missing values (if present), apply feature scaling, encode categorical variables (if applicable), and split the data into training and testing sets.

In [None]:
# Separate features and target
X = df[iris_json['feature_names']]
y = df['target']

# Scale features using StandardScaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split the dataset into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42, stratify=y)

# Print class distribution in train and test sets to verify stratification
print("Class distribution in the full dataset:", collections.Counter(y))
print("Class distribution in the training set:", collections.Counter(y_train))
print("Class distribution in the test set:", collections.Counter(y_test))

# Output shapes of the training and testing sets
print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)

> **Why Feature Scaling?**  
Support Vector Machines are sensitive to the scale of input features because the algorithm relies on distance calculations to find the optimal hyperplane. Features with larger scales can dominate the distance metric, leading to suboptimal decision boundaries. Standardizing features ensures that each feature contributes equally to the model.

## Mathematical Explanation of SVM

Support Vector Machines aim to determine the optimal hyperplane that separates different classes in the feature space. Key concepts include:

1. **Hyperplane Separation:**
   - A hyperplane is a decision boundary. In a 2D space it is a line, while in higher dimensions it generalizes to a plane or hyperplane.

2. **Maximizing the Margin:**
   - The margin is the distance from the hyperplane to the nearest data points (support vectors). SVM maximizes this margin to improve model generalization.

3. **Kernel Functions:**
   - Kernel functions (e.g., linear, polynomial, radial basis function) allow SVMs to perform non-linear classification by mapping data to a higher-dimensional space where a linear separation is feasible.

4. **Optimization Techniques:**
   - The SVM training process involves solving a convex quadratic programming problem, ensuring that the global optimum is reached.

## SVM Optimization Objective

The SVM optimization problem can be formulated as:

$$
\begin{align*}
& \underset{\mathbf{w}, b}{\text{minimize}} \quad \frac{1}{2} \|\mathbf{w}\|^2 \\
& \text{subject to} \quad y_i (\mathbf{w}^\top \mathbf{x}_i + b) \geq 1, \quad \forall i
\end{align*}
$$

where $\mathbf{w}$ is the weight vector, $b$ is the bias, and $y_i$ are the class labels.


### Visual Explanation: Margin and Support Vectors

Below is a simple diagram illustrating the SVM concept of margin and support vectors:

![SVM Margin and Support Vectors](https://miro.medium.com/v2/resize:fit:1400/1*oRk-5aab0G8SkBX2fpw8Gw.png)


## Model Training & Evaluation

We now train an SVM model using scikit-learn's SVC, perform hyperparameter tuning via grid search, and evaluate the model's performance with metrics such as accuracy, classification report, and confusion matrix.

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV

# Set up a parameter grid for hyperparameter tuning
param_grid = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto']
}

# Initialize the Support Vector Classifier
svc = SVC()

# Perform grid search with 5-fold cross-validation
grid_search = GridSearchCV(svc, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Display the best parameters found
print("Best Parameters:", grid_search.best_params_)
best_svc = grid_search.best_estimator_

# Make predictions on the test set
y_pred = best_svc.predict(X_test)

# Evaluate the model performance
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

## Model Analysis & Visualization

Although the Iris dataset has 4 features, we can visualize decision boundaries by considering the two most discriminative features: **petal length** and **petal width**. We also display a heatmap of the confusion matrix for the full model.

In [28]:
# For visualization, select two features: petal length and petal width
features = ['petal length (cm)', 'petal width (cm)']
X_vis = df[features]
y_vis = df['target']

# Scale the two selected features
X_vis_scaled = scaler.fit_transform(X_vis)

# Train an SVM on these two features using the best parameters determined earlier
svc_vis = SVC(C=grid_search.best_params_['C'], 
              kernel=grid_search.best_params_['kernel'], 
              gamma=grid_search.best_params_['gamma'])
svc_vis.fit(X_vis_scaled, y_vis)

# Create a mesh to plot the decision boundaries
x_min, x_max = X_vis_scaled[:, 0].min() - 1, X_vis_scaled[:, 0].max() + 1
y_min, y_max = X_vis_scaled[:, 1].min() - 1, X_vis_scaled[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01),
                     np.arange(y_min, y_max, 0.01))

# Predict over the grid
Z = svc_vis.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

In [None]:
# Plot the decision boundary and support vectors
plt.contourf(xx, yy, Z, alpha=0.3, cmap=plt.cm.coolwarm)
sns.scatterplot(x=X_vis_scaled[:, 0], y=X_vis_scaled[:, 1], 
                hue=[target_map[val] for val in y_vis], palette='coolwarm', edgecolor='k')
# Plot support vectors
plt.scatter(svc_vis.support_vectors_[:, 0], svc_vis.support_vectors_[:, 1], 
            s=120, facecolors='none', edgecolors='black', linewidths=1.5, label='Support Vectors')
plt.xlabel(features[0])
plt.ylabel(features[1])
plt.title('SVM Decision Boundary (Using Petal Dimensions)')
plt.legend()
plt.show()

In [None]:
# Plot confusion matrix for the full model trained on all features
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=iris_json['target_names'], 
            yticklabels=iris_json['target_names'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()

### Confusion Matrix Interpretation

The confusion matrix above shows the number of correct and incorrect predictions for each class. For example, if the model misclassifies a sample of *Iris-versicolor* as *Iris-virginica*, it will appear in the corresponding cell. In this run, the model achieves high accuracy, with very few (if any) misclassifications, indicating strong performance on this dataset.

## Discussion

The SVM model achieved high accuracy on the Iris dataset, confirming its effectiveness for classification tasks on even small datasets. Key observations include:

- **Decision Boundary:** The two-feature visualization shows well-separated classes, indicating that petal dimensions are highly discriminative.
- **Hyperparameter Tuning:** The grid search helped select optimal parameters improving performance.
- **Model Limitations:** While SVM performs well on this dataset, larger or more complex datasets may require more computational resources and careful kernel selection.

Future comparisons with other models (e.g., Decision Trees, K-Nearest Neighbors) can provide insight into alternative approaches for similar classification problems.

## Conclusion

This project demonstrated a comprehensive machine learning pipeline using SVM for classification. Key steps included:

- Loading and processing data stored in JSON format
- Conducting Exploratory Data Analysis (EDA)
- Data preprocessing and feature scaling
- Providing a mathematical explanation of the SVM algorithm
- Training the SVM model with hyperparameter tuning and evaluating its performance
- Visualizing decision boundaries and the confusion matrix

Future work could extend this approach to larger, more complex datasets and explore alternative kernel functions or classification algorithms.

## References

1. Cortes, C. and Vapnik, V. (1995). Support-vector networks. *Machine Learning*, 20(3), 273-297.
2. [scikit-learn: Machine Learning in Python](https://scikit-learn.org/stable/)
3. Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer.
4. [The Iris Dataset: Fisher's Iris Data](http://archive.ics.uci.edu/ml/datasets/Iris)