# Support Vector Machines
You should build a machine learning pipeline using a support vector machine model. In particular, you should do the following:
- Load the `mnist` dataset using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). You can find this dataset in the datasets folder.
- Split the dataset into training and test sets using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).
- Conduct data exploration, data preprocessing, and feature engineering if necessary.
- Train and test a support vector machine model using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).
- Check the documentation to identify the most important hyperparameters, attributes, and methods of the model. Use them in practice.

*.   Loading the dataset (MNIST) using Pandas.  
*.   Splitting the data into training and test sets.
*.   Exploratory Data Analysis (EDA) and preprocessing.
*.   Training an SVM model using Scikit-Learn.
*.   Evaluating the model's performance.
*.   Tuning hyperparameters to improve accuracy.




In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler

In [None]:
# Load the MNIST dataset (Assuming it's in CSV format)
mnist = pd.read_csv("datasets/mnist.csv")

# Display dataset information
print(mnist.info())
print(mnist.head())

# Check if there are missing values
print(mnist.isnull().sum().sum())  # Should be 0

In [None]:
 #Separate features and target
X = mnist.drop(columns=['label'])  # Features (pixel values)
y = mnist['label']  # Target variable (digit)

# Split data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)

In [None]:
# Show some sample images
fig, axes = plt.subplots(2, 5, figsize=(10, 5))
for i, ax in enumerate(axes.flat):
    img = X_train.iloc[i].values.reshape(28, 28)  # Reshape 784 pixels into 28x28 image
    ax.imshow(img, cmap='gray')
    ax.set_title(f"Label: {y_train.iloc[i]}")
    ax.axis('off')
plt.show()

 SVM is sensitive to feature scales, we normalize pixel values. -->  (0 to 255) using StandardScaler.

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Train the SVM Model

In [None]:
# Initialize SVM model with RBF kernel
svm_model = SVC(kernel='rbf', C=10, gamma=0.01, random_state=42)

# Train the model
svm_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = svm_model.predict(X_test_scaled)

In [None]:
# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

# Classification Report
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Confusion Matrix
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt="d", cmap="Blues")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.show()


Finding the Best Hyperparameters using GridSearchCV

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'C': [1, 10, 100],
    'gamma': [0.1, 0.01, 0.001],
    'kernel': ['rbf']
}

grid_search = GridSearchCV(SVC(), param_grid, cv=3, scoring='accuracy', verbose=2, n_jobs=-1)
grid_search.fit(X_train_scaled, y_train)

# Best hyperparameters
print("Best Parameters:", grid_search.best_params_)

# Train final model with best parameters
best_svm = grid_search.best_estimator_
y_pred_best = best_svm.predict(X_test_scaled)

# Final accuracy
print(f"Optimized Accuracy: {accuracy_score(y_test, y_pred_best):.4f}")
