<a href="https://colab.research.google.com/github/Simarjit1303/Data-Science/blob/main/exercises/machine-learning/supervised-learning/support_vector_machines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Support Vector Machines
You should build a machine learning pipeline using a support vector machine model. In particular, you should do the following:
- Load the `mnist` dataset using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). You can find this dataset in the datasets folder.
- Split the dataset into training and test sets using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).
- Conduct data exploration, data preprocessing, and feature engineering if necessary.
- Train and test a support vector machine model using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).
- Check the documentation to identify the most important hyperparameters, attributes, and methods of the model. Use them in practice.

In [27]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Data Collection and Exploration

In [38]:
dataset = pd.read_csv('https://raw.githubusercontent.com/m-mahdavi/teaching/refs/heads/main/datasets/mnist.csv')
dataset.head(3)

Unnamed: 0,id,class,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783,pixel784
0,31953,5,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,34452,8,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,60897,5,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [3]:
dataset.describe()

Unnamed: 0,id,class,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783,pixel784
count,4000.0,4000.0,4000.0,4000.0,4000.0,4000.0,4000.0,4000.0,4000.0,4000.0,...,4000.0,4000.0,4000.0,4000.0,4000.0,4000.0,4000.0,4000.0,4000.0,4000.0
mean,34415.17925,4.4395,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.07675,0.01525,0.013,0.0015,0.0,0.0,0.0,0.0,0.0,0.0
std,20508.890104,2.879655,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2.616022,0.964495,0.822192,0.094868,0.0,0.0,0.0,0.0,0.0,0.0
min,17.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,16575.75,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,34435.5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,52111.5,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,69998.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,125.0,61.0,52.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0


In [39]:
dataset.drop(columns = ['id'], inplace = True)
# checking for null values
dataset.isnull().sum().value_counts()

Unnamed: 0,count
0,785


# Data Preprocessing

In [40]:
# Splitting data into 2 sets Train data(75%) and test data(25%)
train_data, test_data = train_test_split(dataset)
print(f"dataset_size: {dataset.shape}")
print(f"datset_trained_size: {train_data.shape}")
print(f"dataset_test_size: {test_data.shape}")

dataset_size: (4000, 785)
datset_trained_size: (3000, 785)
dataset_test_size: (1000, 785)


In [41]:
# Declaring feature vector and target variable
x_train = train_data.drop('class', axis=1)
y_train = train_data['class']
x_test = test_data.drop('class', axis=1)
y_test = test_data['class']

print(f"x_train_size: {x_train.shape}")
print(f"y_train_size: {y_train.shape}")
print(f"x_test_size: {x_test.shape}")
print(f"y_test_size: {y_test.shape}")

x_train_size: (3000, 784)
y_train_size: (3000,)
x_test_size: (1000, 784)
y_test_size: (1000,)


# [Method 1] Data Modelling and prediction with SVM.SVC(default parameter)


In [42]:
# creating modell for prediction
# Default hyperparameter means C=1.0, kernel=rbf and gamma=auto among other parameters

# initialisinfg classifier with default parameter
svc = SVC()

# fitting classifier to training dataset
svc.fit(x_train, y_train)

# makimg prediction
y_predict = svc.predict(x_test)

# checking the accuracy of our model
accuracy = accuracy_score(y_test, y_predict)
print(f"Accuracy of our model with default perameter is: {accuracy*100}")


Accuracy of our model with default perameter is: 95.7
