# Support Vector Machine

## Table of Contents

1. [**Introduction**](#Intro)   
2. [**Support Vector Machine**](#SVM)
3. [**Model Development**](#ModelDev)
  
    3.1. [**Steps to Implement**](#Proc)
  
    3.2. [**Hyper-parameters**](#Params) 

    3.3. [**Implementation**](#Impl)

4. [**Model Evaluation**](#ModelEval)
5. [**One-Hot Encoding**](#OneHotEncode)
6. [**Final Comments**](#FinCom)

## 1 Introduction <a name="Intro"></a>

A classifier basically separates different classes in the data using __decision boundaries__ and by carving feature space into regions, so that all the points within any given region are destined to be assigned the same label. **Support Vector Machine (SVM)** can be viewed as a relative of logistic regression. __SVM__ is a supervised learning algorithm mostly used for classification.




## 2 Support Vector Machine <a name="SVM"></a>
In **SVM**, we are seeking to maximize the margin for the separator between the two classes. 

<img src="https://docs.google.com/uc?export=download&id=1Fg5h0FchJhWXvVsMpupTSqVgqn36aMpv" width="700">

The channel between two classes is defined by a small number of data points as opposed to logistic regression, where all the points contribute to best position of the line. These contact points are the __support vectors__ defining the channels.


<img src="https://docs.google.com/uc?export=download&id=10EDJvJ7rGuMdLHNzPJjHt4VK8MfHJD6z" width="500">

(ref: https://www.learnopencv.com/support-vector-machines-svm/)

## 3 Model Development <a name="ModelDev"></a>

## 3.1 Steps to Implement <a name="Proc"></a>
Here are the steps to implement __SVM__ in Python using <font color='blue'>scikit-learn</font> library

__1.__ Import `SVC`, `train_test_split`, and `MinMaxScaler` funcions from scikit learn library along with `numpy` library
```python
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler  # For normalization
import numpy as np
```

__2.__ Define dependant variable (target variable) and independent variables (features) from dataset:
```python
x_data=np.array(df[['feature1','feature2',...]])
y_data=df['target variable']
```

__3.__ Normalize your data using `MinMaxScaler` (Optional but advised)
```python
MinMaxscaler = MinMaxScaler()  # define min max scaler
x_data_scaled = MinMaxscaler.fit_transform(x_data)  # transform data
```

__4.__ Split the data into train and test sets: `x_train,x_test,y_train,y_test=train_test_split(x_data_scaled,y_data)`



__5.__ Create an SVM classifier object using the constructor: `classifier = SVC(kernel = 'kernel type', C = numeric value, gamma = numeric value) `


__6.__ Use the fit function to fit the model to the training data: `classifier.fit(x_train,y_train)`

__7.__ Then, make prediction using the test data and training data:
```python
yhatTest=classifier.predict(x_test)
yhatTrain=classifier.predict(x_train)
```


### 3.2 Hyper-parameters <a name="Params"></a>

- <font color='red'> __C__ </font> (Regularization Parameter) tells the SVM optimization how much you want to avoid misclassifying each training example. If __C__ is high, the optimization will choose smaller margin hyperplane, so training data misclassification rate will be low. On the other hand, if __C__ is low, then the margin will be big, even if there will be misclassified training data examples.

- <font color='red'> __Gamma__ </font>: The gamma parameter defines how far the influence of a single training example reaches. This means that high Gamma will consider only points close to the plausible hyperplane and low Gamma will consider points at greater distance. This parameter comes into play if only kernels 'rbf', 'poly' and 'sigmoid' are used.

### 3.3 Implementation <a name="Impl"></a>


Steel plate faults dataset is provided by Semeion, Research of Sciences of Communication, Via Sersale 117, 00128, Rome, Italy. In this dataset, the faults of steel plates are classified into 7 types. Since it has been donated on October 26,2010, this dataset has been widely used in machine learning for automatic pattern recognition. Types of fault and corresponding numbers of sample are shown in the table below

<img src="https://docs.google.com/uc?export=download&id=1pw1oJ7plDsTASg_ntI_QSVivQ-tMhlqq" width="500">


The number of samples vary a lot from one category to another. Meanwhile, fault 7 is a special class because it contains all other faults except the first six kinds of fault. In other words, samples in class 7 may have no obvious common characteristics. For every sample, 27 features are recorded, providing evidences for its fault class. All attributes are expressed by integers or real numbers. Detailed information about these 27 independent variables is listed out in the following table.

<img src="https://docs.google.com/uc?export=download&id=1lAV-mPa2seL9VWkezbaCicnZVwOup2c6" width="500">


Ref: https://www.sciencedirect.com/science/article/pii/S0925231214012193?casa_token=8ZvcrfiUELkAAAAA:Vt2ShomuyzpagA6Su9nSQHzImgti_HHvtK5zuGqgC01It_Xn9UsccPB-5HVtzBonmsYCibDgYQ



Ref for the dataset: https://archive.ics.uci.edu/ml/datasets/Steel+Plates+Faults


In [None]:
import pandas as pd

url = ('https://raw.githubusercontent.com/MasoudMiM/ME_364/main/Steel_Plates_Faults/Data.csv')
df = pd.read_csv(url,names=['X_Minimum', 'X_Maximum', 'Y_Minimum', 'Y_Maximum', 'Pixels_Areas', 'X_Perimeter',
                            'Y_Perimeter', 'Sum_of_Luminosity', 'Minimum_of_Luminosity', 'Maximum_of_Luminosity',
                            'Length_of_Conveyer', 'TypeOfSteel_A300', 'TypeOfSteel_A400', 'Steel_Plate_Thickness',
                            'Edges_Index', 'Empty_Index', 'Square_Index', 'Outside_X_Index', 'Edges_X_Index',
                            'Edges_Y_Index', 'Outside_Global_Index', 'LogOfAreas', 'Log_X_Index', 'Log_Y_Index',
                            'Orientation_Index', 'Luminosity_Index', 'SigmoidOfAreas', 'Pastry', 'Z_Scratch',
                            'K_Scratch', 'Stains', 'Dirtiness', 'Bumps', 'Other_Faults'])           
df.head()

In [None]:
df.shape

__Step 1__, importing `SVC`, `train_test_split`, `MinMaxScaler`, and `numpy`

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler  # For normalization
import numpy as np

__Step 2__, defining target variable and independent variables

In [None]:
x_data=np.array(df[['X_Minimum', 'X_Maximum', 'Y_Minimum', 'Y_Maximum', 'Pixels_Areas', 'X_Perimeter',
             'Y_Perimeter', 'Sum_of_Luminosity', 'Minimum_of_Luminosity', 'Maximum_of_Luminosity',
             'TypeOfSteel_A300', 'TypeOfSteel_A400', 'Steel_Plate_Thickness']])
y_data=df['K_Scratch']

__Step 3__, normalizing using `MinMaxScaler`

In [None]:
MinMaxscaler = MinMaxScaler()  # define min max scaler
x_data_scaled = MinMaxscaler.fit_transform(x_data)  # transform data

__Step 4__, spliting the data into train and test sets

In [None]:
x_train,x_test,y_train,y_test=train_test_split(x_data_scaled,y_data,test_size=0.3)

__Step 5__, creating an SVM classifier

In [None]:
classifier = SVC(kernel = 'rbf', C = 0.1,gamma=30)

__Step 6__, fiting to the training data

In [None]:
classifier.fit(x_train,y_train)

__Step 7__, making predictions

In [None]:
yhatTest=classifier.predict(x_test)
yhatTrain=classifier.predict(x_train)

In [None]:
classifier.classes_

## 4 Model Evaluation <a name="ModelEval"></a>

In [None]:
from sklearn.metrics import accuracy_score 
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss

from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

In [None]:
acc_scoreTrain = accuracy_score(y_train,yhatTrain)
acc_scoreTest = accuracy_score(y_test,yhatTest)
print('accuracy for training data is %0.3f' %acc_scoreTrain)
print('accuracy for test data is %0.3f' %acc_scoreTest)

In [None]:
J_scoreTrain = jaccard_score(y_train,yhatTrain)
J_scoreTest = jaccard_score(y_test,yhatTest)
print('Jaccard Index for training data is %0.3f' %J_scoreTrain)
print('Jaccard Index for test data is %0.3f' %J_scoreTest)

In [None]:
F_scoreTrain = f1_score(y_train,yhatTrain)
F_scoreTest = f1_score(y_test,yhatTest)
print('F-Score for training data is %0.3f' %F_scoreTrain)
print('F-Score for test data is %0.3f' %F_scoreTest)

In [None]:
LogLossTrain = log_loss(y_train,yhatTrain)
LogLossTest = log_loss(y_test,yhatTest)
print('F-Score for training data is %0.3f' %LogLossTrain)
print('F-Score for test data is %0.3f' %LogLossTest)

In [None]:
print('Confusion matrix for training data')
CM_scoreTrain = confusion_matrix(y_train,yhatTrain)   # possible option normalize='true'
print(CM_scoreTrain)

print(40*'-')

print('Confusion matrix for test data')
CM_scoreTest = confusion_matrix(y_test,yhatTest)   # possible option normalize='true'
print(CM_scoreTest)

In [None]:
dispTr = ConfusionMatrixDisplay(CM_scoreTrain,display_labels=['No Fault','Fault'])
dispTr.plot()

dispTs = ConfusionMatrixDisplay(CM_scoreTest,display_labels=['No Fault','Fault'])
dispTs.plot()

<font color='red'>__IMPORTANT NOTE:__</font> SVM algorithm can also be used for multiclass classification, similar to the approach followed when implmeneting logistic regression algorithm for multiclass classification.

## 5 One-Hot Encoding <a name="OneHotEncode"></a>

Many machine learning algorithms cannot operate on labeled data directly. They require all input variables and output variables to be numeric. In some cases when we have categorical data and we want to use binary classification, we need to convert that categorical variable into integer values.

One way to do this is to use __one-hot encoding__. For instance for a categorical variable with three possible classes, we need three new variables with possible values of 0 and 1.

Let's demonstrate that using an example with part of the Fuel Economy Data Set, which is produced by the Office of Energy Efficiency and Renewable Energy of the U.S. Department of Energy. Fuel economy data are the result of vehicle testing done at the Environmental Protection Agency's National Vehicle and Fuel Emissions Laboratory in Ann Arbor, Michigan, and by vehicle manufacturers with oversight by EPA. This data set can be accessed from here: https://github.com/MasoudMiM/ME_364/blob/main/EPA_Green_Vehicle_Guide/EPA_2020_Fuel_Economy.csv and a description of the data is provided at this link: https://www.fueleconomy.gov/feg/EPAGreenGuide/GreenVehicleGuideDocumentation.pdf

In [None]:
import pandas as pd

url = ('https://raw.githubusercontent.com/MasoudMiM/ME_364/main/EPA_Green_Vehicle_Guide/EPA_2020_Fuel_Economy.csv')
dfFuel = pd.read_csv(url)           

dfFuel.drop(columns='Unnamed: 0',inplace=True)
dfFuel.head()

The column __SmartWay__ is a categorical data and has three possible classes, __No__, __Yes__, and __Elite__. 

In [None]:
dfFuel['SmartWay'].unique()

We can transform this variable into a vector of three numerical values for each row with 1 if that row belongs to a specific class and 0 if it does not. This can be done using <font color='blue'> Pandas</font>'s library function `get_dummies`.



In [None]:
dfFuel_encoded = pd.get_dummies(data=dfFuel,columns=['SmartWay'])
dfFuel_encoded.head()

In [None]:
dfFuel_encoded.tail()

Or you can do it for __Fuel__ column with two unique classes of __Gasoline__ and __Diesel__

In [None]:
dfFuel['Fuel'].unique()

In [None]:
dfFuel_encoded2 = pd.get_dummies(data=dfFuel,columns=['Fuel'])
dfFuel_encoded2.head()

This increses the number of columns (dimensions of your dataset) and as a result the size of your dataframe but your learning algorithm will perform a lot better.

## 6 Final Comments <a name="FinCom"></a>

- SVM can be very efficient, because it uses only a subset of the training data, only the support vectors
- It can have high accuracy, sometimes can perform even better than neural networks
- Not very sensitive to overfitting
- Training time is high when we have large datasets
- When the dataset has more noise (i.e. target classes are overlapping) SVM doesn’t perform well
- One hot encoding is used to convert categorical data into integer data to be used for classification purposes