## Support Vector Machine

What this algorithm does is to find the line that best maximizes the margin between the support vectors from the classes in the data.

A company sent out email offers which invite customers to take an action or not (purchase or not). In the dataset, we have certain information about the customers, their gender, age, estimated salary and if they took an action or not.

You are to classify each of those customers based on the action and predict if a customer would to take an action or not based on their age and estimated salary.

###### Predict
Imagine a customer who is female and 35 years old who earns 100,000 USD. Would she take an action or not? How confident are you in your prediction?

### Import Libraries

In [1]:
import numpy as np
import pandas as pd
import os
import matplotlib
import matplotlib.pyplot as plt

# Feature scaling
from sklearn.preprocessing import StandardScaler

# Split data
from sklearn.model_selection import train_test_split

# SVM
from sklearn.svm import SVC

# Accuracy
from sklearn import metrics

# confussion metrics
from sklearn.metrics import confusion_matrix

#classification report
from sklearn.metrics import classification_report

In [2]:
os.getcwd()

'C:\\Users\\Wale'

In [3]:
os.chdir('C:\\Users\\Wale\\Machine Learning data')

In [4]:
data = pd.read_csv("Social_Network_Ads.csv")

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   User ID          400 non-null    int64 
 1   Gender           400 non-null    object
 2   Age              400 non-null    int64 
 3   EstimatedSalary  400 non-null    int64 
 4   Purchased        400 non-null    int64 
dtypes: int64(4), object(1)
memory usage: 15.8+ KB


In [6]:
data.head(10)

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0
5,15728773,Male,27,58000,0
6,15598044,Female,27,84000,0
7,15694829,Female,32,150000,1
8,15600575,Male,25,33000,0
9,15727311,Female,35,65000,0


### Encode the gender column to numerical values of 1s and 0s

In [7]:
data['Gender'] = data['Gender'].astype('category')

In [8]:
data['Gender']=data['Gender'].cat.codes

In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   User ID          400 non-null    int64
 1   Gender           400 non-null    int8 
 2   Age              400 non-null    int64
 3   EstimatedSalary  400 non-null    int64
 4   Purchased        400 non-null    int64
dtypes: int64(4), int8(1)
memory usage: 13.0 KB


## Machine Learning

In [10]:
X = data.iloc[ :,1:4]
X

Unnamed: 0,Gender,Age,EstimatedSalary
0,1,19,19000
1,1,35,20000
2,0,26,43000
3,0,27,57000
4,1,19,76000
...,...,...,...
395,0,46,41000
396,1,51,23000
397,0,50,20000
398,1,36,33000


In [11]:
y = data.iloc[:,-1:]
y

Unnamed: 0,Purchased
0,0
1,0
2,0
3,0
4,0
...,...
395,1
396,1
397,1
398,0


### Feature scaling

In [12]:
Scaler = StandardScaler()
X_scaled = Scaler.fit_transform(X)

### Split the data

In [13]:
# split data
X_train,X_test, y_train,y_test = train_test_split(X, y, random_state = 42,test_size = 0.25)

### Fit the model

In [14]:
classifier = SVC(kernel = 'linear')
classifier.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

### Make predictions with test set

In [15]:
test_pred = classifier.predict(X_test)
test_pred

array([0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0,
       0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1,
       0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1], dtype=int64)

In [16]:
# make prediction for the 35 year old female who earns $100,000 USD
case = classifier.predict(Scaler.fit_transform([[0,35,100000]]))
case

array([0], dtype=int64)

In this case the woman is predicted not to take action.

In [17]:
# How accurate is our model
accuracy = metrics.accuracy_score(y_test, test_pred)
accuracy

0.84

We are certain of our results by 84%. Lets see the classification report and confusion matrix

In [18]:
# confusion metrics
cm = confusion_matrix(y_test,test_pred)
cm

array([[54,  9],
       [ 7, 30]], dtype=int64)

From our confusion matrix we conclude that:
1. **True positive:** 30(We predicted a positive result and it was positive)
2. **True negative:** 54(We predicted a negative result and it was negative)
3. **False positive:** 9(We predicted a positive result and it was negative)
4. **False negative:** 7(We predicted a negative result and it was positive)

The model did better in predicting the negative class -not taking action.

In [19]:
# print the first 25 true and predicted responses
print('True', y_test[0:25])
print('Pred', test_pred[0:25])

True      Purchased
209          0
280          1
33           0
210          1
93           0
84           0
329          1
94           0
266          0
126          0
9            0
361          1
56           0
72           0
132          0
42           0
278          1
376          0
231          0
385          1
77           0
15           0
391          1
271          1
0            0
Pred [0 1 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1 1 0 1 0 0 0 1 0]


In [20]:
# classification report
print(classification_report(y_test, test_pred))

              precision    recall  f1-score   support

           0       0.89      0.86      0.87        63
           1       0.77      0.81      0.79        37

    accuracy                           0.84       100
   macro avg       0.83      0.83      0.83       100
weighted avg       0.84      0.84      0.84       100



The recall which i would assume the business would want high, tells us how often is the model correct when the actual value is positive or negative. We can see that the model classifies the negative class much better, 86% accuracy than the positive class, 81%, but the difference isn't too much. With this we conclude that the model is Highly specific and Highly sensitive.

The precision tells us how often the model is correct when predicting the positive class, which in this case is 0.77%