# Customer Preference Prediction with SVM Classification algorithm

In this project, we are going to classify customers as buyers or non-buyers of a specific product, based on their gender, age and salary.

We will utilize Support Vector Machine (SVM) classification algorithm to classify the customers in order to focus the advertisement activities on the right target group.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [3]:
df=pd.read_csv(r'C:\Users\HP\Downloads\Social_Network_Ads.csv')

#I had to stop at this point for the first commit today because the csv file seems  
# to be problematic and cannot be imported. I will fix it and continue tomorrow.

In [4]:
# Now I'm back and have already fixed it! 
# So let's go ahead with taking a look at the dataset

df.head()

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0


In [5]:
df.shape

(400, 5)

In [6]:
df.describe()

Unnamed: 0,User ID,Age,EstimatedSalary,Purchased
count,400.0,400.0,400.0,400.0
mean,15691540.0,37.655,69742.5,0.3575
std,71658.32,10.482877,34096.960282,0.479864
min,15566690.0,18.0,15000.0,0.0
25%,15626760.0,29.75,43000.0,0.0
50%,15694340.0,37.0,70000.0,0.0
75%,15750360.0,46.0,88000.0,1.0
max,15815240.0,60.0,150000.0,1.0


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   User ID          400 non-null    int64 
 1   Gender           400 non-null    object
 2   Age              400 non-null    int64 
 3   EstimatedSalary  400 non-null    int64 
 4   Purchased        400 non-null    int64 
dtypes: int64(4), object(1)
memory usage: 15.8+ KB


In [8]:
df.isnull().sum()

User ID            0
Gender             0
Age                0
EstimatedSalary    0
Purchased          0
dtype: int64

Looks like we have a quite clean dataset with no null values! 

Now, let's define our independent (predictor) variables and also the dependent (target) variable.

In [9]:
x=df.iloc[:,[2,3]]
y=df.iloc[:,4]

In [10]:
x.head()

Unnamed: 0,Age,EstimatedSalary
0,19,19000
1,35,20000
2,26,43000
3,27,57000
4,19,76000


In [11]:
y.head()

0    0
1    0
2    0
3    0
4    0
Name: Purchased, dtype: int64

Now, I need to split the data into two groups, namely; train and test sets.

In [20]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test=train_test_split(x,y,test_size=0.25)

So, 300 records will be used for training and 100 records for testing the model.

Now let's scale the data.

In [21]:
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
x_train=sc.fit_transform(x_train)
x_test=sc.fit_transform(x_test)

Now that the data is well preprocessed, let's employ the SVM algorithm to find the best hyperplane that separates different groups of data.

In [22]:
from sklearn.svm import SVC
classifier=SVC(kernel='linear',random_state=0)
classifier.fit(x_train,y_train)

y_predict=classifier.predict(x_test)

And the accuracy will be:

In [23]:
from sklearn import metrics
print('Accuracy with linear kernel:', metrics.accuracy_score(y_test,y_predict))

Accuracy with linear kernel: 0.83


Now let's change the kernel and see how can it affect the accuracy of our prediction:

In [24]:
from sklearn.svm import SVC
classifier=SVC(kernel='rbf') # rbf stands for radial basis function
classifier.fit(x_train,y_train)

y_predict=classifier.predict(x_test)
from sklearn import metrics
print('Accuracy with RBF kernel:', metrics.accuracy_score(y_test,y_predict))

Accuracy with RBF kernel: 0.95


Great! That's quite an improvement! As can be clearly seen, the kernel type considerably impacts the accuracy level of the SVM model predictions!

Now, let's play with the parameters of the RBF kernel to see how modifying these parameters can affect the accuracy.

In [27]:
from sklearn.svm import SVC
classifier=SVC(kernel='rbf',gamma=15,C=7,random_state=0)
classifier.fit(x_train,y_train)

y_predict=classifier.predict(x_test)
from sklearn import metrics
print('Accuracy with Modified RBF kernel:', metrics.accuracy_score(y_test,y_predict))

Accuracy with Modified RBF kernel: 0.82


We can also see that any improper tuning of these model parameters can severly damage the accuracy of our model. So, it is crucial to set these parameters to the optimal values dependig on the case under study.

Now let's go ahead and check the polynomial kernel.

In [34]:
from sklearn.svm import SVC
classifier=SVC(kernel='poly',degree=5)
classifier.fit(x_train,y_train)

y_predict=classifier.predict(x_test)
from sklearn import metrics
print('Accuracy with Modified Polynomial kernel:', metrics.accuracy_score(y_test,y_predict))

Accuracy with Modified Polynomial kernel: 0.79


Again, by changing the degree of the polynomial, we can get different results in terms of accuracy. This again highlights the importance of parameter selection for the model.