# **Support Vector Machine (SVM) - Salary Data**

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

In [2]:
# Read csv file and store data for processing
df_sal_train = pd.read_csv('SalaryData_Train.csv')
df_sal_train.head()

Unnamed: 0,age,workclass,education,educationno,maritalstatus,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native,Salary
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


## **Exploratory Data Analysis (EDA)**

In [3]:
df_sal_train.shape

(30161, 14)

In [4]:
df_sal_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30161 entries, 0 to 30160
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   age            30161 non-null  int64 
 1   workclass      30161 non-null  object
 2   education      30161 non-null  object
 3   educationno    30161 non-null  int64 
 4   maritalstatus  30161 non-null  object
 5   occupation     30161 non-null  object
 6   relationship   30161 non-null  object
 7   race           30161 non-null  object
 8   sex            30161 non-null  object
 9   capitalgain    30161 non-null  int64 
 10  capitalloss    30161 non-null  int64 
 11  hoursperweek   30161 non-null  int64 
 12  native         30161 non-null  object
 13  Salary         30161 non-null  object
dtypes: int64(5), object(9)
memory usage: 3.2+ MB


Note: maritcalstatus, relationship and race columns may not impact salary finalization. So, dropping these columns.

In [5]:
df_sal_train = df_sal_train.drop(['maritalstatus', 'relationship', 'race'], axis=1)
df_sal_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30161 entries, 0 to 30160
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   age           30161 non-null  int64 
 1   workclass     30161 non-null  object
 2   education     30161 non-null  object
 3   educationno   30161 non-null  int64 
 4   occupation    30161 non-null  object
 5   sex           30161 non-null  object
 6   capitalgain   30161 non-null  int64 
 7   capitalloss   30161 non-null  int64 
 8   hoursperweek  30161 non-null  int64 
 9   native        30161 non-null  object
 10  Salary        30161 non-null  object
dtypes: int64(5), object(6)
memory usage: 2.5+ MB


In [6]:
df_sal_train.shape

(30161, 11)

In [7]:
# Convert all string object columns to categorical columns
columns = ['workclass', 'education', 'occupation', 'sex', 'native', 'Salary']

for x in columns:
  df_sal_train[x] = df_sal_train[x].astype('category')

df_sal_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30161 entries, 0 to 30160
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   age           30161 non-null  int64   
 1   workclass     30161 non-null  category
 2   education     30161 non-null  category
 3   educationno   30161 non-null  int64   
 4   occupation    30161 non-null  category
 5   sex           30161 non-null  category
 6   capitalgain   30161 non-null  int64   
 7   capitalloss   30161 non-null  int64   
 8   hoursperweek  30161 non-null  int64   
 9   native        30161 non-null  category
 10  Salary        30161 non-null  category
dtypes: category(6), int64(5)
memory usage: 1.3 MB


In [8]:
# Change categorical input features to dummies
# Change output categorical values to numeric catcodes 
df_temp = pd.get_dummies(df_sal_train.drop('Salary', axis=1))
df_temp['Salary'] = df_sal_train['Salary'].cat.codes 

df_temp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30161 entries, 0 to 30160
Data columns (total 85 columns):
 #   Column                              Non-Null Count  Dtype
---  ------                              --------------  -----
 0   age                                 30161 non-null  int64
 1   educationno                         30161 non-null  int64
 2   capitalgain                         30161 non-null  int64
 3   capitalloss                         30161 non-null  int64
 4   hoursperweek                        30161 non-null  int64
 5   workclass_ Federal-gov              30161 non-null  uint8
 6   workclass_ Local-gov                30161 non-null  uint8
 7   workclass_ Private                  30161 non-null  uint8
 8   workclass_ Self-emp-inc             30161 non-null  uint8
 9   workclass_ Self-emp-not-inc         30161 non-null  uint8
 10  workclass_ State-gov                30161 non-null  uint8
 11  workclass_ Without-pay              30161 non-null  uint8
 12  educ

In [9]:
# Separate out x and y variables
x = df_temp.drop('Salary', axis=1)
y = df_temp['Salary']

print('x shape: ', x.shape)
print('y shape: ', y.shape)

x shape:  (30161, 84)
y shape:  (30161,)


## **Hyper Parameter Tuning**

Note: Since data is large about 30161 rows, GridsearchCV for getting best hyper parameters, step execution did not complete even after 20 mins. 

Initially tried after converting input categorical columns to dummies columns. Execution did not complete even after 20 mins with GPU. 

Hence, tried again by changing categorical values to category numeric codes instead of creating many dummy columns. This also did not complete even after 10 minutes. 

So, skipped this step for this problem.

## **SVM Model Creation**

In [10]:
svc = SVC(C=0.1, kernel='rbf')

In [11]:
svc.fit(x,y)

SVC(C=0.1)

In [12]:
preds = svc.predict(x)

In [13]:
confusion_matrix(y, preds)

array([[21942,   711],
       [ 5449,  2059]])

In [14]:
print(classification_report(y,preds))

              precision    recall  f1-score   support

           0       0.80      0.97      0.88     22653
           1       0.74      0.27      0.40      7508

    accuracy                           0.80     30161
   macro avg       0.77      0.62      0.64     30161
weighted avg       0.79      0.80      0.76     30161



In [15]:
accuracy_score(y,preds)

0.7957627399622028

Note: SVM model got created with 79% accuracy

## **Prediction for Test Data**

In [16]:
# Read test csv file and store data for processing
df_sal_test = pd.read_csv('SalaryData_Test.csv')
df_sal_test.head()

Unnamed: 0,age,workclass,education,educationno,maritalstatus,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native,Salary
0,25,Private,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,34,Private,10th,6,Never-married,Other-service,Not-in-family,White,Male,0,0,30,United-States,<=50K


**EDA for Test Data**

In [17]:
df_sal_test.shape

(15060, 14)

In [18]:
df_sal_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15060 entries, 0 to 15059
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   age            15060 non-null  int64 
 1   workclass      15060 non-null  object
 2   education      15060 non-null  object
 3   educationno    15060 non-null  int64 
 4   maritalstatus  15060 non-null  object
 5   occupation     15060 non-null  object
 6   relationship   15060 non-null  object
 7   race           15060 non-null  object
 8   sex            15060 non-null  object
 9   capitalgain    15060 non-null  int64 
 10  capitalloss    15060 non-null  int64 
 11  hoursperweek   15060 non-null  int64 
 12  native         15060 non-null  object
 13  Salary         15060 non-null  object
dtypes: int64(5), object(9)
memory usage: 1.6+ MB


In [19]:
# Drop the columns with less impact as we did in Train dataset
df_sal_test = df_sal_test.drop(['maritalstatus', 'relationship', 'race'], axis=1)
df_sal_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15060 entries, 0 to 15059
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   age           15060 non-null  int64 
 1   workclass     15060 non-null  object
 2   education     15060 non-null  object
 3   educationno   15060 non-null  int64 
 4   occupation    15060 non-null  object
 5   sex           15060 non-null  object
 6   capitalgain   15060 non-null  int64 
 7   capitalloss   15060 non-null  int64 
 8   hoursperweek  15060 non-null  int64 
 9   native        15060 non-null  object
 10  Salary        15060 non-null  object
dtypes: int64(5), object(6)
memory usage: 1.3+ MB


In [20]:
df_sal_test.shape

(15060, 11)

In [21]:
# Convert all string object columns to categorical columns
columns = ['workclass', 'education', 'occupation', 'sex', 'native', 'Salary']

for x in columns:
  df_sal_test[x] = df_sal_test[x].astype('category')

df_sal_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15060 entries, 0 to 15059
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   age           15060 non-null  int64   
 1   workclass     15060 non-null  category
 2   education     15060 non-null  category
 3   educationno   15060 non-null  int64   
 4   occupation    15060 non-null  category
 5   sex           15060 non-null  category
 6   capitalgain   15060 non-null  int64   
 7   capitalloss   15060 non-null  int64   
 8   hoursperweek  15060 non-null  int64   
 9   native        15060 non-null  category
 10  Salary        15060 non-null  category
dtypes: category(6), int64(5)
memory usage: 679.9 KB


In [22]:
# Change categorical input features to dummies
# Change output categorical values to numeric catcodes 
df_temp1 = pd.get_dummies(df_sal_test.drop('Salary', axis=1))
df_temp1['Salary'] = df_sal_test['Salary'].cat.codes 

df_temp1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15060 entries, 0 to 15059
Data columns (total 85 columns):
 #   Column                              Non-Null Count  Dtype
---  ------                              --------------  -----
 0   age                                 15060 non-null  int64
 1   educationno                         15060 non-null  int64
 2   capitalgain                         15060 non-null  int64
 3   capitalloss                         15060 non-null  int64
 4   hoursperweek                        15060 non-null  int64
 5   workclass_ Federal-gov              15060 non-null  uint8
 6   workclass_ Local-gov                15060 non-null  uint8
 7   workclass_ Private                  15060 non-null  uint8
 8   workclass_ Self-emp-inc             15060 non-null  uint8
 9   workclass_ Self-emp-not-inc         15060 non-null  uint8
 10  workclass_ State-gov                15060 non-null  uint8
 11  workclass_ Without-pay              15060 non-null  uint8
 12  educ

In [23]:
# Separate out x and y variables
x_test = df_temp1.drop('Salary', axis=1)
y_test = df_temp1['Salary']

print('x_test shape: ', x_test.shape)
print('y_test shape: ', y_test.shape)

x_test shape:  (15060, 84)
y_test shape:  (15060,)


**Prediction for Test Data**

In [24]:
preds_test = svc.predict(x_test)

In [25]:
confusion_matrix(y_test, preds_test)

array([[10991,   369],
       [ 2703,   997]])

In [26]:
print(classification_report(y_test, preds_test))

              precision    recall  f1-score   support

           0       0.80      0.97      0.88     11360
           1       0.73      0.27      0.39      3700

    accuracy                           0.80     15060
   macro avg       0.77      0.62      0.64     15060
weighted avg       0.78      0.80      0.76     15060



In [27]:
accuracy_score(y_test, preds_test)

0.7960159362549801

**Conclusion: SVM got created with 80% accuracy.**