# Part 8: SVM

Evaluating different SVM models on the [Credit Card dataset](https://www.kaggle.com/mlg-ulb/creditcardfraud)

\*Note: We suggest running the notebook in a Google Colab environment

## Preparing the environment

For this exercise, we are going to use a Google Colab notebook and make use of GPU parallelization, since SVMs can take a lot of training time.

For this purpose, we are going to use the `thundersvm` module which can help us utilize a graphics card to drastically reduce training time.

We install CUDA on our Colab environment and `thundersvm`.

In [1]:
!wget https://developer.nvidia.com/compute/cuda/9.0/Prod/local_installers/cuda-repo-ubuntu1704-9-0-local_9.0.176-1_amd64-deb
!dpkg -i cuda-repo-ubuntu1704-9-0-local_9.0.176-1_amd64-deb
!apt-key add /var/cuda-repo-9-0-local/7fa2af80.pub
!apt-get update
!sudo apt-get install cuda-9.0
!nvcc --version
!pip install thundersvm

--2020-12-15 21:08:22--  https://developer.nvidia.com/compute/cuda/9.0/Prod/local_installers/cuda-repo-ubuntu1704-9-0-local_9.0.176-1_amd64-deb
Resolving developer.nvidia.com (developer.nvidia.com)... 152.199.0.24
Connecting to developer.nvidia.com (developer.nvidia.com)|152.199.0.24|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://developer.download.nvidia.com/compute/cuda/9.0/secure/Prod/local_installers/cuda-repo-ubuntu1704-9-0-local_9.0.176-1_amd64.deb?Pw3d-laPfGotIMjiX0EzT5uTgKaSUtBLX5Gmag40Njb-mfY6OXPW-yZpNqrOMQ5mRYK8Tnx5p2bHCPh3IBi8hg6l7TsWGoW7UnnpfeWVpQgGBOwz8L3Q8f_tirPszMbYaZmRdsw8Nu5E4w07T82lfbTkDAH18SJCPILHYYITIU9Dw7jRnD-oTUMlqj345K8GaD77Z-yDbyKkQgsWkmkK [following]
--2020-12-15 21:08:22--  https://developer.download.nvidia.com/compute/cuda/9.0/secure/Prod/local_installers/cuda-repo-ubuntu1704-9-0-local_9.0.176-1_amd64.deb?Pw3d-laPfGotIMjiX0EzT5uTgKaSUtBLX5Gmag40Njb-mfY6OXPW-yZpNqrOMQ5mRYK8Tnx5p2bHCPh3IBi8hg6l7TsWGoW7UnnpfeWVpQgGBOwz8L3Q

We also going to mount our Google Drive, which contains the credit card dataset we are going to use.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


We import all needed libraries, including `thundersvm` for our SVM model, and over- and under-sampling techniques from `imblearn`, which will help us with our imbalanced dataset.

We will also use `sklearn` for model evaluation, train-test split and dataset values standardization.

We are going to need pandas for managing our data and tqdm to visualize model training progress with a progress bar.

In [3]:
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from thundersvm import SVC
# from sklearn.svm import SVC
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
import pandas as pd
from tqdm import tqdm



## Data preparation

We load our data from Google Drive

In [4]:
# creditcard_df = pd.read_csv('creditcard.csv')
creditcard_df = pd.read_csv('/content/drive/MyDrive/creditcard.csv')
creditcard_df

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,-0.551600,-0.617801,-0.991390,-0.311169,1.468177,-0.470401,0.207971,0.025791,0.403993,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.166480,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,1.612727,1.065235,0.489095,-0.143772,0.635558,0.463917,-0.114805,-0.183361,-0.145783,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.167170,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.379780,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,0.624501,0.066084,0.717293,-0.165946,2.345865,-2.890083,1.109969,-0.121359,-2.261857,0.524980,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,-0.226487,0.178228,0.507757,-0.287924,-0.631418,-1.059647,-0.684093,1.965775,-1.232622,-0.208038,-0.108300,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.50,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,-0.822843,0.538196,1.345852,-1.119670,0.175121,-0.451449,-0.237033,-0.038195,0.803487,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.206010,0.502292,0.219422,0.215153,69.99,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284802,172786.0,-11.881118,10.071785,-9.834783,-2.066656,-5.364473,-2.606837,-4.918215,7.305334,1.914428,4.356170,-1.593105,2.711941,-0.689256,4.626942,-0.924459,1.107641,1.991691,0.510632,-0.682920,1.475829,0.213454,0.111864,1.014480,-0.509348,1.436807,0.250034,0.943651,0.823731,0.77,0
284803,172787.0,-0.732789,-0.055080,2.035030,-0.738589,0.868229,1.058415,0.024330,0.294869,0.584800,-0.975926,-0.150189,0.915802,1.214756,-0.675143,1.164931,-0.711757,-0.025693,-1.221179,-1.545556,0.059616,0.214205,0.924384,0.012463,-1.016226,-0.606624,-0.395255,0.068472,-0.053527,24.79,0
284804,172788.0,1.919565,-0.301254,-3.249640,-0.557828,2.630515,3.031260,-0.296827,0.708417,0.432454,-0.484782,0.411614,0.063119,-0.183699,-0.510602,1.329284,0.140716,0.313502,0.395652,-0.577252,0.001396,0.232045,0.578229,-0.037501,0.640134,0.265745,-0.087371,0.004455,-0.026561,67.88,0
284805,172788.0,-0.240440,0.530483,0.702510,0.689799,-0.377961,0.623708,-0.686180,0.679145,0.392087,-0.399126,-1.933849,-0.962886,-1.042082,0.449624,1.962563,-0.608577,0.509928,1.113981,2.897849,0.127434,0.265245,0.800049,-0.163298,0.123205,-0.569159,0.546668,0.108821,0.104533,10.00,0


As you can see, the dataset is extremely imbalanced, with less than 0.1% of our data being the fraudulent class.

In [5]:
creditcard_df.value_counts(['Class'])

Class
0        284315
1           492
dtype: int64

We are going to standardize the `Amount` column, since compared to other columns its values have a much greater range.

This will help our SVM perform much better.

In [6]:
creditcard_df.Amount = StandardScaler().fit_transform(creditcard_df.Amount.values.reshape((-1,1)))

## Modeling, Training and Evaluation

We start by using the data we have in our hands. In a next step we are going to use SMOTE to oversample the fraudulent minority class and undersample the majority class.

We split into features and targets.

In [7]:
features = creditcard_df.drop(columns=['Time', 'Class'])
targets = creditcard_df.Class

We then split our data into training and test set.

In [8]:
x_train, x_test, y_train, y_test = train_test_split(features, targets, stratify=targets)

### Using regular features

#### Modeling

The following are our model parameters.

In [9]:
parameter_list = [
    # [0.1, 'poly', 0.2, 2],
    # [10, 'poly', 6, 5],
    [0.1, 'polynomial', 0.2, 2],
    [10, 'polynomial', 6, 5],
    [0.1, 'rbf', 0.3],
    [10, 'rbf', 5],
    [0.1, 'sigmoid', 0.5],
    [10, 'sigmoid', 2],
    [100, 'sigmoid', 5]
]

We create a function which will help us make a dictionary object with our models.

In [10]:
def make_models(param_list):
    models = dict()
    for params in param_list:
        try: 
            degree = params[3]
        except IndexError:
            degree = 3
        key = '_'.join([str(i) for i in params])
        models[key] = SVC(C=params[0], 
                             kernel=params[1], 
                             gamma=params[2],
                             degree=degree)
    return models

We create our models.

In [11]:
models = make_models(parameter_list)
models

{'0.1_polynomial_0.2_2': SVC(C=0.1, cache_size=None, class_weight=None, coef0=0.0,
     decision_function_shape='ovo', degree=2, gamma=0.2, gpu_id=0,
     kernel='polynomial', max_iter=-1, max_mem_size=-1, n_jobs=-1,
     probability=False, random_state=None, shrinking=False, tol=0.001,
     verbose=False),
 '0.1_rbf_0.3': SVC(C=0.1, cache_size=None, class_weight=None, coef0=0.0,
     decision_function_shape='ovo', degree=3, gamma=0.3, gpu_id=0, kernel='rbf',
     max_iter=-1, max_mem_size=-1, n_jobs=-1, probability=False,
     random_state=None, shrinking=False, tol=0.001, verbose=False),
 '0.1_sigmoid_0.5': SVC(C=0.1, cache_size=None, class_weight=None, coef0=0.0,
     decision_function_shape='ovo', degree=3, gamma=0.5, gpu_id=0,
     kernel='sigmoid', max_iter=-1, max_mem_size=-1, n_jobs=-1,
     probability=False, random_state=None, shrinking=False, tol=0.001,
     verbose=False),
 '100_sigmoid_5': SVC(C=100, cache_size=None, class_weight=None, coef0=0.0,
     decision_function_sha

And train them.

In [12]:
[model.fit(x_train, y_train) for model in tqdm(models.values())]

100%|██████████| 7/7 [00:49<00:00,  7.06s/it]


[SVC(C=0.1, cache_size=None, class_weight={}, coef0=0.0,
     decision_function_shape='ovo', degree=2, gamma=0.2, gpu_id=0,
     kernel='polynomial', max_iter=-1, max_mem_size=-1, n_jobs=-1,
     probability=False, random_state=None, shrinking=False, tol=0.001,
     verbose=False), SVC(C=10, cache_size=None, class_weight={}, coef0=0.0,
     decision_function_shape='ovo', degree=5, gamma=6, gpu_id=0,
     kernel='polynomial', max_iter=-1, max_mem_size=-1, n_jobs=-1,
     probability=False, random_state=None, shrinking=False, tol=0.001,
     verbose=False), SVC(C=0.1, cache_size=None, class_weight={}, coef0=0.0,
     decision_function_shape='ovo', degree=3, gamma=0.3, gpu_id=0, kernel='rbf',
     max_iter=-1, max_mem_size=-1, n_jobs=-1, probability=False,
     random_state=None, shrinking=False, tol=0.001, verbose=False), SVC(C=10, cache_size=None, class_weight={}, coef0=0.0,
     decision_function_shape='ovo', degree=3, gamma=5, gpu_id=0, kernel='rbf',
     max_iter=-1, max_mem_size=-1,

#### Evaluation

We proceed to evaluate. We want to test test accuracy, precision, recall and f1 score.

In [13]:
y_predicted = [model.predict(x_test) for model in models.values()]

Our polynomial models seem to have performed decently.

In [14]:
accuracies = [metrics.accuracy_score(y_test, pred) for pred in y_predicted]
precisions = [metrics.precision_score(y_test, pred) for pred in y_predicted]
recalls = [metrics.recall_score(y_test, pred) for pred in y_predicted]
f1s = [metrics.f1_score(y_test, pred) for pred in y_predicted]

for key, acc, prec, rec, f1 in zip(models.keys(), accuracies, precisions, recalls, f1s):
  print(key, 'Accuracy:', acc, 'Precision:', prec, 'Recall:', rec, 'F1:', f1)

  _warn_prf(average, modifier, msg_start, len(result))


0.1_polynomial_0.2_2 Accuracy: 0.9995505744220669 Precision: 0.9099099099099099 Recall: 0.8211382113821138 F1: 0.8632478632478633
10_polynomial_6_5 Accuracy: 0.9965450408696385 Precision: 0.308411214953271 Recall: 0.8048780487804879 F1: 0.44594594594594594
0.1_rbf_0.3 Accuracy: 0.9982725204348193 Precision: 0.0 Recall: 0.0 F1: 0.0
10_rbf_5 Accuracy: 0.9984550995758547 Precision: 1.0 Recall: 0.10569105691056911 F1: 0.1911764705882353
0.1_sigmoid_0.5 Accuracy: 0.9974438920255049 Precision: 0.12658227848101267 Recall: 0.08130081300813008 F1: 0.09900990099009901
10_sigmoid_2 Accuracy: 0.9965731299682593 Precision: 0.05185185185185185 Recall: 0.056910569105691054 F1: 0.05426356589147286
100_sigmoid_5 Accuracy: 0.9969944664475717 Precision: 0.11764705882352941 Recall: 0.11382113821138211 F1: 0.11570247933884298


### Using SMOTE

We repeat the same procedure using SMOTE.

We create a processing pipeline with over- and under-sampling as steps. We then make our new synthetic data.

In [15]:
over = SMOTE(sampling_strategy=0.1)
under = RandomUnderSampler(sampling_strategy=0.5)
steps = [('o', over), ('u', under)]
pipeline = Pipeline(steps)
smote_features, smote_targets = pipeline.fit_resample(features, targets)



We split into training and test set.

In [16]:
x_train, x_test, y_train, y_test = train_test_split(smote_features, smote_targets, stratify=smote_targets)

#### Modeling

And build our models once more.

In [17]:
models = make_models(parameter_list)
models

{'0.1_polynomial_0.2_2': SVC(C=0.1, cache_size=None, class_weight=None, coef0=0.0,
     decision_function_shape='ovo', degree=2, gamma=0.2, gpu_id=0,
     kernel='polynomial', max_iter=-1, max_mem_size=-1, n_jobs=-1,
     probability=False, random_state=None, shrinking=False, tol=0.001,
     verbose=False),
 '0.1_rbf_0.3': SVC(C=0.1, cache_size=None, class_weight=None, coef0=0.0,
     decision_function_shape='ovo', degree=3, gamma=0.3, gpu_id=0, kernel='rbf',
     max_iter=-1, max_mem_size=-1, n_jobs=-1, probability=False,
     random_state=None, shrinking=False, tol=0.001, verbose=False),
 '0.1_sigmoid_0.5': SVC(C=0.1, cache_size=None, class_weight=None, coef0=0.0,
     decision_function_shape='ovo', degree=3, gamma=0.5, gpu_id=0,
     kernel='sigmoid', max_iter=-1, max_mem_size=-1, n_jobs=-1,
     probability=False, random_state=None, shrinking=False, tol=0.001,
     verbose=False),
 '100_sigmoid_5': SVC(C=100, cache_size=None, class_weight=None, coef0=0.0,
     decision_function_sha

We train.

In [18]:
[model.fit(x_train, y_train) for model in tqdm(models.values())]

100%|██████████| 7/7 [01:15<00:00, 10.83s/it]


[SVC(C=0.1, cache_size=None, class_weight={}, coef0=0.0,
     decision_function_shape='ovo', degree=2, gamma=0.2, gpu_id=0,
     kernel='polynomial', max_iter=-1, max_mem_size=-1, n_jobs=-1,
     probability=False, random_state=None, shrinking=False, tol=0.001,
     verbose=False), SVC(C=10, cache_size=None, class_weight={}, coef0=0.0,
     decision_function_shape='ovo', degree=5, gamma=6, gpu_id=0,
     kernel='polynomial', max_iter=-1, max_mem_size=-1, n_jobs=-1,
     probability=False, random_state=None, shrinking=False, tol=0.001,
     verbose=False), SVC(C=0.1, cache_size=None, class_weight={}, coef0=0.0,
     decision_function_shape='ovo', degree=3, gamma=0.3, gpu_id=0, kernel='rbf',
     max_iter=-1, max_mem_size=-1, n_jobs=-1, probability=False,
     random_state=None, shrinking=False, tol=0.001, verbose=False), SVC(C=10, cache_size=None, class_weight={}, coef0=0.0,
     decision_function_shape='ovo', degree=3, gamma=5, gpu_id=0, kernel='rbf',
     max_iter=-1, max_mem_size=-1,

#### Evaluation

And evaluate.

In [19]:
y_predicted = [model.predict(x_test) for model in models.values()]

Our models seem to have performed much better on our synthetic data.

In [20]:
accuracies = [metrics.accuracy_score(y_test, pred) for pred in y_predicted]
precisions = [metrics.precision_score(y_test, pred) for pred in y_predicted]
recalls = [metrics.recall_score(y_test, pred) for pred in y_predicted]
f1s = [metrics.f1_score(y_test, pred) for pred in y_predicted]

for key, acc, prec, rec, f1 in zip(models.keys(), accuracies, precisions, recalls, f1s):
  print(key, 'Accuracy:', acc, 'Precision:', prec, 'Recall:', rec, 'F1:', f1)

0.1_polynomial_0.2_2 Accuracy: 0.9905740011254924 Precision: 0.9902058197303052 Recall: 0.9814293753517164 F1: 0.9857980640146965
10_polynomial_6_5 Accuracy: 0.9915588069780529 Precision: 0.9772664645907964 Recall: 0.9978897017445132 F1: 0.9874704162606153
0.1_rbf_0.3 Accuracy: 0.9732226599137123 Precision: 0.9981710105166895 Recall: 0.9213562183455262 F1: 0.9582266442314726
10_rbf_5 Accuracy: 0.9648283624085537 Precision: 0.9998427672955975 Recall: 0.894625773776027 F1: 0.9443124443124443
0.1_sigmoid_0.5 Accuracy: 0.27541737009941847 Precision: 0.23676405628825645 Recall: 0.5278559369724254 F1: 0.3269004574166848
10_sigmoid_2 Accuracy: 0.687816544738323 Precision: 0.5168774792305965 Recall: 0.9715813168261114 F1: 0.6747764912794958
100_sigmoid_5 Accuracy: 0.6874413806040143 Precision: 0.5165607476635514 Recall: 0.9720033764772088 F1: 0.6746082116877411
