# Linear SVM Grid Search

Linear SVMs have been found to be the best model so far. This notebook seeks to optimize the hyperparameters used in this Linear SVM via SciKitLearn's great grid search function.

Refer to these pages for information as it was used to create this notebook.

https://scikit-learn.org/stable/modules/grid_search.html#grid-search

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV


In [24]:
import pandas as pd
import numpy as np
import re

#sklearn imports
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.svm import SVC
from sklearn.svm import OneClassSVM
from sklearn.model_selection import train_test_split, GridSearchCV

## Importing the Data

In [25]:
data = pd.read_csv('./data/frequenciesExtra.csv')

numberRe = re.compile('[0-9]+')
noneRe = re.compile('None')
def daysStrToInt(dStr):
    if isinstance(dStr, str):
        if numberRe.match(dStr):
            return int(dStr.split(' ')[0])
        elif noneRe.match(dStr):
            return None
    return dStr

data['hospDistance'] = data['hospDistance'].transform(daysStrToInt)

  data = pd.read_csv('./data/frequenciesExtra.csv')


In [26]:
def deIdCrf(crfs):
  return crfs.drop(columns=['Masked Client ID', 'Date of Review', 'Date'])

def deIdAdl(adls):
  return adls.drop(columns=['DeIdentify ID', 'CaregiverID', 'VisitDate', 'ActualTimeIn', 'ActualTimeOut', 'Date'])

In [27]:
x = deIdAdl(data).drop(columns=['hasHospitalization', 'hospDistance'])
y = data['hasHospitalization']
d = data['hospDistance']

In [28]:
#split into training and testing data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.25, random_state=15)
y_train = np.array(y_train, dtype=bool)
y_test = np.array(y_test, dtype=bool)
print(y_train.sum())
print(y_test.sum())

221
56


## Grid Search

SVMs have a few hyperparameters to look through:
* C: Inverse of nu. Determines how closely fitting the svm will be by determining penalty for missing a classification
* Kernel: Set to be linear since that provided the best results
* Gamma: Determines the curvature of the hyperplane. Not used because we are using a Linear Kernel
* Weighting: Determines the penalty cost for missing a node from each class. We set those who have been hospitalized to 0 and those who have not to 1. Because we are doing anomaly detection, this paramater becomes very important since Linear SVMs are not designed for that.

In [7]:
#params to grid search through
param_grid = [
   {'C': [0.1,0.5,1,10,100], 
   'class_weight' : [{1.0: 1, 0.0: 10},{1.0: 1, 0.0: 25},{1.0: 1, 0.0: 50},{1.0: 1, 0.0: 100}]
   }
]

#Verbose indicates the level of output desired during grid search execution, higher means more (it's dumb). It doesn't seem to work though
#n_jobs is set to -1, meaning that it will use all processors available to run the grid search faster
search = GridSearchCV(estimator=SVC(kernel='linear'), param_grid= param_grid, verbose= 3, n_jobs=-1)
search.fit(x_train, y_train)
search.best_params_

Fitting 5 folds for each of 20 candidates, totalling 100 fits


## Statistics for Testing Data

In [None]:
best_preds = search.predict(x_test)
print("Precision:",precision_score(y_test, best_preds))
print("Recall:",recall_score(y_test, best_preds))
print("F1 Score",f1_score(y_test, best_preds))

Precision: 0.8157894736842105
Recall: 0.5535714285714286
F1 Score 0.6595744680851064


# Trying things out for OneClassSVM

In [43]:
#params to grid search through
param_grid = [
   {'nu': [0.001, 0.01], 
   'gamma': ['scale', 'auto', 0.1],
   'degree': [2,3,4]
   }
]

#Verbose indicates the level of output desired during grid search execution, higher means more (it's dumb). It doesn't seem to work though
#n_jobs is set to -1, meaning that it will use all processors available to run the grid search faster
search = GridSearchCV(estimator=OneClassSVM(kernel='poly'), param_grid= param_grid, scoring = 'f1', verbose= 3, n_jobs=-1)
search.fit(x_train,y_train)


Fitting 5 folds for each of 18 candidates, totalling 90 fits




In [44]:
print(search.best_params_)
print(pd.DataFrame(search.cv_results_))

{'degree': 2, 'gamma': 'scale', 'nu': 0.001}
    mean_fit_time  std_fit_time  mean_score_time  std_score_time param_degree  \
0        5.084579      0.722745         0.224872        0.036444            2   
1       36.398026      0.873565         1.463747        0.559063            2   
2        5.587369      1.093401         0.305981        0.060711            2   
3       38.276191      1.268626         1.438520        0.280943            2   
4        5.991043      0.579889         0.338076        0.033605            2   
5       35.738123      1.370767         1.959844        0.335767            2   
6        5.682433      0.854755         0.276370        0.039397            3   
7       38.696854      2.462547         1.912894        0.336341            3   
8        6.178154      1.033592         0.338661        0.072811            3   
9       37.845771      2.904658         1.865541        0.396649            3   
10       6.357777      0.941626         0.344685        0.090771

### Testing the best OCSVM model
Using the best OCSVM model that was found using Grid Search

In [45]:
best_preds = search.predict(x_test)
print(best_preds)
best_preds = np.array(best_preds, dtype=bool)

print(y_test)
print("Precision:",precision_score(y_test, best_preds))
print("Recall:",recall_score(y_test, best_preds))
print("F1 Score",f1_score(y_test, best_preds))

[1 1 1 ... 1 1 1]
[False False False ... False False False]
Precision: 0.004005149477900157
Recall: 1.0
F1 Score 0.007978344493517595
