# Linear SVM Grid Search

Linear SVMs have been found to be the best model so far. This notebook seeks to optimize the hyperparameters used in this Linear SVM via SciKitLearn's great grid search function.

Refer to these pages for information as it was used to create this notebook.

https://scikit-learn.org/stable/modules/grid_search.html#grid-search

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV


In [4]:
import pandas as pd
import numpy as np
import re

#sklearn imports
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.svm import SVC
from sklearn.svm import OneClassSVM
from sklearn.model_selection import train_test_split, GridSearchCV

## Importing the Data

In [5]:
data = pd.read_csv('./data/frequenciesExtra.csv')

numberRe = re.compile('[0-9]+')
noneRe = re.compile('None')
def daysStrToInt(dStr):
    if isinstance(dStr, str):
        if numberRe.match(dStr):
            return int(dStr.split(' ')[0])
        elif noneRe.match(dStr):
            return None
    return dStr

data['hospDistance'] = data['hospDistance'].transform(daysStrToInt)

  data = pd.read_csv('./data/frequenciesExtra.csv')


In [6]:
def deIdCrf(crfs):
  return crfs.drop(columns=['Masked Client ID', 'Date of Review', 'Date'])

def deIdAdl(adls):
  return adls.drop(columns=['DeIdentify ID', 'CaregiverID', 'VisitDate', 'ActualTimeIn', 'ActualTimeOut', 'Date'])

In [7]:
x = deIdAdl(data).drop(columns=['hasHospitalization', 'hospDistance'])
y = data['hasHospitalization']
d = data['hospDistance']

In [8]:
#split into training and testing data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.25, random_state=15)
y_train = np.array(y_train, dtype=bool)
y_test = np.array(y_test, dtype=bool)
print(y_train.sum())
print(y_test.sum())

221
56


## Grid Search

SVMs have a few hyperparameters to look through:
* C: Inverse of nu. Determines how closely fitting the svm will be by determining penalty for missing a classification
* Kernel: Set to be linear since that provided the best results
* Gamma: Determines the curvature of the hyperplane. Not used because we are using a Linear Kernel
* Weighting: Determines the penalty cost for missing a node from each class. We set those who have been hospitalized to 0 and those who have not to 1. Because we are doing anomaly detection, this paramater becomes very important since Linear SVMs are not designed for that.

In [26]:
#params to grid search through
param_grid = [
   {'C': [0.1,0.5,1,10,100], 
   'class_weight' : [{1.0: 1, 0.0: 10},{1.0: 1, 0.0: 25},{1.0: 1, 0.0: 50},{1.0: 1, 0.0: 100}]
   }
]

#Verbose indicates the level of output desired during grid search execution, higher means more (it's dumb). It doesn't seem to work though
#n_jobs is set to -1, meaning that it will use all processors available to run the grid search faster
search = GridSearchCV(estimator=SVC(kernel='linear'), param_grid= param_grid, verbose= 3, n_jobs=-1)
search.fit(x_train, y_train)
search.best_params_

Fitting 5 folds for each of 20 candidates, totalling 100 fits


{'C': 0.5, 'class_weight': {1.0: 1, 0.0: 10}}

In [28]:
print(search.best_params_)
print(pd.DataFrame(search.cv_results_))

{'C': 0.5, 'class_weight': {1.0: 1, 0.0: 10}}
    mean_fit_time  std_fit_time  mean_score_time  std_score_time param_C  \
0       37.482078      1.430349         0.802794        0.389252     0.1   
1       33.720942      4.048325         0.543968        0.073956     0.1   
2       30.740556      2.881127         0.531128        0.089459     0.1   
3       28.585553      2.824071         0.501283        0.084969     0.1   
4       34.520521      3.134284         0.513889        0.060154     0.5   
5       35.441495      5.487460         0.524834        0.054402     0.5   
6       33.545605      3.403674         0.507459        0.049434     0.5   
7       34.210650      4.373831         0.532525        0.076677     0.5   
8       36.029290      3.706238         0.509050        0.067354       1   
9       37.176459      5.004768         0.509038        0.096848       1   
10      36.394962      3.996773         0.504720        0.087341       1   
11      35.325864      5.473005         0.

## Statistics for Testing Data

In [27]:
best_preds = search.predict(x_test)
print("Precision:",precision_score(y_test, best_preds))
print("Recall:",recall_score(y_test, best_preds))
print("F1 Score",f1_score(y_test, best_preds))

Precision: 0.8157894736842105
Recall: 0.5535714285714286
F1 Score 0.6595744680851064


# Trying things out for OneClassSVM

While grid search already exists in the One Class SVM model, I wanted to try it here as well.

In [29]:
#params to grid search through
param_grid = [
   {'nu': [0.1], 
   'gamma': [0.005],
   'degree': [3]
   }
]

#Verbose indicates the level of output desired during grid search execution, higher means more (it's dumb). It doesn't seem to work though
#n_jobs is set to -1, meaning that it will use all processors available to run the grid search faster
search = GridSearchCV(estimator=OneClassSVM(kernel='poly'), param_grid= param_grid, scoring = 'f1', verbose= 3, n_jobs=-1)
search.fit(x_train,y_train)


Fitting 5 folds for each of 1 candidates, totalling 5 fits




GridSearchCV(estimator=OneClassSVM(kernel='poly'), n_jobs=-1,
             param_grid=[{'degree': [3], 'gamma': [0.005], 'nu': [0.1]}],
             scoring='f1', verbose=3)

In [31]:
print(search.best_params_)
print(pd.DataFrame(search.cv_results_))

{'degree': 3, 'gamma': 0.005, 'nu': 0.1}
   mean_fit_time  std_fit_time  mean_score_time  std_score_time param_degree  \
0     236.915109      1.988311        21.365844        1.061736            3   

  param_gamma param_nu                                    params  \
0       0.005      0.1  {'degree': 3, 'gamma': 0.005, 'nu': 0.1}   

   split0_test_score  split1_test_score  split2_test_score  split3_test_score  \
0                NaN                NaN                NaN                NaN   

   split4_test_score  mean_test_score  std_test_score  rank_test_score  
0                NaN              NaN             NaN                1  


### Testing the best OCSVM model
Using the best OCSVM model that was found using Grid Search

In [32]:
best_preds = search.predict(x_test)
print(best_preds)
print(np.count_nonzero(best_preds == -1))
print(np.count_nonzero(y_test == 1))

best_preds = (best_preds < 0)


print(np.count_nonzero(best_preds == 1))
print(np.count_nonzero(y_test == 1))

print(best_preds)
print(y_test)

print("Precision:",precision_score(y_test, best_preds))
print("Recall:",recall_score(y_test, best_preds))
print("F1 Score",f1_score(y_test, best_preds))

[1 1 1 ... 1 1 1]
1419
56
1419
56
[False False False ... False False False]
[False False False ... False False False]
Precision: 0.007751937984496124
Recall: 0.19642857142857142
F1 Score 0.014915254237288135
