# Create SVM Model

In this notebook, we will create the SVM models for the ICD9 multi-label problem

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.calibration import CalibratedClassifierCV
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import precision_recall_fscore_support
import matplotlib.pyplot as plt
from datetime import datetime

start = datetime.now()

Load in the data

In [None]:
data = pd.read_parquet('prepared-data.pq')
data.head()

1. Create the X, Y dataset

2. Convert codes to binary array

In [None]:
X = data['toks'].values
y_raw = data['ICD9_CODE'].astype(str)

mlb = MultiLabelBinarizer()
y = mlb.fit_transform(data['ICD9_CODE'].to_list())
mlb.classes_

Display the number of codes, this should match from the pre-processing script. 

In [None]:
mlb.classes_.shape

Create the test/train split 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

Create bag-of-words representation and preform TF-IDF.

Transform both the train and test

In [None]:
def identity_tokenizer(text):
    return text

tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, lowercase=False, min_df=0.001)
X_train = tfidf.fit_transform(X_train)
X_train

In [None]:
X_test = tfidf.transform(X_test)

In [None]:
len(tfidf.vocabulary_)

Create the model, for this model we will use SGDClassifier.

This classifier preforms the hinge loss which is what SVM uses, and additionally it uses SGD. We use SGD to speed up the model training process.

We also use One-vs-Rest strategy with the SVM model, where we create a new binary SVM for each class versus all other labels. 

In [None]:
clf = OneVsRestClassifier(SGDClassifier(loss='hinge', n_jobs=6, class_weight={0: 1, 1: 10}), n_jobs=6)
clf.fit(X_train, y_train)

We next use cross-validation to find class probabilties across 5 different folds. We will use the probabilities later to find the best threshold. 

In [None]:
cv = StratifiedKFold(n_splits=5)
calibrated_clf = MultiOutputClassifier(CalibratedClassifierCV(clf, cv=cv, n_jobs=6), n_jobs=6)
calibrated_clf.fit(X_train, y_train)

In [None]:
y_pred = calibrated_clf.predict_proba(X_test)
y_pred = np.dstack(y_pred)
y_pred = np.transpose(y_pred, (0, 2, 1))
y_pred.shape


In [None]:
y_pred_cls = (y_pred[:, :, 1] > 0.5) * 1

In [None]:
thresholds = np.arange(0, 1, 0.1)
data = []
for threshold in thresholds:
    y_pred_cls = (y_pred[:, :, 1] > threshold) * 1
    precision, recall, fscore, support = precision_recall_fscore_support(y_test, y_pred_cls, average='micro')
    data.append((threshold, precision, recall, fscore, support))

In [None]:
df = pd.DataFrame(data, columns=['threshold', 'precision', 'recall', 'fscore', 'support'])
df.head()

In [None]:
df.iloc[df['fscore'].argmax()]

In [None]:
plt.figure()
plt.title("SVM F-score Curve")
plt.plot(df['threshold'], df['fscore'], label='F-Score')
plt.plot(df['threshold'], df['recall'], label='Recall')
plt.plot(df['threshold'], df['precision'], label='Precision')
plt.xlabel('Threshold Cutoff')
plt.ylabel('Metric')
plt.legend(loc='upper right')
# plt.text()
plt.savefig('svm-fscore.png')
plt.show()

In [None]:
df.to_csv('svm-scores.csv')

In [None]:
end = datetime.now()
total_time = end - start
total_time