<a href="https://colab.research.google.com/github/PDNow-Research/PDNow/blob/main/SVM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Data Science
import re
import csv
import json
import itertools
from tqdm import tqdm
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# General
import os
import sys
import time
import math
import random
from datetime import date
import warnings
current_date = date.today()
warnings.filterwarnings("ignore")

# SVM
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, average_precision_score, classification_report

In [None]:
spiral_df = pd.read_csv('/content/drive/My Drive/Data/HandPD-Replication/Spiral_HandPD.txt', delimiter=' ', index_col=0, names=['Patient', 'Label', 'RMS', 'Max_dist', 'Min_dist', 'SD', 'MRT', 'Max_ET', 'Min_ET', 'SD_ET', 'HT_ET_Diff'])

In [None]:
spiral_df.head()

Unnamed: 0_level_0,Label,RMS,Max_dist,Min_dist,SD,MRT,Max_ET,Min_ET,SD_ET,HT_ET_Diff
Patient,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,1,3521.258301,6247.052734,30801.99219,0.014133,26.785328,176.600113,0.00213,1781.795898,0.25
2,1,4098.876465,6032.535156,34369.70313,0.022838,26.529615,168.352737,0.08496,1443.217529,0.273585
3,1,3854.601807,6453.114746,34709.44531,0.000251,23.670755,180.8983,0.009303,1621.75,0.256329
4,1,4069.221924,6844.231445,32181.26367,0.000168,23.456329,179.116043,0.021419,1454.390137,0.249221
5,1,4104.271973,6949.925293,36444.95313,0.004731,22.488258,188.25621,0.0,1553.536499,0.214511


##**SVM IMPLEMENTATION**

DAY 1 Process
1. Import extracted features. No normalization procedures performed (doesn't seem as if extracted features have been normalized yet either). Run model. Complete overfitting - predicts all to be class for 81% accuracy.
2. Try normalization with StandardScaler from sklearn. Didn't work very well. Same overfitting problem.
3. Tried weighting classes with class_weight='balanced' property since we had many more PD data than non-PD (class-balance). Better results - not overfitting so extremely. 62% accuracy, with 24 false negatives (and 4 false positives)

Next Steps: Try the paper's normalization method. Try to see why we have: 10/14 control rows predicted right. 36/60 PD rows predicted right. Also, important to consider than while we have 368 images for spirals, each patient drew 4, so technically we are predicting per 1 image, not per patient. (And we might be unable to predict per patient anyway, since we think we may lack their indexes/don't know which images belong to a specific patient.)

In [None]:
X = spiral_df[['RMS', 'Max_dist', 'Min_dist', 'SD', 'MRT', 'Max_ET', 'Min_ET', 'SD_ET', 'HT_ET_Diff']]
y = spiral_df['Label']

y_label = y.to_numpy()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0, test_size = 0.2, stratify = y_label)

In [None]:
scaler = StandardScaler().fit(X_train)

In [None]:
X_train = X_train.reset_index(drop = True)
X_test = X_test.reset_index(drop = True)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
y_train = y_train.reset_index(drop = True)
y_test = y_test.reset_index(drop = True)

In [None]:
y_train.value_counts()

2    236
1     58
Name: Label, dtype: int64

In [None]:
y_test.value_counts()

2    60
1    14
Name: Label, dtype: int64

In [None]:
clf = SVC(kernel='rbf', probability=True, class_weight='balanced')
clf.fit(X_train, y_train)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight='balanced', coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=True, random_state=None, shrinking=True, tol=0.001,
    verbose=False)

In [None]:
y_pred = clf.predict(X_test)
y_proba = clf.predict_proba(X_test)

In [None]:
clf.score(X_test, y_test)

0.6216216216216216

In [None]:
y_pred = pd.Series(y_pred)

In [None]:
type(y_test)

pandas.core.series.Series

In [None]:
target_names = ['Control', 'PD']
results = classification_report(y_test, y_pred, target_names = target_names, output_dict=True)
results = pd.DataFrame(results).transpose()
conf_mat = confusion_matrix(y_test, y_pred)

In [None]:
results

Unnamed: 0,precision,recall,f1-score,support
Control,0.294118,0.714286,0.416667,14.0
PD,0.9,0.6,0.72,60.0
accuracy,0.621622,0.621622,0.621622,0.621622
macro avg,0.597059,0.657143,0.568333,74.0
weighted avg,0.785374,0.621622,0.662613,74.0


In [None]:
conf_mat

array([[10,  4],
       [24, 36]])

In [None]:
TN, FP, FN, TP = conf_mat.ravel()

# Sensitivity, hit rate, recall, or true positive rate
TPR = TP/(TP+FN)

# Specificity or true negative rate
TNR = TN/(TN+FP) 

# Precision or positive predictive value
PPV = TP/(TP+FP)

# Negative predictive value
NPV = TN/(TN+FN)

# Fall out or false positive rate
FPR = FP/(FP+TN)

# False negative rate
FNR = FN/(TP+FN)

# False discovery rate
FDR = FP/(TP+FP)

print("TP: ", TP)
print("TN: ", TN)
print("FP: ", FP)
print("FN: ", FN)

print("Sensitivity: ", TPR)
print("Specificity: ", TNR)
print("NPV: ", NPV)
print("PPV: ", PPV)

TP:  36
TN:  10
FP:  4
FN:  24
Sensitivity:  0.6
Specificity:  0.7142857142857143
NPV:  0.29411764705882354
PPV:  0.9


In [None]:
print('TEST GITHUB 2')