# Support Vector Machine

Support Vector Machine tries to find the best line that separates the two classes just like logistic regression. The green region within ± 1 of this line is called Margin. The wider the margin, the better the separation of two or more classes. SVM predicts which side of the gap the new samples will fall.

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.svm import SVC

from src.data import make_dataset


# Import data

In [3]:
labled_data_set,expression_level,labels,true_labels= make_dataset.get_data("original")
labled_data_set_sd,expression_level_sd,labels,true_labels= make_dataset.get_data("standardized")

labels_array= labels["Class"].values

HGV,PCA,UMAP,TSNA = make_dataset.get_transformed_data()

# 3. Train classification model


## Using Original data

In [5]:
X = expression_level
Y = labels_array

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state = 10)
#Train linear SVM model
lsvc_model = SVC(kernel = 'linear', random_state = 10)
lsvc_model.fit(X_train, Y_train) 
Y_pred = lsvc_model.predict(X_test)

print('train score: '+str(lsvc_model.score(X_train,Y_train)))
print('test score:  '+str(lsvc_model.score(X_test,Y_test)))

train score: 1.0
test score:  0.9950248756218906


## Using Original standardized data

In [7]:
X = expression_level_sd
Y = labels_array

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state = 10)
#Train linear SVM model
lsvc_model = SVC(kernel = 'linear', random_state = 10)
lsvc_model.fit(X_train, Y_train) 
Y_pred = lsvc_model.predict(X_test)

print('train score: '+str(lsvc_model.score(X_train,Y_train)))
print('test score:  '+str(lsvc_model.score(X_test,Y_test)))

train score: 1.0
test score:  1.0


## Using UMAP-transformed data

In [4]:
X = UMAP
Y = labels_array

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state = 10)
#Train linear SVM model
lsvc_model = SVC(kernel = 'linear', random_state = 10)
lsvc_model.fit(X_train, Y_train) 
Y_pred = lsvc_model.predict(X_test)

print('train score: '+str(lsvc_model.score(X_train,Y_train)))
print('test score:  '+str(lsvc_model.score(X_test,Y_test)))

train score: 1.0
test score:  1.0


## Using TSNA-transformed data

In [6]:
X = TSNA
Y = labels_array

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state = 10)
#Train linear SVM model
lsvc_model = SVC(kernel = 'linear', random_state = 10)
lsvc_model.fit(X_train, Y_train) 
Y_pred = lsvc_model.predict(X_test)

print('train score: '+str(lsvc_model.score(X_train,Y_train)))
print('test score:  '+str(lsvc_model.score(X_test,Y_test)))

train score: 0.9983333333333333
test score:  1.0


4. Conclusion
UMAP and T-SNE do an excellent job of separating the tumor types. A simple SVM model, with no parameter optimization, is able to predict the test data with 100% accuracy using UMAP-transformed data. Here, over 20,000 genes have been reduced to two dimensions. Two distinct clusters appear for the BRCA tumors; additional data, e.g. single cell RNA-seq, would be interesting to analyze to understand the distinction.