<a href="https://colab.research.google.com/github/Edu126/Support-Vector-Machine-SVM/blob/main/support_vector_machine_svm_model_implementation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Support Vector Machine(SVM) - Model Implementation
**Definition**
> Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks. Its primary objective is to find a hyperplane in an N-dimensional space (where N is the number of features) that distinctly classifies the data points into different classes.

> In simple terms, SVM draws the best line (hyperplane) on the ground to separate instances in a group, creating a big gap between them. Kernels, like Picasso armed with a super calculator, help SVM use interesting shapes to achieve this, making it smart in distinguishing unique features and deciding who's who.
And Reggresion? Well, Imagine that instead of deviding your data, you want to know where the hyperplane will go and analize its trend.

**About the dataset**:

> The dataset contains gene expression of various leukemia patients and contains
gene expression of various leukemia patients on 39 selected locations of the human
genome. These genome positions refer to the genes NPM1, RUNX1, HOXA1, . . .,
HOXA11, HOXA13. These genes are commonly known to be relevant for leukemia.
This genomic data is the basis on which doctors obtain their diagnosis of whether a
patient has leukemia.
For more information refeer to the word file 'Lukemia - Data_Dictionary'

This is a curated dataset so no major data transformation are required.


**Task**:

> Build an SVM classifier that decides for each patient whether or not
they have blood cancer




Notes: No Cross validation or Hyperparameters Optimization is made for the intial porpuse of the notebook.

## Data Exploration

In [428]:
#Import Initial libraries
import pandas as pd
import numpy as np
import seaborn as sn
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 100)

In [429]:
#Read DF
df = pd.read_csv('leukemia.csv')
df.head()

Unnamed: 0,Patient_ID,1563591_at,1570350_at,200063_s_at,206289_at,206847_s_at,208129_x_at,208493_at,208557_at,208604_s_at,209359_x_at,209360_s_at,209905_at,209905_at.1,210365_at,210805_x_at,211180_x_at,211181_x_at,211182_x_at,211620_x_at,213147_at,213147_at.1,213150_at,213150_at.1,213823_at,213844_at,214457_at,214639_s_at,214651_s_at,214651_s_at.1,217263_x_at,221691_x_at,221923_s_at,231786_at,235521_at,235753_at,237697_at,238571_at,238808_at,243058_at,Leukemia
0,Sample_1000,3.056436,3.618254,12.641006,5.062973,3.622257,5.536213,3.796584,4.578751,5.061145,5.290259,6.625408,3.943709,3.943709,5.319614,4.506269,5.983453,4.782621,4.34229,4.948362,4.679406,4.679406,3.565061,3.565061,4.519316,3.984862,3.300462,3.484044,3.482531,3.482531,3.981086,11.077904,9.686451,2.980329,3.059078,3.780181,3.873638,3.631859,3.032457,5.18667,CLL
1,Sample_1001,2.972746,3.656448,13.009815,5.444977,4.430324,6.629713,4.143195,4.581042,4.815637,6.658555,9.108859,7.834432,7.834432,7.560836,4.993084,6.850473,5.087822,4.903638,4.579395,6.606606,6.606606,8.168509,8.168509,5.39532,7.630198,3.147876,5.237195,10.051367,10.051367,4.367952,11.784089,11.272479,3.504151,4.614741,4.456387,3.3926,3.448984,3.547128,5.084203,AML
2,Sample_1002,3.111013,3.910347,12.271732,6.454073,6.61231,7.080542,4.68584,5.839468,5.313898,6.910273,8.577111,9.403318,9.403318,6.119185,4.905725,7.434363,5.076497,5.192318,5.080847,8.065462,8.065462,8.535223,8.535223,5.49458,9.136387,3.765256,8.191289,11.708283,11.708283,3.875326,11.022868,10.209611,3.029066,8.911515,6.942798,3.864401,3.886512,3.015252,5.046901,AML
3,Sample_1003,2.882058,3.582897,12.784057,6.593272,4.799354,5.912197,3.515558,5.22402,5.401763,5.439815,9.079139,8.459776,8.459776,5.088605,4.999124,5.970327,4.81454,4.572632,4.809954,6.777287,6.777287,8.155721,8.155721,3.89822,8.122287,3.193175,3.699731,11.347153,11.347153,4.007342,11.64552,10.333872,2.632752,7.398745,5.028869,3.845556,3.326164,2.811341,4.80397,AML
4,Sample_1004,3.335401,3.426485,12.671934,6.060153,6.8328,6.332313,3.391523,5.938946,5.526973,6.680934,8.888095,8.965483,8.965483,6.551588,6.034367,6.700104,5.776896,5.257346,5.097954,8.174451,8.174451,9.377438,9.377438,5.407589,9.206095,3.490509,3.22443,11.609701,11.609701,4.577778,11.237164,10.486609,2.693749,8.503826,7.260643,3.547633,3.222044,2.904241,5.076562,AML


In [430]:
#Check the size of our df
df.shape

(1273, 41)

In [431]:
#Descriptive exploration
df.describe()

Unnamed: 0,1563591_at,1570350_at,200063_s_at,206289_at,206847_s_at,208129_x_at,208493_at,208557_at,208604_s_at,209359_x_at,209360_s_at,209905_at,209905_at.1,210365_at,210805_x_at,211180_x_at,211181_x_at,211182_x_at,211620_x_at,213147_at,213147_at.1,213150_at,213150_at.1,213823_at,213844_at,214457_at,214639_s_at,214651_s_at,214651_s_at.1,217263_x_at,221691_x_at,221923_s_at,231786_at,235521_at,235753_at,237697_at,238571_at,238808_at,243058_at
count,1273.0,1273.0,1273.0,1273.0,1273.0,1273.0,1273.0,1273.0,1273.0,1273.0,1273.0,1273.0,1273.0,1273.0,1273.0,1273.0,1273.0,1273.0,1273.0,1273.0,1273.0,1273.0,1273.0,1273.0,1273.0,1273.0,1273.0,1273.0,1273.0,1273.0,1273.0,1273.0,1273.0,1273.0,1273.0,1273.0,1273.0,1273.0,1273.0
mean,3.115412,3.644894,12.507614,5.520219,4.133758,6.247992,3.939691,4.782672,4.970276,6.074678,8.108587,5.443801,5.443801,6.280217,5.078159,6.541487,5.142368,5.119446,5.065188,5.64431,5.64431,4.902718,4.902718,4.485124,5.445865,3.243379,4.233219,5.942459,5.942459,4.129535,11.088813,10.286677,2.924103,4.086756,4.23498,3.772124,3.684046,2.995646,5.045064
std,0.166992,0.224113,0.343771,0.579613,1.017319,0.752175,0.485354,0.869639,0.30312,0.730567,1.145677,1.908463,1.908463,1.462943,0.808264,0.709039,0.707208,0.68939,0.536771,1.100937,1.100937,2.125653,2.125653,0.762584,2.384459,0.386378,0.899628,3.222464,3.222464,0.364682,0.523535,0.590797,0.450628,1.856661,1.155967,0.274776,0.289629,0.212019,0.29484
min,2.662753,3.017731,9.934693,4.490391,2.96393,4.266826,3.151199,3.548252,4.105572,4.145656,4.741976,3.190132,3.190132,3.497821,3.735281,4.714119,3.536383,3.620302,3.916784,4.056136,4.056136,2.767695,2.767695,3.562991,3.144433,2.602331,2.991688,2.781264,2.781264,3.242886,6.084229,8.187756,2.563503,2.529696,2.951206,3.072127,2.861418,2.507,4.323821
25%,3.006619,3.498312,12.376825,5.134898,3.491341,5.705982,3.663542,4.252174,4.776805,5.520393,7.293183,4.05587,4.05587,5.099879,4.489526,6.004298,4.651087,4.635643,4.745591,4.848919,4.848919,3.337838,3.337838,4.123251,3.732406,3.014873,3.610749,3.387498,3.387498,3.877556,10.853339,9.885619,2.765655,2.994094,3.516835,3.595149,3.520205,2.875824,4.848525
50%,3.108314,3.625279,12.564261,5.338404,3.717901,6.188445,3.83796,4.500758,4.946706,6.034923,7.929928,4.363569,4.363569,6.014045,4.877209,6.511357,4.994205,4.985503,4.948362,5.208287,5.208287,3.574624,3.574624,4.286365,4.125259,3.135998,3.910382,3.738156,3.738156,4.066445,11.110813,10.275159,2.849247,3.166868,3.732807,3.730214,3.641934,2.963258,5.001983
75%,3.204183,3.774032,12.727351,5.731433,4.327442,6.771927,4.073029,4.912413,5.134135,6.602183,8.902546,6.719136,6.719136,7.380672,5.511237,7.038619,5.516647,5.484664,5.239236,6.259974,6.259974,6.560848,6.560848,4.537772,6.72137,3.332474,4.574764,8.970728,8.970728,4.317955,11.374287,10.68575,2.947633,4.005509,4.367431,3.894196,3.770722,3.073988,5.188874
max,4.576415,5.267596,13.196702,8.26293,8.245297,8.645966,8.595141,9.429559,6.364921,8.57997,11.321528,10.929677,10.929677,10.650974,9.022157,8.648116,9.927173,8.894274,8.470631,9.36468,9.36468,11.431219,11.431219,11.09366,12.468934,5.58651,8.191289,12.918816,12.918816,6.085998,12.569077,11.965435,8.137276,9.730069,8.558392,6.339444,6.23394,5.370392,6.803573


In [432]:
#Validate classes of target variable
df['Leukemia'].value_counts()

AML            542
CLL            448
ALL            134
CML             76
Nonleukemia     73
Name: Leukemia, dtype: int64

## Data Modelling

In [433]:
#Create a new column - transform from multiclass to binary
df['LeukemiaClasification'] = df.Leukemia.apply(lambda x: 1 if x != 'Nonleukemia' else 0)

In [434]:
x = df.drop(['Patient_ID','Leukemia','LeukemiaClasification'], axis=1) # X or explainatory variables, this is the same and we wont do any transformation
y_bin = df['LeukemiaClasification'] # for a binary clasification (1 and 0)
y_mc = df['Leukemia']# for a multiclass classification

## Support Vector Machine Implementation

The goal is to determine the leukemia class for each patient using two approaches:

1. **Binary Classification:** In this approach, we assume there are only two classes: Leukemia (1) or non-Leukemia (0).

2. **Multiclass Classification:** The goal is to determine to which of the classes the patient is more prone to be assigned to (AML, CLL, ALL, CML, Nonleukemia).

Since the DataFrame is not very large (relative to your computing system), we'll iterate over different SVM kernels for each approach, binary and multiclass:

- **Radial Basis Function (RBF) - Default:** RBF kernel is versatile and often works well in practice. It is suitable for capturing complex relationships in the data.

- **Linear:** The linear kernel is computationally efficient and works well when the data is linearly separable. It is a good choice for a starting point.

- **Polynomial (Poly):** The polynomial kernel is effective in capturing non-linear relationships. The degree of the polynomial can be adjusted to control the model complexity.

- **Sigmoid:** The sigmoid kernel can be useful when the data distribution is not explicitly known. It is particularly suitable for neural network-like architectures.

- **Precomputed:** Precomputed kernel allows you to specify a custom kernel matrix. This can be beneficial when you have prior knowledge about the relationships between data points.

These kernel choices provide flexibility in capturing different types of relationships within the data, and the iteration helps identify the most suitable kernel for the leukemia classification task.




### Utilities

In [435]:
#Import libraries
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

In [436]:
#Lets define a function to perform the model iterations
def train_evaluate_model(kernel,X_train,X_test,y_train,y_test):
  model= SVC(kernel=kernel, random_state = 42)
  model.fit(X_train,y_train)
  y_pred = model.predict(X_test)
  accuracy = accuracy_score(y_test,y_pred)
  return accuracy


In [437]:
#Define kernels to loop on
kernels = ['rbf','linear','poly','sigmoid']

### Binary
**Winner:** Poly with 96% accuracy

In [438]:
# Train Test Split
X_train, X_test, y_train, y_test = train_test_split(x, y_bin, test_size=0.30, stratify=y_bin, random_state= 42)

In [439]:
#Create a dictionary to store the accuracy for each model
bin_accuracy_dict = {}

In [440]:
#Kernel Iterations
for kernel in kernels:
  bin_accuracy_dict[kernel] = train_evaluate_model(kernel,X_train, X_test, y_train, y_test)

In [441]:
#Explore models performance for binary classification
pd.DataFrame.from_dict(bin_accuracy_dict,orient='index', columns=['Accuracy'])

Unnamed: 0,Accuracy
rbf,0.942408
linear,0.955497
poly,0.963351
sigmoid,0.942408


### Multiclass
**Winner:** Poly & Linear with 87% accuracy
* There might be a sligly difference if the model runs one more time.

In [442]:
## Train Test Split
X_train, X_test, y_train, y_test = train_test_split(x, y_mc, test_size=0.30, stratify=y_mc, random_state= 42)

In [443]:
#Create a dictionary to store the accuracy for each model
mc_accuracy_dict = {}

In [444]:
#Kernel Iterations
for kernel in kernels:
  mc_accuracy_dict[kernel] = train_evaluate_model(kernel,X_train, X_test, y_train, y_test)

In [445]:
#Explore models performance for multiclass classification
pd.DataFrame.from_dict(mc_accuracy_dict,orient='index', columns=['Accuracy'])

Unnamed: 0,Accuracy
rbf,0.837696
linear,0.879581
poly,0.876963
sigmoid,0.426702


### Model Inidivual Analysis - Multiclass using **Linear** kernel

In [446]:
## Train Test Split
X_train, X_test, y_train, y_test = train_test_split(x, y_mc, test_size=0.30, stratify=y_mc, random_state= 42)

In [447]:
#Initialize the model, here we can 'play' with an additional available for the polinomial kenel which is degree
mc_svm = SVC(kernel='linear', probability=True, random_state=42)

In [448]:
#Train the model
model = mc_svm.fit(X_train,y_train)

In [449]:
#Make prediction over test data
y_pred = model.predict(X_test)

In [450]:
#Make prediction over test data
y_pred = model.predict(X_test)

In [451]:
#Explore the model performance for each class
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

         ALL       0.75      0.82      0.79        40
         AML       0.90      0.87      0.89       163
         CLL       0.94      1.00      0.97       134
         CML       0.60      0.65      0.63        23
 Nonleukemia       0.86      0.55      0.67        22

    accuracy                           0.88       382
   macro avg       0.81      0.78      0.79       382
weighted avg       0.88      0.88      0.88       382



### Model Inidivual Analysis  - Multiclass using **Poly** kernel


In [452]:
from sklearn.metrics import  classification_report
import seaborn as sns

In [453]:
## Train Test Split
X_train, X_test, y_train, y_test = train_test_split(x, y_mc, test_size=0.30, stratify=y_mc, random_state= 42)

In [454]:
#Initialize the model, here we can 'play' with an additional available for the polinomial kenel which is degree
mc_svm = SVC(kernel='poly', probability=True, degree=2, random_state=42)

In [455]:
#Train the model
model = mc_svm.fit(X_train,y_train)

In [456]:
#Make prediction over test data
y_pred = model.predict(X_test)

In [457]:
#Explore the model performance for each class
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

         ALL       0.79      0.68      0.73        40
         AML       0.88      0.91      0.89       163
         CLL       0.93      1.00      0.96       134
         CML       0.67      0.52      0.59        23
 Nonleukemia       0.69      0.50      0.58        22

    accuracy                           0.87       382
   macro avg       0.79      0.72      0.75       382
weighted avg       0.86      0.87      0.87       382



## **Preliminar conclusions**

1. **Model Performance:** Both linear and polynomial kernels show comparable high accuracy (87%+). It suggests that the models are effective in making correct predictions on the dataset.

2. **Preference for Simplicity:** The indication that the linear kernel slightly outperforms the polynomial kernel and the suggestion of a preference for simplicity and interpretability align with the principle of Occam's razor, which favors simpler models when they perform similarly to more complex ones.

3. **Consideration of Class Imbalance:** Class imbalance may impat model performance and introduce biases. Exploring and applying techniques like resampling or adjusting class weights could enhance model improvement.


## **Whats next?**
- Evaluate model feature importance (SHAP)
- Analize decision boundaries (Anchor)
- Perform a two step classification/ Cascade classifier, here the binary result will be used as a feature in the multiclass model possibly improving the multiclass result. This approach has some pros and cons, which I'm still reading about them.[Wikipedia](https://en.wikipedia.org/wiki/Cascading_classifiers), [Paper](https://www.umiacs.umd.edu/labs/cvl/pirl/vikas/publications/raykar_kdd2010_cascade_v3.pdf)
- A diffrent approach can be followed to boost our model interpretability using a linear kernel and performing a OneVsRestClassifier. [Sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html)
