# SVM and MLP Model Comparison

## Instructions

Use the Crohn’s Disease dataset: CrohnD (make sure it is CrohnD with a 'D' - there are multiple Crohn's datasets)

You will need to preprocess this before you can use it. You will need to drop the 'ID' column and you will need to rename the following values:

c1 -> 0, c2 -> 1, F -> 0, M -> 1

Build four SVM(see documentation) models with the best cross-validated performance you can find. Do a cross-validated grid search over the following kernels:

kernel: linear

kernel: poly by varying parameters 'degree' between 2-6, 'coef0'  around 1.0

kernel: rbf by varying parameters 'gamma' between 'scale', 'auto'

kernel: sigmoid by varying parameter coef0

Note: Vary C, regularization constant, between
going up in exponential steps. Set max_iter value to 10,000 to avoid situations where your model doesn't converge.

Does tuning the 'gamma' parameter for poly and sigmoid kernels give you better results?

Compare your best SVM models to each other. Report if the difference between the models is statistically significant (hint: confidence intervals)


In [1]:
# Imports

# Model Stuff
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import make_pipeline

# Other Stuff
import seaborn as sn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

 # Data (Github Clone)
!git clone https://github.com/KamronAggorURI/CSC310.git
%cd CSC310
!git status
!git add CrohnD.csv

fatal: destination path 'CSC310' already exists and is not an empty directory.
/Users/kamronaggor/Desktop/School/URI/Spring 2025/DSP 310/Homework/Homework 9/CSC310
On branch main
Your branch is up to date with 'origin/main'.

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31m.DS_Store[m

nothing added to commit but untracked files present (use "git add" to track)


In [2]:
df = pd.read_csv("CrohnD.csv")
df.head() # good

Unnamed: 0,rownames,ID,nrAdvE,BMI,height,country,sex,age,weight,treat
0,1,19908,4,25.22,163,c1,F,47,67,placebo
1,2,19909,4,23.8,164,c1,F,53,64,d1
2,3,19910,1,23.05,164,c1,F,68,62,placebo
3,4,20908,1,25.71,165,c1,F,48,70,d2
4,5,20909,2,25.95,170,c1,F,67,75,placebo


### Step 1.
You will need to preprocess this before you can use it. You will need to drop the 'ID' column and you will need to rename the following values:

> c1 -> 0, c2 -> 1, F -> 0, M -> 1

In [3]:
# Preprocessing

# Drop the ID col
df.drop(labels=['ID', 'rownames'], axis=1, inplace=True)

df.head() # Good

# Convert categoricals to numericals
df.replace(
    {'country' : {'c1' : 0, 'c2' : 1}, 'sex' : {'F' : 0, 'M' : 1}, 'treat' : {'placebo' : 0, 'd1' : 1, 'd2' : 2}}, inplace=True)
df.head() # Good

Unnamed: 0,nrAdvE,BMI,height,country,sex,age,weight,treat
0,4,25.22,163,0,0,47,67,0
1,4,23.8,164,0,0,53,64,1
2,1,23.05,164,0,0,68,62,0
3,1,25.71,165,0,0,48,70,2
4,2,25.95,170,0,0,67,75,0


## Step 2.
Build four SVM(see documentation) models with the best cross-validated performance you can find.

In [4]:
# Split the data
X = df.drop('nrAdvE', axis=1)
y = df['nrAdvE']

# Train-test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

# Grid Search the best params to get the best score available per kernel
pipeline = make_pipeline(SVC())

# Model setup
param_grid_linear = {
    'svc__C': [0.001, 0.01, 0.1, 1, 10],
    'svc__gamma': ['scale', 'auto'],
    'svc__kernel': ['linear']
}
svm_linear = GridSearchCV(pipeline, param_grid_linear, cv=5, n_jobs=-1)

param_grid_poly = {
    'svc__C': [0.001, 0.01, 0.1, 1, 10],
    'svc__gamma': ['scale', 'auto'],
    'svc__kernel': ['poly']
}
svm_poly = GridSearchCV(pipeline, param_grid_poly, cv=5, n_jobs=-1)

param_grid_rbf = {
    'svc__C': [0.001, 0.01, 0.1, 1, 10],
    'svc__gamma': ['scale', 'auto'],
    'svc__kernel': ['rbf']
}
svm_rbf = GridSearchCV(pipeline, param_grid_rbf, cv=5, n_jobs=-1)

param_grid_sigmoid = {
    'svc__C': [0.001, 0.01, 0.1, 1, 10],
    'svc__gamma': ['scale', 'auto'],
    'svc__kernel': ['sigmoid']
}
svm_sigmoid = GridSearchCV(pipeline, param_grid_sigmoid, cv=5, n_jobs=-1)

models = [svm_linear, svm_poly, svm_rbf, svm_sigmoid]

# 5 Fold CV
scores = dict({model : cross_val_score(model, X_train, y_train, cv=5).mean() for model in models})

for model, score in scores.items():
    print(f'\033[1m {model} 5-F CV Accuracy: {score} \033[0m \n')



[1m GridSearchCV(cv=5, estimator=Pipeline(steps=[('svc', SVC())]), n_jobs=-1,
             param_grid={'svc__C': [0.001, 0.01, 0.1, 1, 10],
                         'svc__gamma': ['scale', 'auto'],
                         'svc__kernel': ['linear']}) 5-F CV Accuracy: 0.4514619883040935 [0m 

[1m GridSearchCV(cv=5, estimator=Pipeline(steps=[('svc', SVC())]), n_jobs=-1,
             param_grid={'svc__C': [0.001, 0.01, 0.1, 1, 10],
                         'svc__gamma': ['scale', 'auto'],
                         'svc__kernel': ['poly']}) 5-F CV Accuracy: 0.4619883040935672 [0m 

[1m GridSearchCV(cv=5, estimator=Pipeline(steps=[('svc', SVC())]), n_jobs=-1,
             param_grid={'svc__C': [0.001, 0.01, 0.1, 1, 10],
                         'svc__gamma': ['scale', 'auto'],
                         'svc__kernel': ['rbf']}) 5-F CV Accuracy: 0.4619883040935672 [0m 

[1m GridSearchCV(cv=5, estimator=Pipeline(steps=[('svc', SVC())]), n_jobs=-1,
             param_grid={'svc__C': [0.001



> We find that the maximum accuracy for the model regardless of the type of svc kernel is ~46.20%.
> The linear kernel is the only one that has a different gridsearch accuracy than the others at ~45.15%.

## Step 3.
Do a cross-validated grid search over the following kernels:

kernel: linear

kernel: poly by varying parameters 'degree' between 2-6, 'coef0'  around 1.0

kernel: rbf by varying parameters 'gamma' between 'scale', 'auto'

kernel: sigmoid by varying parameter coef0

Note: Vary C, regularization constant, between
going up in exponential steps. Set max_iter value to 10,000 to avoid situations where your model doesn't converge.

Does tuning the 'gamma' parameter for poly and sigmoid kernels give you better results?

In [15]:
# Cross-validated grid search over linear kernel -> already done!

# Cross-validated grid search over 'poly' kernel, varying parameters 'degree' between 2-6, 'coef0' around 1.0:

# First pass
param_grid_poly = {
    'svc__kernel' : ['poly'],
    'svc__degree' : [2, 3, 4, 5, 6],
    'svc__coef0' : [1.0],
    'svc__gamma' : ['scale']
}

# Cross-validated grid search over 'rbf' kernel, varying parameters 'gamma' between 'scale' and auto:
param_grid_rbf = {
    'svc__kernel' : ['rbf'],
    'svc__gamma' : ['scale']
}

# Cross-validated grid search over 'sigmoid' kernel, varying parameter 'coef0'.
param_grid_sigmoid = {
    'svc__kernel' : ['sigmoid'],
    'svc__coef0' : [2, 3, 4, 5, 6],
    'svc__gamma' : ['scale']
}

# Running our Gridsearch

# First set CV fold # based on features per label
cv_folds = 3 if y_train.value_counts().min() < 5 else 5

# Then run it
svm_poly = GridSearchCV(pipeline, param_grid_poly, cv=cv_folds, n_jobs=1, error_score='raise')

svm_rbf = GridSearchCV(pipeline, param_grid_rbf, cv=cv_folds, n_jobs=1, error_score='raise')

svm_sigmoid = GridSearchCV(pipeline, param_grid_sigmoid, cv=cv_folds, n_jobs=1, error_score='raise')

# Outputting scores
models = [svm_linear, svm_poly, svm_rbf, svm_sigmoid]

scores = dict({model : cross_val_score(model, X_train, y_train, cv=cv_folds).mean() for model in models})

for model, score in scores.items():
    print(f'\033[1m {model} 5-F CV Accuracy: {score} \033[0m \n')


    
# Second pass
param_grid_poly = {
    'svc__kernel' : ['poly'],
    'svc__degree' : [2, 3, 4, 5, 6],
    'svc__coef0' : [1.0],
    'svc__gamma' : ['auto']
}

# Cross-validated grid search over 'rbf' kernel, varying parameters 'gamma' between 'scale' and auto:
param_grid_rbf = {
    'svc__kernel' : ['rbf'],
    'svc__gamma' : ['auto']
}

# Cross-validated grid search over 'sigmoid' kernel, varying parameter 'coef0'.
param_grid_sigmoid = {
    'svc__kernel' : ['sigmoid'],
    'svc__coef0' : [2, 3, 4, 5, 6],
    'svc__gamma' : ['auto']
}

# Running our Gridsearch

# First set CV fold # based on features per label
cv_folds = 3 if y_train.value_counts().min() < 5 else 5

# Then run it
svm_poly = GridSearchCV(pipeline, param_grid_poly, cv=cv_folds, n_jobs=1, error_score='raise')

svm_rbf = GridSearchCV(pipeline, param_grid_rbf, cv=cv_folds, n_jobs=1, error_score='raise')

svm_sigmoid = GridSearchCV(pipeline, param_grid_sigmoid, cv=cv_folds, n_jobs=1, error_score='raise')

# Outputting scores
models = [svm_linear, svm_poly, svm_rbf, svm_sigmoid]

scores = dict({model : cross_val_score(model, X_train, y_train, cv=cv_folds).mean() for model in models})

for model, score in scores.items():
    print(f'\033[1m {model} 5-F CV Accuracy: {score} \033[0m \n')
    



[1m GridSearchCV(cv=5, estimator=Pipeline(steps=[('svc', SVC())]), n_jobs=-1,
             param_grid={'svc__C': [0.001, 0.01, 0.1, 1, 10],
                         'svc__gamma': ['scale', 'auto'],
                         'svc__kernel': ['linear']}) 5-F CV Accuracy: 0.4623655913978495 [0m 

[1m GridSearchCV(cv=3, error_score='raise',
             estimator=Pipeline(steps=[('svc', SVC())]), n_jobs=1,
             param_grid={'svc__coef0': [1.0], 'svc__degree': [2, 3, 4, 5, 6],
                         'svc__gamma': ['scale'], 'svc__kernel': ['poly']}) 5-F CV Accuracy: 0.4623655913978495 [0m 

[1m GridSearchCV(cv=3, error_score='raise',
             estimator=Pipeline(steps=[('svc', SVC())]), n_jobs=1,
             param_grid={'svc__gamma': ['scale'], 'svc__kernel': ['rbf']}) 5-F CV Accuracy: 0.4623655913978495 [0m 

[1m GridSearchCV(cv=3, error_score='raise',
             estimator=Pipeline(steps=[('svc', SVC())]), n_jobs=1,
             param_grid={'svc__coef0': [2, 3, 4, 5, 6]



[1m GridSearchCV(cv=5, estimator=Pipeline(steps=[('svc', SVC())]), n_jobs=-1,
             param_grid={'svc__C': [0.001, 0.01, 0.1, 1, 10],
                         'svc__gamma': ['scale', 'auto'],
                         'svc__kernel': ['linear']}) 5-F CV Accuracy: 0.4623655913978495 [0m 

[1m GridSearchCV(cv=3, error_score='raise',
             estimator=Pipeline(steps=[('svc', SVC())]), n_jobs=1,
             param_grid={'svc__coef0': [1.0], 'svc__degree': [2, 3, 4, 5, 6],
                         'svc__gamma': ['auto'], 'svc__kernel': ['poly']}) 5-F CV Accuracy: 0.22580645161290325 [0m 

[1m GridSearchCV(cv=3, error_score='raise',
             estimator=Pipeline(steps=[('svc', SVC())]), n_jobs=1,
             param_grid={'svc__gamma': ['auto'], 'svc__kernel': ['rbf']}) 5-F CV Accuracy: 0.4623655913978495 [0m 

[1m GridSearchCV(cv=3, error_score='raise',
             estimator=Pipeline(steps=[('svc', SVC())]), n_jobs=1,
             param_grid={'svc__coef0': [2, 3, 4, 5, 6],



### Here we see essentially identical scores for each kernel EXCEPT for the poly svc kernel. Thus we may conclude that adjusting the gamma variable does not have a change in performance for any of the kernels except for the poly kernel, which has significantly worse accuracy when using gamma='auto' parameter.

Cloning into 'CSC310'...
remote: Enumerating objects: 27, done.[K
remote: Counting objects: 100% (27/27), done.[K
remote: Compressing objects: 100% (23/23), done.[K
remote: Total 27 (delta 6), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (27/27), 2.45 MiB | 1.68 MiB/s, done.
Resolving deltas: 100% (6/6), done.
