# CMPS-320 Homework #6
## Support Vector Machine Practice
Elijah Campbell-Ihim

12/12/23

(a) The fundamental idea behind Support Vector Machines is to find a hyperplane (a line but for multiple dimensions) that best separates the data into different classes while maximizing the margin between the classes. The margin is the distance between the hyperplane and the nearest data point from each class. SVM aims to find the hyperplane that not only separates the classes but also has the maximum margin, making it good at generalizing to new, unseen data.

## Import libraries and dataset

In [4]:
import pandas as pd

# Load the Auto dataset
auto = pd.read_csv("Auto.csv")

auto.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino


## (b) Create binary variable for high/low gas mileage

In [5]:
# Calculate the median gas mileage
median_mpg = auto['mpg'].median()

# Create a binary variable
auto['high_mileage'] = (auto['mpg'] > median_mpg).astype(int)

# Display the modified dataset
auto.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name,high_mileage
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu,0
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320,0
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite,0
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst,0
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino,0


## Some Data Preprocessing

In [6]:
auto.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397 entries, 0 to 396
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           397 non-null    float64
 1   cylinders     397 non-null    int64  
 2   displacement  397 non-null    float64
 3   horsepower    397 non-null    object 
 4   weight        397 non-null    int64  
 5   acceleration  397 non-null    float64
 6   year          397 non-null    int64  
 7   origin        397 non-null    int64  
 8   name          397 non-null    object 
 9   high_mileage  397 non-null    int32  
dtypes: float64(3), int32(1), int64(4), object(2)
memory usage: 29.6+ KB


In [7]:
# Convert 'horsepower' column to numeric (float) type
auto['horsepower'] = pd.to_numeric(auto['horsepower'], errors='coerce')


# Display the modified dataset
auto.head()


Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name,high_mileage
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu,0
1,15.0,8,350.0,165.0,3693,11.5,70,1,buick skylark 320,0
2,18.0,8,318.0,150.0,3436,11.0,70,1,plymouth satellite,0
3,16.0,8,304.0,150.0,3433,12.0,70,1,amc rebel sst,0
4,17.0,8,302.0,140.0,3449,10.5,70,1,ford torino,0


In [8]:
auto = auto.dropna()
auto.info()

<class 'pandas.core.frame.DataFrame'>
Index: 392 entries, 0 to 396
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           392 non-null    float64
 1   cylinders     392 non-null    int64  
 2   displacement  392 non-null    float64
 3   horsepower    392 non-null    float64
 4   weight        392 non-null    int64  
 5   acceleration  392 non-null    float64
 6   year          392 non-null    int64  
 7   origin        392 non-null    int64  
 8   name          392 non-null    object 
 9   high_mileage  392 non-null    int32  
dtypes: float64(4), int32(1), int64(4), object(1)
memory usage: 32.2+ KB


## (c) Linear Support Vector Classifier

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.preprocessing import StandardScaler

# Split the data into features (X) and target variable (y)
X = auto.drop(['mpg', 'high_mileage', 'name'], axis=1)
y = auto['high_mileage']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Fit a linear Support Vector Classifier (SVC) with various values of cost
cost_values = [0.1, 1, 10, 100, 1000]
for cost in cost_values:
    svm_model = SVC(kernel='linear', C=cost)
    svm_model.fit(X_train_scaled, y_train)
    y_pred = svm_model.predict(X_test_scaled)
    
    accuracy = accuracy_score(y_test, y_pred)
    confusion_mat = confusion_matrix(y_test, y_pred)
    
    print(f"Linear SVM with C={cost}: Accuracy = {accuracy:.4f}")
    print("Confusion Matrix:")
    print(confusion_mat)
    print("\n")


Linear SVM with C=0.1: Accuracy = 0.8734
Confusion Matrix:
[[34  9]
 [ 1 35]]


Linear SVM with C=1: Accuracy = 0.9114
Confusion Matrix:
[[38  5]
 [ 2 34]]


Linear SVM with C=10: Accuracy = 0.8861
Confusion Matrix:
[[36  7]
 [ 2 34]]


Linear SVM with C=100: Accuracy = 0.8861
Confusion Matrix:
[[36  7]
 [ 2 34]]


Linear SVM with C=1000: Accuracy = 0.8861
Confusion Matrix:
[[36  7]
 [ 2 34]]




### Interpretation:

#### The accuracy of the linear SVM models is consistently high, ranging from 87.34% to 91.14%. The model with C=1 achieved the  highest accuracy, and with a balanced performance in terms of true positives and true negatives, making it the best model we've trained. 

## (d) SVM with radial basis kernel

In [10]:
# Fit SVMs with radial basis kernel with different values of gamma and cost
gamma_values = [0.1, 1, 10]
cost_values = [0.1, 1, 10, 100]

for gamma in gamma_values:
    for cost in cost_values:
        svm_model = SVC(kernel='rbf', gamma=gamma, C=cost)
        svm_model.fit(X_train_scaled, y_train)
        y_pred = svm_model.predict(X_test_scaled)

        accuracy = accuracy_score(y_test, y_pred)
        confusion_mat = confusion_matrix(y_test, y_pred)

        print(f"Radial SVM with gamma={gamma}, C={cost}: Accuracy = {accuracy:.4f}")
        print("Confusion Matrix:")
        print(confusion_mat)
        print("\n")


Radial SVM with gamma=0.1, C=0.1: Accuracy = 0.8608
Confusion Matrix:
[[33 10]
 [ 1 35]]


Radial SVM with gamma=0.1, C=1: Accuracy = 0.8861
Confusion Matrix:
[[34  9]
 [ 0 36]]


Radial SVM with gamma=0.1, C=10: Accuracy = 0.8861
Confusion Matrix:
[[37  6]
 [ 3 33]]


Radial SVM with gamma=0.1, C=100: Accuracy = 0.8987
Confusion Matrix:
[[39  4]
 [ 4 32]]


Radial SVM with gamma=1, C=0.1: Accuracy = 0.8608
Confusion Matrix:
[[35  8]
 [ 3 33]]


Radial SVM with gamma=1, C=1: Accuracy = 0.8861
Confusion Matrix:
[[38  5]
 [ 4 32]]


Radial SVM with gamma=1, C=10: Accuracy = 0.8861
Confusion Matrix:
[[38  5]
 [ 4 32]]


Radial SVM with gamma=1, C=100: Accuracy = 0.8861
Confusion Matrix:
[[37  6]
 [ 3 33]]


Radial SVM with gamma=10, C=0.1: Accuracy = 0.5443
Confusion Matrix:
[[43  0]
 [36  0]]


Radial SVM with gamma=10, C=1: Accuracy = 0.8734
Confusion Matrix:
[[42  1]
 [ 9 27]]


Radial SVM with gamma=10, C=10: Accuracy = 0.8734
Confusion Matrix:
[[42  1]
 [ 9 27]]


Radial SVM with gam

### Interpretation:

#### The Radial SVM models generally show good accuracy, but there is variation depending on the choice of hyperparameters. Higher values of C generally result in better accuracy for most cases. The impact of gamma is more evident in the case of gamma=10, where the accuracy drops significantly for C=0.1, indicating overfitting. The value of gamma seems to need to be significantly smaller than the value of c in order to produce an accurate model. Based on these results, the model with gamma = .01 and c = 100 seems to be the best model, both in terms of accuracy and overall balance in the confusion matrix. 

## (d) SVM with polynomial basis kernel

In [11]:
# Fit SVMs with polynomial basis kernel with different values of degree, gamma, and cost
degree_values = [2, 3, 4]
gamma_values = [0.1, 1, 10]
cost_values = [0.1, 1, 10, 100]

for degree in degree_values:
    for gamma in gamma_values:
        for cost in cost_values:
            svm_model = SVC(kernel='poly', degree=degree, gamma=gamma, C=cost)
            svm_model.fit(X_train_scaled, y_train)
            y_pred = svm_model.predict(X_test_scaled)

            accuracy = accuracy_score(y_test, y_pred)
            confusion_mat = confusion_matrix(y_test, y_pred)

            print(f"Polynomial SVM with degree={degree}, gamma={gamma}, C={cost}: Accuracy = {accuracy:.4f}")
            print("Confusion Matrix:")
            print(confusion_mat)
            print("\n")


Polynomial SVM with degree=2, gamma=0.1, C=0.1: Accuracy = 0.5823
Confusion Matrix:
[[14 29]
 [ 4 32]]


Polynomial SVM with degree=2, gamma=0.1, C=1: Accuracy = 0.6203
Confusion Matrix:
[[28 15]
 [15 21]]


Polynomial SVM with degree=2, gamma=0.1, C=10: Accuracy = 0.7342
Confusion Matrix:
[[36  7]
 [14 22]]


Polynomial SVM with degree=2, gamma=0.1, C=100: Accuracy = 0.7468
Confusion Matrix:
[[36  7]
 [13 23]]


Polynomial SVM with degree=2, gamma=1, C=0.1: Accuracy = 0.7342
Confusion Matrix:
[[36  7]
 [14 22]]


Polynomial SVM with degree=2, gamma=1, C=1: Accuracy = 0.7468
Confusion Matrix:
[[36  7]
 [13 23]]


Polynomial SVM with degree=2, gamma=1, C=10: Accuracy = 0.7848
Confusion Matrix:
[[37  6]
 [11 25]]


Polynomial SVM with degree=2, gamma=1, C=100: Accuracy = 0.8354
Confusion Matrix:
[[36  7]
 [ 6 30]]


Polynomial SVM with degree=2, gamma=10, C=0.1: Accuracy = 0.7848
Confusion Matrix:
[[37  6]
 [11 25]]


Polynomial SVM with degree=2, gamma=10, C=1: Accuracy = 0.8354
Confusi

### Interpretation
#### For degree=3, the accuracy is consistently high across different gamma and cost values, ranging from 88.61% to 91.14%. This suggests that a cubic polynomial may be the best choice for capturing the underlying relationships in the data. Gamma and c values are similar to previous models, where c should be relatively large and gamma should be relatively small to produce a strong model. Overall, the best choice for a model here would be degree = 3, gamma = .01, and c = 100. 