# **Week 4: Colab Experiment**

# I. Introduction
In this exercise, we load the Breast cancer wisconsin dataset for classification.

The approach combines multiple machine learning techniques to optimize model performance using hyperparameter tuning, cross-validation, and ensemble methods. First, data is preprocessed using a pipeline with StandardScaler to ensure consistent feature scaling. Models such as Logistic Regression, Support Vector Machine (SVM), and Decision Tree are constructed. We employ GridSearchCV for hyperparameter tuning, using a grid of parameters for Logistic Regression and SVM, such as C, kernel, and gamma

# II. Methods

In [5]:
from sklearn.datasets import load_breast_cancer
import pandas as pd
from collections import Counter
from datetime import datetime
import numpy as np
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import zero_one_loss
from sklearn.decomposition import PCA


In [6]:
# Define the dependent and independent variables.
data = load_breast_cancer()
Y = data.target
X = data.data


In [7]:
# Create CV folds
num_folds = 5
kf = KFold(n_splits=num_folds, random_state=0, shuffle=True)
kfold_indices = {}

for i, (train_index, test_index) in enumerate(kf.split(X)):
  kfold_indices[f"fold_{i}"] = {'train': train_index, 'test': test_index}

In [1]:
# Parameter grids for each classifier
param_grid_logreg = {
    'classifier__C': [0.01, 0.1, 1, 10, 100],
    'classifier__penalty': ['l1', 'l2'],
    'classifier__solver': ['liblinear', 'saga'],
    'classifier__max_iter': [10000]  # Higher iteration counts for better convergence
}

param_grid_svm = {
    'classifier__C': [0.1, 1, 10, 100, 1000, 5000, 10000],  # Regularization strength
    'classifier__kernel': ['linear', 'rbf', 'poly'],  # Different kernels to test
    'classifier__gamma': ['scale', 'auto', 0.01, 0.1, 0.5, 1],  # Kernel coefficient
    'classifier__degree': [2, 3, 4, 5, 6],  # Polynomial degree, only used for 'poly' kernel
    'classifier__class_weight': [None, 'balanced'],  # Use balanced class weights
    'classifier__coef0': [0.0, 0.1, 0.3, 0.5, 0.8, 1.0],  # Coef0 for poly and sigmoid kernels
}

param_grid_tree = {
    'classifier__max_depth': [None, 5, 10, 20],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4],
    'classifier__criterion': ['gini', 'entropy']
}

In [2]:
Error_rate = {'logreg': [], 'svm': [], 'decision_tree': []}

In [10]:
# Train models and apply them to the test set

for fold_id in range(num_folds):
  X_train = X[kfold_indices[f"fold_{fold_id}"]['train']]
  Y_train = Y[kfold_indices[f"fold_{fold_id}"]['train']]
  X_test = X[kfold_indices[f"fold_{fold_id}"]['test']]
  Y_test = Y[kfold_indices[f"fold_{fold_id}"]['test']]

  # Logistic regression
  ######################## TODO #####################################
  pipe_logreg = Pipeline([
      ('scaler', StandardScaler()),  # Step 1: Standardize features
      ('pca', PCA(n_components=0.95)),  # Step 2: PCA to reduce dimensionality
      ('classifier', LogisticRegression(random_state=0, max_iter=1000))  # Step 3: Logistic Regression
  ])
  clf_logreg = GridSearchCV(Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(random_state=100))  # Increase max_iter
  ]), param_grid_logreg, cv=5)
  clf_logreg.fit(X_train, Y_train)

  # Make predictions and compute error rate
  Y_pred = clf_logreg.predict(X_test)
  error = zero_one_loss(Y_test, Y_pred)

  Error_rate['logreg'].append(error)
  #####################################################################

In [8]:
for fold_id in range(num_folds):
  # Prepare train and test data for this fold
  X_train = X[kfold_indices[f"fold_{fold_id}"]['train']]
  Y_train = Y[kfold_indices[f"fold_{fold_id}"]['train']]
  X_test = X[kfold_indices[f"fold_{fold_id}"]['test']]
  Y_test = Y[kfold_indices[f"fold_{fold_id}"]['test']]

  # SVM
  #####################################################################
  # Create the pipeline for SVM with StandardScaler
  pipe_svm = Pipeline([
      ('scaler', StandardScaler()),  # Standardize features
      ('classifier', SVC(random_state=0))  # SVM classifier
  ])

  # Use GridSearchCV for hyperparameter tuning
  clf_svm = GridSearchCV(pipe_svm, param_grid_svm, cv=5, scoring='accuracy')

  # Fit the model using the best parameters found by GridSearchCV
  clf_svm.fit(X_train, Y_train)

  # Make predictions and compute the error rate
  Y_pred = clf_svm.predict(X_test)
  error = zero_one_loss(Y_test, Y_pred)

  # Store the error rate for this fold
  Error_rate['svm'].append(error)
  #####################################################################

In [11]:
for fold_id in range(num_folds):
  X_train = X[kfold_indices[f"fold_{fold_id}"]['train']]
  Y_train = Y[kfold_indices[f"fold_{fold_id}"]['train']]
  X_test = X[kfold_indices[f"fold_{fold_id}"]['test']]
  Y_test = Y[kfold_indices[f"fold_{fold_id}"]['test']]

  # Decision tree
  ######################## TODO #####################################
  pipe_tree = Pipeline([
    ('classifier', DecisionTreeClassifier(random_state=0))  # Step 1: Decision Tree
  ])
  clf_tree = GridSearchCV(pipe_tree, param_grid_tree, cv=5)
  clf_tree.fit(X_train, Y_train)
  Y_pred_tree = clf_tree.predict(X_test)
  Error_rate['decision_tree'].append(zero_one_loss(Y_test, Y_pred_tree))
  #####################################################################

## III. Results

Here we report the mean and standard deviation of the error rates over 5 folds for each method.

In [12]:
######################## TODO #####################################
print(f"The error rate over 5 folds in CV:")
print(f"Logistic Regression: mean = {np.mean(Error_rate['logreg']):.4f}, std = {np.std(Error_rate['logreg']):.4f}")
print(f"SVM: mean = {np.mean(Error_rate['svm']):.4f}, std = {np.std(Error_rate['svm']):.4f}")
print(f"Decision Tree: mean = {np.mean(Error_rate['decision_tree']):.4f}, std = {np.std(Error_rate['decision_tree']):.4f}")
#####################################################################

The error rate over 5 folds in CV:
Logistic Regression: mean = 0.0211, std = 0.0131
SVM: mean = 0.0211, std = 0.0181
Decision Tree: mean = 0.0686, std = 0.0131


# IV. Conclusion and Discussion
### Conclusion
Based on the cross-validation results over 5 folds, Logistic Regression performed the best with a mean error rate of 0.0211 and a standard deviation of 0.0131, indicating high predictive accuracy and stability. The SVM model followed closely with a mean error rate of 0.0211 and a standard deviation of 0.0181, demonstrating comparable performance. The Decision Tree, however, had the highest error rate (mean = 0.0686), suggesting it may not be the optimal choice for this dataset.

### Discussion
The results suggest that linear models like Logistic Regression and SVM are better suited for this dataset. Logistic Regression's low error rate and minimal standard deviation indicate that it effectively separates the classes with a linear decision boundary.