# CSE475 HW3, Problem 2

## Introduction

We will perform experiments with different classifiers using the breast cancer detection dataset given in "data.txt" on Canvas. It has 699 labeled data points with 11 entries for each point. The first 10 entries are ID, Clump Thickness, Uniformity of Cell Size, Uniformity of Cell Shape, Marginal Adhesion, Single Epithelial Cell Size, Bare Nuclei, Bland Chromatin, Normal Nucleoli, and Mitoses. The last entry represents the type of tumor (2 for benign, 4 for malignant). We will use 5-Fold Cross Validation (5-fold CV, review lecture 7) to find the best classifier among the following 6 classifiers for this dataset:

1) Logistic Regression
2) Naive Bayes Classifier
3) Decision Tree
4) K Nearest Neighbors
5) Linear SVM
6) SVM with RBF Kernel

Note: 5-fold CV is used to find the best classifier, not the hyperpameters of each classifier (such as K for KNN). The best classifer selected by 5-fold CV is trained on the entire training data and tested on the test data. In 5-fold CV and the final training on the entire training data, please use the default hyperparamters provided by the scikit-learn library for each classifier, and use the built-in functions in the scikit-learn library to train and test each classifier. Please set the "max_iter" parameter of the LinearSVC function to a large number, such as 15000, to avoid warnings. By solving this problem, you will learn how to use the scikit-learn library to perform CV for classification tasks.

Please follow the following steps:

1) Download the dataset named "data.txt" on Canvas. You can get more details about the dataset from https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original) (UCI has a huge collection of datasets which can be used for various machine learning tasks)

2) Separate the data into features and class labels. The features are all the entries between the second entry and the tenth entry inclusively, i.e. (Clump Thickness, Uniformity of Cell Size, Uniformity of Cell Shape, Marginal Adhesion, Single Epithelial Cell Size, Bare Nuclei, Bland Chromatin, Normal Nucleoli, and Mitoses). The last entry is the the class label. Please use the first 680 points as the training data, and the remaining 19 points as the test data.

Steps 1)-2) have been done for your convenience.

3) Use 5-fold CV on the training data to find the best classifier. Please use the KFold function in scikit-learn library for 5-fold CV. In each step of 5-fold CV, every classifier is trained on 4 folds (the training folds) and tested on the remaining fold (the test fold) with the corresponding classficiation accuracy (the percentage of data points classified correctly) on the test fold. Therefore, each classifier will be trained (and tested) for 5 times. Report the average classification accuracy of each classifier for 5-fold CV. Please shuffle the data for 5-fold CV which can be done by setting the "shuffle" parameter of the KFold function. Shuffling makes sure there is a mix of both classes in the training and test folds.

4) Report the best classifier which has the maximum average classification accuracy in step 3), along with its average classification accuracy in step 3).

6) Train the best classifier with the (scikit-learn) default hyperparameters on the entire training data, and report the classification accuracy of the best classifier on the test data. Again, if Linear SVM is the best classifer, please set the "max_iter" parameter of the LinearSVC function to a large number, such as 15000, to avoid warnings.

There is no standard format for reporting the required results in the previous steps, and please choose your favorite way.

In [24]:
import numpy as np
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

random_seed = 0
np.random.seed(random_seed)

# Read data
data = np.genfromtxt("data.txt", delimiter=",")


# Separate features
features = data[:, 1:-1].astype(np.float32)

# Separate labels
labels = data[:, -1]

train_features = features[:680,:]
train_labels = labels[:680]

test_features = features[680:,:]
test_labels = labels[680:]

In [25]:

models = {
    "Logistic Regression": LogisticRegression(),
    "Naive Bayes": GaussianNB(),
    "Decision Tree": DecisionTreeClassifier(),
    "K-Nearest Neighbors": KNeighborsClassifier(),
    "Linear SVM": LinearSVC(max_iter=15000),
    "SVM with RBF Kernel": SVC(kernel='rbf')
}

In [26]:
kf = KFold(n_splits=5, shuffle=True, random_state=random_seed)
cv_results = {}

print("5-Fold Mean CV Accuracies: \n")

for name, model in models.items():
    scores = cross_val_score(model, train_features, train_labels, cv=kf, scoring='accuracy')
    cv_results[name] = scores.mean()
    print(f"{name} = {scores.mean():.4f}")

5-Fold Mean CV Accuracies: 

Logistic Regression = 0.9676
Naive Bayes = 0.9618
Decision Tree = 0.9353
K-Nearest Neighbors = 0.9676
Linear SVM = 0.9662
SVM with RBF Kernel = 0.9721


In [None]:

best_clf_name = max(cv_results, key=cv_results.get)
print(f"\nBest classifier based on CV: {best_clf_name} with Mean CV Accuracy = {cv_results[best_clf_name]:.4f} \n")

best_model = models[best_clf_name]
best_model.fit(train_features, train_labels)
test_predictions = best_model.predict(test_features)
test_accuracy = accuracy_score(test_labels, test_predictions)

print(f"Test Accuracy of {best_clf_name}: {test_accuracy:.4f}")


Best classifier based on CV: SVM with RBF Kernel with Mean CV Accuracy = 0.9721 

Test Accuracy of SVM with RBF Kernel: 1.0000 

