## CS345 Fall 2024 Assignment 2

Updated 9/13/2024

## Part 0:  Data Loaders

Write data loaders for each of the four datasets listed above.  Use the same API as in assignment 1 for creating the penguins dataset, which mimics the data loaders provided by scikit-learn.

For example, a function ```load_qsar``` should return a feature matrix X and labels vector y for this dataset.  Similarly for the other datasets.  For the ```load_breast_cancer``` function, you may use the scikit-learn function that creates the dataset.  Since the gisette dataset has separate training and validation sets (that you will use as training / test sets), you will need to write two data loaders - one for the training set, and one for the test set.  Note that you will need to convert the labels from the values 0,1 to $\pm 1$, since that is what our perceptron learning algorithm expects as label values.

#### Missing values
Whenever a dataset has missing values, any training example that has missing features should be removed.  (An alternative to removing training examples is to *impute* missing features, e.g. by replacing missing values by the average of that feature.)

#### File structure

Store all the files you download in a sub-directory called `data` relative to the location of your notebook.  
Make sure to use the filenames specified in each data loader.
This will ensure that your code will run properly when we execute your notebook

#### A note on the heart disease diagnosis dataset
The heart disease diagnosis dataset has several data files associated with it.  Use [this file](http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data), where categorical variables have been replaced with numerical vaues.  The last column in the file contains the label associated with each example.  In the processed file, a label `0` corresponds to a healthy individual; other values correspond to varying levels of heart disease.  **In your experiments focus on the binary classification problem of trying to distinguish between healthy and non-healthy individuals.**


### Preliminaries

We'll start with a review of the notation used to represent a labeled dataset. In supervised learning we work with a dataset of $N$ labeled examples, each of which is a pair $(\mathbf{x}_i, y_i)$, where $\mathbf{x}_i$ is a $d$-dimensional vector (we always use boldface to denote vectors), and $y_i$ is the label associated with $\mathbf{x}_i$.  Keep in mind that the formulation of the perceptron algorithm that we used in class relies on the labels being $\pm 1$, so make sure that is the case for the data you use!

In this assignment we will use the following datasets:


* The [QSAR](https://archive.ics.uci.edu/dataset/254/qsar+biodegradation) data for predicting the biochemical activity of a molecule.
* The [Wisconsin breast cancer dataset](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer). 
* The [Gisette](https://archive.ics.uci.edu/dataset/170/gisette) handwritten digit recognition dataset. For this dataset you are provided separate training/validation/test sets.  Since the test set doesn't come with labels, use the validation set for testing the classifier.
* The [heart disease diagnosis](https://archive.ics.uci.edu/dataset/45/heart+disease) dataset.
 

In [127]:
import numpy as np
from sklearn.datasets import load_breast_cancer
import pandas as pd

In [129]:
# complete the following data loaders.  
# All the data files should be in a sub-folder called "data" relative to the
# location of your notebook.
# In each dataloader we specify the expected filenames you should use
# to ensure your notebook runs correctly

def load_qsar():
    qsar_filename='data/biodeg.csv'

    X_qsar = pd.read_csv(qsar_filename).values
    X_qsar = X_qsar.astype(str)
    X_qsar = np.array([row[0].split(';') for row in X_qsar])
    
    y_qsar = X_qsar[:,41]
    y_qsar[y_qsar == 'RB'] = 1
    y_qsar[y_qsar == 'NRB'] = -1
    y_qsar = y_qsar.astype(np.int64)
    
    X_qsar = np.delete(X_qsar, 41, 1)
    X_qsar = X.astype(np.float64)
    
    return X_qsar, y_qsar

def load_cancer():
    # use the scikit-learn data loader
    data = load_breast_cancer()
    X = data.data
    y = data.target
    y = y * 2 - 1
    X = np.hstack([X, np.ones((len(X), 1))])
    return X, y

def load_gisette_train():
    features_filename='data/gisette_train.data'
    labels_filename='data/gisette_train.labels'

    X_gistrain = pd.read_csv(features_filename).values
    X_gistrain = X_gistrain.astype(str)
    X_gistrain = np.char.split(X_gistrain)
    X_gistrain = np.vstack([np.array(row, dtype=int) for row in X_gistrain.flatten()])

    y_gistrain = pd.read_csv(labels_filename).values
    
    return X_gistrain, y_gistrain

def load_gisette_test():
    features_filename='data/gisette_valid.data'
    labels_filename='data/gisette_valid.labels'    

    X_gistest = pd.read_csv(features_filename).values
    X_gistest = X_gistest.astype(str)
    X_gistest = np.char.split(X_gistest)
    X_gistest = np.vstack([np.array(row, dtype=int) for row in X_gistest.flatten()])

    y_gistest = pd.read_csv(labels_filename).values
    
    return X_gistest, y_gistest

def load_heart():
    heart_url = "http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
    heart_data = pd.read_csv(heart_url, header=None, na_values='?', dtype=np.float64).values
    
    y_heart = heart_data[:,13]
    X_heart = np.delete(heart_data, 13, 1)
    
    y_heart = y_heart[~np.isnan(X_heart).any(axis=1)]
    X_heart = X_heart[~np.isnan(X_heart).any(axis=1)]

    y_heart = y_heart / y_heart
    y_heart[np.isnan(y_heart)] = -1
    
    return X_heart, y_heart


In [139]:
X, y = load_cancer()

print(X.shape)
print(y.shape)

(569, 31)
(569,)
float64


You can use the following function to check that your function returns arrays of the appropriate shapes (you will need to determine how many features / examples each dataset contains).  The only case where this is somewhat of a challenge is the heart dataset, which contains some missing values.

In [439]:
def data_is_valid(X, y, examples=0, features=0):
    return (
        X.shape == (examples, features)
        and y.shape == (examples,)
        and not np.any(np.isnan(X))
        and np.all((y==1) | (y==-1))
    )

# for example:
heart_X, heart_y = load_heart()
print("validity for heart dataset: ", data_is_valid(heart_X, heart_y, examples=297, features=13))


validity for heart dataset:  True


  y_heart = y_heart / y_heart


## Part 1:  Evaluating the Perceptron on Real World Datasets

In this part of the assignment you will work with the perceptron algorithm and run it on two real-world datasets.  For comparison, you will also evaluate an [SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) classifier on the same datasets.  We will cover SVMs in detail later in the course, and here you will simply use it with its default settings.

- Compare the performance of the  perceptron using the implementation we used in class with the SVM classifier on the QSAR and breast cancer diagnosis datasets. Do so by estimating the accuracy on a sample of the data that you reserve for testing (the test set).  In each case reserve  70% of the data for training, and 30% for testing.  To gain more confidence in your error estimates, repeat this experiment using 10 random splits of the data into training/test sets for each algorithm.  Use the same train-test splits for each algorithm.  Report the average accuracy and its standard deviation in a nicely formatted table.  Is there a classifier among the two that appears to perform better?  In answering this, consider the differences in performance you observe in comparison to the standard deviation.  Make sure to let the perceptron algorithm run for a sufficient number of epochs.  In implementing this task, you may use a for loop to iterate over the 10 random splits.

A note about the classifier API:  in this course we follow the scikit-learn classifier API, which requires that a classifier have the following methods (in addition to a constructor):

* `fit(X, y)`:  trains a classifier using a feature matrix `X` and a labels vector `y`.
* `predict(X)`:  given a feature matrix `X`, return a vector of labels for each feature vector represented by `X`.

For those interested in more information about the scikit-learn API, here's a [link](https://scikit-learn.org/stable/developers/develop.html).


### A note about displaying your results

We recommend displaying the results of your experiments in the form of an automatically-generated table.  pandas DataFrame objects render nicely in Jupyter notebooks, and are an easy way to achieve this with minimal work.  Here's an example that you can use as a template:

In [4]:
import pandas as pd
data = [
    ['Perceptron (cancer)', 0, 0],
    ['SVM (cancer)', 0, 0],
    ['Perceptron (biodeg)', 0, 0],
    ['SVM (biodeg)', 0, 0],
]
pd.DataFrame(data, columns = ['Classifier', 'Mean', 'StdDev'])

Unnamed: 0,Classifier,Mean,StdDev
0,Perceptron,0,0
1,SVM,0,0
2,Perceptron,0,0
3,SVM,0,0


In [5]:
# your code here

# use an SVM classifier with default settings to compare to the
# perceptron:
from sklearn.svm import SVC
svm = SVC()


*discussion of your results*

## Part 2:  Learning Curves 

Whenever we train a classifier it is useful to know if we have collected a sufficient amount of data for accurate classification.  A good way of determining that is to construct a **learning curve**, which is a plot of classifier performance as a function of the number of training examples.  Plot a learning curve for the perceptron algorithm using the [Gisette](http://archive.ics.uci.edu/dataset/170/gisette) handwritten digit recognition dataset. For this dataset use the separately provided validation set for testing your classifiers.  A test set is provided without its labels, so is not usable for us.
The x-axis for the plot (number of training examples) should be on a logarithmic scale - something like 10,20,40,80,200,400,800 etc.  In your submission use numbers that are appropriate for the dataset at hand.  Since the x-axis is on a logarithmic scale, plot the learning curve using [plt.semilogx](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.semilogx.html).

What can you conclude from the learning curve you have constructed for this particular dataset?

In answering this question, you can use the following [wikipedia article](https://en.wikipedia.org/wiki/Learning_curve#In_machine_learning).
Make sure that you use a fixed test set to evaluate performance while varying the size of the training set.  Use the Gisette validation set for that purpose.
Also, do not use the scikit-learn function for computing the validation curve, or any other scikit-learn functions for this task.

In [6]:
# your code here

*discussion of your results*

## Part 3:  Data standardization 

In this section we will explore the effect of normalizing the data, focusing on normalization of each feature individually.  In class we saw how to convert each column (i.e. feature) of a data matrix so that it fall in the range $[-1,1]$.  In this assignment we will explore a different approach callled **standardization**.

Here's what you need to do:

* Write a method to standardize a data matrix, so that each column has zero mean and standard deviation equal to 1.  This is done by subtracting the mean of each column, and dividing by its standard deviation.  See details [here](https://en.wikipedia.org/wiki/Feature_scaling#Standardization_(Z-score_Normalization)).  Scikit-learn has a method called [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) which does this.  Do not use it!  To demonstrate that your method works correctly, show that after standardization, your feature matrix has a zero mean and standard deviation equal to 1 for each column.  Make sure not to use for loops!

* Compare the accuracy of the standard perceptron on the heart dataset  with standardization and without it (make sure to evaluate the accuracy on a held out test set).  Like we did earlier, report the accuracy as the average over ten train-test splits.  Which leads to better performance?  Can you explain why?


In [7]:
# your code

*discussion and explanation*

### Part 4:  Use of AI and other web resources

In the cell below indicate in detail how you used AI and other web resources for this assignment.  If you used AI tools, indicate how useful they were.

*Your answer here*

### Your Report

Answer the questions in the cells reserved for that purpose.


### Submission

Submit your report as a Jupyter notebook via Canvas.  Running the notebook should generate all the results and plots in your notebook.

### Grading 


```
Grading sheet for assignment 2
Part 0:  20 points
Part 1:  40 points
Part 2:  20 points
Part 3:  20 points
```

