## Assignment 1 Part 2:  A simple classification task (70 Points - Due 09/07/2022 11:59pm)

For this part of the assignment, you will be using the Breast Cancer Wisconsin (Diagnostic) Database to create a classifier that can help diagnose patients. First, read through the description of the dataset (below). Then, try out the first one or two questions, which use basic numpy to prepare the data, so you can get familiar with the various columns, etc. Then use k-NN classifiers to learn and make predictions.

- Correct answers and code receive full credit, but partial credit will be awarded if you have the right idea even if your final answers aren't quite right.<br><br>

- Submit both of your completed notebook files to the Canvas site as a single zip file named si670-hw1-youruniquename.zip.<br><br>

- Please name this notebook si670-hw1-part2-youruniqname.ipynb.<br><br>

- As a reminder, the notebook code you submit must be your own work. Feel free to discuss general approaches to the homework with classmates: if you end up forming more of a team discussion on multiple questions, please include the names of the people you worked with at the top of your notebook file.<br><br>

- Any file submitted after the deadline will be marked as late. Please consult the syllabus regarding late submission policies. You can submit the homework as many time as you want, but only your latest submission will be graded.

In [1]:
# import required modules and load data file
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()

print(cancer.DESCR)  # print the data set description

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radi

The object returned by `load_breast_cancer()` is a scikit-learn Bunch object, which is similar to a dictionary.

In [2]:
cancer.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [3]:
cancer.feature_names

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension'], dtype='<U23')

### Question 1 (10 points)

Scikit-learn works with lists, numpy arrays, scipy-sparse matrices, and pandas DataFrames, so converting the dataset to a DataFrame is not necessary for training this model. Using a DataFrame does however help make many things easier such as manipulating data, so let's practice creating a classifier with a pandas DataFrame. 



Convert the sklearn.dataset `cancer` to a DataFrame. 

*This function should return a `(569, 31)` DataFrame with * 

*columns = *

    ['mean radius', 'mean texture', 'mean perimeter', 'mean area',
    'mean smoothness', 'mean compactness', 'mean concavity',
    'mean concave points', 'mean symmetry', 'mean fractal dimension',
    'radius error', 'texture error', 'perimeter error', 'area error',
    'smoothness error', 'compactness error', 'concavity error',
    'concave points error', 'symmetry error', 'fractal dimension error',
    'worst radius', 'worst texture', 'worst perimeter', 'worst area',
    'worst smoothness', 'worst compactness', 'worst concavity',
    'worst concave points', 'worst symmetry', 'worst fractal dimension',
    'target']

*and index = *

    RangeIndex(start=0, stop=569, step=1)

In [14]:
cancer.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,

In [18]:
def answer_one():
    
    data = cancer.data
    col = cancer.feature_names
    cancer_data = pd.DataFrame(data, columns=[col])
    cancer_data['target'] = pd.Series(data=cancer.target)
    return cancer_data


answer_one().shape

(569, 31)

### Question 2 (20 points)
Using `train_test_split`, split the dataset into training and test sets `(X_train, X_test, y_train, and y_test)`.

**Set the random number generator state to 0 using `random_state=0` to make sure your results match ours **

*This function should return a tuple of length 4:* `(X_train, X_test, y_train, y_test)`*, where* 
* `X_train` *has shape* `(426, 30)`
* `X_test` *has shape* `(143, 30)`
* `y_train` *has shape* `(426,)`
* `y_test` *has shape* `(143,)`

In [49]:
from sklearn.model_selection import train_test_split

def answer_two():
    df = answer_one()
    X= df[cancer.feature_names]
    y = df.target.squeeze()
    X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.25, random_state=0)

    return X_train, X_test, y_train, y_test


X_train, X_test, y_train, y_test = answer_two()

assert X_train.shape == (426, 30), "error: shape incorrect"
assert X_test.shape == (143, 30), "error: shape incorrect"
assert y_train.shape == (426,), "error: shape incorrect"
assert y_test.shape == (143,), "error: shape incorrect"

### Question 3 (20 points)
Using KNeighborsClassifier, fit a k-nearest neighbors (knn) classifier with `n_neighbors = 5` on `X_train`, `y_train`. Then evaluate the classifier accuracy using `score` function on `X_test` and `y_test`.

*This function should return a tuple of (knn, accuracy), where*
* `knn` is a `sklearn.neighbors.classification.KNeighborsClassifier`
* `accuracy` is a `float` number returned by the `score` function

In [58]:
from sklearn.neighbors import KNeighborsClassifier

def answer_three():
    X_train, X_test, y_train, y_test = answer_two()
    
    knn = KNeighborsClassifier(n_neighbors=5)
    knn.fit(X_train, y_train)
    accuracy = knn.score(X_test, y_test)
    return (knn, accuracy)


answer_three()



(KNeighborsClassifier(), 0.9370629370629371)

### Question 4 (20 points)
Recall in the fruits example in lab1, we found the feature scales matter. In this question, please examine the mean and standard deviation of `X_train`, and use the `sklearn.preprocessing.StandardScaler` to normalize the feature. Then train another knn (k=5) classifier and evaluate it.

*This function should return a tuple of (standardized_X_train, knn, accuracy), where*
* `standardized_X_train` is a `pandas.DataFrame` of the standardized features
* `knn` is a `sklearn.neighbors.classification.KNeighborsClassifier`
* `accuracy` is a `float` number returned by the `score` function

In [66]:
from sklearn.preprocessing import StandardScaler


def answer_four():
    X_train, X_test, y_train, y_test = answer_two()
    scaler = StandardScaler()
    standardized_X_train = scaler.fit_transform(X_train)
    knn = KNeighborsClassifier(n_neighbors=5)
    knn.fit(standardized_X_train, y_train)
    accuracy = knn.score(X_test, y_test)
    return (standardized_X_train, knn, accuracy)  


answer_four()



(array([[-0.65079907, -0.43057322, -0.68024847, ..., -0.36433881,
          0.32349851, -0.7578486 ],
        [-0.82835341,  0.15226547, -0.82773762, ..., -1.45036679,
          0.62563098, -1.03071387],
        [ 1.68277234,  2.18977235,  1.60009756, ...,  0.72504581,
         -0.51329768, -0.96601386],
        ...,
        [-1.33114223, -0.22172269, -1.3242844 , ..., -0.98806491,
         -0.69995543, -0.12266325],
        [-1.25110186, -0.24600763, -1.28700242, ..., -1.75887319,
         -1.56206114, -1.00989735],
        [-0.74662205,  1.14066273, -0.72203706, ..., -0.2860679 ,
         -1.24094654,  0.2126516 ]]),
 KNeighborsClassifier(),
 0.3706293706293706)