<a href="https://colab.research.google.com/github/Meng-MiamiOH/MTH231/blob/main/exercise_machine_learning_and_coding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise for Machine Learning and Coding Session

For this exercise, you will be using the Breast Cancer Wisconsin (Diagnostic) Database to create classifiers that can help diagnose patients. First, read through the description of the dataset (below).

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()

print(cancer.DESCR) # Print the data set description

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, f

The object returned by load_breast_cancer() is a scikit-learn Bunch object, which is similar to a dictionary.

In [2]:
cancer.keys()

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

### 1. Convert the sklearn.dataset `cancer['data']` to a `(569, 30)` DataFrame `X` and convert the `cancer['target']` to a `(569, 1)` Series `y`.

In [3]:
X =  pd.DataFrame(data=cancer['data'], columns=cancer['feature_names'])
y = pd.Series(cancer['target'])
print(X)
print(y)

     mean radius  mean texture  ...  worst symmetry  worst fractal dimension
0          17.99         10.38  ...          0.4601                  0.11890
1          20.57         17.77  ...          0.2750                  0.08902
2          19.69         21.25  ...          0.3613                  0.08758
3          11.42         20.38  ...          0.6638                  0.17300
4          20.29         14.34  ...          0.2364                  0.07678
..           ...           ...  ...             ...                      ...
564        21.56         22.39  ...          0.2060                  0.07115
565        20.13         28.25  ...          0.2572                  0.06637
566        16.60         28.08  ...          0.2218                  0.07820
567        20.60         29.33  ...          0.4087                  0.12400
568         7.76         24.54  ...          0.2871                  0.07039

[569 rows x 30 columns]
0      0
1      0
2      0
3      0
4      0
      

### 2. What is the class distribution? (i.e. how many instances of `malignant` (encoded 0) and how many `benign` (encoded 1)?)


In [4]:
len(y[y == 0])

212

In [5]:
len(y[y == 1])

357

### 4. Using `train_test_split`, split `X` and `y` into training and test sets `X_train`, `X_test`, `y_train`, and `y_test`.

* `X_train` *has shape* `(426, 30)`
* `X_test` *has shape* `(143, 30)`
* `y_train` *has shape* `(426,)`
* `y_test` *has shape* `(143,)`

In [6]:
# your code
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

### 5. Using KNeighborsClassifier, fit a k-nearest neighbors (knn) classifier with `X_train`, `y_train` and using one nearest neighbor (`n_neighbors = 1`).

In [7]:
# your code
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                     weights='uniform')

### 6. Using your knn classifier, predict the class label using the mean value for each feature.


In [8]:
mean = X_train.mean().values.reshape(1, -1)
mean

array([[1.41017371e+01, 1.92302582e+01, 9.17882864e+01, 6.51732160e+02,
        9.63570657e-02, 1.04227606e-01, 8.98661261e-02, 4.90516174e-02,
        1.80954930e-01, 6.28922770e-02, 3.99731690e-01, 1.22207676e+00,
        2.82412676e+00, 3.94833991e+01, 7.03827230e-03, 2.60056502e-02,
        3.33039967e-02, 1.19317066e-02, 2.04458427e-02, 3.86995540e-03,
        1.62159484e+01, 2.56751174e+01, 1.06917136e+02, 8.73415728e+02,
        1.32423732e-01, 2.55204859e-01, 2.78357545e-01, 1.15696033e-01,
        2.90094836e-01, 8.42250235e-02]])

In [9]:
# your code
knn.predict(mean)

array([1])

### 7. Use the knn classifer your created for Question 5 to predict the class labels for the test set `X_text`.

*The result should be a numpy array with shape `(143,)` and values either `0.0` or `1.0`.*

In [10]:
# your code
knn.predict(X_test)

array([1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1,
       1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0,
       0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0,
       0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0])

In [11]:
print(y_test)

479    0
135    0
32     0
198    0
140    1
      ..
80     1
493    1
330    0
12     0
441    0
Length: 143, dtype: int64


### 8. Find the score (mean accuracy) of your knn classifier using `X_test` and `y_test`.

*The calculated score should be a float between 0 and 1*

In [12]:
# your code
knn.score(X_test,y_test)

0.9230769230769231

### 9. Try to train kNN classifier with different k (1, 3, 5, 10) and compare the performance.

In [13]:
# your code
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train,y_train)
print(knn.score(X_test,y_test))
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train,y_train)
print(knn.score(X_test,y_test))
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train,y_train)
print(knn.score(X_test,y_test))
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(X_train,y_train)
print(knn.score(X_test,y_test))

0.9230769230769231
0.9230769230769231
0.9230769230769231
0.9370629370629371


### 10. Use logistic regression model to create a classifier for the data.

In [15]:
# your code
from sklearn.linear_model import LogisticRegression

lc = LogisticRegression().fit(X_train, y_train)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


### 11. Evaluate the logistic classifier and compare it with kNN classifier.

In [16]:
# your code
lc.score(X_test,y_test)

0.9300699300699301

### 12. Use cross-validation to calculate accuracy

In [None]:
import numpy as ny
# your code


### 13. Use cross-validation to calculate f1

In [None]:
# your code