# Breast Cancer Classification using K-Nearest Neighbors

###### Project from Codecademy Unsupervised Machine Learning Course

---

## Imports

In [97]:
from IPython.core.display import display

import numpy as np
import pandas as pd

from pandas_profiling import ProfileReport

import sklearn
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import MinMaxScaler

print('numpy version   :', np.__version__)
print('pandas version  :', pd.__version__)
print('sklearn version :', sklearn.__version__)

numpy version   : 1.20.3
pandas version  : 1.3.2
sklearn version : 0.24.2


---

## Data Inspection

Let's begin by loading the dataset from sklearn.

In [98]:
breast_cancer_data = load_breast_cancer()

Next, let's inspect the data, so we know what we're working with.

In [99]:
# show all data in dataset
print(breast_cancer_data.data)

[[1.799e+01 1.038e+01 1.228e+02 ... 2.654e-01 4.601e-01 1.189e-01]
 [2.057e+01 1.777e+01 1.329e+02 ... 1.860e-01 2.750e-01 8.902e-02]
 [1.969e+01 2.125e+01 1.300e+02 ... 2.430e-01 3.613e-01 8.758e-02]
 ...
 [1.660e+01 2.808e+01 1.083e+02 ... 1.418e-01 2.218e-01 7.820e-02]
 [2.060e+01 2.933e+01 1.401e+02 ... 2.650e-01 4.087e-01 1.240e-01]
 [7.760e+00 2.454e+01 4.792e+01 ... 0.000e+00 2.871e-01 7.039e-02]]


In [100]:
# show the first datapoint in the dataset
print(breast_cancer_data.data[0])

[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
 1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
 6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
 1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
 4.601e-01 1.189e-01]


That's a lot of random numbers. Let's take a look at the feature names, so we can see what each of them represents.

In [101]:
print(breast_cancer_data.feature_names)

['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']


Now we have a better understanding of the features within this dataset, let's take a look at the target variable.

In [102]:
# show target values
print(breast_cancer_data.target)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 0 1 0 0
 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1
 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 1 1 1 0 1
 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 1 0 1 1 0 0 0 1 0
 1 0 1 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 1 1
 1 0 1 1 1 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1
 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 0 0 0 1 1
 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0
 0 1 0 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1
 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 1 1 1 1 1 0 1 1
 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1
 1 1 1 1 1 1 0 1 0 1 1 0 

In [103]:
# show target names
print(breast_cancer_data.target_names)

['malignant' 'benign']


We can see that our target variable for this dataset is a binary variable, either 0 or 1. A value of 0 represents a
malignant tumor, and a value of 1 represents a benign tumor.

Just visually, we can see that the target variables are relatively balanced, let's confirm this is the case before we
move on.

In [104]:
num_malignant = len([val for val in breast_cancer_data.target if val == 0])
num_benign = len([val for val in breast_cancer_data.target if val == 1])
num_samples = len(breast_cancer_data.target)

In [105]:
print(f'Percentage malignant : {round((num_malignant/num_samples)*100, 2)}%')
print(f'Percentage benign    : {round((num_benign/num_samples)*100, 2)}%')

Percentage malignant : 37.26%
Percentage benign    : 62.74%


We have a very slight imbalance, roughly 1 : 1.5, with malignant tumors as the minority class. This shouldn't need any
special treatment, and we can proceed without having to worry about resampling or class weights.

I'm just going to transform the data from a dictionary to a Pandas DataFrame object, and do a little exploratory data
analysis. This will help us get a better idea of the features.

In [106]:
col_names = [col_name.replace(' ', '_') for col_name in list(breast_cancer_data.feature_names)] + ['target']

# convert dictionary of np arrays to Pandas Dataframe using numpy concat method
df = pd.DataFrame(data = np.c_[breast_cancer_data.data, breast_cancer_data.target],
                  columns = col_names)

In [107]:
display(df.head())

Unnamed: 0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,mean_compactness,mean_concavity,mean_concave_points,mean_symmetry,mean_fractal_dimension,...,worst_texture,worst_perimeter,worst_area,worst_smoothness,worst_compactness,worst_concavity,worst_concave_points,worst_symmetry,worst_fractal_dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0.0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0.0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0.0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0.0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0.0


### Data Scaling

The first thing that jumps out is the scale of the feature values. For example, if we look at ```mean area``` we can see
it has drastically larger values than some other columns.

This is a particular problem for k-nearest neighbors clustering, as it works using the distance between data points,
based on all features. Thus, if one feature is much larger than the others, it will dominate the distance calculations.

Scaling is usually good to do for all models however, as it prevents things from going wrong during the loss function
gradient descent.

We can either choose to scale all the feature data to be between some range of real numbers (e.g. 0 and 1) or we can
normalise all the data.

For now, let's simply scale the data to be between 0 and 1.

In [108]:
scaler = MinMaxScaler()
df[df.columns[:-1]] = scaler.fit_transform(df[df.columns[:-1]])  # defaults to 0-1 scaling

In [109]:
display(df.head())

Unnamed: 0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,mean_compactness,mean_concavity,mean_concave_points,mean_symmetry,mean_fractal_dimension,...,worst_texture,worst_perimeter,worst_area,worst_smoothness,worst_compactness,worst_concavity,worst_concave_points,worst_symmetry,worst_fractal_dimension,target
0,0.521037,0.022658,0.545989,0.363733,0.593753,0.792037,0.70314,0.731113,0.686364,0.605518,...,0.141525,0.66831,0.450698,0.601136,0.619292,0.56861,0.912027,0.598462,0.418864,0.0
1,0.643144,0.272574,0.615783,0.501591,0.28988,0.181768,0.203608,0.348757,0.379798,0.141323,...,0.303571,0.539818,0.435214,0.347553,0.154563,0.192971,0.639175,0.23359,0.222878,0.0
2,0.601496,0.39026,0.595743,0.449417,0.514309,0.431017,0.462512,0.635686,0.509596,0.211247,...,0.360075,0.508442,0.374508,0.48359,0.385375,0.359744,0.835052,0.403706,0.213433,0.0
3,0.21009,0.360839,0.233501,0.102906,0.811321,0.811361,0.565604,0.522863,0.776263,1.0,...,0.385928,0.241347,0.094008,0.915472,0.814012,0.548642,0.88488,1.0,0.773711,0.0
4,0.629893,0.156578,0.630986,0.48929,0.430351,0.347893,0.463918,0.51839,0.378283,0.186816,...,0.123934,0.506948,0.341575,0.437364,0.172415,0.319489,0.558419,0.1575,0.142595,0.0


In [110]:
display(df.describe().T)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
mean_radius,569.0,0.338222,0.166787,0.0,0.223342,0.302381,0.416442,1.0
mean_texture,569.0,0.323965,0.145453,0.0,0.218465,0.308759,0.40886,1.0
mean_perimeter,569.0,0.332935,0.167915,0.0,0.216847,0.293345,0.416765,1.0
mean_area,569.0,0.21692,0.149274,0.0,0.117413,0.172895,0.271135,1.0
mean_smoothness,569.0,0.394785,0.126967,0.0,0.304595,0.390358,0.47549,1.0
mean_compactness,569.0,0.260601,0.161992,0.0,0.139685,0.224679,0.340531,1.0
mean_concavity,569.0,0.208058,0.186785,0.0,0.06926,0.144189,0.306232,1.0
mean_concave_points,569.0,0.243137,0.192857,0.0,0.100944,0.166501,0.367793,1.0
mean_symmetry,569.0,0.379605,0.138456,0.0,0.282323,0.369697,0.45303,1.0
mean_fractal_dimension,569.0,0.270379,0.148702,0.0,0.163016,0.243892,0.340354,1.0


In [111]:
print(df.dtypes)

mean_radius                float64
mean_texture               float64
mean_perimeter             float64
mean_area                  float64
mean_smoothness            float64
mean_compactness           float64
mean_concavity             float64
mean_concave_points        float64
mean_symmetry              float64
mean_fractal_dimension     float64
radius_error               float64
texture_error              float64
perimeter_error            float64
area_error                 float64
smoothness_error           float64
compactness_error          float64
concavity_error            float64
concave_points_error       float64
symmetry_error             float64
fractal_dimension_error    float64
worst_radius               float64
worst_texture              float64
worst_perimeter            float64
worst_area                 float64
worst_smoothness           float64
worst_compactness          float64
worst_concavity            float64
worst_concave_points       float64
worst_symmetry      

I've found a superb method of quickly, visually inspecting features in a dataset is to use the ```ProfileReport```
from the ```pandas_profiling``` module. Let's take a look at the data using this method.

In [112]:
profile = ProfileReport(df, title='Breast Cancer Dataset Report', dark_mode=True, progress_bar=False)

In [113]:
profile.to_widgets()



Summarize dataset:   0%|          | 0/44 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…