# This is the notebook for the breast cancer dataset.

In [80]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from KNN import KNN
from sklearn.metrics import accuracy_score,  confusion_matrix, classification_report

In [81]:
breast_data = pd.read_csv("../datasets/cancer.csv")
breast_data.head(3)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,


In [82]:
#Show me only the columns where there are null values
breast_data.isna().sum()[breast_data.isna().sum() > 0]

Unnamed: 32    569
dtype: int64

In [83]:
#I am dropping the 'Unnamed 32' column because it is filled with null values
#Also getting rid of the id because I do not see that being useful in any way
breast_data.drop(['Unnamed: 32', 'id'], axis=1, inplace=True)
#Better
breast_data.head(3)

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758


In [84]:
#Show me all data fields that aren't a float
breast_data.dtypes[breast_data.dtypes != 'float64']
#Seems like only y is a string, which is great and we don't need to do further cleaning of the data

diagnosis    object
dtype: object

In [85]:
#Let's separate our X and y
X = breast_data.drop(["diagnosis"], axis=1)
y = breast_data['diagnosis']

#Let's do a train test split
train_X, test_X, train_y, test_y = train_test_split(X, y, train_size=0.75, random_state=42, shuffle=True)

### I am going to fit KNN to this dataset for a few reasons:
- There are enough rows to account for the high dimensionality.
- The dataset is not too big and therefore we will not face any efficiency issues.
- Kmeans is a good fit for data that is not too noisy. (Breasts come in in lots of shapes and sizes but the tumours in this dataset do not `greatly` vary in size).

### Another model to consider fitting: Decision Tree / Random Forest
- Because there are lots of metrics, a Decision Tree would be able to ignore any unrelated ones easily.
- I feel like since we are looking at a topological features of a bodypart, the Decision Tree could come up with good rules to classify the tumour.

In [86]:
#Let's do knn and optimize parameters with GridSearchCV

#Parameters for GridSearchCV
params = {'n_neighbors': [x for x in range(1, 50)],
          'metric': ['euclidean', 'manhattan', 'minkowski', 'chebyshev', 'wminkowski', 'seuclidean', 'mahalanobis']}

#Our instance of KNN
knn = KNeighborsClassifier()

#GridSearchCV to try combinations of parameters
clf = GridSearchCV(knn, params, cv=5)

#Fitting to the data
clf.fit(X, y)


GridSearchCV(cv=5, estimator=KNeighborsClassifier(),
             param_grid={'metric': ['euclidean', 'manhattan', 'minkowski',
                                    'chebyshev', 'wminkowski', 'seuclidean',
                                    'mahalanobis'],
                         'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
                                         13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
                                         23, 24, 25, 26, 27, 28, 29, 30, ...]})

In [87]:
print("The best parameters that GSCV has found are: {}".format(clf.best_params_))
print("The highest acccuracy achieved with these parameters is: {}%".format(np.round(clf.best_score_ * 100, 3)))

The best parameters that GSCV has found are: {'metric': 'manhattan', 'n_neighbors': 9}
The highest acccuracy achieved with these parameters is: 93.852%


In [88]:
#I am curious what my version of KNN will do, I will stick to euclidean distance
k = KNN(9)
#Fitting to the training data
k.fit(np.array(train_X), np.array(train_y))
#Getting predictions
predictions = k.predict(np.array(test_X))
#Calculating the accuracy score
score = accuracy_score(test_y, predictions)

print("The acccuracy achieved with my implementation of KNN is: {}%".format(np.round(score * 100, 3)))

The acccuracy achieved with my implementation of KNN is: 95.804%


### Small flex: my algorithm did better.

### Let's create a confusion matrix:

In [89]:
matrix = confusion_matrix(test_y, predictions, labels=['M', 'B'])
print("Here is the confusion matrix:\n" + str(matrix))

Here is the confusion matrix:
[[50  4]
 [ 2 87]]


In [90]:
print("A true positive means that the diagnosis was malignent and the tumour was malignent.\n")
print("The number of true positives is:", matrix[0][0])
print("The number of true negatives is:", matrix[1][1])
print("The number of false positives is:", matrix[0][1])
print("The number of false negatives is:", matrix[1][0])

A true positive means that the diagnosis was malignent and the tumour was malignent.

The number of true positives is: 50
The number of true negatives is: 87
The number of false positives is: 4
The number of false negatives is: 2


### We can observe that the value for false positives is double the value for false negatives, this can be seen as a good thing as a positive test is twice as likely to be incorrect than a negative test. Meaning that your chances of not having a malignent tumor given that the algorithm predicted you did, are greater than your chances of having a malignent tumor given that the algorithm predicted you didn't. (For this model and data split at least).

### Let's generate a classification report

In [91]:
print(classification_report(test_y, predictions, labels=["M", "B"], target_names=["Malignent", "Benign"]))

              precision    recall  f1-score   support

   Malignent       0.96      0.93      0.94        54
      Benign       0.96      0.98      0.97        89

    accuracy                           0.96       143
   macro avg       0.96      0.95      0.96       143
weighted avg       0.96      0.96      0.96       143



### We observe some things:
- The precision (percentage of correct predictions) for both Malignent and Benign tumors is 96%.
- The recall (% of positive cases caught) for benign tumors is greater than the recall for malignent tumors by around 5%.
- The F1 score (% of positive predictions that were correct) for benign tumors is 3% greater than the F1 score for malignent tumors, which supports that we can be more confident of a negative classification than a positive classification.
- I hypothesize this being due to the higher support metric (occurences in the dataset) for benign tumors.