# Discriminative Classifiers - Diagnosing Breast Cancer

In this practical, we will work with a real dataset of medical data. The features are generated from images of masses taken from breast tissue. The outcome variable is whether the mass is malignant or benign. More information can be found [here](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29).

We will use a train/test split to explore the impact of the $k$ on performance, looking at the trade-off between bias and variance. We will also look at how the KNN model is sensitive to the values of the features you use.

First, we load the data into a DataFrame and assign the features to `X` and the `diagnosis` variable to `y`.

Look at the distribution of benign (`y==0`) and malignant (`y==1`). What do you notice?

In [1]:
import pandas as pd

data = pd.read_csv("data/wisconsin_data.csv")
X = data.drop("diagnosis", axis=1)
y = data["diagnosis"]

# Your thoughts here...
print(y.value_counts(normalize=True))

## Creating a test/train split

Split the data into train and test datasets. You could do this manually but the `sklearn.model_selection.train_test_split` function can handle it all. It takes in data `X` and `y` and splits it into `X_train`, `X_test`, `y_train` and `y_test`.

Use this function to split up your data. Make the test set contain around 20% of the data using `test_size=0.2`.

Set `stratify=y` to ensure the ratio of classes in `y_train`/`y_test` is preserved and check this is the case.

In [2]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=144)

print(y_train.value_counts(normalize=True))
print(y_test.value_counts(normalize=True))

## Training an initial model

Instantiate a `sklearn.neighbors.KNeighborsClassifier` model with default parameters (`k=5`), named `knn`.

Use the `.fit()` method to train it on `X_train` and `y_train`.

Use the `.score()` method of the trained model to find its accuracy using the test set `X_test` and `y_test`

In [3]:
from sklearn.neighbors import KNeighborsClassifier

# Your code here...
neighbours = 5

knn = KNeighborsClassifier(n_neighbors=neighbours)

knn.fit(X_train, y_train);

accuracy = knn.score(X_test, y_test)

print(accuracy)

## Pre-processing data for optimal KNN performance

Because KNN uses the concept of **distance** between points to determine similarity, if the scales of features differ wildly then it can cause issues.

For example, if one feature is in the range 1-5, but another in the 400 to 290000, then the Euclidean spaces represented by these features are very far apart. Distances between two points based on these features will be extreme.

The min/max of `X_train` shows this:

In [4]:
X_train.describe()

It is not the raw feature values that matter, but their size relative to each other. Therefore, we can scale all features to be within the same range. This is normally 0 to 1.

This can be easily done using `sklearn.preprocessing.MinMaxScaler`:



In [5]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"New min: {X_train_scaled.min():.3f} New max: {X_train_scaled.max():.3f}")

Train and score a new KNN as before, named `knn_scaled`, using the new scaled data.

In [6]:
# Your code here...
from sklearn.neighbors import KNeighborsClassifier

# Your code here...
neighbours = 5

knn_scaled = KNeighborsClassifier(n_neighbors=neighbours)

knn_scaled.fit(X_train_scaled, y_train);

accuracy = knn_scaled.score(X_test_scaled, y_test)

print(f"Accuracy: {accuracy:.3f}")

Accuracy has improved quite a bit!

## The impact of `k` on accuracy

Recall that the value of $k$ in KNN impacts model **bias** (how well the model captures relevant relations in the features) and model **variance** (how sensitive the model is to noise in the features).

A KNN model is most prone to overfitting when $k$ is low, and underfitting when $k$ is high.

For values of $k$ in `range(1, 400)`, create a model using that value of $k$ and `.fit()` it using `X_train_scaled` and `y_train`.

Use the `.score()` method on the train data (`X_train_scaled` and `y_train`) and store the resulting score in `accs_train`.

Use the `.score()` method on the test data (`X_test_scaled` and `y_test`) and store the resulting score in `accs_test`.

(This might take 30 seconds or so to complete!)

In [7]:
accs_train = []
accs_test = []

# Your code here...
for i in range(1, 400):
    knn = KNeighborsClassifier(n_neighbors=i)

    knn.fit(X_train_scaled, y_train);

    accs_train.append(knn.score(X_train_scaled, y_train))
    accs_test.append(knn.score(X_test_scaled, y_test))

The cell below will plot the results for you, of accuracy at various values of $k$. What do you observe?

In [8]:
import seaborn as sns

# Make plot a readable size
sns.set_theme(rc={"figure.figsize": (12, 8)})
# Convert data to DataFrame
df = pd.DataFrame(
    zip(accs_train, accs_test, range(1, 400)),
    columns=["train data (seen)", "test data (unseen)", "k"],
)
# Melt to long format for easy plotting
df = df.melt(var_name="Evaluated against:", id_vars="k", value_name="Accuracy")
# Plot dataframe
g = sns.lineplot(data=df, hue="Evaluated against:", x="k", y="Accuracy")


# Your thoughts here...
## Accuracy 

## Evaluating KNN: true/false positives/negatives

The `.score()` method used the accuracy metric - the number of correct classifications out of the total classifications made.

This doesn't really give the best picture of model performance, though. As you saw when $k>350$, accuracy flat-lines at 0.63. This is because the model is using almost ALL the other data points for classification and around 63% of them are in the benign class.

A more useful approach is to see how the model performed for each individual class. Especially for health-related tasks, we are interested in:

* True Positives (TP): Cases in which the tissue is malignant and it was predicted as such.
* True Negatives (TN): Cases in which the tissue is benign (not malignant) and it was predicted as such.
* False Positives (FP): Cases in which the tissue is benign (not malignant) and it was predicted as malignant. (This is often called Type I error.)
* False Negatives (FN): Cases in which the tissue is malignant and it was predicted as benign. (This is often called Type II error.)

A confusion matrix can show this and can be computed using `pandas.crosstab` then visualised with `seaborn.heatmap`.

The cell below will do this for you. What do you observe?

In [9]:
import matplotlib.pyplot as plt

# Make readable size
sns.set_theme(rc={"figure.figsize": (18, 6)})
# Get 4 new blank plots in a row
fig, axes = plt.subplots(1, 4)

# Iterate through a few values of k
for e, k in enumerate([50, 100, 200, 350]):
    # Make model
    knn = KNeighborsClassifier(n_neighbors=k)
    # Train on training data
    knn.fit(X_train_scaled, y_train)
    # Get predictions of test data
    y_pred = knn.predict(X_test_scaled)

    # Make the confusion matrix. Normalise the cells to show percentages overall
    cm = pd.crosstab(
        y_test, y_pred, rownames=["True"], colnames=["Predicted"], normalize=True
    )

    # Plot confusion matrix, one on each of the blank axes.
    g = sns.heatmap(
        data=cm, cmap="Blues", square=True, annot=True, ax=axes[e], cbar=False
    )

    # Label them so it's clear which is which
    g.set_title(f"k = {k}")


# Your thoughts here...
### K=100 reflects best relation to truth
### K=350 is aweful, it's predicting everything as 0

## Evaluating KNN: precision, recall, F1 score

True/false positives/negatives can be combined to make new metrics, to give a more concise understanding of how the model is performing.

* Precision = TP/TP+FP
    * Ratio of correctly predicted positive observations to the total predicted positive observations
* Recall = TP/TP+FN
    * Ratio of correctly predicted positive observations to all of the observations in that class
* F1 Score = 2*(Recall Precision) / (Recall + Precision)
    * Weighted average of Precision and Recall
    
`sklearn.metrics.classification_report` can provide a nice summary of all of these metrics, per class.

For values of `k` in `[50,100,200,350]`, train and fit a new model on the scaled training data.

Use the model's `.predict()` method with the scaled test data. Store as `y_pred`.

Use `classification_report(y_test, y_pred, zero_division=0)` to calculate metrics for the model and print them out.

(Note: `zero_division=0` will prevent an error from popping up when precision or recall equal 0.)

What do you observe?

In [10]:
from sklearn.metrics import classification_report

# Your code and thoughts below...
for k in [50,100,200,350]:
    knn = KNeighborsClassifier(n_neighbors=k)
    
    knn.fit(X_train_scaled, y_train)
    
    y_pred = knn.predict(X_test_scaled)
    
    print(f"Classification for {k} Neighbours:\n{classification_report(y_test, y_pred, zero_division=0)}")

# Conclusion

In this section you trained evaluated a KNN model for diagnosing breast cancer.

You could also look more into the evaluation metrics for these classifiers. See [the sklearn documentation](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) for a range of classification metrics.