# **AI in Medicine**

Welcome to the second  programming workshop for the AI in Radiology!

Today we will be diving into some more advanced concepts in machine learning:
*    Supervised learning: Decision Trees and understanding Overfitting and Knn Classifiers


Before you begin reading and editing any code, make sure to make a personal copy of this notebook by clicking `File` --> `Save a Copy in Drive` so you can make changes to the code.

## Imports and loading the dataset
Let's start the necessary `import`s for our code. We will go to the breast cancer dataset for a supervised learning topics.

**Confused about something?** Raise questions during the session! Also remember that you can always Google a function or piece of code that you're not sure about and you will find lots of documentation explaining what is happening.

In [None]:
# Load the necessary python libraries
from sklearn import preprocessing, decomposition
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, KFold, cross_validate
# deprecated from sklearn.metrics import plot_confusion_matrix, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
import graphviz
from graphviz import Source
from IPython.display import SVG
import pandas as pd
import numpy as np
import scipy

%matplotlib inline
plt.style.use('bmh')
plt.rcParams.update({'font.size': 14,
                     'xtick.labelsize' : 14,
                     'ytick.labelsize' : 14,
                     'figure.figsize' : [12,8],
                     })

# **Supervised Learning**: Decision Trees for Classification

This session we will look at another supervised learning technique, Decision Trees. Recall that last session we looked at logistic regression, which is suitable for classifying data that is *linearly separable*. Decision trees, on the other hand, can learn non-linear classification boundaries which is suitable for more complex datasets. Another benefit of decision trees is that they are **easy to interpret**! This is because, for a trained decision tree classifier, you can directly view the decisions that are being made in the algorithm which lead to the final classification. As a result, decision trees are a great option for medical problems (such as reaching a diagnosis based on certain patient symptoms) where it is important to explain and understand how an algorithm reaches its decision.

Once again, we will be analyzing the breast cancer dataset. We will use decision trees to learn how to predict benign from malignant tumors, and we will investigate how the accuracy changes as we change the depth of the decision tree.

**Reminder:** the breast cancer dataset has 539 tumor cases with 30 features each. Feel free to visualize and plot the features to explore the data

In [None]:
# Load breast cancer dataset and store as a Pandas dataframe. This code is the same as used in the previous session
cancer = load_breast_cancer()
scaler = preprocessing.StandardScaler().fit(cancer.data) # Define a scaler which standardizes the data to have mean 0 and standard devation 1. This usually improves stability and performance
cancer_data_scaled = scaler.transform(cancer.data) #Use the scaler to transform the training and test data

# get statistics
# print the number of samples with cancer

n_samples_cancer = (cancer.target == 0).sum()
print(n_samples_cancer)


212


## Training a decision tree and understanding overfitting:

`sklearn` provides all the code necessary to create and train a decision tree.

One of the most important parameters that we need to settle on for a decision tree is its maximum depth, which is essentially how many levels of decisions we want the algorithm to make before arriving at the final classification. If we pick a decision tree with too few levels (e.g. `max_depth = 1`), then it will not be complex enough to accurately differentiate between benign and malignant tumors. On the other hand, if our decision tree has too many levels (e.g. `max_depth = 15` for our dataset), then the algorithm will try to perfectly classify the training set over 15 levels of questions. The issue here is that this perfect classification will not generalize to the validation set, and so the validation accuracy will be poor. We will do a simple experiment to figure out how to pick the appropriate decision tree depth, and you can use this type of analysis in the future to settle on other important parameters for algorithms.

This will be our experiment:
1.  Split the dataset into training and validation sets with a split of 75%/25%
2.  Train 15 different decision tree classifiers with exactly the same parameters, with the exception of maximum depth which varies from 1 to 15.
3.  Visualize training and validation set accuracy for each decision tree
4.  Visualize the decision tree with the most appropriate maximum depth
5. Visualize the best decision tree results as a confusion matrix



In [None]:
# This is an important hyperparameter which determines how the training and validation data is split.
# Try changing it to another integer and rerunning the max_depth experiment below.
random_state = 13

In [None]:
# Split the dataset into training and validation sets
X_train, X_validation, y_train, y_validation = train_test_split(cancer_data_scaled, cancer.target, random_state=random_state)

# Create empty variables to store the decision tree accuracy results
training_accuracy = []
validation_accuracy = []

# Define the range of decision tree as 1 - 15
max_dep = range(1,16)

# Use a for loop to try each of the maximum depth decision trees, and store the scores in the appropriate lists
for md in max_dep:
    tree = DecisionTreeClassifier(max_depth=md,random_state=0)
    tree.fit(X_train,y_train)
    training_accuracy.append(tree.score(X_train, y_train))
    validation_accuracy.append(tree.score(X_validation, y_validation))

# Plot the tree depths against training and validation set accuracies
plt.figure()
plt.plot(max_dep,training_accuracy, label='Training set')
plt.plot(max_dep,validation_accuracy, label='Validation set')
plt.ylabel('Accuracy')
plt.xlabel('Max Depth')
plt.legend()
plt.show()


## What is the optimal tree depth?
## Can you explain why the training set accuracy keeps increasing with depth?

In [None]:
# set the oprtimal depth as new max_depth
#### YOUR CODE HERE ####
max_depth = ???


# now we can fir a new DecisionTreeClassifier with our new max_depth and evaluate it
tree = DecisionTreeClassifier(max_depth=max_depth,random_state=0)
# TODO: Train the classifier, e.g. fit it to the data  X_train and y_train
#### YOUR CODE HERE ####
tree.fit(X_train, y_train)


training_accuracy = tree.score(X_train, y_train)
validation_accuracy = tree.score(X_validation, y_validation)


# TODO: print the accuracy for the training and the validation set
#### YOUR CODE HERE ####




# Visualize decision tree
# TODO: Run the code to print the most important feature
graph = Source(export_graphviz(tree, out_file=None, class_names=['malignant','benign'], feature_names=cancer.feature_names, impurity=False, filled=True))
SVG(graph.pipe(format='svg'))

In [None]:
# TODO: Run the code print the most important feature
MOST_IMPORTANT = ...
print(MOST_IMPORTANT)

**BONUS**: Can you improve the accuary by building a knn Classifier?

1. Check https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html on how to build a knn Classifier
2. Loop over a range of k to find the best performing one
3. Store the results in a loop and then plot them
4. Train an knn-classifier with the best k and compare the validation_accuracy for the decision tree and the knn-Classifier

In [None]:
# define the range of k
#### YOUR CODE HERE ####
from_k =...
to_k = ...

# lists to store the results
train_accuracies_knn = []
validation_accuracies_knn = []

# Loop over the k and train for each a knn Classifier
for i in range(from_k, to_k):
  #### YOUR CODE HERE ####

In [None]:
# plot the results and select best k
#### YOUR CODE HERE ####

In [None]:
# train knn-Classifier with best k and compute the accuracy for the training set and the validation set
#### YOUR CODE HERE ####
best_k = ...

# fit knn-Classifier
#### YOUR CODE HERE ####


#### YOUR CODE HERE ####
knn_train_accuracy = ...
knn_validation_accuracy = ...


# compare the accucarcies of both models
#### YOUR CODE HERE ####





**What do you think? Why does the knn Classifier perform better here?**