# Data Science 1 - Tutorial 5.3 - Classification

## The Breast Cancer Wisconsin Dataset

For this exercise we will use the [the breast cancer wisconsin dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html)

In [None]:
# Go to the linked page and find out how to import the dataset
from sklearn.datasets import load_breast_cancer
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Understanding the data

In [None]:
# Load the dataset
bc_data = __________

# Print the desription
print(bc_data.__________)

In [None]:
# Save the dataset into a DataFrame
# As the column names, take feature_names for the explanatory variables
# and "type" for the response

# __________

bc_df.head(3)

In [None]:
# We've seen a pairplot before, here we introduce a heatmap
# to display all the pairwise correlation values
plt.figure(figsize=(20,20))

# Use heatmap from seaborn
__________(bc_df._____, #correlation values of bc_df
           annot=True);


In [None]:
# Map the values of "type" in bc_df into: 0:"malignant", 1:"benign"

bc_df['type'] = bc_df._____._____(__________)

# Display the value counts for each malignant and benign class
bc_df['type'].__________

In [None]:
# This is just an example of another type of display, catplot
# Run this and optionally, find out how you can change the kind of the plot
sns.catplot(x="worst perimeter", y="type", data=bc_df,# kind="box",
           height=3, aspect=2);

### Splitting the data

In [None]:
# Perform an 80/20 train-test-split

from __________ import __________

X_train, X_test, y_train, y_test = train_test_split(__________, ___________, # Use the DataFrame
                                                    stratify=__________, # Stratify the split!
                                                    random_state=123)

In [None]:
# Let's display the stratified data split
# Simply run this
fig, ax =plt.subplots(1,2, figsize=(8,3)) #, sharey=True)
sns.countplot(x=y_train, ax=ax[0], order=["malignant", "benign"])
sns.countplot(x=y_test, ax=ax[1], order=["malignant", "benign"]);

## Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Complete the following

# Instantiate the tree
model_tree = __________

# Fit the tree to the training set
__________

print('Training accuracy: ', model_tree.score(X_train, y_train))

### Evaluation Metrics

In [None]:
from sklearn.metrics import classification_report, plot_confusion_matrix

# Using the fitted model_tree, predict the outcomes from the test set
pred_tree = __________

# Using the imported metrics, print out the reports
# classification report
__________

# plot the confusion matrix
__________

### Feature Importance

In [None]:
# How can we print out the feature importances?

# print(model_tree.__________)

# These numbers correspond to the columns of the features
print(X_train.columns)

In [None]:
# We can collect the feature importances in a DataFrame and sort the values for easier exploration
# Fill in the following to display the sorted feature importance values

pd.DataFrame(__________, # set column names as the index
             __________, # feature importance values
             __________ # column name for the feature importance
            ).__________ # sort ascending

### Displaying the Tree

In [None]:
# Run this cell
from sklearn.tree import plot_tree

plt.figure(figsize=(20,10), dpi=300)
plot_tree(model_tree, feature_names=X_train.columns);#, filled=True);

### Changing some hyperparameters

To see the effects of changing some of the parameters, let's create a function to simplify our reporting task.

In [None]:
# Complete the following
def tree_report(model):
    test_pred = model.__________(__________) # prediction on the test set

    # print the classification report
    print(__________)

    # plot the confusion matrix
    __________

    # plot the resulting tree
    plt.figure(figsize=(20,10), dpi=300)
    __________

In [None]:
# Initiate a pruned tree with depth=3
model_pruned = DecisionTreeClassifier(__________,
                                      random_state=123)

# Fit the pruned tree to the training set
__________

# Generate the report
__________


In [None]:
# Now try changing other hyperparameters and report the performance.

## Random Forests

1. What might be the motivation of random forest (i.e. what are the shortcomings of using a single decision tree)?

**A**:

2. List two hyperparameters of random forests?

**A**:

3. Explain bagging in random forest.

**A**:

In [None]:
# Complete the following

from sklearn.ensemble import RandomForestClassifier

model_rf = RandomForestClassifier(___________, # Set the number of trees to 20
                                 random_state=123)

# Fit the model to the training set
__________

# Predict from the test set
_________

# Print the classification report and plot the confusion matrix
__________

### Using Cross Validation for Random Forest

We'll use grid search to decide the two hyperparameters. Later on in the SVM section, we will again use GridSearchCV.

In [None]:
# Run this cell
from sklearn.model_selection import GridSearchCV

n_estimators = [32, 64, 96, 128, 160]
max_features = [4,5,6,8]
param_grid = {'n_estimators': n_estimators,
              'max_features': max_features}

model_rf = RandomForestClassifier()
grid = GridSearchCV(model_rf, param_grid)
grid.fit(X_train, y_train)

In [None]:
# Print the best parameters
grid.__________

In [None]:
# Fit the model using the best chosen parameters and report the performance
model_rfcv = __________
__________

### Error vs. number of trees

We'll now use another technique to choose the number of trees, namely, to plot the errors and number of misclassifications against the number of trees.

In [None]:
from sklearn.metrics import accuracy_score

errors = []
misclassifications = []

for n in range(1,200,5):
    # Instantiate, fit, predict
    model_rf = __________ # set max_features=6
    model_rf.__________
    test_pred = __________

    e = 1-accuracy_score(y_test, test_pred)
    missed = __________ # How to count the misclassified outcomes?

    errors.append(e)
    misclassifications.append(missed)

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8,4))
# Plot errors and misclassifications side by side
__________

## SVM

We'll take a subset of the breast cancer dataframe to illustrate the separating hyperplane.

In [None]:
# Take the two columns which have the highest feature importance values (Subsection 2.2 above) and the target column
df = __________


sns.scatterplot(x=_________, y=_________, # The order doesn't matter here
                hue=__________, # Differentiate the color based on the outcomes
                data=df);

In [None]:
from sklearn.svm import SVC
#help(SVC)

In [None]:
# Run this helper function
def plot_svm(model,X,y):

    X = X.values

    # Scatter Plot
    plt.scatter(X[:, 0], X[:, 1],
                c=y, s=30,cmap='seismic')


    # plot the decision function
    ax = plt.gca()
    xlim = ax.get_xlim()
    ylim = ax.get_ylim()

    # create grid to evaluate model
    xx = np.linspace(xlim[0], xlim[1], 30)
    yy = np.linspace(ylim[0], ylim[1], 30)
    YY, XX = np.meshgrid(yy, xx)
    xy = np.vstack([XX.ravel(), YY.ravel()]).T
    Z = model.decision_function(xy).reshape(XX.shape)

    # plot decision boundary and margins
    ax.contour(XX, YY, Z, colors='k', levels=[-1, 0, 1], alpha=0.5,
               linestyles=['--', '-', '--'])
    # plot support vectors
    ax.scatter(model.support_vectors_[:, 0], model.support_vectors_[:, 1], s=100,
               linewidth=1, facecolors='none', edgecolors='k')
    plt.show()


In [None]:
# Also run this
X = df.drop('type', axis=1)
y = bc_data.target

model_svm = SVC(kernel='linear')
model_svm.fit(X,y)

plt.figure(figsize=(10,5))
plot_svm(model_svm, X, y)

Try it yourself:

1. Research: What is kernel in SVM, what is radial basis function?
2. Now plot the decision boundary with the rbf kernel (Simply remove the argument `kernel='linear'` because `rbf` is the default value). Try changing the values of `C` and `gamma`.
3. Use `GridSearchCV` as shown in Section 3.1 to find the best among these hyperparameters:
`'C':[0.01, 0.1, 1, 10]`, `'kernel':['linear','rbf']`.