<a href="https://colab.research.google.com/github/EdenShaveet/Disclosure-Curriculum/blob/main/Module1_KNN_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module Exercise: Create Simple KNN Classifier

---


**Script Description:** Generates a KNN classifier based on the Wisconsin Breast Cancer (Diagnostic) Dataset

**Dataset Description:** The  [Wisconsin-Breast Cancer (Diagnostic) dataset (WBC)](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)) is a classification dataset sourced from the University of California Urvine Machine Learning Repository. The dataset includes relevant measurements relating to breast cancer cases, including whether a mass was benign and malignant.

**Script Attributions:** This script was developed for the Machine Learning Model & Dataset Disclosure for Healthcare & Public Health Contexts (MDSD4Health) Curriculum.

**Instructions:** Read each text block and run each corresponding code block to generate a simple K-Nearest Neighbors (KNN) classifier for the Wisconsin Breast Cancer (Diagnostic) Dataset. For help with Python and Colab, see the Python Help page at the MDSD4Health curriculum website.

# Package Management

First, let's import necessary packages.

In [2]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

# View Dataset

Next, we'll import our dataset using the `load_breast_cancer()` function and view a preview of the dataframe. Take a look at the variables contained in this dataset.

In [None]:
df = load_breast_cancer(as_frame=True) # Imports dataset as dataframe and assigns to object "df"
df.frame # Returns dataframe

Our "target" variable (our label of interest) is whether a breast mass was benign (1) or malignant (0). Let's take a look at the number of benign and malignant masses descriptively and as a plot.

In [None]:
df['target'].value_counts() # Returnd count of target variable classes

In [None]:
sns.countplot(df['target']) # Generates plot of target variable classes

# Data Splitting

Before creating and training a model, we need to split our dataset into training and test subsets. We do this by assigning our indicator variables to object "X" and our target variable to object "y," then splitting each into random train and test subsets using `train_test_split`



In [6]:
X = pd.DataFrame(df.data, columns=df.feature_names) # Assign indicator variables as object "X"
y = pd.Series(df.target) # Assign target variable (diagnosis) as object "y"

X_train, X_test, y_train, y_test = train_test_split(X, y) # Split the dataset into training and testing sets

# Create KNN

Finally, let's use the dataset to create and train a KNN classifier to classify breast masses as benign or malignant.

In [7]:
knn = KNeighborsClassifier(n_neighbors=5) # Create KNN classifier (⬅️✏️Try to experiment with different values of N!)

knn.fit(X_train, y_train) # Train classifier using training sets

y_pred = knn.predict(X_test) # Predict response for test dataset

# Performance Metrics

In [None]:
metrics.confusion_matrix(y_test, y_pred) # Return confusion matrix

In [None]:
print("Accuracy:",metrics.accuracy_score(y_test, y_pred)) # Return classifier's accuracy metric
print("F1 Score:",metrics.f1_score(y_test, y_pred)) # Return classifier's F1 score
print("Precision:",metrics.precision_score(y_test, y_pred)) # Return classifier's precision metric
print("Recall / Sensitivity:",metrics.recall_score(y_test, y_pred)) # Return classifier's recall (or sensitivity) metric

# Assess

Refer to the classification performance metric information contained in Module 1.1. How did your model perform?