# Task 2: kNN Classifier

### 1. Introduction & Objectives

In this task, we will process data and use a kNN Classifier on Breast Cancer Wisconsin (Diagnostic) Data Set. 

The dataset contains 569 instances of cancer biopsies, each with 30 features. The features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. The dataset is available at the UCI Machine Learning Repository.

Our aim in this assignment is to learn about the usage of kNN algorithm with the aforementioned dataset, byt training our own classifier model by splitting the available data in to a training set  for a classifier, and using the rest of the data to test our trained classifier model. We will measure the classifiers accuracy, precision and recall by giving it a test group where we compare the verified data to the classifiers predictions.

The objectives of the assignment are:
1. To learn to use the kNN algorithm for classification problems
2. To learn to evaluate the performance of a classifier.

### 2. Data Understanding

#### 2.1. Importing the Libraries and Loading the Dataset

Let's start by importing the necessary libraries for loading the dataset.

In [18]:
from ucimlrepo import fetch_ucirepo

In [19]:
# Load the dataset
data = fetch_ucirepo(id=17)

# Data (as pandas dataframes)
features = data.data.features
targets = data.data.targets

features = features.copy()
targets = targets.copy()

#### 2.2. Summary of variables

Let's check the first few rows of the dataset to understand the variables and their types.

In [20]:
print("Features:")
features.head()

Features:


Unnamed: 0,radius1,texture1,perimeter1,area1,smoothness1,compactness1,concavity1,concave_points1,symmetry1,fractal_dimension1,...,radius3,texture3,perimeter3,area3,smoothness3,compactness3,concavity3,concave_points3,symmetry3,fractal_dimension3
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [28]:
print("Targets:")
targets.head()

Targets:


Unnamed: 0,Diagnosis
0,0
1,0
2,0
3,0
4,0


As we can see, the dataset contains 30 features and 1 target variable. The target variable is a string variable with two classes: 'M' (Malignant) and 'B' (Benign). The features are as follows: radius, texture, perimeter, smoothness, compactnes, convacity, conclave_points, symmetry and fractal_dimension. All of the features are represented in numerical variables. The features are measured across three different sizes (1, 2, 3) providing a total of 30 features.

#### 2.3. Data Preprocessing

According to the UC Irvine Machine Learning Repository, the dataset is clean and does not contain any missing values, so we don't need to perform any data cleaning operations.

### 3. Data Preparation

#### 3.1 Conversion of Target Variable

Let's convert the target variable to a binary variable. We will convert 'M' to 1 and 'B' to 0, so it can be used in the kNN classifier. After the conversion, we will check the results.

In [33]:
# Convert target variable to binary
targets["Diagnosis"] = [1 if x == 'M' else 0 for x in targets["Diagnosis"]]
targets.head()

Unnamed: 0,Diagnosis
0,0
1,0
2,0
3,0
4,0


The target variable has been successfully converted to a binary variable. The target variable now contains 1 for 'M' and 0 for 'B'.

#### 3.2 Standardization of Features

We will standardize the features so that they have a mean of 0 and a standard deviation of 1. This is important because the kNN algorithm is sensitive to the scale of the features. We will use the StandardScaler from the scikit-learn library to standardize the features.

In [23]:
from sklearn.preprocessing import StandardScaler

# Standardize the features
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

In [27]:
features_scaled

array([[ 1.09706398, -2.07333501,  1.26993369, ...,  2.29607613,
         2.75062224,  1.93701461],
       [ 1.82982061, -0.35363241,  1.68595471, ...,  1.0870843 ,
        -0.24388967,  0.28118999],
       [ 1.57988811,  0.45618695,  1.56650313, ...,  1.95500035,
         1.152255  ,  0.20139121],
       ...,
       [ 0.70228425,  2.0455738 ,  0.67267578, ...,  0.41406869,
        -1.10454895, -0.31840916],
       [ 1.83834103,  2.33645719,  1.98252415, ...,  2.28998549,
         1.91908301,  2.21963528],
       [-1.80840125,  1.22179204, -1.81438851, ..., -1.74506282,
        -0.04813821, -0.75120669]])

The features have been successfully standardized. The features are now centered around 0 with a standard deviation of 1.

### 4. 