# Assignment

## Business Understanding
The objectives of this assignment are:
- to learn to use the k-nearest neighbors algorithm for classification problems
- to learn to evaluate the performance of a classifier

Our Task is to fetch a dataset related to breast cancer diagnosis and build a k-nearest neighbors (KNN) classifier to predict whether a tumor is malignant or benign based on various features. We will also evaluate the performance of our classifier using appropriate metrics.


## Data Understanding
Let's load our dataset and investigate its structure and contents. The dataset consists mostly of numerical measurements such as:
- radius            - Mean of distances from center to points on the perimeter
- texture           - Standard deviation of gray-scale values
- perimeter         - The actual irregular outline of the tumor.
- area              - Area of the tumor
- smoothness        - Local variation in radius lengths
- compactness       - Calculated with: Perimeter^2 / Area - 1.0
- concavity         - Severity of concave portions of the contour
- concave points    - Number of concave portions of the contour
- symmetry          - Symmetry of the tumor
- fractal dimension - "Coastline approximation" - 1

We also have variable called "Diagnosis" which determines whether a tumor is malignant or benign.
- M = Malignant
- B = Benign


In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from ucimlrepo import fetch_ucirepo

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix

# fetch dataset
breast_cancer_wisconsin_diagnostic = fetch_ucirepo(id=17)

# data (as pandas dataframes)
data = breast_cancer_wisconsin_diagnostic.data.original
data.head()

Unnamed: 0,ID,radius1,texture1,perimeter1,area1,smoothness1,compactness1,concavity1,concave_points1,symmetry1,...,texture3,perimeter3,area3,smoothness3,compactness3,concavity3,concave_points3,symmetry3,fractal_dimension3,Diagnosis
0,842302,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,M
1,842517,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,M
2,84300903,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,M
3,84348301,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,M
4,84358402,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,M


From the data we understand that each row represents a different breast cancer case, with various features measured for each case. The target variable is the diagnosis, which indicates whether the tumor is malignant or benign.

Next we wanna check how many rows of malignant and benign cases are in the dataset to understand the class distribution.

In [None]:
# Printing data distribution
diagnosis_counts = data['Diagnosis'].value_counts()
print("Diagnosis distribution:")
print(diagnosis_counts)

## Data Preparation

## Modeling

## Evaluation

## Deployment