## INTRO
Definition:

K-Nearest Neighbors is a supervised learning algorithm used for classification and regression tasks. It predicts the target for a new data point based on the majority label (for classification) or the average value (for regression) of the closest K points in the training set.

Key Concepts:
•	Distance Metric: KNN typically uses Euclidean distance to find the nearest neighbors.
•	ChoosinKKK: Selecting the number of neighborsKKK, affects the model's performance. LoweKKK values lead to more flexible boundaries, while higheKKK values give smoothboundaryare class.


#### Business Problem
Suppose we want to classify whether a tumor is benign or malignant based on features like size and texture. This is a supervised learning classification problem, requiring us to take the following steps:

> Step 1: Collect data with features and labels (benign or malignant).
> Step 2: For a new tumor, calculate the distance to each tumor in the training set.
> Step 3: Select the K closest tumors and use the majority label (since this is a classification problem; would use the average value if it were a regression task) to predict the class.

#### Dataset Description

> Features: 30 numeric features (e.g., radius, texture, smoothness, symmetry, etc.)

> Target: Binary (1 = malignant, 0 = benign)

This dataset will serve well for a KNN demo as it includes various characteristics of tumors, allowing for a practical illustration of classification with KNN.

#### Data Collection

In [1]:
# import statements

from sklearn.neighbors import KNeighborsClassifier
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [2]:
from sklearn.datasets import load_breast_cancer

# Load dataset
data = load_breast_cancer()

# Create a DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target  # Adding the target column (0 = malignant, 1 = benign)

# Display first few rows
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


#### Data Analysis/Cleaning

In [3]:
# doing a quick check to ascertain if the dataset is clean

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

So it is obvious that there are no issues here. In reality, most datasets obtained from python libraries have little or no issues so that the student can concentrate on the real task at hand.

#### Exploratory Data Analysis (EDA)

In [4]:
# Checking to see the correlation between each of the predictor variables (features) and the target variable
# (target) and also ascertaining which of the features have inner correlation so that one of these can be
# dropped. Since this project is based on medical data, any feature with a correlation with the target variable
# greater than +/- 0.3 is good. Also, any feature with a correlation greater than +/- 0.5 with another feature
# has inner correlation issues, and one of them would be dropped.
# Notice that in this dataset, all variables are numeric.

df.corr()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
mean radius,1.0,0.323782,0.997855,0.987357,0.170581,0.506124,0.676764,0.822529,0.147741,-0.311631,...,0.297008,0.965137,0.941082,0.119616,0.413463,0.526911,0.744214,0.163953,0.007066,-0.730029
mean texture,0.323782,1.0,0.329533,0.321086,-0.023389,0.236702,0.302418,0.293464,0.071401,-0.076437,...,0.912045,0.35804,0.343546,0.077503,0.27783,0.301025,0.295316,0.105008,0.119205,-0.415185
mean perimeter,0.997855,0.329533,1.0,0.986507,0.207278,0.556936,0.716136,0.850977,0.183027,-0.261477,...,0.303038,0.970387,0.94155,0.150549,0.455774,0.563879,0.771241,0.189115,0.051019,-0.742636
mean area,0.987357,0.321086,0.986507,1.0,0.177028,0.498502,0.685983,0.823269,0.151293,-0.28311,...,0.287489,0.95912,0.959213,0.123523,0.39041,0.512606,0.722017,0.14357,0.003738,-0.708984
mean smoothness,0.170581,-0.023389,0.207278,0.177028,1.0,0.659123,0.521984,0.553695,0.557775,0.584792,...,0.036072,0.238853,0.206718,0.805324,0.472468,0.434926,0.503053,0.394309,0.499316,-0.35856
mean compactness,0.506124,0.236702,0.556936,0.498502,0.659123,1.0,0.883121,0.831135,0.602641,0.565369,...,0.248133,0.59021,0.509604,0.565541,0.865809,0.816275,0.815573,0.510223,0.687382,-0.596534
mean concavity,0.676764,0.302418,0.716136,0.685983,0.521984,0.883121,1.0,0.921391,0.500667,0.336783,...,0.299879,0.729565,0.675987,0.448822,0.754968,0.884103,0.861323,0.409464,0.51493,-0.69636
mean concave points,0.822529,0.293464,0.850977,0.823269,0.553695,0.831135,0.921391,1.0,0.462497,0.166917,...,0.292752,0.855923,0.80963,0.452753,0.667454,0.752399,0.910155,0.375744,0.368661,-0.776614
mean symmetry,0.147741,0.071401,0.183027,0.151293,0.557775,0.602641,0.500667,0.462497,1.0,0.479921,...,0.090651,0.219169,0.177193,0.426675,0.4732,0.433721,0.430297,0.699826,0.438413,-0.330499
mean fractal dimension,-0.311631,-0.076437,-0.261477,-0.28311,0.584792,0.565369,0.336783,0.166917,0.479921,1.0,...,-0.051269,-0.205151,-0.231854,0.504942,0.458798,0.346234,0.175325,0.334019,0.767297,0.012838


From the results above, notice that `mean radius` has high corr. values with `mean perimeter`, `mean area`, `mean compactness`, `mean concavity`, `mean concave points`, `worst perimeter`, `worst area`, `worst concavity`, and `worst concave points`. So, all of these would be dropped since they are merely saying the same thing as `mean radius`.

Now, we must also check to see that `mean radius` has a good corr. with the target variable for it to be qualified to be used as a feature to build our model. In this case, it has a value of - 0.73, which is strong enough.

For the same reasons as stated above, the other features that qualify to be used are `mean texture`, `mean smoothness`, `mean fractal dimension`, `radius error`, `texture error`, `smoothness error`, and `fractal dimension error`... A total of 8 key features out of the original 30 features.

Now, since the df contains several columns, it may be laborious and time-consuming to take the route of dropping unwanted columns (21 in this case) as this would require listing all the features to drop.

So, we will create a new df including only the features we want - a smarter technique!

In [5]:
# List of columns to keep
selected_features = ['mean radius', 'mean texture', 'mean smoothness', 'mean fractal dimension', 'radius error',
'texture error', 'smoothness error', 'fractal dimension error', 'target']

# Create a new DataFrame with only the selected features
df_1 = df[selected_features]

# Display the new DataFrame
df_1.head()

Unnamed: 0,mean radius,mean texture,mean smoothness,mean fractal dimension,radius error,texture error,smoothness error,fractal dimension error,target
0,17.99,10.38,0.1184,0.07871,1.095,0.9053,0.006399,0.006193,0
1,20.57,17.77,0.08474,0.05667,0.5435,0.7339,0.005225,0.003532,0
2,19.69,21.25,0.1096,0.05999,0.7456,0.7869,0.00615,0.004571,0
3,11.42,20.38,0.1425,0.09744,0.4956,1.156,0.00911,0.009208,0
4,20.29,14.34,0.1003,0.05883,0.7572,0.7813,0.01149,0.005115,0


#### Model Building

In [6]:
# To avoid array mismatching issues, we create two new dataframes of 'price' and 'area' only 
# (extracting these from the main dataframe)

target_df=df_1.target
predictors_df = df_1.drop(['target'], axis=1)

In [7]:
# checking to see if both were properly created

target_df.head()

0    0
1    0
2    0
3    0
4    0
Name: target, dtype: int32

In [8]:
predictors_df.head()

Unnamed: 0,mean radius,mean texture,mean smoothness,mean fractal dimension,radius error,texture error,smoothness error,fractal dimension error
0,17.99,10.38,0.1184,0.07871,1.095,0.9053,0.006399,0.006193
1,20.57,17.77,0.08474,0.05667,0.5435,0.7339,0.005225,0.003532
2,19.69,21.25,0.1096,0.05999,0.7456,0.7869,0.00615,0.004571
3,11.42,20.38,0.1425,0.09744,0.4956,1.156,0.00911,0.009208
4,20.29,14.34,0.1003,0.05883,0.7572,0.7813,0.01149,0.005115


In [9]:
# instantiating a variable called knn to hold our KNN model

knn = KNeighborsClassifier(n_neighbors=3)

#### Model Training

In [10]:
# now we will use our .fit method on the two dataframes to train our model, and then run the cell to train
# the model

knn.fit(predictors_df, target_df)


#### Model Testing

In [11]:
# now that our model is trained, we will use it to predict results using the test set results, passing
# random values for each of the predictor variables in the predictors_df To do this, we define a variable
# 'prediction'.
# knn is now our trained model


prediction = knn.predict([[50, 30, 0.5, 0.0003, 0.444, 0.55552, 1.55555, 0.444563]])  

# printing the results of prediction to see the diagnosis
prediction



array([0])

So, the model predicted that the tutor with the supplied dimensions is benign [0].

Note that in this project, we skipped the step of splitting our dataset into train and test sets. We used the entire dataset to train and used a random set of values to test because we are not data-rich here. 

#### Model Evaluation

We cannot evaluate this model at this time because we did not split our dataset into train and test sets.