<h2 style='color:blue' align="center">K Nearest Neighbor Classification</h2>

**We will predict whether a patient has Cancer or Not depending upon the 9 different attributes.
**

The attributes are: <br>
**1. Clump Thickness <br>
2. Uniformity of Cell Size <br>
3. Uniformity of Cell Shape <br>
4. Marginal Adhesion <br>
5. Single Epithelial Cell Size <br>
6. Bare Nuclei <br>
7. Bland Chromatin <br>
8. Normal Nucleoli <br>
9. Mitoses**

Install pandas and sklearn

pip install pandas-profiling

pip install sklearn

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # for data visualization purposes
import seaborn as sns # for data visualization
%matplotlib inline

In [2]:
knn_df = pd.read_csv("5_breast-cancer-wisconsin.data")
knn_df.head()

Unnamed: 0,Id,Clump_thickness,Uniformity_Cell_Size,Uniformity_Cell_Shape,Marginal_Adhesion,Single_Epithelial_Cell_Size,Bare_Nuclei,Bland_Chromatin,Normal_Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


**Analyze Shape of Data**

In [3]:
knn_df.shape

(683, 11)

*We can see that there are **683 instances and 11 attributes** in the data set.*<br>

In the dataset description, it is given that there are 10 attributes and 1 Class which is the target variable. <br>

So, we have **10 attributes and 1 target** variable.

**Drop redundant columns** <br>
We should drop any redundant columns from the dataset which does not have any predictive power. <br>
Here, Id is the redundant column. 

In [4]:
knn_df.drop('Id', axis=1, inplace=True)

**View Summary of Dataset**

In [5]:
knn_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 683 entries, 0 to 682
Data columns (total 10 columns):
 #   Column                       Non-Null Count  Dtype
---  ------                       --------------  -----
 0   Clump_thickness              683 non-null    int64
 1   Uniformity_Cell_Size         683 non-null    int64
 2   Uniformity_Cell_Shape        683 non-null    int64
 3   Marginal_Adhesion            683 non-null    int64
 4   Single_Epithelial_Cell_Size  683 non-null    int64
 5   Bare_Nuclei                  683 non-null    int64
 6   Bland_Chromatin              683 non-null    int64
 7   Normal_Nucleoli              683 non-null    int64
 8   Mitoses                      683 non-null    int64
 9   Class                        683 non-null    int64
dtypes: int64(10)
memory usage: 53.5 KB


*We can see that the **Id** column has been removed from the dataset.* <br>

We can see that there are **9 numerical variables and 1 categorical variable** in the dataset. 

**Divide our data into attributes and labels**

In [6]:
knn_X = knn_df.drop(['Class'], axis=1)
knn_Y = knn_df['Class']

*Here the X variable contains all the columns from the dataset, except the "Class" column, which is the label. The Y variable contains the values from the "Class" column. The X variable is our attribute set and Y variable contains corresponding labels.*

In [7]:
knn_X

Unnamed: 0,Clump_thickness,Uniformity_Cell_Size,Uniformity_Cell_Shape,Marginal_Adhesion,Single_Epithelial_Cell_Size,Bare_Nuclei,Bland_Chromatin,Normal_Nucleoli,Mitoses
0,5,1,1,1,2,1,3,1,1
1,5,4,4,5,7,10,3,2,1
2,3,1,1,1,2,2,3,1,1
3,6,8,8,1,3,4,3,7,1
4,4,1,1,3,2,1,3,1,1
...,...,...,...,...,...,...,...,...,...
678,3,1,1,1,3,2,1,1,1
679,2,1,1,1,2,1,1,1,1
680,5,10,10,3,7,3,8,10,2
681,4,8,6,4,3,4,10,6,1


In [8]:
knn_Y

0      2
1      2
2      2
3      2
4      2
      ..
678    2
679    2
680    4
681    4
682    4
Name: Class, Length: 683, dtype: int64

**Divide our data into training and test sets.** <br>
*split up 20% of the data in to the test set and 80% for training.*

In [9]:
from sklearn.model_selection import train_test_split
knn_X_train, knn_X_test, knn_Y_train, knn_Y_test = train_test_split(knn_X, knn_Y, test_size = 0.2, random_state = 0)

**Check the shape of knn_X_train and knn_X_test**

In [10]:
knn_X_train.shape, knn_X_test.shape

((546, 9), (137, 9))

 **Feature Scaling** <br>
 
We should map all the feature variables onto the same scale. 

In [11]:
knn_cols = knn_X_train.columns

In [12]:
from sklearn.preprocessing import StandardScaler
knn_scaler = StandardScaler()

knn_X_train = knn_scaler.fit_transform(knn_X_train)
knn_X_test = knn_scaler.transform(knn_X_test)

In [13]:
knn_X_train = pd.DataFrame(knn_X_train, columns=[knn_cols])
knn_X_test = pd.DataFrame(knn_X_test, columns=[knn_cols])

**Check the Feature Sacled Data**

In [14]:
knn_X_train.head()

Unnamed: 0,Clump_thickness,Uniformity_Cell_Size,Uniformity_Cell_Shape,Marginal_Adhesion,Single_Epithelial_Cell_Size,Bare_Nuclei,Bland_Chromatin,Normal_Nucleoli,Mitoses
0,1.988395,-0.697811,-0.741526,-0.633637,-0.54872,1.815536,0.619074,0.345321,-0.338637
1,-1.224684,-0.697811,-0.741526,-0.633637,-0.997897,-0.682796,-0.188607,-0.621578,-0.338637
2,0.203351,-0.697811,-0.741526,-0.633637,-0.54872,-0.682796,-0.188607,-0.621578,-0.338637
3,-0.510666,-0.697811,-0.404973,-0.633637,-0.54872,-0.682796,-0.592447,-0.621578,-0.338637
4,1.274378,-0.372444,-0.06842,-0.633637,1.247988,-0.127611,1.426754,-0.621578,-0.338637


**Train the K Neighbors Classifier on this data and make predictions.** <br>

**3 Nearest Neighbors** Classifier

In [15]:
from sklearn.neighbors import KNeighborsClassifier
knn_3 = KNeighborsClassifier(n_neighbors=3)

knn_3.fit(knn_X_train, knn_Y_train)

KNeighborsClassifier(n_neighbors=3)

**Make predictions on the test data.**

In [16]:
knn_Y_pred = knn_3.predict(knn_X_test)

In [17]:
knn_Y_pred

array([2, 2, 4, 4, 2, 2, 2, 4, 2, 2, 4, 2, 4, 2, 2, 4, 4, 4, 4, 2, 2, 2,
       4, 2, 4, 4, 2, 2, 2, 4, 2, 4, 4, 2, 2, 2, 4, 4, 2, 4, 2, 2, 2, 2,
       2, 2, 2, 4, 2, 2, 4, 2, 4, 2, 2, 2, 4, 4, 2, 4, 2, 2, 2, 2, 2, 2,
       2, 2, 4, 4, 2, 2, 2, 2, 2, 2, 4, 2, 2, 2, 4, 2, 4, 2, 2, 4, 2, 4,
       4, 2, 4, 2, 4, 4, 4, 2, 4, 4, 4, 2, 2, 2, 4, 4, 2, 2, 4, 4, 2, 2,
       4, 2, 2, 4, 2, 2, 2, 2, 2, 2, 2, 4, 2, 2, 4, 4, 2, 4, 2, 4, 2, 2,
       4, 2, 2, 4, 2])

In [18]:
knn_Y_test

113    2
378    2
303    4
504    4
301    2
      ..
21     4
454    2
506    2
500    4
77     2
Name: Class, Length: 137, dtype: int64

**Validate Model Performance basis Predicted vs. Actual values of Y for X_test**

**Precision and recall, F1 measure, accuracy and confusion matrix**

In [19]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
print(confusion_matrix(knn_Y_test, knn_Y_pred))
print(classification_report(knn_Y_test, knn_Y_pred))
print(accuracy_score(knn_Y_test, knn_Y_pred)) 

[[84  3]
 [ 1 49]]
              precision    recall  f1-score   support

           2       0.99      0.97      0.98        87
           4       0.94      0.98      0.96        50

    accuracy                           0.97       137
   macro avg       0.97      0.97      0.97       137
weighted avg       0.97      0.97      0.97       137

0.9708029197080292
