# KNN -  K Nearest Neighbors - Classification

To understand KNN for classification, we'll work with a simple dataset representing gene expression levels. Gene expression levels are calculated by the ratio between the expression of the target gene (i.e., the gene of interest) and the expression of one or more reference genes (often household genes). This dataset is synthetic and specifically designed to show some of the strengths and limitations of using KNN for Classification.


More info on gene expression: https://www.sciencedirect.com/topics/biochemistry-genetics-and-molecular-biology/gene-expression-level

## Imports

In [1]:
import numpy as np
import pandas as pd


## Data

In [2]:
df = pd.read_csv('gene_expression.csv')

In [3]:
df.head()

Unnamed: 0,Gene One,Gene Two,Cancer Present
0,4.3,3.9,1
1,2.5,6.3,0
2,5.7,3.9,1
3,6.1,6.2,0
4,7.4,3.4,1


## Train|Test Split and Scaling Data

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [5]:
X = df.drop('Cancer Present',axis=1)
y = df['Cancer Present']

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [7]:
scaler = StandardScaler()

In [8]:
scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.fit_transform(X_test)

In [9]:
from sklearn.neighbors import KNeighborsClassifier

In [10]:
knn_model = KNeighborsClassifier(n_neighbors=1)

In [11]:
knn_model.fit(scaled_X_train,y_train)

KNeighborsClassifier(n_neighbors=1)

## Model Evaluation

### Metrics
*****************************************************************************************************************
Accuracy: measures the proportion of correctly classified cases from the total number of objects in the dataset.\
Precision: measures the ability of the model to detect only relevant instances\
Recall: measures the ability of the model to detect all instances of a class
*****************************************************************************************************************

#### 1. Accuracy
$$
\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
$$

#### 2. Precision
$$
\text{Precision} = \frac{TP}{TP + FP}
$$

#### 3. Recall
$$
\text{Recall} = \frac{TP}{TP + FN}
$$

Where:
- \( TP \) = True Positives
- \( TN \) = True Negatives
- \( FP \) = False Positives
- \( FN \) = False Negatives

In [12]:
y_pred = knn_model.predict(scaled_X_test)

In [13]:
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

In [None]:
accuracy_score(y_test,y_pred)

In [None]:
confusion_matrix(y_test,y_pred)

In [None]:
print(classification_report(y_test,y_pred))

## Elbow Method for Choosing Reasonable K Values

**NOTE: This uses the test set for the hyperparameter selection of K.**

In [39]:
test_error_rates = []


for k in range(1,30):
    knn_model = KNeighborsClassifier(n_neighbors=k)
    knn_model.fit(scaled_X_train,y_train) 
   
    y_pred_test = knn_model.predict(scaled_X_test)
    
    test_error = 1 - accuracy_score(y_test,y_pred_test)
    test_error_rates.append(test_error)

In [19]:
import matplotlib.pyplot as plt

In [None]:
plt.figure(figsize=(10,6),dpi=200)
plt.plot(range(1,30),test_error_rates,label='Test Error')
plt.legend()
plt.ylabel('Error Rate')
plt.xlabel("K Value")

## Final Model



In [42]:

knn5 = KNeighborsClassifier(n_neighbors=5)


In [43]:
knn5.fit(scaled_X_train,y_train)

KNeighborsClassifier()

In [44]:
y_pred = knn5.predict(X_test)



In [None]:
print(classification_report(y_test,y_pred))

In [None]:
confusion_matrix(y_test,y_pred)