<a href="https://colab.research.google.com/github/3m6d/ML-techniques-practise/blob/main/KNearestNeighbours.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



### K-Nearest Neighbors (KNN)
It is a supervised learning algorithm primarily used for classification and regression tasks. It is a type of instance-based learning or lazy learning algorithm because it does not explicitly learn a model during the training process. Instead, it memorizes the training data and makes predictions based on the similarity of new input data to the existing data points.

## When to Use Manhattan vs Euclidean Distance

### Manhattan Distance
- Use when:
  - Features are high-dimensional (more features).
  - There are more outliers or gaps in data.
  - Data has sparsity (e.g., missing values).
  - Data is mostly discrete.

### Euclidean Distance
- Use when:
  - Data is continuous.
  - The data distribution is more uniform without significant gaps or sparse regions.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


In [2]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [7]:

file_path = '/content/drive/MyDrive/disease.csv'
df = pd.read_csv(file_path)

print(df.head())


   Unnamed: 0          a          b Category
0           0  37.454012   9.256646    dis_a
1           1  95.071431  27.095047    dis_a
2           2  73.199394  43.647292    dis_a
3           3  59.865848  36.611244    dis_a
4           4  15.601864  40.328057    dis_a


In [8]:
x = df[['a','b']].to_numpy()
y = df['Category'].to_numpy()

In [9]:
x.shape, y.shape

((2000, 2), (2000,))

In [12]:
x,y

(array([[37.45401188,  9.25664644],
        [95.07143064, 27.09504737],
        [73.19939418, 43.64729179],
        ...,
        [30.97878592, 19.72861577],
        [29.0045532 , 26.49702935],
        [87.14140342,  8.06836792]]),
 array(['dis_a', 'dis_a', 'dis_a', ..., 'dis_b', 'dis_b', 'dis_b'],
       dtype=object))

Why 42 is a Common Example Value:

If you see random_state=42 often, it's due to a popular reference from Douglas Adams' The Hitchhiker's Guide to the Galaxy, where 42 is the "Answer to the Ultimate Question of Life, the Universe, and Everything." It's become a fun "default" example in the data science community, but any integer will work for ensuring reproducibility.

In summary, setting random_state makes your experiments reproducible and consistent when you use random operations in data processing or model training.

In [20]:
#xtrain
#ytrain
#random_state = 42

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2,random_state=42)



In [21]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
##HERE N_NEIGHBORS OR K IS 3, BUT IT CAN DEPEND ACCORDING TO YOUR DATA

In [22]:
knn.fit(x_train,y_train)
#THE MODEL IS TRAINED

In [15]:
print (x_test[33].shape)
print (x_test[33].reshape(1,-1).shape)

(2,)
(1, 2)


In [17]:
## predict takes in 2 dimension dataset

predicted_class = knn.predict(x_test[33].reshape(1,-1))
predicted_class

array(['dis_a'], dtype=object)

In [18]:
predicted_probabilities = knn.predict_proba(x_test[33].reshape(1,-1))
predicted_probabilities

array([[1., 0.]])

##Confusion Matrix


In [23]:
from sklearn.metrics import classification_report, confusion_matrix
y_pred = knn.predict(x_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

       dis_a       0.53      0.55      0.54       199
       dis_b       0.54      0.52      0.53       201

    accuracy                           0.54       400
   macro avg       0.54      0.54      0.54       400
weighted avg       0.54      0.54      0.54       400



Microaverage:
This method calculates metrics globally by counting the total true positives, false positives, and false negatives across all classes, treating each instance equally.

Microaverage Support:
It refers to the total number of samples considered in the metric calculation when aggregating results over all classes.

Weighted Average Relative to Sample:
This approach calculates metrics for each class and then averages them, weighted by the number of true instances in each class, ensuring larger classes have a greater influence.

In coursework, you need to do this:

* Data Cleaning
* Correlation Test
* Handling Missing Values
* Normalization of Data
* Train-Test Split
* Model Training and Comparison:
  1. Algorithm 1 with Same Data
  2. Algorithm 2 with Same Data
  3. Algorithm 3 with Same Data

