<a href="https://colab.research.google.com/github/TanjinaHasan15/Data_Mining/blob/main/KNN_algo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [11]:
import numpy as np
import math
import random
from collections import Counter
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import precision_recall_fscore_support, confusion_matrix as sk_cm, accuracy_score as sk_acc
import pandas as pd

In [12]:
dataset = pd.read_csv('Iris.csv')
dataset.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [15]:
dataset = dataset.drop('Id', axis=1)
X = dataset.drop('Species', axis=1)
y = dataset['Species']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
display(X_train.head())
display(X_test.head())
display(y_train.head())
display(y_test.head())

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
81,5.5,2.4,3.7,1.0
133,6.3,2.8,5.1,1.5
137,6.4,3.1,5.5,1.8
75,6.6,3.0,4.4,1.4
109,7.2,3.6,6.1,2.5


Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
73,6.1,2.8,4.7,1.2
18,5.7,3.8,1.7,0.3
118,7.7,2.6,6.9,2.3
78,6.0,2.9,4.5,1.5
76,6.8,2.8,4.8,1.4


Unnamed: 0,Species
81,Iris-versicolor
133,Iris-virginica
137,Iris-virginica
75,Iris-versicolor
109,Iris-virginica


Unnamed: 0,Species
73,Iris-versicolor
18,Iris-setosa
118,Iris-virginica
78,Iris-versicolor
76,Iris-versicolor


In [23]:
def euclidean_distance(point1, point2):
    distance = np.sqrt(np.sum((point1 - point2)**2))
    return distance

## Implement knn prediction

### Subtask:
Create a function that takes a training set, a test point, and the number of neighbors (k) as input. This function should calculate the distance between the test point and all training points, find the k nearest neighbors, and predict the class based on the majority class among the neighbors.


**Reasoning**:
Define the `predict_knn` function to calculate distances, find k nearest neighbors, and predict the class based on majority voting.



In [17]:
def predict_knn(X_train, y_train, test_point, k):
    """Predicts the class of a test point using KNN.

    Args:
        X_train: DataFrame of training features.
        y_train: Series of training labels.
        test_point: Series representing the test point.
        k: Number of neighbors to consider.

    Returns:
        The predicted class for the test point.
    """
    distances = []
    for index, train_point in X_train.iterrows():
        dist = euclidean_distance(test_point.values, train_point.values)
        distances.append((dist, index))

    distances.sort(key=lambda x: x[0])
    k_nearest_neighbors = distances[:k]

    neighbor_labels = [y_train[i] for dist, i in k_nearest_neighbors]
    label_counts = Counter(neighbor_labels)
    predicted_class = label_counts.most_common(1)[0][0]

    return predicted_class

## Evaluate the model

### Subtask:
Create a function to evaluate the performance of the KNN model on the testing set. This could include calculating accuracy, precision, recall, and F1-score.


**Reasoning**:
Define the `evaluate_knn` function to calculate accuracy, precision, recall, and F1-score using scikit-learn metrics.



In [18]:
def evaluate_knn(y_true, y_pred):
    """Evaluates the performance of the KNN model.

    Args:
        y_true: True labels.
        y_pred: Predicted labels.

    Returns:
        A dictionary containing accuracy, precision, recall, and F1-score.
    """
    accuracy = sk_acc(y_true, y_pred)
    precision, recall, f1_score, _ = precision_recall_fscore_support(y_true, y_pred, average='weighted', zero_division=0)

    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1_score': f1_score
    }

## Integrate and execute

### Subtask:
Combine the functions and run the KNN algorithm on the dataset.


**Reasoning**:
Initialize an empty list for predictions and iterate through the test set, making predictions using the custom `predict_knn` function, then evaluate the predictions using `evaluate_knn` and print the results.



In [20]:
y_pred = []
k = 5  # Choose a value for k
for index, test_point in X_test.iterrows():
    predicted_class = predict_knn(X_train, y_train, test_point, k)
    y_pred.append(predicted_class)

metrics = evaluate_knn(y_test, y_pred)
print(metrics)

{'accuracy': 1.0, 'precision': 1.0, 'recall': 1.0, 'f1_score': 1.0}


## Analyze the results

### Subtask:
Interpret the evaluation metrics to understand the performance of the custom KNN implementation.


**Reasoning**:
Examine the metrics dictionary and interpret the results to describe the model's performance based on the calculated values.



In [21]:
print("KNN Model Performance Metrics:")
print(f"Accuracy: {metrics['accuracy']:.4f}")
print(f"Precision: {metrics['precision']:.4f}")
print(f"Recall: {metrics['recall']:.4f}")
print(f"F1-score: {metrics['f1_score']:.4f}")

print("\nInterpretation:")
print("The custom KNN model achieved perfect scores (1.0) across all evaluation metrics (accuracy, precision, recall, and F1-score) on the test dataset.")
print("This indicates that the model correctly classified every instance in the test set.")
print("A perfect score of 1.0 for accuracy means the model made zero incorrect predictions.")
print("Perfect precision (1.0) means that when the model predicted a class, it was always correct.")
print("Perfect recall (1.0) means the model was able to find all the actual instances of each class in the test set.")
print("A perfect F1-score (1.0), being the harmonic mean of precision and recall, further confirms the model's flawless performance on this specific test set.")
print("It's important to note that while these results are excellent, perfect scores on a test set, especially a small one or one derived from the same distribution as the training data, might not always generalize perfectly to unseen data.")

KNN Model Performance Metrics:
Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
F1-score: 1.0000

Interpretation:
The custom KNN model achieved perfect scores (1.0) across all evaluation metrics (accuracy, precision, recall, and F1-score) on the test dataset.
This indicates that the model correctly classified every instance in the test set.
A perfect score of 1.0 for accuracy means the model made zero incorrect predictions.
Perfect precision (1.0) means that when the model predicted a class, it was always correct.
Perfect recall (1.0) means the model was able to find all the actual instances of each class in the test set.
A perfect F1-score (1.0), being the harmonic mean of precision and recall, further confirms the model's flawless performance on this specific test set.
It's important to note that while these results are excellent, perfect scores on a test set, especially a small one or one derived from the same distribution as the training data, might not always generalize perfectly

## Summary:

### Data Analysis Key Findings

*   The dataset was successfully split into training and testing sets, with 30% of the data allocated for testing.
*   A custom `euclidean_distance` function was implemented from scratch to calculate the distance between two data points.
*   A custom `predict_knn` function was created, which calculates distances to training points, identifies the `k` nearest neighbors (with $k=5$ used in the execution), and predicts the class based on majority voting among the neighbors.
*   A custom `evaluate_knn` function was developed using scikit-learn metrics to calculate accuracy, precision, recall, and F1-score.
*   The custom KNN model, using $k=5$, achieved a perfect accuracy of 1.0000, precision of 1.0000, recall of 1.0000, and F1-score of 1.0000 on the test dataset.

### Insights or Next Steps

*   While the custom KNN implementation performed perfectly on this specific test set, it is crucial to evaluate its performance on unseen data or using cross-validation to get a more robust estimate of its generalization ability.
*   Experimenting with different values of $k$ could be beneficial to see if the performance is sensitive to this hyperparameter.
