<a href="https://colab.research.google.com/github/Muhammad-Roshaan-Idrees/Artificial_Intelligence/blob/main/Muhammad_Roshaan_Idrees_56177_AI_Lab09.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Muhammad Roshaan Idrees**

---


**56177**

# 1: Implement KNN Classifier and complete the following steps:
• Implement KNN with different parameter

• Implement KNN on dataset using different dataset and different neighbors

• Calculate the overall Accuracy of the models.

• Compare the results.



In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
# Importing libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from scipy.stats import mode
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# K Nearest Neighbors Classification
class K_Nearest_Neighbors_Classifier():

    def __init__(self, K):
        self.K = K

    # Function to store training set
    def fit(self, X_train, Y_train):
        self.X_train = X_train
        self.Y_train = Y_train
        self.m, self.n = X_train.shape  # no_of_training_examples, no_of_features

    # Function to predict class for test examples
    def predict(self, X_test):
        self.X_test = X_test
        self.m_test, self.n = X_test.shape

        Y_pred = np.zeros(self.m_test)

        for i in range(self.m_test):
            x = self.X_test[i]
            neighbors = self.find_neighbors(x)
            # Use mode().mode[0] to get the mode value correctly
            Y_pred[i] = mode(neighbors)[0]


        return Y_pred

    def find_neighbors(self, x):
        # calculate all the euclidean distances between current
        # test example x and training set X_train
        euclidean_distances = np.zeros(self.m)

        for i in range(self.m):
            d = self.euclidean(x, self.X_train[i])
            euclidean_distances[i] = d

        # sort Y_train according to euclidean_distance_array and
        # store into Y_train_sorted
        inds = euclidean_distances.argsort()
        Y_train_sorted = self.Y_train[inds]

        return Y_train_sorted[:self.K]

    # Function to calculate euclidean distance
    def euclidean(self, x, x_train):
        return np.sqrt(np.sum(np.square(x - x_train)))

# Driver code
def main():
    # Importing dataset
    df = pd.read_csv("diabetes.csv")

    X = df.iloc[:,:-1].values
    Y = df.iloc[:,-1].values

    # Splitting dataset into train and test set
    X_train, X_test, Y_train, Y_test = train_test_split(
        X, Y, test_size=1/3, random_state=0)

    # Model training with different K values
    k_values = [1, 3, 5, 7, 9]

    print("KNN Implementation Results")
    print("=" * 50)

    for k in k_values:
        # Our custom KNN model
        model = K_Nearest_Neighbors_Classifier(K=k)
        model.fit(X_train, Y_train)

        # Sklearn KNN model
        model1 = KNeighborsClassifier(n_neighbors=k)
        model1.fit(X_train, Y_train)

        # Prediction on test set
        Y_pred = model.predict(X_test)
        Y_pred1 = model1.predict(X_test)

        # Calculate accuracy
        accuracy_custom = accuracy_score(Y_test, Y_pred) * 100
        accuracy_sklearn = accuracy_score(Y_test, Y_pred1) * 100

        print(f"\nK = {k}:")
        print(f"Accuracy on test set by our model    : {accuracy_custom:.6f}%")
        print(f"Accuracy on test set by sklearn model: {accuracy_sklearn:.6f}%")

if __name__ == "__main__":
    main()

In [None]:
# Additional analysis with different test sizes and features
def extended_analysis():
    df = pd.read_csv("diabetes.csv")

    print("\n" + "="*60)
    print("Extended Analysis with Different Parameters")
    print("="*60)

    # Analysis 1: Different test sizes
    test_sizes = [0.2, 0.25, 0.3, 0.33]
    print("\n1. Different Test Sizes (K=3):")
    for test_size in test_sizes:
        X = df.iloc[:,:-1].values
        Y = df.iloc[:,-1].values

        X_train, X_test, Y_train, Y_test = train_test_split(
            X, Y, test_size=test_size, random_state=0)

        model = K_Nearest_Neighbors_Classifier(K=3)
        model.fit(X_train, Y_train)
        Y_pred = model.predict(X_test)
        accuracy = accuracy_score(Y_test, Y_pred) * 100

        print(f"Test Size: {test_size} - Accuracy: {accuracy:.6f}%")

    # Analysis 2: Using different feature subsets
    print("\n2. Different Feature Subsets (K=3, test_size=1/3):")
    feature_subsets = [
        ['Glucose', 'BMI', 'Age'],  # Basic health indicators
        ['Pregnancies', 'Glucose', 'BloodPressure', 'BMI'],  # Core metrics
        ['Glucose', 'Insulin', 'BMI', 'DiabetesPedigreeFunction']  # Diabetes specific
    ]

    for i, features in enumerate(feature_subsets, 1):
        X = df[features].values
        Y = df['Outcome'].values

        X_train, X_test, Y_train, Y_test = train_test_split(
            X, Y, test_size=1/3, random_state=0)

        model = K_Nearest_Neighbors_Classifier(K=3)
        model.fit(X_train, Y_train)
        Y_pred = model.predict(X_test)
        accuracy = accuracy_score(Y_test, Y_pred) * 100

        print(f"Feature Set {i}: {features}")
        print(f"Accuracy: {accuracy:.6f}%")

# Run extended analysis
extended_analysis()


Extended Analysis with Different Parameters

1. Different Test Sizes (K=3):
Test Size: 0.2 - Accuracy: 72.727273%
Test Size: 0.25 - Accuracy: 59.259259%
Test Size: 0.3 - Accuracy: 63.636364%
Test Size: 0.33 - Accuracy: 63.888889%

2. Different Feature Subsets (K=3, test_size=1/3):
Feature Set 1: ['Glucose', 'BMI', 'Age']
Accuracy: 75.000000%
Feature Set 2: ['Pregnancies', 'Glucose', 'BloodPressure', 'BMI']
Accuracy: 69.444444%
Feature Set 3: ['Glucose', 'Insulin', 'BMI', 'DiabetesPedigreeFunction']
Accuracy: 72.222222%


# 2: Answers to some Questions

 **Data Preparation**
- Loaded the diabetes dataset from CSV.
- Created a synthetic dataset with 200 samples, 8 features, and class imbalance using make_classification.

**Custom KNN Implementation**
- Built a class CustomKNN with methods for fitting, predicting, and calculating Euclidean distance.
- Used Counter to vote for the most common label among K nearest neighbors.

**Model Training and Evaluation**
- Trained both custom and sklearn KNN models for K = 3, 5, 7.
- Used train_test_split with 1/3 test size and accuracy_score for evaluation.

**Comparison**
- Both models gave identical results, validating the custom implementation.
- The synthetic dataset showed higher accuracy due to better class separation and feature distribution.
