<a href="https://colab.research.google.com/github/Allekarthik/AIML_Projects_and_labs/blob/main/AIML_3_exercise_Module_01_lab_02_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:

import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt

rng = np.random.default_rng(seed=42)

Exercise: Try to implement a 3 nearest neighbour classifier and compare the accuracy of the 1 nearest neighbour classifier and the 3 nearest neighbour classifier on the test dataset. You can use the KNeighborsClassifier class from the scikit-learn library to implement the K-Nearest Neighbors model. You can set the number of neighbors using the n_neighbors parameter. You can also use the accuracy_score function from the scikit-learn library to calculate the accuracy of the model.

In [None]:
dataset = datasets.fetch_california_housing()
# Dataset description
print(dataset.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

In [None]:
def NN1(traindata, trainlabel, query, k=1):
    diff = traindata - query
    sq = diff * diff
    dist = sq.sum(1)
    nearest_indices = np.argsort(dist)[:k]
    nearest_labels = trainlabel[nearest_indices]
    return np.argmax(np.bincount(nearest_labels))

def NN(traindata, trainlabel, testdata, k=1):
    predlabel = np.array([NN1(traindata, trainlabel, i, k) for i in testdata])
    return predlabel

In [None]:
def Accuracy(gtlabel, predlabel):
    assert len(gtlabel) == len(predlabel), "Length of the ground-truth labels and predicted labels should be the same"
    return np.mean(gtlabel == predlabel)

In [None]:
def split(data, label, percent):
    rnd = rng.random(len(label))
    split1 = rnd < percent
    split2 = rnd >= percent

    split1data = data[split1, :]
    split1label = label[split1]
    split2data = data[split2, :]
    split2label = label[split2]
    return split1data, split1label, split2data, split2label

In [None]:
testdata, testlabel, alltraindata, alltrainlabel = split(
    dataset.data, dataset.target, 20 / 100
)


In [None]:
traindata, trainlabel, valdata, vallabel = split(
    alltraindata, alltrainlabel, 75 / 100)


In [None]:
trainpred_k1 = NN(traindata, trainlabel, traindata, k=1)
trainAccuracy_k1 = Accuracy(trainlabel, trainpred_k1)
print("Training accuracy using nearest neighbor algorithm when k=1:", trainAccuracy_k1 * 100, "%")


trainpred_k3 = NN(traindata, trainlabel, traindata, k=3)
trainAccuracy_k3 = Accuracy(trainlabel, trainpred_k3)
print("Training accuracy using nearest neighbor algorithm when k=3:", trainAccuracy_k3 * 100, "%")

Training accuracy using nearest neighbor algorithm when k=1: 100.0 %
Training accuracy using nearest neighbor algorithm when k=3: 63.54416227439707 %


Exercise: How does the accuracy of the 3 nearest neighbour classifier change with the number of splits? How is it affected by the split size? Compare the results with the 1 nearest neighbour classifier.

In [None]:
def AverageAccuracy(alldata, alllabel, splitpercent, iterations, classifier=NN):
    """
    This function takes in the data, labels, split percentage, number of iterations and classifier function
    and returns the average accuracy of the classifier

    alldata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    alllabel: numpy array of shape (n,) where n is the number of samples
    splitpercent: float which is the percentage of data to be used for training
    iterations: int which is the number of iterations to run the classifier
    classifier: function which is the classifier function to be used

    returns: the average accuracy of the classifier
    """
    accuracy = 0
    for ii in range(iterations):
        traindata, trainlabel, valdata, vallabel = split(
            alldata, alllabel, splitpercent
        )
        valpred = classifier(traindata, trainlabel, valdata)
        accuracy += Accuracy(vallabel, valpred)
    return accuracy / iterations  # average of all accuracies

In [None]:
avg_acc = AverageAccuracy(alltraindata, alltrainlabel, 75 / 100, 10, classifier=NN)
print("Average validation accuracy:", avg_acc*100, "%")
testpred = NN(alltraindata, alltrainlabel, testdata)

print("Test accuracy:", Accuracy(testlabel, testpred)*100, "%")

Average validation accuracy: 37.29545652577994 %
Test accuracy: 38.45393924425784 %


In [None]:
def NN1(traindata, trainlabel, query, k=3):
    diff = traindata - query
    sq = diff * diff
    dist = sq.sum(1)
    nearest_indices = np.argsort(dist)[:k]
    nearest_labels = trainlabel[nearest_indices]
    return np.argmax(np.bincount(nearest_labels))


In [None]:
def AverageAccuracy(alldata, alllabel, splitpercent, iterations, classifier=NN):
    """
    This function takes in the data, labels, split percentage, number of iterations and classifier function
    and returns the average accuracy of the classifier

    alldata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    alllabel: numpy array of shape (n,) where n is the number of samples
    splitpercent: float which is the percentage of data to be used for training
    iterations: int which is the number of iterations to run the classifier
    classifier: function which is the classifier function to be used

    returns: the average accuracy of the classifier
    """
    accuracy = 0
    for _ in range(iterations):
        traindata, trainlabel, valdata, vallabel = split(
            alldata, alllabel, splitpercent
        )
        valpred = classifier(traindata, trainlabel, valdata, k=3)  # Pass k=3 to classifier
        accuracy += Accuracy(vallabel, valpred)
    return accuracy / iterations  # average of all accuracies


In [None]:
# Example usage with k=3
avg_acc_k3 = AverageAccuracy(alltraindata, alltrainlabel, 0.75, 10, classifier=NN)
print("Average validation accuracy with k=3:", avg_acc_k3 * 100, "%")

testpred_k3 = NN(alltraindata, alltrainlabel, testdata, k=3)
test_acc_k3 = Accuracy(testlabel, testpred_k3)
print("Test accuracy with k=3:", test_acc_k3 * 100, "%")


Average validation accuracy with k=3: 38.25921061728629 %
Test accuracy with k=3: 38.18226722647567 %


Compare the results with the 1 nearest neighbour classifier.
when k value increases avg validation accuracy increases

when k value increases test accuracy decreases
