<a href="https://colab.research.google.com/github/T-Naveen-2308/IIITH-FMML-Assignments/blob/main/Week_4_Assignment_Machine_Learning_terms_and_metrics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning terms and metrics

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

In [None]:
rng = np.random.default_rng(seed=42)

In [None]:
dataset = datasets.fetch_california_housing()
dataset

{'data': array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
           37.88      , -122.23      ],
        [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
           37.86      , -122.22      ],
        [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
           37.85      , -122.24      ],
        ...,
        [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
           39.43      , -121.22      ],
        [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
           39.43      , -121.32      ],
        [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
           39.37      , -121.24      ]]),
 'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894]),
 'frame': None,
 'target_names': ['MedHouseVal'],
 'feature_names': ['MedInc',
  'HouseAge',
  'AveRooms',
  'AveBedrms',
  'Population',
  'AveOccup',
  'Latitude',
  'Longitude'],
 'DESCR': '.. _california_housing_dataset:\n

Given below are the list of target values. These correspond to the house value derived considering all the 8 input features and are continuous values. We should use regression models to predict these values but we will start with a simple classification model for the sake of simplicity. We need to just round off the values to the nearest integer and use a classification model to predict the house value.

In [None]:
print("Orignal target values:", dataset.target)

dataset.target = dataset.target.astype(int)

print("Target values after conversion:", dataset.target)
print("Input variables shape:", dataset.data.shape)
print("Output variables shape:", dataset.target.shape)

Orignal target values: [4.526 3.585 3.521 ... 0.923 0.847 0.894]
Target values after conversion: [4 3 3 ... 0 0 0]
Input variables shape: (20640, 8)
Output variables shape: (20640,)


The simplest model to use for classification is the K-Nearest Neighbors model. We will use this model to predict the house value with a K value of 1. We will also use the accuracy metric to evaluate the model.

In [None]:
def NN1(traindata, trainlabel, query):
    diff = (
        traindata - query
    )
    sq = diff * diff
    dist = sq.sum(1)
    label = trainlabel[np.argmin(dist)]
    return label


def NN(traindata, trainlabel, testdata):
    predlabel = np.array([NN1(traindata, trainlabel, i) for i in testdata])
    return predlabel

We will also define a 'random classifier', which randomly allots labels to each sample

In [None]:
def RandomClassifier(traindata, trainlabel, testdata):
    classes = np.unique(trainlabel)
    rints = rng.integers(low=0, high=len(classes), size=len(testdata))
    predlabel = classes[rints]
    return predlabel

We need a metric to evaluate the performance of the model. Let us define a metric 'Accuracy' to see how good our learning algorithm is. Accuracy is the ratio of the number of correctly classified samples to the total number of samples. The higher the accuracy, the better the algorithm. We will use the accuracy metric to evaluate and compate the performance of the K-Nearest Neighbors model and the random classifier.

Let us make a function to split the dataset with the desired probability. We will use this function to split the dataset into training and testing sets. We will use the training set to train the model and the testing set to evaluate the model.

We will reserve 20% of our dataset as the test set. We will not change this portion throughout our experiments

In [None]:
all_train_data, test_data, all_train_target, test_target = train_test_split(dataset.data, dataset.target, test_size=0.2, random_state=42)
train_data, val_data, train_target, val_target = train_test_split(all_train_data, all_train_target, test_size=0.25, random_state=42)

What is the accuracy of our classifiers on the train dataset?

In [None]:
train_pred = NN(train_data, train_target, train_data)
train_accuracy = accuracy_score(train_target, train_pred)
print(f"Training Accuracy using Nearest Neighbour Algorithm: {train_accuracy*100:.2f}%")

knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(train_data, train_target)
train_pred = knn.predict(train_data)
train_accuracy = accuracy_score(train_target, train_pred)
print(f"Training Accuracy using K Nearest Neighbours Algorithm: {train_accuracy*100:.2f}%")

train_pred = RandomClassifier(train_data, train_target, train_data)
train_accuracy = accuracy_score(train_target, train_pred)
print(f"Training Accuracy using Random Classifier: {train_accuracy*100:.2f}%")

Training accuracy using Nearest Neighbour Algorithm: 100.00%
Training accuracy using K Nearest Neighbours Algorithm: 100.00%
Training accuracy using Random Classifier: 17.30%


For nearest neighbour, the train accuracy is always 1. The accuracy of the random classifier is close to 1/(number of classes) which is 0.1666 in our case. This is because the random classifier randomly assigns a label to each sample and the probability of assigning the correct label is 1/(number of classes). Let us predict the labels for our validation set and get the accuracy. This accuracy is a good estimate of the accuracy of our model on unseen data.

In [None]:
val_pred = NN(train_data, train_target, val_data)
val_accuracy = accuracy_score(val_target, val_pred)
print(f"Validation Accuracy using Nearest Neighbour Algorithm: {val_accuracy*100:.2f}%")

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(train_data, train_target)
val_pred = knn.predict(val_data)
val_accuracy = accuracy_score(val_target, val_pred)
print(f"Validation Accuracy using K Nearest Neighbours Algorithm: {val_accuracy*100:.2f}%")

val_pred = RandomClassifier(train_data, train_target, val_data)
val_accuracy = accuracy_score(val_target, val_pred)
print(f"Validation Accuracy using Random Classifier: {val_accuracy*100:.2f}%")

Validation accuracy using Nearest Neighbour Algorithm: 33.67%
Validation accuracy using K Nearest Neighbours Algorithm: 36.80%
Validation accuracy using Random Classifier: 16.79%


Validation accuracy of nearest neighbour is considerably less than its train accuracy while the validation accuracy of random classifier is the same. However, the validation accuracy of nearest neighbour is twice that of the random classifier. Now let us try another random split and check the validation accuracy. We will see that the validation accuracy changes with the split. This is because the validation set is small and the accuracy is highly dependent on the samples in the validation set. We can get a better estimate of the accuracy by using cross-validation.

In [None]:
train_data, val_data, train_target, val_target = train_test_split(all_train_data, all_train_target, test_size=0.01, random_state=42)
val_pred = NN(train_data, train_target, val_data)
val_accuracy = accuracy_score(val_target, val_pred)
print(f"Validation Accuracy using Nearest Neighbour Algorithm: {val_accuracy*100:.2f}%")

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(train_data, train_target)
val_pred = knn.predict(val_data)
val_accuracy = accuracy_score(val_target, val_pred)
print(f"Validation Accuracy using K Nearest Neighbours Algorithm: {val_accuracy*100:.2f}%")

Validation Accuracy using Nearest Neighbour Algorithm: 37.95%
Validation Accuracy using K Nearest Neighbours Algorithm: 34.94%


You can run the above cell multiple times to try with different random splits.
We notice that the accuracy is different for each run, but close together.

Now let us compare it with the accuracy we get on the test dataset.

In [None]:
test_pred = NN(all_train_data, all_train_target, test_data)
test_accuracy = accuracy_score(test_target, test_pred)

print(f"Test Accuracy: {test_accuracy*100:.2f}%")

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(all_train_data, all_train_target)
test_pred = knn.predict(test_data)
test_accuracy = accuracy_score(test_target, test_pred)

print(f"Test Accuracy: {test_accuracy*100:.2f}%")

Test Accuracy: 35.63%
Test Accuracy: 38.20%


## Multiple Splits

One way to get more accurate estimates for the test accuracy is by using <b>cross-validation</b>. Here, we will try a simple version, where we do multiple train/val splits and take the average of validation accuracies as the test accuracy estimation. Here is a function for doing this. Note that this function will take a long time to execute. You can reduce the number of splits to make it faster.

In [None]:
def average_accuracy(all_data, all_target, split_percent, iterations):
    accuracy = 0
    for ii in range(iterations):
        train_data, val_data, train_target, val_target = train_test_split(all_data, all_target, test_size=split_percent, random_state=42)
        val_pred = NN(train_data, train_target, val_data)
        accuracy += accuracy_score(val_target, val_pred)
    return accuracy / iterations

In [None]:
avg_acc = average_accuracy(all_train_data, all_train_target, 0.25, 10)
print("Average Validation Accuracy:", avg_acc*100, "%")

test_pred = NN(all_train_data, all_train_target, test_data)
print("Test Accuracy:", accuracy_score(test_target, test_pred)*100, "%")

Average Validation Accuracy: 33.672480620155035 %
Test Accuracy: 35.634689922480625 %


In [None]:
def average_accuracy(all_data, all_target, split_percent, iterations):
    accuracy = 0
    for ii in range(iterations):
        train_data, val_data, train_target, val_target = train_test_split(all_data, all_target, test_size=split_percent, random_state=42)
        knn = KNeighborsClassifier(n_neighbors=3)
        knn.fit(train_data, train_target)
        val_pred = knn.predict(val_data)
        accuracy += accuracy_score(val_target, val_pred)
    return accuracy / iterations

In [None]:
avg_acc = average_accuracy(all_train_data, all_train_target, 0.25, 100)
print("Average Validation Accuracy:", avg_acc*100, "%")

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(all_train_data, all_train_target)
test_pred = knn.predict(test_data)
print("Test Accuracy:", accuracy_score(test_target, test_pred)*100, "%")

Average Validation Accuracy: 33.57558139534876 %
Test Accuracy: 36.240310077519375 %
