<a href="https://colab.research.google.com/github/Akshaya345/AIML_Tutorial/blob/main/AIML_Module1_Lab2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [35]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
rand=np.random.default_rng(seed=42)

In [36]:
dataset=datasets.fetch_california_housing()
print(dataset.DESCR)
print(dataset.keys())
dataset.target=dataset.target.astype(np.int)
print()
print()
print(dataset.data.shape)
print(dataset.target.shape)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dataset.target=dataset.target.astype(np.int)


In [37]:
def NN1(traindata,trainlabel,query):
  diff=traindata-query
  square=diff*diff
  dist=square.sum(1)
  label=trainlabel[np.argmin(dist)]
  return label
def NN(traindata,trainlabel,testdata):
  predlabel=np.array([NN1(traindata,trainlabel,i) for i in testdata])
  return predlabel

In [38]:
def RandomClassifier(traindata,trainlabel,testdata):
  classes=np.unique(trainlabel)
  rints=rand.integers(low=0,high=len(classes),size=len(testdata))
  predlabel=classes[rints]
  return predlabel

In [39]:
def Accuracy(gtlabel, predlabel):
  assert len(gtlabel)==len(predlabel), "Length of the groundtruth labels and predicted labels should be the same"
  correct=(gtlabel==predlabel).sum()
  return correct/len(gtlabel)

In [40]:
def split(data, label, percent):
  rnd=rand.random(len(label))
  split1=rnd<percent
  split2=rnd>=percent
  split1data=data[split1,:]
  split1label=label[split1]
  split2data=data[split2,:]
  split2label=label[split2]
  return split1data, split1label, split2data, split2label

In [41]:
testdata,testlabel,alltraindata,alltrainlabel=split(dataset.data,dataset.target,20/100)
print('Number of test samples=', len(testlabel))
print('Number of other samples=', len(alltrainlabel))
print('Percent of test data=', len(testlabel)*100/len(dataset.target),'%')

Number of test samples= 4144
Number of other samples= 16496
Percent of test data= 20.07751937984496 %


In [42]:
traindata,trainlabel,valdata,vallabel=split(alltraindata,alltrainlabel,75/100)

In [43]:
trainprediction=NN(traindata,trainlabel,traindata)
trainAccuracy=Accuracy(trainlabel,trainprediction)
print("Train accuracy using nearest neighbour is: ", trainAccuracy)
trainprediction=RandomClassifier(traindata,trainlabel,traindata)
trainAccuracy=Accuracy(trainlabel,trainprediction)
print("Train accuracy using random classifier is: ", trainAccuracy)

Train accuracy using nearest neighbour is:  1.0
Train accuracy using random classifier is:  0.164375808538163


In [44]:
valprediction=NN(traindata,trainlabel,valdata)
valAccuracy=Accuracy(vallabel,valprediction)
print("Validation accuracy using nearest neighbour is: ",valAccuracy)
valprediction=RandomClassifier(traindata,trainlabel,valdata)
valAccuracy=Accuracy(vallabel,valprediction)
print("Validation accuracy using random classifier is: ",valAccuracy)

Validation accuracy using nearest neighbour is:  0.34108527131782945
Validation accuracy using random classifier is:  0.1688468992248062


In [45]:
traindata,trainlabel,valdata,vallabel=split(alltraindata,alltrainlabel,75/100)
valpred=NN(traindata,trainlabel,valdata)
valAccuracy=Accuracy(vallabel,valpred)
print("Validation accuracy of nearest neighbour is: ", valAccuracy)

Validation accuracy of nearest neighbour is:  0.34048257372654156


In [49]:
testprediction=NN(alltraindata,alltrainlabel,testdata)
testAccuracy=Accuracy(testlabel,testprediction)
print('Test accuracy is: ', testAccuracy)

Test accuracy is:  0.34917953667953666


**1. How is the accuracy of the validation set affected if we increase the percentage of validation set? What happens when we reduce it?**

**Ans:** When the percentage of the validation set is increased,we receive a more reliable estimate of the model's performance,which often leads to improved validation accuracy in **nearest neighbour** whereas increasing the percentage of the validation set does not significantly impact accuracy for **random classifier**. Also,we have fewer data for training,which may result in underfitting.

**2. How does the size of the train and validation set affect how well we can predict the accuracy on the test set using the validation set?**

**Ans:** The size of the train set affects the model's ability to learn, with larger sets potentially improving predictive accuracy. A larger validation set provides a more reliable estimate of model performance but should be balanced with an adequately sized training set for accurate predictions on the test set.

**3. What do you think is a good percentage to reserve for the validation set so that these two factors are balanced?**

**Ans:** For both nearest neighbor and random classifier, a good percentage to reserve for the validation set to achieve a balance typically falls in the range of 10% to 30%.

In [53]:
traindata,trainlabel,valdata,vallabel=split(alltraindata,alltrainlabel,40/100)
valprediction=NN(traindata,trainlabel,valdata)
valAccuracy=Accuracy(vallabel,valprediction)
print("Validation accuracy using nearest neighbour is: ",valAccuracy)
valprediction=RandomClassifier(traindata,trainlabel,valdata)
valAccuracy=Accuracy(vallabel,valprediction)
print("Validation accuracy using random classifier is: ",valAccuracy)

Validation accuracy using nearest neighbour is:  0.3348187158193235
Validation accuracy using random classifier is:  0.16832084261697386


In [54]:
testprediction=NN(alltraindata,alltrainlabel,testdata)
testAccuracy=Accuracy(testlabel,testprediction)
print('Test accuracy is: ', testAccuracy)

Test accuracy is:  0.34917953667953666


In [51]:
def AverageAccuracy(alldata,alllabel,splitpercent,iterations,classifier=NN):
  accuracy = 0
  for i in range(iterations):
    traindata,trainlabel,valdata,vallabel=split(alldata,alllabel,splitpercent)
    valprediction=classifier(traindata,trainlabel,valdata)
    accuracy+=Accuracy(vallabel,valprediction)
  return accuracy/iterations

In [52]:
print('Average validation accuracy is: ',AverageAccuracy(alltraindata,alltrainlabel,75/100,10,classifier=NN))
testprediction=NN(alltraindata,alltrainlabel,testdata)
print('Test accuracy is: ',Accuracy(testlabel,testprediction) )

Average validation accuracy is:  0.3359366875267045
Test accuracy is:  0.34917953667953666


**1.** **Does averaging the validation accuracy across multiple splits give more consistent results?**

**Ans:** Yes,cross-validation, or averaging validation accuracy across multiple splits, provides a more stable and less biassed estimate of a model's performance by reducing the impact of a single data split's randomness and ensuring a more comprehensive assessment of its generalisation capability. This method helps in obtaining a more accurate depiction of a model's expected performance on previously unknown data.


**2.** **Does it give more accurate estimate of test accuracy?**

**Ans:** Yes, it provides a more accurate estimate of a model's likely performance on unseen data. It reduces the bias and variability associated with a single data split, resulting in a more reliable prediction of how the model will perform on new, unseen data.

**3.** **What is the effect of the number of iterations on the estimate? Do we get a better estimate with higher iterations?**

**Ans:** Increasing the number of iterations during training can lead to a more accurate estimate of test accuracy up to a certain point. As the model iterates and refines its parameters, it becomes better at capturing patterns in the training data, potentially improving its overall performance.

However, it's essential to keep track of the model's effectiveness using a different validation set or cross-validation. This reduces overfitting and helps determine when the model has reached its peak performance. Overfitting happens when the model becomes overly dependent on the training set, which reduces the model's capacity to generalise. The model's prediction of test accuracy may become less accurate in such circumstances.

Therefore, while increasing the number of iterations can improve the model's performance and its test accuracy estimate, careful monitoring for overfitting is required to ensure the accuracy of the estimate for unknown data.

**4. Consider the results you got for the previous questions. Can we deal with a very small train dataset or validation dataset by increasing the iterations?**

**Ans:** Although more iterations can improve model performance, they cannot entirely make up for a very short training or validation dataset. Small datasets can lead to overfitting. A careful balance between dataset size, model complexity, and training iterations is necessary to get accurate findings.
