### Chapter 6, Q1(a) Show empirically that the information limit of 2 prediction bits per parameter also holds for nearest neighbors.

In [34]:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

Below I have a dataset for Diabetes. It originally has 7 columns or features, and 766 rows of data. I am using a real-world dataset as opposed to a random dataset to account for real-life scenarios and accurate results. 

In [35]:
df=pd.read_csv("diabetes.csv")
df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


Now I will demonstrate that the (expected) information limit of 2 predictions bits per memorized point holds for KNN. <br>
To do that, I will increase the dimensionality as per function counting theorem. In theory, as the dimensionality increases, the ability of the model to generalize decreases. In other words, it memorizes all the datapoints to an extent that if I reduce the dataset by half, the accuracy of the model will remain unchanged. <br>

In [36]:
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score

# Read the dataset
df = pd.read_csv("diabetes.csv")

# Initialize lists to store accuracies for each variation
accuracies_full_dataset = []
accuracies_half_dataset = []

# Set the range of columns to consider (from 1 to 7)
for num_columns in range(1, 8):
    # Select the first 'num_columns' columns as features
    x = df.iloc[:, :num_columns]
    y = df['Outcome']
    
    # Scale the features using Min-Max scaling
    scaler = MinMaxScaler()
    x = scaler.fit_transform(x)
    
    # Split the dataset into training and testing sets for the entire dataset
    x_train_full, x_test_full, y_train_full, y_test_full = train_test_split(x, y, test_size=0.2, shuffle=True, random_state=42)
    
    # Initialize a KNN classifier with 15 neighbors
    model_full = KNeighborsClassifier(n_neighbors=3)
    
    # Train the KNN model on the entire dataset
    model_full.fit(x_train_full, y_train_full.values.ravel())  # Flattening y
    
    # Make predictions on the test set for the entire dataset
    y_pred_full = model_full.predict(x_test_full)
    
    # Calculate the accuracy of the model for the entire dataset
    accuracy_full = accuracy_score(y_test_full, y_pred_full)
    
    # Store the accuracy for the entire dataset
    accuracies_full_dataset.append(accuracy_full)

    # Calculate the accuracy for the dataset with samples reduced by half
    # Randomly select half of the samples
    df_half = df.sample(frac=0.5, random_state=42)
    
    # Select the first 'num_columns' columns as features for the reduced dataset
    x_half = df_half.iloc[:, :num_columns]
    y_half = df_half['Outcome']
    
    # Scale the features using Min-Max scaling for the reduced dataset
    x_half = scaler.transform(x_half)
    
    # Split the reduced dataset into training and testing sets
    x_train_half, x_test_half, y_train_half, y_test_half = train_test_split(x_half, y_half, test_size=0.2, shuffle=True, random_state=42)
    
    # Initialize a KNN classifier with 15 neighbors for the reduced dataset
    model_half = KNeighborsClassifier(n_neighbors=15)
    
    # Train the KNN model on the reduced dataset
    model_half.fit(x_train_half, y_train_half.values.ravel())  # Flattening y
    
    # Make predictions on the test set for the reduced dataset
    y_pred_half = model_half.predict(x_test_half)
    
    # Calculate the accuracy of the model for the reduced dataset
    accuracy_half = accuracy_score(y_test_half, y_pred_half)
    
    # Store the accuracy for the reduced dataset
    accuracies_half_dataset.append(accuracy_half)

    # Print the results for the current iteration
    print(f"Number of Columns: {num_columns}")
    print(f"Accuracy - n_full: {accuracy_full}")
    print(f"Accuracy - n_half: {accuracy_half}\n")

Number of Columns: 1
Accuracy - n_full: 0.6103896103896104
Accuracy - n_half: 0.6623376623376623

Number of Columns: 2
Accuracy - n_full: 0.6948051948051948
Accuracy - n_half: 0.7272727272727273

Number of Columns: 3
Accuracy - n_full: 0.7532467532467533
Accuracy - n_half: 0.7012987012987013

Number of Columns: 4
Accuracy - n_full: 0.6753246753246753
Accuracy - n_half: 0.7012987012987013

Number of Columns: 5
Accuracy - n_full: 0.6883116883116883
Accuracy - n_half: 0.7012987012987013

Number of Columns: 6
Accuracy - n_full: 0.6623376623376623
Accuracy - n_half: 0.6883116883116883

Number of Columns: 7
Accuracy - n_full: 0.7077922077922078
Accuracy - n_half: 0.6883116883116883



#### The result here shows that as the when d=16, accuracy of **n_full and n_half is roughly the same**. Hence at d = 5,6,7, n_full/n_half = 2 holds true as the limit of 2 predictions bits per memorized point holds true. 

### (b) Extend your experiments to multi-class classification.

We will now explore a dataset that demonstrates multi class classification: `animal_class.csv`. We have 7 different class in class.csv. These classes are the species class of the animals. Every animal in `zoo.csv` has a class from class.csv

In [37]:
df_class = pd.read_csv("animal_class.csv")
df_animals = pd.read_csv("zoo.csv")
df_class

Unnamed: 0,Class_Number,Number_Of_Animal_Species_In_Class,Class_Type,Animal_Names
0,1,41,Mammal,"aardvark, antelope, bear, boar, buffalo, calf,..."
1,2,20,Bird,"chicken, crow, dove, duck, flamingo, gull, haw..."
2,3,5,Reptile,"pitviper, seasnake, slowworm, tortoise, tuatara"
3,4,13,Fish,"bass, carp, catfish, chub, dogfish, haddock, h..."
4,5,4,Amphibian,"frog, frog, newt, toad"
5,6,8,Bug,"flea, gnat, honeybee, housefly, ladybird, moth..."
6,7,10,Invertebrate,"clam, crab, crayfish, lobster, octopus, scorpi..."


In [50]:
df_animals

Unnamed: 0,animal_name,hair,feathers,eggs,milk,airborne,aquatic,predator,toothed,backbone,breathes,venomous,fins,legs,tail,domestic,catsize,class_type
0,aardvark,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1,1
1,antelope,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1
2,bass,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4
3,bear,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1,1
4,boar,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96,wallaby,1,0,0,1,0,0,0,1,1,1,0,0,2,1,0,1,1
97,wasp,1,0,1,0,1,0,0,0,0,1,1,0,6,0,0,0,6
98,wolf,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1
99,worm,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,7


In [49]:
X=df_animals.iloc[:, 1: 17].values
y = df_animals.iloc[:, 17].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
clf = KNeighborsClassifier(n_neighbors=5, metric="minkowski", p=2) 
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("Accuracy of Multi-Class KNN Classification:",accuracy_score(y_test, y_pred))

Accuracy of Multi-Class KNN Classification: 1.0


If you observe the dataset, there are 16 dimensions in `df_animals` having different features of the animals. There are **7 classes** for the output column Y. <br>
Let's vary the dimensionality and check the accuracy of the dataset if it were halved. 

Now I am going to vary the dimensions. For each d = 4,8,12,16, I am going to check the accuracy for the entire dataset, and when half the number of samples from the dataset were removed at random. In theory, as the dimentionsality increases, the accuracy should be almost the same for the entire dataset as well as half the datset, achieving **2 prediction bits per parameter**.

In [51]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import numpy as np

# Define the dimensions to test
dimensions_to_test = [4, 8, 12, 16]

for dim in dimensions_to_test:
    # Assuming df_animals is your DataFrame
    X = df_animals.iloc[:, 1: dim+1].values  # Selecting features
    y = df_animals.iloc[:, 17].values  # Target variable
    
    # Split data into training and testing sets for entire dataset
    X_train_full, X_test_full, y_train_full, y_test_full = train_test_split(X, y, test_size=0.2, random_state=0)
    
    # Initialize and train KNN classifier for entire dataset
    clf_full = KNeighborsClassifier(n_neighbors=5, metric="minkowski", p=2) 
    clf_full.fit(X_train_full, y_train_full)
    
    # Make predictions on the test set for entire dataset
    y_pred_full = clf_full.predict(X_test_full)
    
    # Calculate accuracy for entire dataset
    accuracy_full = accuracy_score(y_test_full, y_pred_full)
    
    # Halve the sample values by random selection
    indices_to_keep = np.random.choice(X.shape[0], size=X.shape[0]//2, replace=False)
    X_half = X[indices_to_keep]
    y_half = y[indices_to_keep]
    
    # Split data into training and testing sets for half dataset
    X_train_half, X_test_half, y_train_half, y_test_half = train_test_split(X_half, y_half, test_size=0.2, random_state=0)
    
    # Initialize and train KNN classifier for half dataset
    clf_half = KNeighborsClassifier(n_neighbors=1, metric="minkowski", p=2) 
    clf_half.fit(X_train_half, y_train_half)
    
    # Make predictions on the test set for half dataset
    y_pred_half = clf_half.predict(X_test_half)
    
    # Calculate accuracy for half dataset
    accuracy_half = accuracy_score(y_test_half, y_pred_half)
    
    # Print the accuracies for both entire and half datasets
    print(f"Accuracy for {dim} dimensions (n_full): {accuracy_full:.4f} | Accuracy for {dim} dimensions (n_half): {accuracy_half:.4f}")


Accuracy for 4 dimensions (n_full): 0.7619 | Accuracy for 4 dimensions (n_half): 0.9000
Accuracy for 8 dimensions (n_full): 0.8095 | Accuracy for 8 dimensions (n_half): 0.9000
Accuracy for 12 dimensions (n_full): 0.9524 | Accuracy for 12 dimensions (n_half): 0.8000
Accuracy for 16 dimensions (n_full): 1.0000 | Accuracy for 16 dimensions (n_half): 0.9000


#### As you can see in the results above, when d=16, accuracy of **n_full and n_half is almost the same**. Hence at d = 16, n_full/n_half = 2 holds true as the limit of 2 predictions bits per memorized point holds true. 