#### Project Description: 
Consider the problem where we want to predict the type of material (among 3 material types) of a mug based on four measurements, namely the height, diameter, weight, and hue (color).

In [None]:
# Run this code to install required library
!pip install pandas
!pip install sklearn

In [2]:
# Import necessary libraries
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import LeaveOneOut
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score

a) Using Cartesian distance as the similarity measurements show the results of the material type prediction for the Evaluation data that you generated above for a) based on the corresponding generated training data for values of K of 1, 3, and 5.

In [3]:
# a) Load the data
train = pd.read_csv('train_dataset.csv')
test = pd.read_csv('test_dataset.csv')

# Encode the material types (class labels)
le = LabelEncoder()
train['class'] = le.fit_transform(train['class'])

# Split the data into features and targets
X_train = train.iloc[:, :-1]
y_train = train.iloc[:, -1]
X_test = test

In [4]:
# Printing train data head
pd.read_csv('train_dataset.csv').head()

Unnamed: 0,height,diameter,weight,hue,class
0,0.07909,0.062436,0.348144,5.06225,Ceramic
1,0.083359,0.067004,0.17466,3.042573,Plastic
2,0.13942,0.141117,0.44522,4.199028,Plastic
3,0.145907,0.146399,0.526784,1.492253,Plastic
4,0.107969,0.110693,0.531342,3.126876,Metal


In [5]:
# Printing Test data head
pd.read_csv('test_dataset.csv').head()

Unnamed: 0,height,diameter,weight,hue
0,0.095138,0.091299,0.519574,2.532808
1,0.073791,0.03,0.163126,2.109131
2,0.073555,0.075702,0.273573,0.196916
3,0.119338,0.15,0.450114,2.754943


b) Implement the KNN algorithm for this problem. It will work with different training data sets and allow to input a data point for the prediction.

In [6]:
# Build and train the KNN model
for k in [1, 3, 5]:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)

    # Predict on test data
    y_pred = knn.predict(X_test)
    print(f"Predictions with k={k}: {le.inverse_transform(y_pred)}")

Predictions with k=1: ['Ceramic' 'Plastic' 'Plastic' 'Ceramic']
Predictions with k=3: ['Ceramic' 'Plastic' 'Plastic' 'Ceramic']
Predictions with k=5: ['Metal' 'Plastic' 'Plastic' 'Metal']


c) To evaluate the performance of the KNN algorithm, implementing a leave-one-out evaluation routine for algorithm. In leave-one-out validation, we repeatedly evaluate the algorithm by removing one data point from the training set, training the algorithm on the remaining data set and then testing it on the point we removed to see if the label matches or not. Repeating this for each of the data points gives us an estimate as to the percentage of erroneous predictions the algorithm makes and thus a measure of the accuracy of the algorithm for the given data. Applying leave-one-out validation with KNN algorithm to the dataset for c) for values for K of 1, 3, and 5 and report the results. For which value of K do you get the best performance ?

In [11]:
# c) Leave-one-out validation
loo = LeaveOneOut()

data = pd.read_csv('program_dataset.csv')
X = data.iloc[:, :-1]
y = pd.Series(le.transform(data.iloc[:, -1]))  # Convert to pandas Series

for k in [1, 3, 5]:
    y_pred = list()
    for train_index, test_index in loo.split(X):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]

        knn = KNeighborsClassifier(n_neighbors=k)
        knn.fit(X_train, y_train)

        y_hat = knn.predict(X_test)
        y_pred.append(y_hat[0])
        
    # Calculate accuracy for each k
    accuracy = accuracy_score(y, y_pred)
    print(f"Leave-one-out cross-validation accuracy with k={k}: {accuracy}")


Leave-one-out cross-validation accuracy with k=1: 0.4
Leave-one-out cross-validation accuracy with k=3: 0.425
Leave-one-out cross-validation accuracy with k=5: 0.4166666666666667


d) Modify KNN algorithm to use Manhattan distance (L1) as the similarity measure and repeat the experiment from part c). Which similarity measure gives you a better performance for each of the values of K ?

In [14]:
# d) KNN with Manhattan distance (p=1 in Minkowski distance)
for k in [1, 3, 5]:
    y_pred = list()
    for train_index, test_index in loo.split(X):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]

        knn = KNeighborsClassifier(n_neighbors=k, p=1)
        knn.fit(X_train, y_train)

        y_hat = knn.predict(X_test)
        y_pred.append(y_hat[0])

    # Calculate accuracy for each k
    accuracy = accuracy_score(y, y_pred)
    print(f"Leave-one-out cross-validation accuracy with Manhattan distance and k={k}: {accuracy}")

Leave-one-out cross-validation accuracy with Manhattan distance and k=1: 0.4583333333333333
Leave-one-out cross-validation accuracy with Manhattan distance and k=3: 0.44166666666666665
Leave-one-out cross-validation accuracy with Manhattan distance and k=5: 0.45


e) Repeat the prediction experiment from part c) using KNN and Cartesian distance when the fourth attribute in the data is removed (i.e. when only the first three features in the input data are available). Which data gives better predictions?

In [31]:
# e) KNN with only three features
X_train = train.iloc[:, :3]
y_train = train.iloc[:, -1]  # Redefine y_train
X_test = test.iloc[:, :3]

for k in [1, 3, 5]:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    # We can't compute the accuracy in e) because we don't have the true labels for the test data.
    print(f"Predictions with 3 features (height, diameter, weight) and k={k}: {le.inverse_transform(y_pred)}")


Predictions with 3 features (height, diameter, weight) and k=1: ['Metal' 'Ceramic' 'Metal' 'Plastic']
Predictions with 3 features (height, diameter, weight) and k=3: ['Plastic' 'Plastic' 'Ceramic' 'Plastic']
Predictions with 3 features (height, diameter, weight) and k=5: ['Plastic' 'Ceramic' 'Ceramic' 'Plastic']
