**Project Description :-**

Lets imagine that you are working in the recycling division of the department of energy. The government has requested your division to identify the types of plastics found in the sea so that they can take necessary actions to reduce or prevent their usage. You are required to find the type of plastic, given its remains from the sea

Assuming that there are only six different types d plastics. The compositions of plastics vary over a specific range of values. You are provided with the data regarding the range of composition percentages, properties and the corresponding types of plastic.

In [1]:
from sklearn.model_selection import train_test_split
import numpy as np

In [2]:
X = np.genfromtxt("train_X_knn.csv", delimiter=',', dtype=np.float64, skip_header=1)
Y = np.genfromtxt("train_Y_knn.csv", delimiter=',', dtype=int)

In [3]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.7, random_state=1)

In [4]:
"""
Returns:
ln norm distance
"""
def compute_ln_norm_distance(vector1, vector2, n):
    sum = 0
    for i in range(len(vector2)):
        sum = sum + abs((vector2[i]-vector1[i])**(n))
    return (sum)**(1/n)

In [5]:
"""
Returns:
Indices of the 1st k- nearest neighbors in train_X, in the order with nearest first.
"""
def find_k_nearest_neighbors(train_X, test_example, k, n):
    indices_dist_pairs = []
    index= 0
    for train_elem_x in train_X:
        distance = compute_ln_norm_distance(train_elem_x, test_example, n)
        indices_dist_pairs.append([index, distance])
        index += 1
    indices_dist_pairs.sort(key = lambda x: x[1])
    top_k_pairs = indices_dist_pairs[:k]
    top_k_indices = [i[0] for i in top_k_pairs]
    return top_k_indices

In [6]:
"""
Returns:
Classified points using knn method
"""
def classify_points_using_knn(train_X, train_Y, test_X, k, n):
    classified_Y = []
    for test_elem_x in test_X:
        top_k_nn_indices = find_k_nearest_neighbors(train_X, test_elem_x, k, n)
        top_knn_labels = []

        for i in top_k_nn_indices:
            top_knn_labels.append(train_Y[i])
        Y_values = list(set(top_knn_labels))

        max_count = 0
        most_frequent_label = -1
        for y in Y_values:
            count = top_knn_labels.count(y)
            if(count > max_count):
                max_count = count
                most_frequent_label = y

        classified_Y.append(most_frequent_label)
    return np.array(classified_Y)

In [7]:
"""
Returns:
Calculates accuracy of the model.
"""
def calculate_accuracy(predicted_Y, actual_Y):
    count = 0
    for i in range(len(predicted_Y)):
        if (predicted_Y[i] == actual_Y[i]):
            count += 1
    return (count/len(actual_Y))

In [8]:
"""
Returns K value based on validation data.
"""
def best_k_and_best_n_value_using_validation_set(train_X, train_Y, validation_split_percent):
    import math 
    total_num_of_observations = len(train_X)
    train_length = math.floor((100 - validation_split_percent)/100 * total_num_of_observations )
    validation_X = train_X[train_length :]
    validation_Y = train_Y[train_length :]
    train_X = train_X[0 : train_length]
    train_Y = train_Y[0 : train_length]
 
    best_k = -1
    best_accuracy = 0
    best_N = -1
    for n in range(1, 20):
        for k in range(1, train_length+1):
            predicted_Y = classify_points_using_knn(train_X, train_Y, validation_X, n, k)
            accuracy = calculate_accuracy(predicted_Y, validation_Y)
            if accuracy > best_accuracy:
                best_k, best_N = k, n
                best_accuracy = accuracy

    return best_k, best_N

In [13]:
best_K, best_n = best_k_and_best_n_value_using_validation_set(X_train, Y_train, validation_split_percent = 70)
print("Best value of K and N using cross validation : ", best_K,"and", best_n)

Best value of K and N using cross validation :  1 and 6


In [14]:
"""
Returns the classified value.
"""
def predict_target_values(X, Y, best_K, best_n):
    test_X = X
    X = X.tolist()
    Y = Y.tolist()
    predicted_Y = classify_points_using_knn(X, Y, test_X, best_K, best_n)
    return predicted_Y

In [15]:
pred = predict_target_values(X_test, Y_test, best_K, best_n)

In [16]:
print(calculate_accuracy(pred, Y_test))

1.0
