# Hierarchical Clustering for Seed Categorization

================================================================================================================================

#### Implementation Steps:
1. Import necessary libraries and modules.
2. Define helper functions for distance calculation, k-NN classification, and accuracy metric.
3. Define a function `splits_CV()` to create k-folds from a given dataset.
4. Define a function `k_Fold()` that performs k-fold cross-validation and returns the average accuracy.
5. Define a function `dataset_KNN()` that performs the following steps:
    a. Remove the target column from the input DataFrame.
    b. Shuffle the DataFrame.
    c. Split the DataFrame into train and test sets.
    d. Apply Agglomerative Clustering to the train set.
    e. Calculate the distance of data points to the centroids of the clusters.
    f. Perform min-max scaling on the calculated distances.
    g. Return the modified train and test sets with the new features.
6. Load the dataset (Seed_Data.csv).
7. Perform a loop for each linkage type (single, complete, average) and then for each cluster number (3, 4, 5, 6, 7):
    a. Call `dataset_KNN()` to get the modified train and test sets.
    b. Save the modified train and test sets as .csv files.
    c. Perform a loop for each k-NN (3, 5, 7, 9, 11):
        i. Call `k_Fold()` to perform cross-validation on the train set and calculate the average accuracy.
        ii. Print the average accuracy.
8. Calculate the best scenario (best number of clusters) for each linkage type.
9. Use the best scenario (in this case, 3 clusters and 'complete' linkage) to preprocess the dataset, and obtain the modified train and test sets.
10. Apply k-NN (with k = 5) to the modified test set and obtain the predictions.
11. Apply k-NN to the original dataset:
    a. Shuffle the DataFrame.
    b. Split the DataFrame into train and test sets.
    c. Perform a loop for each k-NN (3, 5, 7, 9, 11):
        i. Call `k_Fold()` to perform cross-validation on the train set and calculate the average accuracy.
        ii. Print the average accuracy.
12. Apply k-NN (with k = 5) to the original test set and obtain the predictions.
13. Calculate the accuracy of the predictions for both the original and modified datasets using the `accuracy_metric()` function.

In [1]:
import numpy as np
import pandas as pd

## Agglomerative Clustering

In [2]:
class AgglomerativeClustering:
    def __init__(self,n_clusters=2,linkage="single"):
        
        self.n_clusters = n_clusters
        self.linkage = linkage

    def fit_predict(self,X):
        # Perform agglomerative clustering for input and returns cluster labels for each data items
        n=X.shape[0]
        d=self.d_matrix(X)
        cluster=self.get_initial_cluster(n)
        s=set(range(n))  
        for _ in range(n-self.n_clusters):
            p,q=np.unravel_index(np.argmin(d, axis=None), d.shape) # Find the indices of the smallest distance in distance matrix d
            t_set=s-{p,q} 
            d=self.update_d(d,p,q,t_set,self.linkage) 
            cluster=self.update_cluster(cluster,p,q) 
            s=s-{max(p,q)} # remove indext of merged set
        decor_l=[] # Store final clusters
        for v in cluster.values():
            decor_l.append(v)
        
        self.labels_= self.clustertolabels(decor_l)
        return self.labels_

    def clustertolabels(self,clusters):
        # Method takes the cluster and label to the data points
        ln = sum([len(c) for c in clusters])
        labels = np.zeros(ln,dtype = np.int)
        ind = -1
        for c in clusters:
            ind+=1
            for i in c:
                labels[i] = ind # Assign current cluster index as label 
        return labels


    def d_matrix(self,data):
        # Compute distance matrix
        n=data.shape[0]  
        d=np.empty(shape=[n,n]) 
        d.fill(np.inf)  
        for i in range(n-1):
            for j in range(i+1,n):
                d[i,j]=distance(data[i],data[j]) 
        return d

    
    def get_initial_cluster(self,n):
        # initialize cluster, with each points being its own cluster
        c={}
        for i in range(n):
            c[i]={i}   
        return c

   
    def update_d(self,d,p,q,t_set,linkage):
        # Update distance matrix after merging clusters p and q using linkage method
        for i in t_set: 
            u,v=min(i,p),max(i,p) 
            w,x=min(i,q),max(i,q)
            if(linkage=="complete"):
                t=max(d[u,v],d[w,x])
            elif(linkage=="average"):
                t=(d[u,v]+d[w,x])/2
            else:     
                t=min(d[u,v],d[w,x])
        
            d[u,v]=t
            d[w,x]=t
            
        m_pq=max(p,q)
        d[m_pq,:]=np.inf
        d[:,m_pq]=np.inf
        return d


    def update_cluster(self,c,p,q):
        # Update cluster after merging p and q
        i=c.pop(max(p,q)) 
        m=min(p,q)
        c[m]=c[m].union(i) # Merge two clusters by taking union of their sets of data points
        return c
    
def distance(pt1,pt2):
    # Compute the Euclidean distance between two points
        if(len(pt1)!=len(pt2)):
            print("Dimensions of the points are not equal")
            return  
        dim=len(pt1)  
        s=0
        for i in range(dim):
            s+=(pt1[i]-pt2[i])**2 
        dist=np.sqrt(s)  
        return dist

# KNN

In [3]:
def dist_cartesian(sample, inputs):
    #calculate cartesian distance between given set of samples and inpouts
    diff = sample - inputs
    sum_pow = np.sum(diff**2, axis=1)
    return sum_pow**0.5
    
def lbl_classify(k, sorted_labels):
    # Classify a point based on majority class labels amoung its k nearest neighbours
    k_neighbors = sorted_labels[:k]
    target = np.unique(k_neighbors)
    count = []
    for i in target:
        x = np.count_nonzero(k_neighbors == i)
        count.append(x)
    return target[np.argmax(count)]

def KNN_classification(sample, k, X, y):
    """
    sample: the point to be classified
    k: k numbers of neighbours
    X: the input dataset
    y: class label for input dataset
    """
    # Perform KNN classification for given sample point
    labels = list(y)
    inputs = list(X)
    cart_distance = dist_cartesian(sample, inputs)
    labeled_cart = np.vstack((cart_distance, labels))
    sorted_cart = labeled_cart.T[labeled_cart.T[:, 0].argsort()] #Transpose and sort by distance
    sorted_labels = sorted_cart.T[1]
    return lbl_classify(k, sorted_labels)

def accuracy_metric(actual, predicted):
    # Calculate accuracy of classification
    correct = 0
    for i in range(len(actual)):
        if actual[i] == predicted[i]:
            correct += 1
    return correct / float(len(actual))

# K fold

In [4]:
from random import randrange
def splits_CV(dataset_KNN, folds):
    # Split input dataset into fold number of equally sized subset
    dataset_KNN_split = []
    df_copy = dataset_KNN
    fold_size = int(df_copy.shape[0] / folds)
    for i in range(folds):
        fold = []
        while len(fold) < fold_size:
            r = randrange(df_copy.shape[0]) 
            index = df_copy.index[r]
            fold.append(df_copy.loc[index].values.tolist())
            df_copy = df_copy.drop(index) # Remove the row at index from df_copy to avoide selecting again
        dataset_KNN_split.append(np.asarray(fold))
    return dataset_KNN_split


In [5]:
def k_Fold(dataset_KNN, f, k):
    # Perform k-fold cross validation to evaluate performance of KNN
    data=splits_CV(dataset_KNN,f)
    result=[] # Store accuracy score for each fold
    for i in range(f):
        r = list(range(f))
        r.pop(i) # Remove current fold from index as test set
        for j in r :
            if j == r[0]:
                cv = data[j]
            else:    
                cv=np.concatenate((cv,data[j]), axis=0) # concatenate fold index with cv
  
        predictions = []
        for sample in data[i][:,:-1]: # Exclude last column which is true label
            prediction_1 = KNN_classification(sample, k, cv[:,:-1], cv[:,-1]) # cv[:,:-1] training features, cv[:,-1] training set labels
            predictions.append(prediction_1)
        acc = accuracy_metric(data[i][:,-1], predictions)   
        result.append(acc) 
    return result # Classification accuracy for each fold

In [6]:
def dataset_KNN(df, n, linkage):
    # Prepare input dataset for KNN classifier
    df = df.drop("target", axis=1) # Remove target column, only need of features column
    df = df.sample(frac=1).reset_index(drop = True) # shuffle datafram to avid bais in order

    train_set = df.iloc[:170, :].reset_index(drop = True)
    test_set = df.iloc[170:, :].reset_index(drop = True)

    clustering = AgglomerativeClustering(n_clusters=n, linkage=linkage)
    pred_clusters = clustering.fit_predict(train_set.values)
    
    train_set['labels'] = pred_clusters # add cluster labels to training set
    
    for i in range(n):
        indexes = np.where(train_set['labels'] == i) # Find index of samples in train set that belongs to current cluster
        df_i = train_set.iloc[indexes[0], :-1].reset_index(drop = True) #Create new datafram containing only samples in the current cluster
        centroid = list(df_i.mean()) # calculate centroid of current my taking mean of each features in df_i
        dataset_KNN = train_set.iloc[:,:-1]
        column_name = "cluster_feature_" + str(i)
        #distance of datapoints 
        train_set[column_name] = [np.sum(np.square(row-centroid)) for row in dataset_KNN.values]
        #min max scaling
        train_set[column_name] = train_set[column_name]/train_set[column_name].max()
        test_set[column_name] = [np.sum(np.square(row-centroid)) for row in test_set.values]
        test_set[column_name] = test_set[column_name]/test_set[column_name].max()
    
    train_set['labels'] = train_set.pop('labels')
    #return modified train and test set
    return train_set, test_set
  

In [7]:
df = pd.read_csv("Seed_Data.csv")

In [8]:
df.head()

Unnamed: 0,A,P,C,LK,WK,A_Coef,LKG,target
0,15.26,14.84,0.871,5.763,3.312,2.221,5.22,0
1,14.88,14.57,0.8811,5.554,3.333,1.018,4.956,0
2,14.29,14.09,0.905,5.291,3.337,2.699,4.825,0
3,13.84,13.94,0.8955,5.324,3.379,2.259,4.805,0
4,16.14,14.99,0.9034,5.658,3.562,1.355,5.175,0


In [9]:
'''
numbers of clusters
number of neighbours
linkages(single, complete, average)
'''
n_clusters = [3,4,5,6,7]
knn = [3,5,7,9,11] 
linkages = ['single', 'complete', 'average']
Accuracy_linkage = []

In [10]:
for linkage in linkages:
    print("\u0332".join(f"Scenario for {linkage} Linkage :\n"))
    Accuracy = []
    for n in n_clusters:
        
        train_set, test_set = dataset_KNN(df, n, linkage)
        #df - dataframe
        #n - clusters
        #linkage = single, complete, averagee
        train_set.to_csv("train_set.csv")
        test_set.to_csv("test_set.csv")
        print(f"No of clusters :{n}")
        acc_clusters = []
        for k in knn:
            result = k_Fold(train_set, 8, k) #fold accuracy for 8-fold cross validation
            acc = sum(result)/len(result) #KNN accuracy based on  folds
            print(f"Accuracy for {n} clusters using {k} nearest data points: {acc}")
            acc_clusters.append(acc)
        Accuracy.append(sum(acc_clusters)/len(acc_clusters)) #average cluster accuracy
        acc = (sum(acc_clusters)/len(acc_clusters))*100 #percentage
        print(f"Accuracy : {acc} %   ") 
     
    print(f"Best scenario for no of clusters : {n_clusters[np.argmax(Accuracy)]} \n\n") # for best case 
    Accuracy_linkage.append(max(Accuracy))
    

S̲c̲e̲n̲a̲r̲i̲o̲ ̲f̲o̲r̲ ̲s̲i̲n̲g̲l̲e̲ ̲L̲i̲n̲k̲a̲g̲e̲ ̲:̲

No of clusters :3
Accuracy for 3 clusters using 3 nearest data points: 0.9880952380952381


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  labels = np.zeros(ln,dtype = np.int)


Accuracy for 3 clusters using 5 nearest data points: 0.9880952380952381
Accuracy for 3 clusters using 7 nearest data points: 0.9761904761904762
Accuracy for 3 clusters using 9 nearest data points: 0.9583333333333335
Accuracy for 3 clusters using 11 nearest data points: 0.9583333333333335
Accuracy : 97.3809523809524 %   
No of clusters :4
Accuracy for 4 clusters using 3 nearest data points: 0.9821428571428572


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  labels = np.zeros(ln,dtype = np.int)


Accuracy for 4 clusters using 5 nearest data points: 0.9821428571428572
Accuracy for 4 clusters using 7 nearest data points: 0.9702380952380952
Accuracy for 4 clusters using 9 nearest data points: 0.9523809523809524
Accuracy for 4 clusters using 11 nearest data points: 0.9523809523809524
Accuracy : 96.78571428571429 %   
No of clusters :5
Accuracy for 5 clusters using 3 nearest data points: 0.9761904761904762


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  labels = np.zeros(ln,dtype = np.int)


Accuracy for 5 clusters using 5 nearest data points: 0.9821428571428572
Accuracy for 5 clusters using 7 nearest data points: 0.9702380952380952
Accuracy for 5 clusters using 9 nearest data points: 0.9464285714285715
Accuracy for 5 clusters using 11 nearest data points: 0.9523809523809524
Accuracy : 96.54761904761905 %   
No of clusters :6
Accuracy for 6 clusters using 3 nearest data points: 0.9642857142857144


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  labels = np.zeros(ln,dtype = np.int)


Accuracy for 6 clusters using 5 nearest data points: 0.9642857142857143
Accuracy for 6 clusters using 7 nearest data points: 0.9642857142857144
Accuracy for 6 clusters using 9 nearest data points: 0.9345238095238095
Accuracy for 6 clusters using 11 nearest data points: 0.9345238095238095
Accuracy : 95.23809523809523 %   
No of clusters :7
Accuracy for 7 clusters using 3 nearest data points: 0.9642857142857143


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  labels = np.zeros(ln,dtype = np.int)


Accuracy for 7 clusters using 5 nearest data points: 0.9642857142857143
Accuracy for 7 clusters using 7 nearest data points: 0.9464285714285715
Accuracy for 7 clusters using 9 nearest data points: 0.9345238095238095
Accuracy for 7 clusters using 11 nearest data points: 0.9345238095238095
Accuracy : 94.88095238095238 %   
Best scenario for no of clusters : 3 


S̲c̲e̲n̲a̲r̲i̲o̲ ̲f̲o̲r̲ ̲c̲o̲m̲p̲l̲e̲t̲e̲ ̲L̲i̲n̲k̲a̲g̲e̲ ̲:̲



Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  labels = np.zeros(ln,dtype = np.int)


No of clusters :3
Accuracy for 3 clusters using 3 nearest data points: 0.9821428571428572
Accuracy for 3 clusters using 5 nearest data points: 0.9761904761904762
Accuracy for 3 clusters using 7 nearest data points: 0.9642857142857143
Accuracy for 3 clusters using 9 nearest data points: 0.9523809523809524
Accuracy for 3 clusters using 11 nearest data points: 0.9464285714285714
Accuracy : 96.42857142857144 %   


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  labels = np.zeros(ln,dtype = np.int)


No of clusters :4
Accuracy for 4 clusters using 3 nearest data points: 0.9702380952380952
Accuracy for 4 clusters using 5 nearest data points: 0.9761904761904763
Accuracy for 4 clusters using 7 nearest data points: 0.9702380952380952
Accuracy for 4 clusters using 9 nearest data points: 0.9583333333333334
Accuracy for 4 clusters using 11 nearest data points: 0.9583333333333334
Accuracy : 96.66666666666669 %   


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  labels = np.zeros(ln,dtype = np.int)


No of clusters :5
Accuracy for 5 clusters using 3 nearest data points: 0.9642857142857143
Accuracy for 5 clusters using 5 nearest data points: 0.9404761904761906
Accuracy for 5 clusters using 7 nearest data points: 0.9166666666666666
Accuracy for 5 clusters using 9 nearest data points: 0.9047619047619048
Accuracy for 5 clusters using 11 nearest data points: 0.8809523809523808
Accuracy : 92.14285714285714 %   


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  labels = np.zeros(ln,dtype = np.int)


No of clusters :6
Accuracy for 6 clusters using 3 nearest data points: 0.9583333333333334
Accuracy for 6 clusters using 5 nearest data points: 0.9642857142857144
Accuracy for 6 clusters using 7 nearest data points: 0.9345238095238095
Accuracy for 6 clusters using 9 nearest data points: 0.8988095238095238
Accuracy for 6 clusters using 11 nearest data points: 0.8988095238095237
Accuracy : 93.0952380952381 %   


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  labels = np.zeros(ln,dtype = np.int)


No of clusters :7
Accuracy for 7 clusters using 3 nearest data points: 0.9404761904761906
Accuracy for 7 clusters using 5 nearest data points: 0.9047619047619048
Accuracy for 7 clusters using 7 nearest data points: 0.8988095238095238
Accuracy for 7 clusters using 9 nearest data points: 0.8869047619047619
Accuracy for 7 clusters using 11 nearest data points: 0.857142857142857
Accuracy : 89.76190476190476 %   
Best scenario for no of clusters : 4 


S̲c̲e̲n̲a̲r̲i̲o̲ ̲f̲o̲r̲ ̲a̲v̲e̲r̲a̲g̲e̲ ̲L̲i̲n̲k̲a̲g̲e̲ ̲:̲

No of clusters :3


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  labels = np.zeros(ln,dtype = np.int)


Accuracy for 3 clusters using 3 nearest data points: 0.9880952380952381
Accuracy for 3 clusters using 5 nearest data points: 0.9940476190476191
Accuracy for 3 clusters using 7 nearest data points: 0.9940476190476191
Accuracy for 3 clusters using 9 nearest data points: 0.9940476190476191
Accuracy for 3 clusters using 11 nearest data points: 0.9702380952380953
Accuracy : 98.80952380952381 %   
No of clusters :4


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  labels = np.zeros(ln,dtype = np.int)


Accuracy for 4 clusters using 3 nearest data points: 0.9523809523809524
Accuracy for 4 clusters using 5 nearest data points: 0.9226190476190476
Accuracy for 4 clusters using 7 nearest data points: 0.9464285714285715
Accuracy for 4 clusters using 9 nearest data points: 0.9404761904761905
Accuracy for 4 clusters using 11 nearest data points: 0.898809523809524
Accuracy : 93.21428571428572 %   
No of clusters :5


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  labels = np.zeros(ln,dtype = np.int)


Accuracy for 5 clusters using 3 nearest data points: 1.0
Accuracy for 5 clusters using 5 nearest data points: 0.9880952380952381
Accuracy for 5 clusters using 7 nearest data points: 0.9702380952380953
Accuracy for 5 clusters using 9 nearest data points: 0.9404761904761905
Accuracy for 5 clusters using 11 nearest data points: 0.9404761904761906
Accuracy : 96.78571428571429 %   
No of clusters :6
Accuracy for 6 clusters using 3 nearest data points: 0.9523809523809523


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  labels = np.zeros(ln,dtype = np.int)


Accuracy for 6 clusters using 5 nearest data points: 0.9642857142857143
Accuracy for 6 clusters using 7 nearest data points: 0.9345238095238095
Accuracy for 6 clusters using 9 nearest data points: 0.9583333333333334
Accuracy for 6 clusters using 11 nearest data points: 0.9464285714285715
Accuracy : 95.11904761904762 %   
No of clusters :7
Accuracy for 7 clusters using 3 nearest data points: 0.9642857142857143


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  labels = np.zeros(ln,dtype = np.int)


Accuracy for 7 clusters using 5 nearest data points: 0.9464285714285714
Accuracy for 7 clusters using 7 nearest data points: 0.9523809523809524
Accuracy for 7 clusters using 9 nearest data points: 0.9285714285714286
Accuracy for 7 clusters using 11 nearest data points: 0.9285714285714286
Accuracy : 94.40476190476191 %   
Best scenario for no of clusters : 3 




In [11]:
train_set.head()

Unnamed: 0,A,P,C,LK,WK,A_Coef,LKG,cluster_feature_0,cluster_feature_1,cluster_feature_2,cluster_feature_3,cluster_feature_4,cluster_feature_5,cluster_feature_6,labels
0,18.17,16.26,0.8637,6.271,3.512,2.853,6.273,0.008466,0.385712,0.495419,0.355588,0.665801,0.480277,0.34557,0
1,11.35,13.12,0.8291,5.176,2.668,4.337,5.132,0.75972,0.008994,0.039452,0.313838,0.076476,0.859283,0.880877,1
2,13.22,13.84,0.868,5.395,3.07,4.157,5.088,0.447116,0.027948,0.025301,0.097221,0.084817,0.413196,0.559389,2
3,15.11,14.54,0.8986,5.579,3.462,3.128,5.18,0.205342,0.121633,0.141458,0.006964,0.196581,0.154308,0.335935,3
4,11.49,13.22,0.8263,5.304,2.695,5.388,5.31,0.767923,0.096284,0.049101,0.322121,0.002738,0.599657,0.68522,4


In [12]:
test_set.head()

Unnamed: 0,A,P,C,LK,WK,A_Coef,LKG,cluster_feature_0,cluster_feature_1,cluster_feature_2,cluster_feature_3,cluster_feature_4,cluster_feature_5,cluster_feature_6
0,13.54,13.85,0.8871,5.348,3.156,2.587,5.178,0.414035,0.040169,0.140324,0.142864,0.202567,0.600609,0.716926
1,15.05,14.68,0.8779,5.712,3.328,2.129,5.36,0.217536,0.140817,0.25127,0.128796,0.321164,0.496828,0.581456
2,18.65,16.41,0.8698,6.285,3.594,4.391,6.102,0.027154,0.572784,0.569898,0.516746,0.645222,0.472349,0.306478
3,12.78,13.57,0.8716,5.262,3.026,1.176,4.782,0.567662,0.057496,0.236899,0.211487,0.238081,0.800388,0.884427
4,13.45,14.02,0.8604,5.516,3.065,3.531,5.097,0.419879,0.033876,0.088796,0.15913,0.171173,0.565506,0.683321


# Predictions on test_set

In [19]:
# Prediction on the test set
train_set, test_set = dataset_KNN(df, 3, 'complete')
predictions = []
# Extract feature matrix X and target vector y from the train set dataframe
X = train_set.iloc[:,:-1].values
y = train_set.iloc[:,-1].values
for sample in test_set.values:
    prediction_1 = KNN_classification(sample, 5, X, y)
    predictions.append(prediction_1)
print(predictions)

[2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 2.0, 2.0, 1.0, 1.0, 2.0, 1.0, 2.0, 1.0, 0.0, 1.0, 2.0, 0.0, 2.0, 0.0, 1.0, 2.0, 2.0, 2.0, 1.0, 1.0, 2.0, 0.0, 2.0, 0.0, 1.0, 0.0]


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  labels = np.zeros(ln,dtype = np.int)


# KNN on original dataset_KNN

In [20]:
df = pd.read_csv("Seed_Data.csv")
# Shuffle and Split and reset index
df = df.sample(frac=1).reset_index(drop = True)
train_set = df.iloc[:170, :].reset_index(drop = True)
test_set = df.iloc[170:, :].reset_index(drop = True)
# Extract feature matrics and target vectors
X = train_set.iloc[:,:-1].values
y = train_set.iloc[:,-1].values
X_test = test_set.iloc[:,:-1].values
y_test = test_set.iloc[:,-1].values

In [21]:
knn = [3,5,7,9,11] 
Accuracy = []
for k in knn:
    # Perform 10-fold cross validation on the first 150 samples of train set
    result = k_Fold(train_set.iloc[:150,:], 10, k)
    acc = sum(result)/len(result)
    print(f"Accuracy using {k} knn: {acc}")
    Accuracy.append(acc)
    av_acc = sum(Accuracy)/len(Accuracy)
print(f"Accuracy using KNN: {av_acc} ")

Accuracy using 3 knn: 0.9266666666666667
Accuracy using 5 knn: 0.9400000000000001
Accuracy using 7 knn: 0.9333333333333333
Accuracy using 9 knn: 0.9400000000000002
Accuracy using 11 knn: 0.9400000000000001
Accuracy using KNN: 0.9360000000000002 


In [22]:
predictions = []
for sample in X_test:
    prediction_1 = KNN_classification(sample, 5, X, y)
    predictions.append(prediction_1)

In [23]:
accuracy_metric(y_test, predictions)

0.8

#### Comparison of the results:

1. Agglomerative Clustering:

   a. Single Linkage:
      - Best scenario: 3 clusters
      - Average accuracy: 97.38%
      
   b. Complete Linkage:
      - Best scenario: 4 clusters
      - Average accuracy: 96.67%
      
   c. Average Linkage:
      - Best scenario: 3 clusters
      - Average accuracy: 98.81%

2. KNN Classifier (without clustering preprocessing):
   - Average accuracy: 93.6%

From the comparison, we can see that the highest average accuracy is achieved when using the Average Linkage method with 3 clusters (98.81%). This indicates that preprocessing the dataset with agglomerative clustering using average linkage improves the KNN classification accuracy compared to using the KNN classifier without any clustering preprocessing (93.6%).