1. In this question,  you will explore the concept of Mahalanobis distance and its application to classifying samples from the Iris dataset. The Iris dataset is a commonly used dataset in machine learning and consists of three classes of iris plants: Setosa, Versicolor, and Virginica. You will compute the Mahalanobis distance for one sample from each class and classify the samples based on their Mahalanobis distance.
Tasks:
Load the Iris dataset (csv file present in the classroom).
Choose one random sample from each class (Setosa, Versicolor, and Virginica) which will act as the test data.
Compute the mean vector and covariance matrix for each class (without the sample picked in the previous part, now it will act as the test data).
Calculate the Mahalanobis distance for each of the selected samples with each of the class using the formula:
Mahalanobis distance = sqrt((x - μ)ᵀ * Σ⁻¹ * (x - μ))
Where:
x is the feature vector of the sample.
μ is the mean vector for each class.
Σ⁻¹ is the inverse of the covariance matrix for each class.

Compare the Mahalanobis distances for the three samples and classify each sample to the class with the smallest Mahalanobis distance.
Print the original class and the predicted class for each sample, along with their Mahalanobis distances.


In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
import math

In [37]:
df = pd.read_csv('./iris.csv')
test_df = []

#Extract one random sample from each class
for class_label in df['variety'].unique():
    
    cr = df[df['variety'] == class_label]
    idx = random.randint(0, len(cr) - 1)
    sample = cr.iloc[idx]
   #print(sample)
    test_df.append(sample)

#Testing set
test_df = pd.DataFrame(test_df)
print("Testing dataframe:\n",test_df)

#Remove testing set from the training set.
train_df = df.drop(test_df.index)
print("Training dataframe:\n", train_df)

Testing dataframe:
      sepal.length  sepal.width  petal.length  petal.width     variety
48            5.3          3.7           1.5          0.2      Setosa
59            5.2          2.7           3.9          1.4  Versicolor
138           6.0          3.0           4.8          1.8   Virginica
Training dataframe:
      sepal.length  sepal.width  petal.length  petal.width    variety
0             5.1          3.5           1.4          0.2     Setosa
1             4.9          3.0           1.4          0.2     Setosa
2             4.7          3.2           1.3          0.2     Setosa
3             4.6          3.1           1.5          0.2     Setosa
4             5.0          3.6           1.4          0.2     Setosa
..            ...          ...           ...          ...        ...
145           6.7          3.0           5.2          2.3  Virginica
146           6.3          2.5           5.0          1.9  Virginica
147           6.5          3.0           5.2          2.0 

In [43]:

#Computing mean for each class
mean_vector = df.groupby('variety').mean()
print(mean_vector)

            sepal.length  sepal.width  petal.length  petal.width
variety                                                         
Setosa             5.006        3.428         1.462        0.246
Versicolor         5.936        2.770         4.260        1.326
Virginica          6.588        2.974         5.552        2.026


In [73]:
#Function for calculating the covariance between each pair of attributes in the dataset.
#It takes two features and computes the covariance.

def cov(x, y):

    n = len(x)
    
    # Calculate means
    mean_x = np.mean(x)
    mean_y = np.mean(y)
    sum = 0
    # Calculate covariance
    for xi, yi in zip(x, y):
        sum += (xi - mean_x) * (yi - mean_y)
    sum /= n - 1
    
    return sum

#Function to 
def make_matrix(df):
    
    #features and length of them
    features = df.columns
    n = len(features)

    #Init a zero based matrix
    cov_matrix = [[0] * n for _ in range(n)] 

    for i in range(n):
        for j in range(n):
            if i <= j:
                corr = cov(df[features[i]], df[features[j]])
                cov_matrix[i][j] = corr
                cov_matrix[j][i] = corr

    return cov_matrix


In [74]:
final_cov = []

#traversing around each class
for class_label in df['variety'].unique():
    cr = df[df['variety'] == class_label]
    c_removed = cr.drop(columns = ['variety'])
    cov_mat = make_matrix(c_removed)
    final_cov.append(cov_mat)


#Printing each class's covariance matrix
for i, class_label in enumerate(df['variety'].unique()):
    print(f"Covariance Matrix for {class_label}:")
    print(np.array(final_cov[i]))

Covariance Matrix for Setosa:
[[0.12424898 0.09921633 0.0163551  0.01033061]
 [0.09921633 0.1436898  0.01169796 0.00929796]
 [0.0163551  0.01169796 0.03015918 0.00606939]
 [0.01033061 0.00929796 0.00606939 0.01110612]]
Covariance Matrix for Versicolor:
[[0.26643265 0.08518367 0.18289796 0.05577959]
 [0.08518367 0.09846939 0.08265306 0.04120408]
 [0.18289796 0.08265306 0.22081633 0.07310204]
 [0.05577959 0.04120408 0.07310204 0.03910612]]
Covariance Matrix for Virginica:
[[0.40434286 0.09376327 0.3032898  0.04909388]
 [0.09376327 0.10400408 0.07137959 0.04762857]
 [0.3032898  0.07137959 0.30458776 0.04882449]
 [0.04909388 0.04762857 0.04882449 0.07543265]]


In [94]:
import numpy as np

# Mahalanobis distance function
def mahalanobis(x, mean_vector, cov_mat):
    diff = x - mean_vector
    cov_inv = np.linalg.inv(cov_mat)
    dist = np.sqrt(np.dot(np.dot(diff.T, cov_inv), diff))  
    return dist


def distance(test_df, train_df):
    dis = []
    class_labels = train_df['variety'].unique()
    for class_label in train_df['variety'].unique():

        
        cr = train_df[train_df['variety'] == class_label]
        
        # Drop the 'variety' column to get only feature columns
        cf = cr.drop(columns=['variety'])
        
        # Calculate the mean vector and covariance matrix for the class
        mean_vector = np.mean(cf, axis=0)
        
        # Transpose to get features in columns
        cov_mat = np.cov(cf.T)  

        d = []
        for i, sample in test_df.iterrows():
            # Convert the sample to a numpy array for calculation
            d.append(mahalanobis(np.array(sample), mean_vector, cov_mat))

        dis.append(d)  
    
    return dis, class_labels



# Dropping the class label column from test_df (assuming it has 'variety' too)
test_features = test_df.drop(columns=['variety'])

# Call the distance function
mahalanobis_distances, class_labels = distance(test_features, train_df)

# Print the distances
print("The Mahalanobis distances are:\n")
for d in mahalanobis_distances:
    print(d)


The Mahalanobis distances are:

[1.1454203905257336, 16.469649685600842, 21.900645321496718]
[11.22112337557554, 2.2161285528253187, 3.001754059755531]
[14.221055622925201, 3.6474173249202537, 1.8281873435831342]


In [97]:
#Compare the Mahalanobis distances for the three samples and classify each sample to the class with the smallest Mahalanobis distance.
def predict_assign(test_df, train_df):
    #Get distances
    distances, class_labels = distance(test_df, train_df)

    predicted = []
    #res = []

    for i in range(len(test_df)):
        sample_dis = []
        for j in range(len(class_labels)):
            sample_dis.append(distances[j][i])

        min_dist = min(sample_dis)
        predicted_class = class_labels[sample_dis.index(min_dist)]
        predicted.append(predicted_class)

    return predicted, dict(zip(class_labels, sample_dis))

predicted_classes, d = predict_assign(test_features, train_df)
print("Predicted classes:", predicted_classes)
print("distances:", d)
print("True labels:", list(test_df['variety']))

Predicted classes: ['Setosa', 'Versicolor', 'Virginica']
distances: {'Setosa': 21.900645321496718, 'Versicolor': 3.001754059755531, 'Virginica': 1.8281873435831342}
True labels: ['Setosa', 'Versicolor', 'Virginica']
