Consider the 128- dimensional feature vectors (d=128) given in the “gender.csv” file. (2 classes, male and female)


a) Use PCA to reduce the dimension from d to d‟. (Here d=128).


b) Display the eigenvalue based on increasing order, select the d‟ of the corresponding eigenvector which is the appropriate dimension d‟ ( select d‟ S.T first 95% of λ values of the covariance matrix are considered).


c) Use d‟ features to classify the test cases (use any classification algorithm taught in class like Bayes classifier, minimum distance classifier, and so on).

Dataset Specifications:

Total number of samples = 800

Number of classes = 2 (labeled as “male” and “female”)

Samples from “1 to 400” belongs to class “male”

Samples from “401 to 800” belongs to class “female”

Number of samples per class = 400

Number of dimensions = 128

Use the following information to design classifier:


Number of test cases (first 10 in each class) = 20

Number of training feature vectors ( remaining 390 in each class) = 390

Number of reduced dimensions = d‟ (map 128 to d‟ features vector)

In [30]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

In [31]:
gender_df = pd.read_csv('gender.csv')
gender_df['Label'] = gender_df['Unnamed: 1']
gender_df.drop(['Unnamed: 0', 'Unnamed: 1'], axis=1, inplace=True)

In [32]:
df = gender_df.drop('Label', axis=1)
cov_matrix = np.cov(df, rowvar = False)
print(cov_matrix)


[[ 2.97351868e-03 -3.60642117e-04 -2.16383652e-04 ...  4.44221964e-04
  -1.00563881e-04 -4.91091254e-05]
 [-3.60642117e-04  2.46669576e-03  5.11762350e-05 ... -3.82309128e-04
  -1.37693470e-04 -1.87163213e-05]
 [-2.16383652e-04  5.11762350e-05  2.58212973e-03 ... -3.07653151e-04
  -2.05282568e-04  2.86060002e-05]
 ...
 [ 4.44221964e-04 -3.82309128e-04 -3.07653151e-04 ...  2.37008172e-03
   2.47776832e-06  1.52785404e-04]
 [-1.00563881e-04 -1.37693470e-04 -2.05282568e-04 ...  2.47776832e-06
   2.46498262e-03  3.22312975e-05]
 [-4.91091254e-05 -1.87163213e-05  2.86060002e-05 ...  1.52785404e-04
   3.22312975e-05  2.25878968e-03]]


In [33]:
import numpy as np

# Function for Principal Component Analysis (PCA) which takes a DataFrame as input
def pca(X):
    
    # Making the data mean-centered
    mean = np.mean(X, axis=0)
    X = X - mean

    # Calculate the covariance matrix
    cov_matrix = np.cov(X, rowvar=False)

    # Calculate the eigenvalues and eigenvectors
    eig_values, eig_vectors = np.linalg.eig(cov_matrix)

    eig_values, eig_vectors = eig_values.real, eig_vectors.real

    # Sort the eigenvectors by decreasing eigenvalues
    idx = np.argsort(eig_values)[::-1]

    # Sort eigenvalues
    eig_values = eig_values[idx]  

    print("Eigen values after sorting\n", eig_values)

    # Compute cumulative variance
    variance = np.cumsum(eig_values) / np.sum(eig_values)

    # Get the number of components based on the variance threshold
    d = np.argmax(variance >= 0.95) + 1
    print("New dimensions: ", d)

    # Select the top d eigenvectors
    sel_eigen_vectors = eig_vectors[:, :d]

    # Project the data onto the selected eigenvectors
    X_pca = np.dot(X, sel_eigen_vectors)

    return X_pca


In [34]:
df_reduced = pca(df)
df_reduced = pd.DataFrame(df_reduced)

Eigen values after sorting
 [4.12176766e-02 2.40118775e-02 1.71111604e-02 1.45027660e-02
 1.26813904e-02 1.18931602e-02 1.01143111e-02 9.04300685e-03
 8.47642204e-03 7.92259881e-03 7.56723390e-03 7.13193051e-03
 6.75919664e-03 6.47829814e-03 6.31321208e-03 5.78163452e-03
 5.54058725e-03 5.48252160e-03 5.26469830e-03 5.03681869e-03
 4.78933015e-03 4.52810294e-03 4.47827311e-03 4.22639684e-03
 4.21242903e-03 3.97050356e-03 3.83249923e-03 3.66188374e-03
 3.45420175e-03 3.36381832e-03 3.21785618e-03 3.04563338e-03
 3.01768926e-03 2.85672221e-03 2.79537719e-03 2.70232846e-03
 2.63687059e-03 2.55370940e-03 2.52691826e-03 2.42131249e-03
 2.37047722e-03 2.29834291e-03 2.27518972e-03 2.14555764e-03
 2.05654128e-03 1.99342469e-03 1.91474242e-03 1.84400602e-03
 1.83671248e-03 1.76746631e-03 1.63982016e-03 1.59730472e-03
 1.53753513e-03 1.47002752e-03 1.42703305e-03 1.36569109e-03
 1.30160525e-03 1.21898058e-03 1.17783237e-03 1.16898611e-03
 1.11811877e-03 1.05886796e-03 1.00515486e-03 9.89754891e

In [35]:
df_reduced.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,47,48,49,50,51,52,53,54,55,56
count,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0,...,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0
mean,1.7763570000000002e-17,1.4988010000000002e-17,4.996004e-18,-1.7486010000000002e-17,1.720846e-17,4.440892e-18,6.383782e-18,-8.049117e-18,7.21645e-18,-1.3322680000000001e-17,...,-8.049117e-18,9.645063e-18,9.436896e-18,1.94289e-18,1.967176e-17,9.436896e-18,7.736867e-18,-7.5287e-18,5.2041700000000004e-18,3.4694469999999996e-19
std,0.2030214,0.1549577,0.1308096,0.1204274,0.1126117,0.1090558,0.1005699,0.09509473,0.09206749,0.08900898,...,0.04049469,0.04204125,0.04294189,0.04285688,0.03996629,0.03921142,0.01862687,0.01966605,0.01322105,0.03777609
min,-0.3546874,-0.3176415,-0.3720125,-0.374311,-0.2812327,-0.3378446,-0.2513338,-0.3136177,-0.3233857,-0.3051982,...,-0.1492379,-0.1341538,-0.1152212,-0.1321982,-0.114362,-0.1262619,-0.06095278,-0.05246327,-0.034511,-0.1120404
25%,-0.2050774,-0.1066496,-0.08985982,-0.08397985,-0.07359415,-0.07189799,-0.07328894,-0.06631901,-0.06393577,-0.06255846,...,-0.02707542,-0.02841119,-0.02901279,-0.02797426,-0.0269061,-0.02661118,-0.01171505,-0.01303794,-0.008824402,-0.02574077
50%,0.0295984,-0.03750354,0.0003624099,0.00575796,0.002377023,0.003567549,0.0006567316,0.007273334,-0.001603639,-0.0008802275,...,0.0002183583,-0.001343643,-0.000371412,0.001625032,-0.001087592,-8.005682e-05,0.0007857637,-0.0007923597,-0.0008755637,-0.0001882582
75%,0.1953019,0.07021733,0.09471993,0.08692933,0.07946624,0.07288791,0.06896335,0.06493091,0.06365774,0.06070273,...,0.0284873,0.0287823,0.02919775,0.02925536,0.02741744,0.02737686,0.01198817,0.01221487,0.008580014,0.02572501
max,0.3607476,0.5501217,0.3966303,0.3274456,0.3288025,0.3410586,0.2933661,0.2529252,0.2538657,0.2603154,...,0.121922,0.1279623,0.1645562,0.1125424,0.1552163,0.1288062,0.05586602,0.05768246,0.0541937,0.1257339


In [36]:
#Here I am using Baye's classifier to classify the data
#Function to check whether the data falls under case 1
def isCase1(mat):
    first = mat[0][0]
    for i in range(1, len(mat)):
        if mat[i][i] != first:
            return False
    return True


#If the data falls under case-1 discriminant function is computed accordingly.
def linear_case1(w, pw, cov_mat):
    u1 = np.mean(w, axis=0)
    cov = cov_mat[0][0]
    weight = u1 / cov
    bias = np.log(pw) - 0.5 * np.dot(u1.T, u1) / (cov ** 2)
    return weight, bias

#If the data falls under case-2 discriminant function is computed accordingly.
def linear_case2(w, pw, cov_mat):
    u1 = np.mean(w, axis=0)
    inv_cov = np.linalg.inv(cov_mat)
    weight = inv_cov @ u1
    bias = np.log(pw) - 0.5 * u1.T @ inv_cov @ u1
    return weight, bias

#If the data falls under case-3 discriminant function is computed accordingly.
def non_linear(w, pw, cov_mat):
    u1 = np.mean(w, axis=0)
    inv_cov = np.linalg.inv(cov_mat)
    weight1 = -0.5 * inv_cov
    weight2 = inv_cov @ u1
    bias = np.log(pw) - 0.5 * np.log(np.linalg.det(cov_mat)) - 0.5 * u1.T @ inv_cov @ u1
    return weight1, weight2, bias


#wieghts and biases are extracted from above functions by checking the cases for the given data.
def bayes_classifier(w1, w2, pw1, pw2):
    w1_cov = np.cov(w1, rowvar=False)
    w2_cov = np.cov(w2, rowvar=False)
    
    if np.allclose(w1_cov, w2_cov):
        weight1, bias1 = linear_case2(w1, pw1, w1_cov)
        weight2, bias2 = linear_case2(w2, pw2, w2_cov)
        return lambda x: np.dot(weight1 - weight2, x) + (bias1 - bias2)
    elif isCase1(w1_cov) and isCase1(w2_cov):
        weight1, bias1 = linear_case1(w1, pw1, w1_cov)
        weight2, bias2 = linear_case1(w2, pw2, w2_cov)
        return lambda x: np.dot(weight1 - weight2, x) + (bias1 - bias2)
    else:
        weight1_1, weight1_2, bias1 = non_linear(w1, pw1, w1_cov)
        weight2_1, weight2_2, bias2 = non_linear(w2, pw2, w2_cov)
        return lambda x: x.T @ (weight1_1 - weight2_1) @ x + np.dot(weight1_2 - weight2_2, x) + (bias1 - bias2)



In [37]:
#as first 400 are male and next 400 are female
labels = np.array([0] * 400 + [1] * 400)

#Test dataset first 10 in each class
test_df = pd.concat([df_reduced[ : 10], df_reduced[400 : 410]])
test_labels = np.array([0] * 10 + [1] * 10)

#Remaining data is used for training
train_male = df_reduced.iloc[:390]       # First 390 samples belong to "male"
train_female = df_reduced.iloc[390:780]

In [38]:
pw1, pw2 = 0.5, 0.5
test_df = np.array(test_df.values)

#Predictions are made using the bayes classifier
result = bayes_classifier(train_male, train_female, pw1, pw2)
predictions = np.array([0 if result(x) > 0 else 1 for x in test_df])

#Accuracy is calculated
accuracy = np.mean(test_labels == predictions)*100
print("Accuracy of the model is", accuracy, "%")


Accuracy of the model is 90.0 %


In [39]:
for i, (true, pred) in enumerate(zip(test_labels, predictions)):
    print(f"Sample {i+1}: True label = {'Male' if true == 0 else 'Female'}, Predicted = {'Male' if pred == 0 else 'Female'}")


Sample 1: True label = Male, Predicted = Male
Sample 2: True label = Male, Predicted = Male
Sample 3: True label = Male, Predicted = Male
Sample 4: True label = Male, Predicted = Male
Sample 5: True label = Male, Predicted = Male
Sample 6: True label = Male, Predicted = Male
Sample 7: True label = Male, Predicted = Male
Sample 8: True label = Male, Predicted = Female
Sample 9: True label = Male, Predicted = Male
Sample 10: True label = Male, Predicted = Male
Sample 11: True label = Female, Predicted = Male
Sample 12: True label = Female, Predicted = Female
Sample 13: True label = Female, Predicted = Female
Sample 14: True label = Female, Predicted = Female
Sample 15: True label = Female, Predicted = Female
Sample 16: True label = Female, Predicted = Female
Sample 17: True label = Female, Predicted = Female
Sample 18: True label = Female, Predicted = Female
Sample 19: True label = Female, Predicted = Female
Sample 20: True label = Female, Predicted = Female


Eigenfaces-Face classification using PCA (40 classes)

a) Use the following “face.csv” file to classify the faces of 40 different people using PCA.

b) Do not use the in-built function for implementing PCA.

c) Use appropriate classifier taught in class (use any classification algorithm taught in class like Bayes classifier, minimum distance classifier, and so on)

d) Refer to the following link for a description of the dataset: https://towardsdatascience.com/eigenfaces-face-classification-in-python-7b8d2af3d3ea


In [40]:
face_df = pd.read_csv('face.csv')
face_df.head(20)
face_df.dropna(inplace=True)
y = np.array(face_df['target'].values)
face_df.drop(columns='target', inplace=True)

In [41]:
face_df_reduced = pca(face_df)

Eigen values after sorting
 [ 1.88401756e+01  1.10717620e+01  6.30461460e+00 ... -2.62195151e-16
 -2.62195151e-16 -2.75029807e-16]
New dimensions:  123


In [42]:
#Splitting data set to train the model
X_train, X_test, y_train, y_test = train_test_split(face_df_reduced, y, test_size=0.2, random_state=42)

In [43]:

# classify the testing data using Nearest Neighbor Classifier and print the accuracy 
knn = KNeighborsClassifier(n_neighbors = 1)

knn.fit(X_train, y_train)

pred = knn.predict(X_test)

print("Accuracy of the model using Nearest Neighbor is: ", accuracy_score(y_test, pred)*100, "%")


Accuracy of the model using Nearest Neighbor is:  92.5 %
