1. Load the height/weight data from the file heightWeightData.txt. The first column is the class label (1=male, 2=female), the second column is height, the third weight. Start by replacing the weight column by the product of height and weight.

For the Fisher’s linear discriminant analysis as discussed in the class, send the python/matlab code and answers for the following questions:

a. What’s the SB matrix?

b. What’s the SW matrix?

c. What’s the optimal 1d projection direction?

d. Project the data in the optimal 1d projection direction. Set the decision threshold as the middle point between the 
projected means. What’s the misclassification error rate?

e. What’s your height and weight? What’s the model prediction for your case (male/female)?

In [1]:
#Imports
import numpy as np
import matplotlib.pyplot as plt

#Load Data
data = np.genfromtxt("heightWeightData.txt", delimiter=",")

#Weight is 3rd Column
np.set_printoptions(suppress=True)
new_data = np.zeros(data.shape)
for i in range(int(new_data.shape[0])):
    new_data[i, 0] = data[i, 0]
    new_data[i, 1] = data[i, 1]
    new_data[i, 2] = np.multiply(data[i, 1], data[i, 2])

In [2]:
#Implementing Fisher's Linear Discriminant Analysis
#Let's group data first
#Count males (=1) and females (=2)
nr_males = 0
nr_females = 0
for i in range(int(new_data.shape[0])):
    if new_data[i, 0] == 1:
        nr_males+=1
    elif new_data[i, 0] == 2:
        nr_females+=1
#print(nr_males, nr_females)
#Concatenate Class Sizes
class_sizes = np.array([nr_males, nr_females])

#Assign Classes
males = np.zeros([nr_males, new_data.shape[1]])
females = np.zeros([nr_females, new_data.shape[1]])
m_index = 0
f_index = 0
for index in range(int(new_data.shape[0])):
    if new_data[index, 0] == 1:
        males[m_index] = new_data[index]
        m_index+=1
    elif new_data[index, 0] == 2:
        females[f_index] = new_data[index]
        f_index+=1

#Calculate means vector for each class
#Drop Label Column
f_males = males[:, 1:]
f_females = females[:, 1:]
#Calculate mean vector for each class
mean_males = np.mean(a=f_males, axis=0)
mean_females = np.mean(a=f_females, axis=0)

print("Mean Vector for Males Class: \n", mean_males,"\nMean Vector for Females Class: \n", mean_females)

Mean Vector for Males Class: 
 [  182.01013699 14552.85501781] 
Mean Vector for Females Class: 
 [ 165.28540146 9757.31728073]


a. What’s the SB matrix?

In [3]:
#Calculate Overall Mean
overall_mean = np.mean(new_data[:, 1:], axis=0)
#print("Overall mean vector is: ", overall_mean)

#Let's Compute Between Class Scatter Matrix S_B
"According to the slides: S_B = (m2-m1)(m2-m1).T"
S_B = np.multiply((mean_females-mean_males), (mean_females-mean_males).T)
print("S_B Matrix is: ", S_B)

S_B Matrix is:  [     279.71677843 22997182.18774202]


b. What’s the SW matrix?

In [11]:
#Let's Compute Within Class Scatter Matrix S_W
#According to Slides
#Males Class
scatter_male = sum(np.matmul((f_males-mean_males).T, ((f_males-mean_males).T).T))
scatter_female = sum(np.matmul((f_females-mean_females).T, ((f_females-mean_females).T).T))
S_W = scatter_male+scatter_female
print("S_W Matrix is: ", S_W)

S_W Matrix is:  [2.39983269e+06 1.07899323e+09]


c. What’s the optimal 1d projection direction?

In [14]:
#Optimal Projection or Matrix W
W = (1/S_W)*(mean_females-mean_males)
print("Optimal 1D Projection Direction is: ", W)

Optimal 1D Projection Direction is:  [-0.00000697 -0.00000444]


d. Project the data in the optimal 1d projection direction. Set the decision threshold as the middle point between the 
projected means. What’s the misclassification error rate?

In [15]:
#Calcultate Threshold
tot = 0
class_means = np.array([mean_males, mean_females])
for mean in class_means:
    tot += np.dot(W.T, mean)
    #print(tot)
w0 = 0.5 * tot
print("Calculated threshold is: ", w0)

Calculated threshold is:  -0.055232916501277526


In [17]:
#Calculate Error
#For each input project the point
features = (new_data[:, 1:]).T
labels = new_data[:,0]
projected = np.dot(W.T, np.array(features))
#projected

In [18]:
#Assign Predictions
predictions = []
for item in projected:
    if item >= w0:
        predictions.append(2)
    else:
        predictions.append(1)
predictions

[2,
 2,
 2,
 2,
 2,
 2,
 1,
 2,
 1,
 2,
 2,
 1,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 1,
 2,
 2,
 1,
 1,
 1,
 2,
 2,
 2,
 2,
 2,
 1,
 2,
 2,
 2,
 1,
 2,
 2,
 2,
 1,
 2,
 2,
 1,
 2,
 2,
 1,
 1,
 2,
 1,
 2,
 2,
 1,
 2,
 2,
 1,
 2,
 1,
 1,
 2,
 2,
 2,
 1,
 2,
 1,
 2,
 2,
 2,
 2,
 1,
 1,
 2,
 2,
 1,
 2,
 2,
 2,
 2,
 1,
 2,
 1,
 2,
 2,
 2,
 2,
 2,
 2,
 1,
 2,
 2,
 2,
 2,
 2,
 1,
 2,
 2,
 2,
 2,
 2,
 2,
 1,
 1,
 1,
 1,
 1,
 2,
 2,
 1,
 1,
 2,
 2,
 2,
 2,
 1,
 2,
 2,
 2,
 2,
 1,
 1,
 1,
 2,
 2,
 2,
 2,
 2,
 2,
 1,
 1,
 1,
 2,
 1,
 1,
 1,
 1,
 2,
 1,
 2,
 1,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 1,
 2,
 2,
 2,
 2,
 2,
 1,
 2,
 1,
 2,
 2,
 1,
 2,
 2,
 2,
 2,
 1,
 2,
 1,
 2,
 2,
 2,
 2,
 1,
 2,
 1,
 2,
 2,
 2,
 1,
 2,
 2,
 2,
 2,
 2,
 2,
 1,
 1,
 2,
 2,
 1,
 2,
 2,
 2,
 1,
 2,
 2,
 1,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 1,
 2,
 2,
 1,
 2,
 1,
 1,
 1,
 1,
 2,
 1]

In [19]:
#Check Classification
errors = (labels != predictions)
n_errors = sum(errors)

error_rate = (n_errors/len(predictions) * 100)
print("Error Rate is: ", error_rate, "%")

Error Rate is:  11.904761904761903 %


e. What’s your height and weight? What’s the model prediction for your case (male/female)?

In [21]:
#My case
my_height = 164
my_weight = 65
my_features = np.array([my_height, my_weight*my_height])
my_ground_truth = "Male"

#My Prediction
my_projection = np.dot(W.T, my_features)
if my_projection >= w0:
    my_pred = "Female"
else:
    my_pred = "Male"

print("In my case I was predicted as: ", my_pred, " which is ", my_ground_truth==my_pred)

In my case I was predicted as:  Female  which is  False


In [26]:
#Let's use Sklearn to see if our solution is correct
#Using sklearn
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
clf = LinearDiscriminantAnalysis()
clf.fit(new_data[:, 1:], labels)
LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,
              solver='eigen', store_covariance=False, tol=0.0001)
print(clf.get_params())
predictions = clf.predict(new_data[:, 1:])
print(predictions)
errors = sum(labels!=predictions)
error_rate = (n_errors/len(predictions) * 100)
print("Error Rate is: ", error_rate, "%")
print("\nAs can be seen, our solution is right!")

{'n_components': None, 'priors': None, 'shrinkage': None, 'solver': 'svd', 'store_covariance': False, 'tol': 0.0001}
[2. 2. 2. 2. 2. 2. 1. 2. 1. 2. 2. 1. 1. 2. 2. 2. 2. 2. 2. 1. 2. 2. 2. 1.
 1. 2. 2. 2. 2. 2. 1. 2. 2. 2. 1. 2. 2. 2. 1. 2. 2. 1. 2. 2. 1. 1. 2. 1.
 2. 2. 1. 2. 2. 1. 2. 1. 1. 2. 2. 2. 1. 2. 1. 2. 2. 2. 2. 1. 1. 2. 2. 1.
 2. 2. 2. 2. 1. 2. 1. 2. 2. 2. 2. 2. 2. 1. 2. 1. 2. 2. 2. 2. 2. 2. 2. 2.
 2. 2. 1. 1. 1. 1. 1. 2. 1. 1. 1. 2. 2. 2. 2. 2. 2. 2. 2. 2. 1. 1. 1. 2.
 2. 2. 2. 2. 2. 1. 1. 1. 2. 1. 1. 1. 1. 2. 2. 2. 1. 2. 2. 2. 2. 1. 2. 2.
 1. 2. 2. 2. 2. 2. 1. 2. 1. 2. 2. 1. 2. 2. 2. 2. 1. 2. 2. 2. 2. 2. 2. 1.
 2. 1. 2. 2. 2. 1. 2. 2. 2. 2. 2. 2. 2. 1. 2. 2. 1. 2. 2. 2. 1. 2. 2. 1.
 2. 2. 2. 2. 1. 2. 2. 1. 2. 2. 1. 2. 1. 1. 1. 1. 2. 1.]
Error Rate is:  11.904761904761903 %

As can be seen, our solution is right!
