1. Perform Fisher Discriminant Analysis on pima-indians-diabetes.csv
2. Perform Decision Tree classification on pima-indians-diabetes.csv
3. Perform Fisher Discriminant Analysis on pima-indians-diabetes.csv and then perform decision tree classification. compare the results.

# Fisher's Discriminant Analysis

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.preprocessing import StandardScaler
random_seed = 5

Lets perform fisher's discriminant analysis using scatter matrices for pima indians diabetes database to detect whether the patient account to get diabetes or not.

In [2]:
df = pd.read_csv("Data/pima-indians-diabetes.csv")
df.head()

Unnamed: 0,pregnancies,glucose,blood_pressure,skin_thickness,insulin,bmi,pedigree,age,result
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


> Lets compute the weight matrix using fisher's criterion for pima database.

In [3]:
features = ["pregnancies", "glucose", "blood_pressure", "skin_thickness", "insulin", "bmi", "pedigree", "age"]
label = ["result"]
X = df[features]
y = df[label]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random_seed)

Lets us define our own FDA.

In [4]:
class FDA:
    def __calculate_weight_matrix__(self):
        """
        Method to calculate weight matrix, using scatter matrices. Here we calculate within class scatter matrix, using 
        that we calculate weight matrix to predict labels using fisher's criterion function. And also zcut for differentiation two classes.
        """
        P = np.array(self.X_train[self.y_train["result"] == 1])
        N = np.array(self.X_train[self.y_train["result"] == 0])
        PMean = np.mean(P, 0)
        NMean = np.mean(N, 0)
        S1 = (len(P) - 1) * np.cov(P.T)
        S2 = (len(N) - 1) * np.cov(N.T)
        SW = S1 + S2
        SB = (PMean - NMean).dot((PMean - NMean).T)
        W = np.matmul(np.linalg.inv(SW), (PMean - NMean))
        PMean_proj = W.T@PMean
        NMean_proj = W.T@NMean
        self.zcut = 0.5*(PMean_proj + NMean_proj)
        return W
    
    def fit(self, X, y):
        """
        Method to fit the model
        """
        self.X_train = X
        self.y_train = y
        self.W = self.__calculate_weight_matrix__()
        return self
        
    def predict(self, X_test):
        """
        Method to predict the labels of model
        """
        return pd.DataFrame(X_test.apply(lambda row: 1 if self.W.T@row >= self.zcut else 0, axis=1))

    def score(self, X_test, y_test):
        """
        Method printing the score of the model after predicting the labels using test labels
        """
        y_pred = self.predict(X_test)
        return accuracy_score(y_test, y_pred)

In [5]:
clf = FDA()
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print("Accuracy score of the FDA : ", score)

Accuracy score of the FDA :  0.7662337662337663


# Decision tree classification

In [6]:
clf = DecisionTreeClassifier(max_depth=3, criterion="entropy")
clf = clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
score = accuracy_score(y_test, y_pred)
print("Accuracy score of the DTC : ", score)

Accuracy score of the DTC :  0.7337662337662337


# LDA and then Decision tree

- Feature Scaling

In [7]:
scalar = StandardScaler()
X_train = scalar.fit_transform(X_train)
X_test = scalar.transform(X_test)

- Dimensionality reduction

In [8]:
lda = LDA(n_components=1)
X_train = lda.fit_transform(X_train, y_train.values.ravel())
X_test = lda.transform(X_test)

In [9]:
clf = DecisionTreeClassifier(max_depth=3, criterion="entropy")
clf = clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
score = accuracy_score(y_test, y_pred)
print("Accuracy score of the DTC : ", score)

Accuracy score of the DTC :  0.7922077922077922


Thus the accuracy increased by 3 percent when compared to LDA and by 5 percent when compared to decision tree by using both at a time, i.e. LDA for dimensionality reduction and the decision tree on the reduces data