<b>Objective</b>: Implementing Fisher's Linear Discriminant Analysis to predict Heart Disease (1-4) or No Heart Disease (0) based on factors like Cholestrol, BP, Blood Sugar, Heart Rate etc

Linear Discriminant Analysis is a technique used to <b>find a linear combination</b> of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a <b>linear classifier</b>, or, more commonly, for <b>dimensionality reduction</b> before later classification.

LDA unlike PCA is a supervised dimensionality reduction technique. Reduced dimensions can be ranked on the basis of their ability to maximize the distance between clusters(classes) and minimize distance between data points within a cluster(class) from their respective centroids.

### Importing Libraries and Dataset

In [1]:
import numpy as np
import pandas as pd

In [2]:
data = pd.read_csv('processed.cleveland.data',header=None)

### Data Preprocessing 

In [3]:
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0


From the dataset description we can name and understand the features.

In [4]:
cols = ['Age','Sex','Chest Pain Type','Resting BP','Cholestrol','Fasting Blood Sugar','Resting ECG',
        'Max Heart Rate','Exercise Angina','Oldpeak','Slope','CA','Thal','Heart Disease']

In [5]:
data.columns = cols

In [6]:
data.head()

Unnamed: 0,Age,Sex,Chest Pain Type,Resting BP,Cholestrol,Fasting Blood Sugar,Resting ECG,Max Heart Rate,Exercise Angina,Oldpeak,Slope,CA,Thal,Heart Disease
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0


In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
Age                    303 non-null float64
Sex                    303 non-null float64
Chest Pain Type        303 non-null float64
Resting BP             303 non-null float64
Cholestrol             303 non-null float64
Fasting Blood Sugar    303 non-null float64
Resting ECG            303 non-null float64
Max Heart Rate         303 non-null float64
Exercise Angina        303 non-null float64
Oldpeak                303 non-null float64
Slope                  303 non-null float64
CA                     303 non-null object
Thal                   303 non-null object
Heart Disease          303 non-null int64
dtypes: float64(11), int64(1), object(2)
memory usage: 33.3+ KB


Taking a look at the features 'CA' and 'Thal' being shown as non-numerical, what we find out is that both these features
have certain '?' values, probably in place pf missing values.

To fix this, we'll eliminate the instances having missing values since they form a very small percentage of our data.

In [8]:
data['CA'].unique()

array(['0.0', '3.0', '2.0', '1.0', '?'], dtype=object)

As we can see there's a "?" value that probably was used for unavailable data. Let's see how many rows have 'CA' as '?'

In [14]:
len(data[data.CA=='?'])

4

Since it's a very small number, we can safely remove all 4 rows. And then, we'll follow the same steps for the 'Thal' feature.

In [15]:
data_numerical = data[data['CA'] != '?']

In [16]:
data_numerical['Thal'].unique()

array(['6.0', '3.0', '7.0', '?'], dtype=object)

In [17]:
len(data_numerical[data_numerical.Thal=='?'])

2

In [18]:
data_preprocessed = data_numerical[data_numerical['Thal'] != '?']

### Separating Features and Labels

In [19]:
X = data_preprocessed.drop(['Heart Disease'], axis=1)
y_ = data_preprocessed['Heart Disease']

In [20]:
y_ = y_.to_frame()
y = pd.DataFrame()

Changing all target label values above 0 (i.e. 1,2,3,4) to--> 1 as we want to classify into no-heart disease (0) and heart disease(1 - 4).

In [21]:
y['Heart Disease'] = y_['Heart Disease'].apply(lambda x: x if x < 1 else 1)

We have transformed label values from 1-4 into 1 as can be seen in the initial dataframe (y_) vs our final label dataframe (y) -

In [22]:
y_.head(10)

Unnamed: 0,Heart Disease
0,0
1,2
2,1
3,0
4,0
5,0
6,3
7,0
8,2
9,1


In [23]:
y.head(10)

Unnamed: 0,Heart Disease
0,0
1,1
2,1
3,0
4,0
5,0
6,1
7,0
8,1
9,1


### Model

#### Training and Test sets

Splitting the data into 80-20 split.

In [24]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y.values.ravel(), test_size=0.2, random_state=10)

Scaling -

In [25]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

#### Linear Discriminant Analysis

In [26]:
class LDA:

    def __init__(self):
        self.w = None # Weights

    def fit(self, X, y):
        n_features = X.shape[1]
        targets = np.unique(y)
        vW = np.zeros((n_features, n_features)) # vW represents variance spread within a cluster
        vB = np.zeros((n_features, n_features)) # vB represents variance spread between 2 clusters
        total_mean = np.mean(X, axis=0)
        for i in targets:
            Xi = X[y == i]
            mean_i = np.mean(Xi, axis=0)
            vW += (Xi - mean_i).T.dot((Xi - mean_i))
            ni = Xi.shape[0]
            mean_diff = (mean_i - total_mean).reshape(n_features, 1)
            vB += ni * (mean_diff).dot(mean_diff.T)

        # vW^-1 * vB
        self.w = np.linalg.inv(vW).dot(vB)
    
    def predict(self, X):
        y_pred = []
        for sample in X:
            h = sample.dot(self.w)
            y = 1 * (h < 0)
            y_pred.append(y[7])
        return y_pred

Fitting training data on our model

In [27]:
lda = LDA()
lda.fit(X_train,y_train)

Testing the model on X_test

In [28]:
predictions = lda.predict(X_test)

### Evaluation

In [29]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix
print('Accuracy score: ', accuracy_score(y_test, predictions))
print('Precision score: ', precision_score(y_test, predictions))
print('Recall score: ', recall_score(y_test, predictions))
print('Confusion Matrix:\n', confusion_matrix(y_test,predictions))

print("\nWe have achieved an accuracy of about {}% in predicting Heart Disease through our model.".format(str(round(100*accuracy_score(y_test, predictions),2))))

Accuracy score:  0.8833333333333333
Precision score:  0.8461538461538461
Recall score:  0.88
Confusion Matrix:
 [[31  4]
 [ 3 22]]

We have achieved an accuracy of about 88.33% in predicting Heart Disease through our model.
