Implementing Fisher's Linear Discriminant Analysis to predict Heart Disease (1-4) or No Heary Disease (0) based n factors like Cholestrol, BP, Blood Sugar, Heart Rate etc

### Importing Libraries and Dataset

In [1]:
import numpy as np
import pandas as pd

In [2]:
data = pd.read_csv('processed.cleveland.data',header=None)

### Data Preprocessing

In [3]:
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0


From the dataset description we can name and understand the features.

In [4]:
cols = ['Age','Sex','Chest Pain Type','Resting BP','Cholestrol','Fasting Blood Sugar','Resting ECG',
        'Max Heart Rate','Exercise Angina','Oldpeak','Slope','CA','Thal','Heart Disease']

In [5]:
data.columns = cols

In [6]:
data.head()

Unnamed: 0,Age,Sex,Chest Pain Type,Resting BP,Cholestrol,Fasting Blood Sugar,Resting ECG,Max Heart Rate,Exercise Angina,Oldpeak,Slope,CA,Thal,Heart Disease
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0


In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
Age                    303 non-null float64
Sex                    303 non-null float64
Chest Pain Type        303 non-null float64
Resting BP             303 non-null float64
Cholestrol             303 non-null float64
Fasting Blood Sugar    303 non-null float64
Resting ECG            303 non-null float64
Max Heart Rate         303 non-null float64
Exercise Angina        303 non-null float64
Oldpeak                303 non-null float64
Slope                  303 non-null float64
CA                     303 non-null object
Thal                   303 non-null object
Heart Disease          303 non-null int64
dtypes: float64(11), int64(1), object(2)
memory usage: 33.3+ KB


Taking a look at the features 'CA' and 'Thal' being shown as non-numerical, what we find out is that both these features
have certain '?' values, probably in place pf missing values.

To fix this, we'll eliminate the instances having missing values since they form a very small percentage of our data.

In [8]:
pd.set_option('display.max_rows', 500)
data['CA']

0      0.0
1      3.0
2      2.0
3      0.0
4      0.0
5      0.0
6      2.0
7      0.0
8      1.0
9      0.0
10     0.0
11     0.0
12     1.0
13     0.0
14     0.0
15     0.0
16     0.0
17     0.0
18     0.0
19     0.0
20     0.0
21     0.0
22     0.0
23     2.0
24     2.0
25     0.0
26     0.0
27     0.0
28     0.0
29     0.0
30     2.0
31     2.0
32     0.0
33     0.0
34     0.0
35     0.0
36     0.0
37     1.0
38     1.0
39     0.0
40     3.0
41     0.0
42     2.0
43     0.0
44     0.0
45     1.0
46     0.0
47     0.0
48     1.0
49     0.0
50     1.0
51     0.0
52     1.0
53     0.0
54     1.0
55     1.0
56     1.0
57     0.0
58     1.0
59     1.0
60     0.0
61     0.0
62     3.0
63     0.0
64     1.0
65     2.0
66     0.0
67     0.0
68     0.0
69     0.0
70     0.0
71     2.0
72     2.0
73     2.0
74     1.0
75     0.0
76     1.0
77     1.0
78     0.0
79     0.0
80     0.0
81     0.0
82     0.0
83     0.0
84     0.0
85     0.0
86     0.0
87     0.0
88     0.0
89     0.0
90     0.0

In [9]:
data_numerical = data[data['CA'] != '?']

In [10]:
data_numerical['Thal']

0      6.0
1      3.0
2      7.0
3      3.0
4      3.0
5      3.0
6      3.0
7      3.0
8      7.0
9      7.0
10     6.0
11     3.0
12     6.0
13     7.0
14     7.0
15     3.0
16     7.0
17     3.0
18     3.0
19     3.0
20     3.0
21     3.0
22     3.0
23     7.0
24     7.0
25     3.0
26     3.0
27     3.0
28     3.0
29     7.0
30     3.0
31     7.0
32     3.0
33     7.0
34     3.0
35     3.0
36     7.0
37     6.0
38     7.0
39     3.0
40     7.0
41     7.0
42     3.0
43     3.0
44     3.0
45     7.0
46     3.0
47     7.0
48     3.0
49     3.0
50     3.0
51     7.0
52     3.0
53     3.0
54     7.0
55     7.0
56     7.0
57     7.0
58     3.0
59     3.0
60     7.0
61     3.0
62     7.0
63     3.0
64     7.0
65     7.0
66     3.0
67     7.0
68     7.0
69     3.0
70     3.0
71     7.0
72     7.0
73     6.0
74     3.0
75     3.0
76     7.0
77     3.0
78     3.0
79     7.0
80     3.0
81     3.0
82     3.0
83     7.0
84     3.0
85     3.0
86     3.0
87       ?
88     3.0
89     3.0
90     3.0

In [11]:
data_preprocessed = data_numerical[data_numerical['Thal'] != '?']

In [12]:
X = data_preprocessed.drop(['Heart Disease'], axis=1)
y_ = data_preprocessed['Heart Disease']

In [13]:
y_ = y_.to_frame()
y = pd.DataFrame()

Changing all target label values above 0 (i.e. 1,2,3,4) to--> 1 as we want to classify into no-heart disease (0) and heart disease(1 - 4).

In [14]:
y['Heart Disease'] = y_['Heart Disease'].apply(lambda x: x if x < 1 else 1)

We have transformed label values from 1-4 into 1 as can be seen in the initial dataframe (y_) vs our final label dataframe (y) -

In [15]:
y_.head(15)

Unnamed: 0,Heart Disease
0,0
1,2
2,1
3,0
4,0
5,0
6,3
7,0
8,2
9,1


In [16]:
y.head(15)

Unnamed: 0,Heart Disease
0,0
1,1
2,1
3,0
4,0
5,0
6,1
7,0
8,1
9,1


### Model

#### Training and Test sets

Splitting the data into 80-20 split.

In [17]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y.values.ravel(), test_size=0.2, random_state=10)

Scaling -

In [18]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

#### Linear Discriminant Analysis

In [19]:
class LDA:

    def __init__(self):
        self.w = None # Weights

    def fit(self, X, y):
        n_features = X.shape[1]
        targets = np.unique(y)
        vW = np.zeros((n_features, n_features)) # vW represents variance spread within a cluster
        vB = np.zeros((n_features, n_features)) # vB represents variance spread between 2 clusters
        total_mean = np.mean(X, axis=0)
        for i in targets:
            Xi = X[y == i]
            mean_i = np.mean(Xi, axis=0)
            vW += (Xi - mean_i).T.dot((Xi - mean_i))
            ni = Xi.shape[0]
            mean_diff = (mean_i - total_mean).reshape(n_features, 1)
            vB += ni * (mean_diff).dot(mean_diff.T)

        # vW^-1 * vB
        self.w = np.linalg.inv(vW).dot(vB)
    
    def predict(self, X):
        y_pred = []
        for sample in X:
            h = sample.dot(self.w)
            y = 1 * (h < 0)
            y_pred.append(y[7])
        return y_pred

Fitting training data on our model

In [20]:
lda = LDA()
lda.fit(X_train,y_train)

Testing the model on X_test

In [21]:
predictions = lda.predict(X_test)

### Evaluation

In [22]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix
print('Accuracy score: ', accuracy_score(y_test, predictions))
print('Precision score: ', precision_score(y_test, predictions))
print('Recall score: ', recall_score(y_test, predictions))
print('Confusion Matrix:\n', confusion_matrix(y_test,predictions))

print("\nWe have achieved an accuracy of about {}% in predicting Heart Disease through our model.".format(str(round(100*accuracy_score(y_test, predictions),2))))

Accuracy score:  0.8833333333333333
Precision score:  0.8461538461538461
Recall score:  0.88
Confusion Matrix:
 [[31  4]
 [ 3 22]]

We have achieved an accuracy of about 88.33% in predicting Heart Disease through our model.
