## This  project analyzes Heart Failure Prediction Dataset using Logistic Regression and Random Forests, and then, compares results obtained. The two methods were used because output results should be binary (0/1). Link to the dataset: https://www.kaggle.com/fedesoriano/heart-failure-prediction. 
### Finished on 20/01/2022.

### First, we need to read the comma-separated file using pandas. Custom headers were given for convenience.

In [1]:
import pandas as pd
headers = ["age", "sex", "cpa", "rbp", "chol",
           "fbs", "recg", "maxhr", "EA",
           "oldpeak", "ST_slope", "HD"]
# cpa = ChestPainType
# rbp = resting blood pressure
# chol = cholesterol
# fbs = fasting blood sugar
# recg = resting ECG
# EA = exercise angina
# HD = heart disease

In [2]:
df = pd.read_csv("../input/heart-failure-prediction/heart.csv", header=None, names=headers)
df.head()

Unnamed: 0,age,sex,cpa,rbp,chol,fbs,recg,maxhr,EA,oldpeak,ST_slope,HD
0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
1,40,M,ATA,140,289,0,Normal,172,N,0,Up,0
2,49,F,NAP,160,180,0,Normal,156,N,1,Flat,1
3,37,M,ATA,130,283,0,ST,98,N,0,Up,0
4,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1


### For some reasons, previous headers were set as one new row of data, so it has to be removed.

In [3]:
df = df.iloc[1: , :]

In [4]:
df.head()

Unnamed: 0,age,sex,cpa,rbp,chol,fbs,recg,maxhr,EA,oldpeak,ST_slope,HD
1,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
2,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
3,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
4,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
5,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


### To check if there are any missing values, the following function was used.

In [5]:
df.isnull().sum()

age         0
sex         0
cpa         0
rbp         0
chol        0
fbs         0
recg        0
maxhr       0
EA          0
oldpeak     0
ST_slope    0
HD          0
dtype: int64

### Logistic regression has a feature that it only works with numbers, so data names in string type columns should be replaced by numbers. The next functions look for unique data names in given columns to then numerize them.

#### For example, here, it can be seen that the Chest Pain Type columns can have either "ASY", "NAP", "ATA", or "TA" values. So they can be changed to 1, 2, 3, and 4 respectively.

In [6]:
df['cpa'].value_counts()

ASY    496
NAP    203
ATA    173
TA      46
Name: cpa, dtype: int64

In [7]:
df['recg'].value_counts()

Normal    552
LVH       188
ST        178
Name: recg, dtype: int64

In [8]:
df['ST_slope'].value_counts()

Flat    460
Up      395
Down     63
Name: ST_slope, dtype: int64

In [9]:
cleanup_nums = {"cpa": {"ASY": 1, "NAP": 2, "ATA":3, "TA":4},
                "sex": {"M": 0, "F": 1},
               "recg": {"Normal":1, "LVH":2, "ST":3},
               "EA": {"Y":1, "N":0},
                "ST_slope": {"Down":0, "Flat":1, "Up":2}
               }

### "Replace" function uses the above dictionary to replace strings into numbers.

In [10]:
df = df.replace(cleanup_nums)

### Here, it is seen that the dataframe only has numbers (at least at a glance).

In [11]:
df.head()

Unnamed: 0,age,sex,cpa,rbp,chol,fbs,recg,maxhr,EA,oldpeak,ST_slope,HD
1,40,0,3,140,289,0,1,172,0,0.0,2,0
2,49,1,2,160,180,0,1,156,0,1.0,1,1
3,37,0,3,130,283,0,3,98,0,0.0,2,0
4,48,1,1,138,214,0,1,108,1,1.5,1,1
5,54,0,2,150,195,0,1,122,0,0.0,2,0


### However, it turns out that the columns unchanged by the dictionary (which were initially shown as numbers) are not in "int" or "float" type, but in "object" type. So, column data types had to be changed.

In [12]:
df['age'].dtype

dtype('O')

In [13]:
df['age']=df['age'].astype('float32')
df['rbp']=df['rbp'].astype('float32')
df['chol']=df['chol'].astype('float32')
df['fbs']=df['fbs'].astype('float32')
df['maxhr']=df['maxhr'].astype('float32')
df['oldpeak']=df['oldpeak'].astype('float32')
df['HD']=df['HD'].astype('float32')

In [14]:
df['age']

1      40.0
2      49.0
3      37.0
4      48.0
5      54.0
       ... 
914    45.0
915    68.0
916    57.0
917    57.0
918    38.0
Name: age, Length: 918, dtype: float32

In [15]:
df['HD']

1      0.0
2      1.0
3      0.0
4      1.0
5      0.0
      ... 
914    1.0
915    1.0
916    1.0
917    1.0
918    0.0
Name: HD, Length: 918, dtype: float32

### As the dataset now truly contains only numbers, we can separate the independent and dependent variables. Outcome results (Heart Disease) will be stored in "y" and features or the factors will be stored in feature vector X, since all the other columns are expected to be the reasons of heart diseases.

In [16]:
X = df.drop("HD", axis=1)
y = df['HD']

### Importing the scikit-learn function which splits X and y into train and test datasets.

In [17]:
from sklearn.model_selection import train_test_split

### 3:1 ratio should suffice for the dataset of this size.

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [19]:
from sklearn.linear_model import LogisticRegression

### Here, specific parameters were added to avoid the bug that showed up continuously. Alternative solution could be to scale the inputs so that they have much smaller values, but setting a large maximum iteration value did work as well.

In [20]:
logmodel = LogisticRegression(solver="lbfgs", max_iter=1000)

### Checking the datatype of values in columns of train dataset to make sure it is not "object" but numbers.

In [21]:
X_train['age']

156    56.0
363    56.0
870    59.0
102    51.0
200    57.0
       ... 
107    48.0
271    45.0
861    60.0
436    60.0
103    40.0
Name: age, Length: 688, dtype: float32

### Training the model on train dataset.

In [22]:
logmodel.fit(X_train, y_train)

LogisticRegression(max_iter=1000)

### "Predictions" will have the final results made by our model.

In [23]:
predictions = logmodel.predict(X_test)

In [24]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [25]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

         0.0       0.79      0.88      0.83        98
         1.0       0.90      0.83      0.86       132

    accuracy                           0.85       230
   macro avg       0.84      0.85      0.85       230
weighted avg       0.85      0.85      0.85       230



### Confusion matrix helped to qualify the results by showing True Negative and False Positive in the first row, and False Negative and True Positive in the second.

In [26]:
confusion_matrix(y_test, predictions)

array([[ 86,  12],
       [ 23, 109]])

### To calculate the accuracy, number of true predictions can be divided by the sum of all predictions.

In [27]:
print(f"The accuracy of the model is {(86+109)/(86+12+23+109)}")

The accuracy of the model is 0.8478260869565217


In [28]:
from sklearn.ensemble import RandomForestRegressor

In [29]:
forest_model = RandomForestRegressor(random_state=1)

### Training the second model on the previous train dataset.

In [30]:
forest_model.fit(X_train, y_train)

RandomForestRegressor(random_state=1)

### Below, heart disease predictions were given as rational numbers between 0 and 1. To binarize them, the for loop was used. However, a threshold value was intuitively chosen as 0.5, but indeed, it can be chosen in a more sophisticated manner.

In [31]:
heart_predictions = forest_model.predict(X_test)
print(heart_predictions)
for i in range (len(heart_predictions)):
    if heart_predictions[i]<0.5:
        heart_predictions[i]=0
    else:
        heart_predictions[i]=1
print(heart_predictions)

[0.11 0.67 0.99 0.99 0.03 0.56 0.74 0.01 0.64 0.94 0.48 0.02 0.68 0.08
 0.96 0.8  0.15 0.46 0.86 0.38 0.82 0.95 0.   0.56 0.73 0.94 0.   0.61
 0.   0.   0.98 0.02 0.53 0.95 0.92 0.53 1.   0.   0.92 0.68 0.68 0.76
 0.45 0.   0.15 0.65 0.86 0.92 1.   0.34 0.01 0.   1.   0.97 0.31 0.08
 0.25 0.85 0.73 0.87 0.23 0.08 0.   0.95 0.1  0.95 1.   0.9  0.99 0.64
 0.36 0.   0.73 0.41 0.04 0.76 0.25 0.57 0.02 0.53 0.49 0.91 0.7  0.01
 0.93 0.91 0.   0.65 0.05 0.34 0.32 0.9  0.99 0.08 0.37 0.   0.89 0.46
 0.8  0.72 0.06 0.97 0.85 0.   0.62 0.89 0.   0.52 0.96 0.09 0.97 0.95
 0.01 0.15 0.97 0.21 0.87 0.   0.9  0.57 0.75 0.53 0.69 0.7  0.01 0.
 0.11 0.23 0.   0.3  0.23 0.99 0.93 0.05 0.96 0.05 0.51 0.73 0.   0.72
 0.15 0.1  0.92 0.86 0.92 0.84 0.86 0.34 0.08 0.28 0.   0.61 0.71 0.03
 0.69 0.02 0.01 0.03 0.   0.82 0.01 0.18 1.   0.01 0.93 0.97 0.54 0.98
 0.11 0.24 0.95 0.91 0.02 0.86 0.   0.85 0.47 0.72 0.13 0.56 0.72 1.
 0.01 0.98 0.24 0.   0.93 0.   1.   0.98 0.64 0.97 0.52 0.03 0.53 0.32
 0.   0.  

In [32]:
print(classification_report(y_test, heart_predictions))

              precision    recall  f1-score   support

         0.0       0.83      0.88      0.86        98
         1.0       0.91      0.87      0.89       132

    accuracy                           0.87       230
   macro avg       0.87      0.87      0.87       230
weighted avg       0.88      0.87      0.87       230



In [33]:
confusion_matrix(y_test, heart_predictions)

array([[ 86,  12],
       [ 17, 115]])

### The confusion matrix for the Random Forests shows that it has ~87.39% accuracy.
### It is slightly higher than that of the Logistic Regression ~84.78%.

In [34]:
print(f"Accuracy of the Random Forests Model is: {(86+115)/(86+12+17+115)}")

Accuracy of the Random Forests Model is: 0.8739130434782608


## Two models were built with sufficiently high accuracies. This projected has revealed to me that Decision Trees and Random Forests can be easier to implement and require less data wrangling than Logistic Regression. With the methods I used, Random Forests turned out to be slightly more accurate. Yet, this project has many ways to improve:
- choosing an optimal max_iter value for Logistic Regression;
- choosing an optimal split size on a basis of a more educated guess;
- choosing a more accurate threshold value or using better ways to binarize numbers for Random Forests;
- analyzing weights of the features (which factors were the most deciding?);
- using Gradient Descent for more optimization, and many others.