<a href="https://colab.research.google.com/github/AditiCoderElite/Logistic-Regression---Heart-Disease-Prediction/blob/main/Logistic_Regression_Heart_Disease_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  Logistic Regression - Heart Disease Prediction

---

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

csv_file = 'https://s3-student-datasets-bucket.whjr.online/whitehat-ds-datasets/uci-heart-disease/heart.csv'
df = pd.read_csv(csv_file)
print("\n", df.head(), "\n", df.info(), "\n")


print("Number of records in each label are")
print(df['target'].value_counts())


print("\nPercentage of records in each label are")
print(df['target'].value_counts() * 100 / df.shape[0])


X = df.drop(columns = 'target')
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB

    age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  slope  \
0   63    1   3       145   233    1        0      150      0      2.3      0   
1   37    1   2       130   250    0        1      187 

---

####Multivariate Logistic Regression


In [3]:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

lg_clf_1 = LogisticRegression()
lg_clf_1.fit(X_train, y_train)
lg_clf_1.score(X_train, y_train)

y_train_pred = lg_clf_1.predict(X_train)

print(f"{'Train Set'.upper()}\n{'-' * 75}\nConfusion Matrix:")
print(confusion_matrix(y_train, y_train_pred))

print("\nClassification Report:")
print(classification_report(y_train, y_train_pred))

TRAIN SET
---------------------------------------------------------------------------
Confusion Matrix:
[[ 77  20]
 [  9 106]]

Classification Report:
              precision    recall  f1-score   support

           0       0.90      0.79      0.84        97
           1       0.84      0.92      0.88       115

    accuracy                           0.86       212
   macro avg       0.87      0.86      0.86       212
weighted avg       0.87      0.86      0.86       212



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [4]:
y_test_pred = lg_clf_1.predict(X_test)

print(f"{'Test Set'.upper()}\n{'-' * 75}\nConfusion Matrix:")
print(confusion_matrix(y_test, y_test_pred))

print("\nClassification Report")
print(classification_report(y_test, y_test_pred))

TEST SET
---------------------------------------------------------------------------
Confusion Matrix:
[[32  9]
 [ 8 42]]

Classification Report
              precision    recall  f1-score   support

           0       0.80      0.78      0.79        41
           1       0.82      0.84      0.83        50

    accuracy                           0.81        91
   macro avg       0.81      0.81      0.81        91
weighted avg       0.81      0.81      0.81        91



As you can see,
- The FP and FN values in the confusion matrix are low
- The precision and recall values are also good
- The f1-score is also greater than **0.7**

This clearly shows that the decision boundary accurately separates the labels (or classes) with good accuracy.

But this logistic regression model (refer to the object stored in the `lg_clf_1` variable) is created using all the features (or independent variables). It is quite possible that not all features are of imporatance for the classification of the labels in the `target` column. Therefore, we still can improve the model by reducing the number of features to obtain higher f1-scores.

---

#### Data Standardisation






In [5]:
def standard_scaler(series):
  new_series = (series - series.mean()) / series.std()
  return new_series

norm_X_train = X_train.apply(standard_scaler, axis = 0)
norm_X_test = X_test.apply(standard_scaler, axis = 0)

norm_X_train.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
count,212.0,212.0,212.0,212.0,212.0,212.0,212.0,212.0,212.0,212.0,212.0,212.0,212.0
mean,1.864337e-16,1.298751e-16,2.2518670000000003e-17,5.697748e-16,1.424437e-16,-5.81296e-17,-1.005485e-16,3.05835e-16,9.216946000000001e-17,7.541138000000001e-17,5.865329e-17,7.960090000000001e-17,3.7705690000000006e-17
std,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
min,-2.757098,-1.391141,-0.9778484,-2.142798,-2.129975,-0.3811266,-1.029172,-2.731467,-0.6855616,-0.928991,-2.305793,-0.6746937,-3.912465
25%,-0.7177485,-1.391141,-0.9778484,-0.6152369,-0.6649586,-0.3811266,-1.029172,-0.6547229,-0.6855616,-0.928991,-0.676366,-0.6746937,-0.5475864
50%,0.07080006,0.7154438,-0.0136444,-0.02771338,-0.1338901,-0.3811266,0.8680843,0.1693821,-0.6855616,-0.1961683,-0.676366,-0.6746937,-0.5475864
75%,0.723392,0.7154438,0.9505596,0.5598102,0.5162111,-0.3811266,0.8680843,0.7847138,1.451778,0.5366543,0.9530612,0.3770347,1.134853
max,2.463637,0.7154438,1.914764,3.614933,5.799427,2.611423,2.765341,2.279091,1.451778,4.200768,0.9530612,3.53222,1.134853


In [6]:
norm_X_test.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
count,91.0,91.0,91.0,91.0,91.0,91.0,91.0,91.0,91.0,91.0,91.0,91.0,91.0
mean,-2.147245e-16,-1.390829e-16,-1.9520400000000003e-17,-6.868742e-16,-4.1480860000000003e-17,3.5380730000000005e-17,-4.880101e-18,-5.29491e-16,-1.049222e-16,1.339037e-16,-1.244426e-16,-6.344132000000001e-17,1.848338e-16
std,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
min,-2.301763,-1.661622,-0.8425578,-1.853721,-2.624853,-0.4938276,-0.9430373,-3.319275,-0.714835,-0.8367971,-2.184053,-0.8102615,-3.491486
25%,-0.8354271,-1.661622,-0.8425578,-0.6650121,-0.7201088,-0.4938276,-0.9430373,-0.6418709,-0.714835,-0.8367971,-0.5812398,-0.8102615,-0.4364358
50%,0.1797284,0.595208,-0.8425578,-0.0166253,-0.01836075,-0.4938276,-0.9430373,0.1078023,-0.714835,-0.3799059,-0.5812398,-0.8102615,-0.4364358
75%,0.6309086,0.595208,1.12341,0.4696648,0.6165541,-0.4938276,0.9639937,0.6432832,1.383552,0.5719508,1.021573,0.9246514,1.091089
max,2.43563,0.595208,2.106394,3.549502,3.67974,2.002745,2.871025,1.86418,1.383552,3.884412,1.021573,2.659564,1.091089


---

####Features Selection Using RFE



In [14]:
from sklearn.feature_selection import RFE
from sklearn.metrics import f1_score
from sklearn.linear_model import LogisticRegression

dict_rfe = {}

for i in range(1, len(X_train.columns) + 1):
  lg_clf_2 = LogisticRegression()
  rfe = RFE(lg_clf_2,n_features_to_select=i) # 'i' is the number of features to be selected by RFE to fit a logistic regression model on norm_X_train and y_train.
  rfe.fit(norm_X_train, y_train)

  rfe_features = list(norm_X_train.columns[rfe.support_]) # A list of important features chosen by RFE.
  rfe_X_train = norm_X_train[rfe_features]

  lg_clf_3 = LogisticRegression()
  lg_clf_3.fit(rfe_X_train, y_train)

  y_test_pred = lg_clf_3.predict(norm_X_test[rfe_features])

  f1_scores_array = f1_score(y_test, y_test_pred, average = None)
  dict_rfe[i] = {"features": list(rfe_features), "f1_score": f1_scores_array} # 'i' is the number of features to be selected by RFE.

In [10]:
dict_rfe

{1: {'features': ['cp', 'exang', 'oldpeak', 'slope', 'ca', 'thal'],
  'f1_score': array([0.8       , 0.84313725])},
 2: {'features': ['cp', 'exang', 'oldpeak', 'slope', 'ca', 'thal'],
  'f1_score': array([0.8       , 0.84313725])},
 3: {'features': ['cp', 'exang', 'oldpeak', 'slope', 'ca', 'thal'],
  'f1_score': array([0.8       , 0.84313725])},
 4: {'features': ['cp', 'exang', 'oldpeak', 'slope', 'ca', 'thal'],
  'f1_score': array([0.8       , 0.84313725])},
 5: {'features': ['cp', 'exang', 'oldpeak', 'slope', 'ca', 'thal'],
  'f1_score': array([0.8       , 0.84313725])},
 6: {'features': ['cp', 'exang', 'oldpeak', 'slope', 'ca', 'thal'],
  'f1_score': array([0.8       , 0.84313725])},
 7: {'features': ['cp', 'exang', 'oldpeak', 'slope', 'ca', 'thal'],
  'f1_score': array([0.8       , 0.84313725])},
 8: {'features': ['cp', 'exang', 'oldpeak', 'slope', 'ca', 'thal'],
  'f1_score': array([0.8       , 0.84313725])},
 9: {'features': ['cp', 'exang', 'oldpeak', 'slope', 'ca', 'thal'],
  'f

In [15]:
pd.options.display.max_colwidth = 100
f1_df = pd.DataFrame.from_dict(dict_rfe, orient = 'index')
f1_df


Unnamed: 0,features,f1_score
1,[oldpeak],"[0.6578947368421052, 0.7547169811320756]"
2,"[cp, oldpeak]","[0.7837837837837839, 0.851851851851852]"
3,"[cp, oldpeak, ca]","[0.8, 0.8431372549019608]"
4,"[cp, oldpeak, ca, thal]","[0.7692307692307694, 0.826923076923077]"
5,"[cp, exang, oldpeak, ca, thal]","[0.7948717948717948, 0.8461538461538461]"
6,"[cp, exang, oldpeak, slope, ca, thal]","[0.8, 0.8431372549019608]"
7,"[sex, cp, exang, oldpeak, slope, ca, thal]","[0.7848101265822786, 0.8349514563106797]"
8,"[sex, cp, restecg, exang, oldpeak, slope, ca, thal]","[0.7948717948717948, 0.8461538461538461]"
9,"[sex, cp, restecg, thalach, exang, oldpeak, slope, ca, thal]","[0.8205128205128206, 0.8653846153846153]"
10,"[sex, cp, fbs, restecg, thalach, exang, oldpeak, slope, ca, thal]","[0.8205128205128206, 0.8653846153846153]"


In [16]:
lg_clf_4 = LogisticRegression()
rfe = RFE(lg_clf_4,n_features_to_select=3)


rfe.fit(norm_X_train, y_train)

rfe_features = norm_X_train.columns[rfe.support_]
print(rfe_features)
final_X_train = norm_X_train[rfe_features]

lg_clf_4 = LogisticRegression()
lg_clf_4.fit(final_X_train, y_train)

y_test_predict = lg_clf_4.predict(norm_X_test[rfe_features])
final_f1_scores_array = f1_score(y_test, y_test_predict, average = None)
print(final_f1_scores_array)

Index(['cp', 'oldpeak', 'ca'], dtype='object')
[0.8        0.84313725]


---