# Heart Disease - ML Prediction

Build a heart disease prediction system using logistic regression. Demonstrate end-to-end functionality, from model training and evaluation to making and saving predictions.


## Project Overview
**Objective:** Develop a heart disease prediction system using logistic regression to classify whether a person has heart disease based on various health metrics.
**Outcome:** The project demonstrates a complete workflow from data preparation and model training to prediction and model persistence, showcasing an end-to-end approach to heart disease prediction.

## Dataset
This [dataset](https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset) dates from 1988 and consists of four databases: Cleveland, Hungary, Switzerland, and Long Beach V. It contains 76 attributes, including the predicted attribute, but all published experiments refer to using a subset of 14 of them. The "target" field refers to the presence of heart disease in the patient. It is integer valued 0 = no disease and 1 = disease.

**Content**

Attribute Information:
- age
- sex
- chest pain type (4 values)
- resting blood pressure
- serum cholestoral in mg/dl
- fasting blood sugar > 120 mg/dl
- resting electrocardiographic results (values 0,1,2)
- maximum heart rate achieved
- exercise induced angina
- oldpeak = ST depression induced by exercise relative to rest
- the slope of the peak exercise ST segment
- number of major vessels (0-3) colored by flourosopy
- thal: 0 = normal; 1 = fixed defect; 2 = reversable defect
The names and social security numbers of the patients were recently removed from the database, replaced with dummy values.

## Machine Learning Predictions

**Install dependencies**

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

**Data collection and processing**

In [2]:
df = pd.read_csv('/kaggle/input/heart-disease/heart_disease_data.csv')

df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [3]:
df.shape

(303, 14)

In [4]:
df.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


In [6]:
df.columns

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
      dtype='object')

In [7]:
df.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

In [8]:
# Check distribution of the target variable
#** 1 = Defective Heart
#** 0 = Healthy heart

df['target'].value_counts()

target
1    165
0    138
Name: count, dtype: int64

**Splitting features and target**

In [9]:
X = df.drop(columns='target', axis=1)
Y = df['target']

In [10]:
print(X.head())

   age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  slope  \
0   63    1   3       145   233    1        0      150      0      2.3      0   
1   37    1   2       130   250    0        1      187      0      3.5      0   
2   41    0   1       130   204    0        0      172      0      1.4      2   
3   56    1   1       120   236    0        1      178      0      0.8      2   
4   57    0   0       120   354    0        1      163      1      0.6      2   

   ca  thal  
0   0     1  
1   0     2  
2   0     2  
3   0     2  
4   0     2  


In [11]:
print(Y.head())

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64


In [12]:
# Extract column names to be used later when making web app interface
for column in X.columns:
    print(column)

age
sex
cp
trestbps
chol
fbs
restecg
thalach
exang
oldpeak
slope
ca
thal


**Splitting training data and test data**

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=3, stratify=Y)

In [14]:
print(X.shape, X_train.shape, X_test.shape)

(303, 13) (242, 13) (61, 13)


In [15]:
scaler = StandardScaler()
scaler.fit(X_train)

**Model training: Logistic Regression**

In [16]:
# Instantiate LogisticRegression

lr = LogisticRegression()

In [17]:
# train the LogisticRegression with training data

lr.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


**Model evaluation: Accuracy score**

In [18]:
# Accuracy on train data

X_train_pred = lr.predict(X_train)

train_accuracy = accuracy_score(X_train_pred, y_train)

print('Accuracy on train data: ', train_accuracy)

Accuracy on train data:  0.8636363636363636


In [19]:
# Accuracy on test data

X_test_pred = lr.predict(X_test)

test_accuracy = accuracy_score(X_test_pred, y_test)

print('Accuracy on test data: ', test_accuracy)

Accuracy on test data:  0.8032786885245902


**Making a predictive system**

In [20]:
input_data = (58,1,0,150,270,0,0,111,1,0.8,2,0,3)

# change the input data to a numpy array
input_data_numpy_array = np.asarray(input_data)

# reshape the nparray as we are predicting only 1 instance
input_data_reshaped = input_data_numpy_array.reshape(1,-1)

prediction = lr.predict(input_data_reshaped)
print(prediction)

if (prediction[0] == 0):
    print('The person does not have a heart disease')
else:
    print('The person has heart disease')

[0]
The person does not have a heart disease




**Saving the trained model**

In [21]:
import pickle

In [22]:
# Save the model to a file
filename = 'heart_disease_model.sav'
pickle.dump(lr, open(filename, 'wb'))

In [23]:
# Save the scaler to a file
scaler_filename = 'heart_disease_scaler.sav'
pickle.dump(scaler, open(scaler_filename, 'wb'))

In [24]:
# Loading the saved model

loaded_model = pickle.load(open('heart_disease_model.sav', 'rb'))

In [25]:
# Loading the saved scaler
loaded_scaler = pickle.load(open('heart_disease_scaler.sav', 'rb'))

In [26]:
input_data = (66,1,0,160,228,0,0,138,0,2.3,2,0,1)

# change the input data to a numpy array
input_data_numpy_array = np.asarray(input_data)

# reshape the nparray as we are predicting only 1 instance
input_data_reshaped = input_data_numpy_array.reshape(1,-1)

# Standardize the input data using the loaded scaler
input_data_standardized = loaded_scaler.transform(input_data_reshaped)

prediction = loaded_model.predict(input_data_standardized)
print(prediction)

if (prediction[0] == 0):
    print('The person does not have a heart disease')
else:
    print('The person has heart disease')

[1]
The person has heart disease


