<a href="https://colab.research.google.com/github/THARU12342000/Heart-Disease-Prediction/blob/main/Heart_Disease_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

🔬💓 **Heart Disease Prediction using Machine Learning** 💓🔬

Welcome to a cutting-edge project where **data meets health**! 🚀 Using **Logistic Regression** and **Python's powerful libraries** (like **NumPy**, **Pandas**, and **Scikit-learn**), we're building an intelligent system to predict if a person is at risk of heart disease based on crucial health factors. 🩺💡

This project leverages **real-world data** to create a model that not only analyzes medical features but also provides accurate predictions. 📊🤖 With **data cleaning**, **feature selection**, and **model evaluation**, we've crafted a solution to enhance healthcare decision-making. 💻✨

Let's dive into the world of **predictive analytics** and **machine learning** for a healthier future! 🌱👩‍💻

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Import Dependencies

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Data collection and Processing


In [None]:
# loading the csv data to a Pandas DataFrame
heart_data = pd.read_csv('/content/heart_data.csv')

Print first 5 rows


In [None]:
# print first 5 rows of the dataset
heart_data.head()

Unnamed: 0,age,sex,chest_pain_type,resting_blood_pressure,cholestoral,fasting_blood_sugar,rest_ecg,Max_heart_rate,exercise_induced_angina,oldpeak,slope,vessels_colored_by_flourosopy,thalassemia,target
0,52,Male,Typical angina,125,212,Lower than 120 mg/ml,ST-T wave abnormality,168,No,1.0,Downsloping,Two,Reversable Defect,0
1,53,Male,Typical angina,140,203,Greater than 120 mg/ml,Normal,155,Yes,3.1,Upsloping,Zero,Reversable Defect,0
2,70,Male,Typical angina,145,174,Lower than 120 mg/ml,ST-T wave abnormality,125,Yes,2.6,Upsloping,Zero,Reversable Defect,0
3,61,Male,Typical angina,148,203,Lower than 120 mg/ml,ST-T wave abnormality,161,No,0.0,Downsloping,One,Reversable Defect,0
4,62,Female,Typical angina,138,294,Greater than 120 mg/ml,ST-T wave abnormality,106,No,1.9,Flat,Three,Fixed Defect,0


In [None]:
# print last 5 rows of the dataset
heart_data.tail()

Unnamed: 0,age,sex,chest_pain_type,resting_blood_pressure,cholestoral,fasting_blood_sugar,rest_ecg,Max_heart_rate,exercise_induced_angina,oldpeak,slope,vessels_colored_by_flourosopy,thalassemia,target
1020,59,Male,Atypical angina,140,221,Lower than 120 mg/ml,ST-T wave abnormality,164,Yes,0.0,Downsloping,Zero,Fixed Defect,1
1021,60,Male,Typical angina,125,258,Lower than 120 mg/ml,Normal,141,Yes,2.8,Flat,One,Reversable Defect,0
1022,47,Male,Typical angina,110,275,Lower than 120 mg/ml,Normal,118,Yes,1.0,Flat,One,Fixed Defect,0
1023,50,Female,Typical angina,110,254,Lower than 120 mg/ml,Normal,159,No,0.0,Downsloping,Zero,Fixed Defect,1
1024,54,Male,Typical angina,120,188,Lower than 120 mg/ml,ST-T wave abnormality,113,No,1.4,Flat,One,Reversable Defect,0


In [None]:
# number of rows and columns in the dataset
heart_data.shape

(1025, 14)

In [None]:
# getting some info about the data
heart_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1025 entries, 0 to 1024
Data columns (total 14 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   age                            1025 non-null   int64  
 1   sex                            1025 non-null   object 
 2   chest_pain_type                1025 non-null   object 
 3   resting_blood_pressure         1025 non-null   int64  
 4   cholestoral                    1025 non-null   int64  
 5   fasting_blood_sugar            1025 non-null   object 
 6   rest_ecg                       1025 non-null   object 
 7   Max_heart_rate                 1025 non-null   int64  
 8   exercise_induced_angina        1025 non-null   object 
 9   oldpeak                        1025 non-null   float64
 10  slope                          1025 non-null   object 
 11  vessels_colored_by_flourosopy  1025 non-null   object 
 12  thalassemia                    1025 non-null   o

In [None]:
# checking for missing values
heart_data.isnull().sum()

Unnamed: 0,0
age,0
sex,0
chest_pain_type,0
resting_blood_pressure,0
cholestoral,0
fasting_blood_sugar,0
rest_ecg,0
Max_heart_rate,0
exercise_induced_angina,0
oldpeak,0


In [None]:
# statistical measures about the data
heart_data.describe()

Unnamed: 0,age,resting_blood_pressure,cholestoral,Max_heart_rate,oldpeak,target
count,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0
mean,54.434146,131.611707,246.0,149.114146,1.071512,0.513171
std,9.07229,17.516718,51.59251,23.005724,1.175053,0.50007
min,29.0,94.0,126.0,71.0,0.0,0.0
25%,48.0,120.0,211.0,132.0,0.0,0.0
50%,56.0,130.0,240.0,152.0,0.8,1.0
75%,61.0,140.0,275.0,166.0,1.8,1.0
max,77.0,200.0,564.0,202.0,6.2,1.0


In [None]:
from sklearn.preprocessing import LabelEncoder

binary_cols = ['sex', 'fasting_blood_sugar', 'exercise_induced_angina']
label_encoder = LabelEncoder()

for col in binary_cols:
    heart_data[col] = label_encoder.fit_transform(heart_data[col])


In [None]:
heart_data = pd.get_dummies(heart_data, columns=['chest_pain_type', 'rest_ecg', 'slope',
                                                 'vessels_colored_by_flourosopy', 'thalassemia'],
                            drop_first=True)  # drop_first=True avoids dummy variable trap


In [None]:
# checking the distribution of Target Variable
heart_data['target'].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
1,526
0,499


1 --> Defective Heart

0 --> Healthy Heart

Splitting the Features and Target


In [None]:
X = heart_data.drop(columns='target', axis=1)
Y = heart_data['target']

In [None]:
print(X)

      age  sex  resting_blood_pressure  cholestoral  fasting_blood_sugar  \
0      52    1                     125          212                    1   
1      53    1                     140          203                    0   
2      70    1                     145          174                    1   
3      61    1                     148          203                    1   
4      62    0                     138          294                    0   
...   ...  ...                     ...          ...                  ...   
1020   59    1                     140          221                    1   
1021   60    1                     125          258                    1   
1022   47    1                     110          275                    1   
1023   50    0                     110          254                    1   
1024   54    1                     120          188                    1   

      Max_heart_rate  exercise_induced_angina  oldpeak  \
0                168         

In [None]:
print(Y)

0       0
1       0
2       0
3       0
4       0
       ..
1020    1
1021    0
1022    0
1023    1
1024    0
Name: target, Length: 1025, dtype: int64


In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=2)

In [None]:
print(X.shape, X_train.shape, X_test.shape)

(1025, 22) (820, 22) (205, 22)


Model Training

Logistic Regression


In [None]:
model = LogisticRegression()

In [None]:
# training the LogisticRegression model with Training data
model.fit(X_train, Y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Model evaluation

In [None]:
# accuracy on training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

In [None]:
print('Accuracy on Training data : ', training_data_accuracy)

Accuracy on Training data :  0.8658536585365854


In [None]:
# accuracy on test data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [None]:
print('Accuracy on Test data : ', test_data_accuracy)

Accuracy on Test data :  0.8341463414634146


Build Predictive System

In [None]:
print(X_train.shape[1])  # This should show 22, matching the model
print(X_train.columns)    # Get feature names


22
Index(['age', 'sex', 'resting_blood_pressure', 'cholestoral',
       'fasting_blood_sugar', 'Max_heart_rate', 'exercise_induced_angina',
       'oldpeak', 'chest_pain_type_Atypical angina',
       'chest_pain_type_Non-anginal pain', 'chest_pain_type_Typical angina',
       'rest_ecg_Normal', 'rest_ecg_ST-T wave abnormality', 'slope_Flat',
       'slope_Upsloping', 'vessels_colored_by_flourosopy_One',
       'vessels_colored_by_flourosopy_Three',
       'vessels_colored_by_flourosopy_Two',
       'vessels_colored_by_flourosopy_Zero', 'thalassemia_No',
       'thalassemia_Normal', 'thalassemia_Reversable Defect'],
      dtype='object')


In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Assuming you trained the model with a DataFrame 'X_train'
input_df = pd.DataFrame(input_data, columns=["feature1", "feature2", "feature3", ...])  # Replace with actual feature names

# Define the scaler (ensure it matches the one used during training)
scaler = StandardScaler()
input_scaled = scaler.fit_transform(input_df)

# Predict using the trained model
prediction = model.predict(input_scaled)

# Output result
if prediction[0] == 0:
    print("The Person does NOT have Heart Disease")
else:
    print("The Person HAS Heart Disease")


The Person HAS Heart Disease




### Conclusion:

The logistic regression model achieved **86.6% accuracy on training data** and **83.4% accuracy on test data**, demonstrating good performance. A minor issue was identified with input data format during prediction, causing a warning. This can be resolved by ensuring input data retains the same feature names as the training data and using the same scaler for consistent predictions. The model is effective for heart disease prediction and can be further optimized for deployment.