# Heart Disease Prediction with Logistic Regression

In this notebook, we will demonstrate how Logistic Regression, a widely used machine learning algorithm, can be applied to predict the likelihood of heart disease. The goal is to build a predictive model that can classify individuals as either at risk or not at risk for heart disease based on various health-related features.

## Problem Overview:
Heart disease is a leading cause of death globally, and early detection is crucial for effective treatment and prevention. By leveraging machine learning, we can predict an individual’s risk of developing heart disease and enable timely interventions.

The dataset contains important health metrics such as age, cholesterol levels, blood pressure, and other cardiovascular indicators. Our objective is to train a Logistic Regression model to classify individuals into two categories: **at risk (1)** or **not at risk (0)** for heart disease.

## Steps Involved:
- **Data Exploration & Preprocessing**: Understanding the dataset, handling missing values, and scaling the features for optimal performance.
- **Model Training**: Implementing a Logistic Regression model to learn patterns in the data.
- **Model Evaluation**: Assessing the model's performance using metrics like accuracy, precision, recall, and the confusion matrix.
- **Prediction & Conclusion**: Making predictions and summarizing the results to conclude the effectiveness of the Logistic Regression model.

This analysis will provide insights into how machine learning, particularly Logistic Regression, can be applied to healthcare problems like heart disease prediction.

## Libraries Used:
- **`numpy`**, **`pandas`**, and **`scikit-learn`** are used for data manipulation, model building, and evaluation.


### Importing Libraries

In [6]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import  StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

### Loading and Exploring the Dataset

In [2]:
data = pd.read_csv("heart_disease_data.csv")

In [3]:
data.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [4]:
data.shape

(303, 14)

In [68]:
data["target"].value_counts() # Checking if the data is well distributed

target
1    165
0    138
Name: count, dtype: int64

### Checking missing values and summary statistics

In [11]:
data.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


In [12]:
data.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

### Splitting Data

In [13]:
X = data.drop(columns="target")
Y = data["target"]

In [14]:
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [73]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2,random_state=3)

In [74]:
print(X.shape, X_train.shape,X_test.shape)
print(Y.shape, Y_train.shape,Y_test.shape)

(303, 13) (242, 13) (61, 13)
(303,) (242,) (61,)


### Standardizing Data

In [20]:
stdScalar = StandardScaler()
stdScalar.fit(X_train)

In [21]:
X_train_std = stdScalar.transform(X_train)

In [43]:
X_test_std = stdScalar.transform(X_test)

In [82]:
print(X_test_std.std())

1.0225601418328714


### Building Logistic Regression

In [83]:
logisticReg = LogisticRegression()
logisticReg.fit(X_train_std,Y_train)

In [84]:
Y_train_predicted = logisticReg.predict(X_train_std)
train_accuracy = accuracy_score(Y_train,Y_train_predicted)

In [85]:
predictY = logisticReg.predict(X_test_std)
test_accuracy = accuracy_score(Y_test, predictY)

### Evaluating The Model 

In [86]:
print(test_accuracy)
print(train_accuracy)

0.8688524590163934
0.8429752066115702


### Building a Predictive System 

In [89]:
import warnings
# Suppresses the warning about mismatched feature names in scikit-learn to keep the output clean
warnings.filterwarnings("ignore", message="X does not have valid feature names")

In [92]:
data_input = [52,1,0,128,255,0,1,161,1,0,2,1,3]
data_array = np.asarray(data_input)
data_array_rshp = data_array.reshape(1,-1)
std_data = stdScalar.transform(data_array_rshp)

In [93]:
prediction = logisticReg.predict(std_data)
print(prediction) #must be 0

[0]
