In [1]:
import pandas as pd
import numpy as np

# Data Understanding

## Description
The dataset consists of 1763 observations, each representing a unique patient, and 12 different attributes associated with heart disease. This dataset is a critical resource for researchers focusing on predictive analytics in cardiovascular diseases.

## Variables Overview:
1. **Age:** A continuous variable indicating the age of the patient.
2. **Sex:** A categorical variable with two levels ('Male', 'Female'), indicating the gender of the patient.
3. **CP (Chest Pain type)**: A categorical variable describing the type of chest pain experienced by the patient, with categories such as 'Asymptomatic', 'Atypical Angina', 'Typical Angina', and 'Non-Angina'.
4. **TRTBPS (Resting Blood Pressure)**: A continuous variable indicating the resting blood pressure (in mm Hg) on admission to the hospital.
5. **Chol (Serum Cholesterol)**: A continuous variable measuring the serum cholesterol in mg/dl.
6. **FBS (Fasting Blood Sugar)**: A binary variable where 1 represents fasting blood sugar > 120 mg/dl, and 0 otherwise.
7. **Rest ECG (Resting Electrocardiographic Results)**: Categorizes the resting electrocardiographic results of the patient into 'Normal', 'ST Elevation', and other categories.
8. **Thalachh (Maximum Heart Rate Achieved)**: A continuous variable indicating the maximum heart rate achieved by the patient.
9. **Exng (Exercise Induced Angina)**: A binary variable where 1 indicates the presence of exercise-induced angina, and 0 otherwise.
10. **Oldpeak (ST Depression Induced by Exercise Relative to Rest)**: A continuous variable indicating the ST depression induced by exercise relative to rest.
11. **Slope (Slope of the Peak Exercise ST Segment)**: A categorical variable with levels such as 'Flat', 'Up Sloping', representing the slope of the peak exercise ST segment.
14. **Target**: A binary target variable indicating the presence (1) or absence (0) of heart disease.

## Descriptive Statistics:
The patients' age ranges from 29 to 77 years, with a mean age of approximately 54 years. The resting blood pressure spans from 94 to 200 mm Hg, and the average cholesterol level is about 246 mg/dl. The maximum heart rate achieved varies widely among patients, from 71 to 202 beats per minute.

## Importance for Research:
This dataset provides a comprehensive view of various factors that could potentially be linked to heart disease, making it an invaluable resource for developing predictive models. By analyzing relationships and patterns within these variables, researchers can identify key predictors of heart disease and enhance the accuracy of diagnostic tools. This could lead to better preventive measures and treatment strategies, ultimately improving patient outcomes in the realm of cardiovascular health

In [2]:
# Load CSV from Google Drive using a direct-download URL (preferred).
file_id = '1Ih588PiAEbWzxJx_jDVUTATwFOnWhI6n'
url = f'https://drive.google.com/uc?id={file_id}&export=download'
df = pd.read_csv(url)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1763 entries, 0 to 1762
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   age                  1763 non-null   int64  
 1   sex                  1763 non-null   int64  
 2   Chest pain type      1763 non-null   int64  
 3   trestbps             1763 non-null   int64  
 4   cholesterol          1763 non-null   int64  
 5   fasting blood sugar  1763 non-null   int64  
 6   resting ecg          1763 non-null   int64  
 7   max heart rate       1763 non-null   int64  
 8   exercise angina      1763 non-null   int64  
 9   oldpeak              1763 non-null   float64
 10  ST slope             1763 non-null   int64  
 11  target               1763 non-null   int64  
dtypes: float64(1), int64(11)
memory usage: 165.4 KB


In [None]:
df.head()

Unnamed: 0,age,sex,Chest pain type,trestbps,cholesterol,fasting blood sugar,resting ecg,max heart rate,exercise angina,oldpeak,ST slope,target
0,70,1,4,130,322,0,2,109,0,2.4,2,2
1,67,0,3,115,564,0,2,160,0,1.6,2,1
2,57,1,2,124,261,0,0,141,0,0.3,1,2
3,64,1,4,128,263,0,0,105,1,0.2,2,1
4,74,0,2,120,269,0,2,121,1,0.2,1,1


In [None]:
df['target'].unique()

array([2, 1, 0])

In [None]:
df['target'].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
1,918
0,725
2,120


In [None]:
df.describe()

Unnamed: 0,age,sex,Chest pain type,trestbps,cholesterol,fasting blood sugar,resting ecg,max heart rate,exercise angina,oldpeak,ST slope,target
count,1763.0,1763.0,1763.0,1763.0,1763.0,1763.0,1763.0,1763.0,1763.0,1763.0,1763.0,1763.0
mean,53.952921,0.736245,3.039138,131.950085,222.625638,0.192286,0.798071,142.952921,0.368123,0.962337,1.442428,0.656835
std,9.267101,0.440793,1.023642,18.154333,90.119674,0.394208,0.923926,25.150727,0.482432,1.109458,0.722159,0.601448
min,28.0,0.0,0.0,0.0,0.0,0.0,0.0,60.0,0.0,-2.6,0.0,0.0
25%,47.0,0.0,2.0,120.0,199.0,0.0,0.0,125.0,0.0,0.0,1.0,0.0
50%,55.0,1.0,3.0,130.0,234.0,0.0,0.0,145.0,0.0,0.6,1.0,1.0
75%,61.0,1.0,4.0,140.0,272.5,0.0,2.0,162.0,1.0,1.6,2.0,1.0
max,77.0,1.0,4.0,200.0,603.0,1.0,2.0,202.0,1.0,6.2,3.0,2.0


In [None]:
df.isnull().sum()

Unnamed: 0,0
age,0
sex,0
Chest pain type,0
trestbps,0
cholesterol,0
fasting blood sugar,0
resting ecg,0
max heart rate,0
exercise angina,0
oldpeak,0
