#### Life cycle of Machine learning Project

- Understanding the Problem Statement
- Data Collection
- Data Checks to perform
- Exploratory data analysis
- Data Pre-Processing
- Model Training
- Choose best model

# Data Collection

Dataset Source : https://www.kaggle.com/datasets/mexwell/heart-disease-dataset?select=heart_statlog_cleveland_hungary_final.csv

The dataset consist of 12 columns and 918 rows(after removing duplicates)

In [3]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

Importing the dataset

In [4]:
data=pd.read_csv("heart.csv")

showing Top 2 records

In [5]:
data.head(2)

Unnamed: 0,age,sex,chest pain type,resting bp s,cholesterol,fasting blood sugar,resting ecg,max heart rate,exercise angina,oldpeak,ST slope,target
0,40,1,2,140,289,0,0,172,0,0.0,1,0
1,49,0,3,160,180,0,0,156,0,1.0,2,1


Shape of Dataset

In [6]:
data.shape

(1190, 12)

Checking null values

In [7]:
data.isnull().sum()

age                    0
sex                    0
chest pain type        0
resting bp s           0
cholesterol            0
fasting blood sugar    0
resting ecg            0
max heart rate         0
exercise angina        0
oldpeak                0
ST slope               0
target                 0
dtype: int64

In [8]:
data.duplicated().sum()

272

In [9]:
data.drop_duplicates(inplace=True)

In [10]:
data.duplicated().sum()

0

After removing duplicated values, the shape is:

In [11]:
data.shape

(918, 12)

check data types

In [12]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 918 entries, 0 to 1189
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   age                  918 non-null    int64  
 1   sex                  918 non-null    int64  
 2   chest pain type      918 non-null    int64  
 3   resting bp s         918 non-null    int64  
 4   cholesterol          918 non-null    int64  
 5   fasting blood sugar  918 non-null    int64  
 6   resting ecg          918 non-null    int64  
 7   max heart rate       918 non-null    int64  
 8   exercise angina      918 non-null    int64  
 9   oldpeak              918 non-null    float64
 10  ST slope             918 non-null    int64  
 11  target               918 non-null    int64  
dtypes: float64(1), int64(11)
memory usage: 93.2 KB


Checking the number of unique values in each column

In [13]:
data.nunique()

age                     50
sex                      2
chest pain type          4
resting bp s            67
cholesterol            222
fasting blood sugar      2
resting ecg              3
max heart rate         119
exercise angina          2
oldpeak                 53
ST slope                 4
target                   2
dtype: int64

Checking statistics of dataset

In [14]:
data.describe()

Unnamed: 0,age,sex,chest pain type,resting bp s,cholesterol,fasting blood sugar,resting ecg,max heart rate,exercise angina,oldpeak,ST slope,target
count,918.0,918.0,918.0,918.0,918.0,918.0,918.0,918.0,918.0,918.0,918.0,918.0
mean,53.510893,0.78976,3.251634,132.396514,198.799564,0.233115,0.603486,136.809368,0.404139,0.887364,1.636166,0.553377
std,9.432617,0.407701,0.931031,18.514154,109.384145,0.423046,0.805968,25.460334,0.490992,1.06657,0.609341,0.497414
min,28.0,0.0,1.0,0.0,0.0,0.0,0.0,60.0,0.0,-2.6,0.0,0.0
25%,47.0,1.0,3.0,120.0,173.25,0.0,0.0,120.0,0.0,0.0,1.0,0.0
50%,54.0,1.0,4.0,130.0,223.0,0.0,0.0,138.0,0.0,0.6,2.0,1.0
75%,60.0,1.0,4.0,140.0,267.0,0.0,1.0,156.0,1.0,1.5,2.0,1.0
max,77.0,1.0,4.0,200.0,603.0,1.0,2.0,202.0,1.0,6.2,3.0,1.0


Exploring data

In [15]:
variables = data.columns

for var in variables:
    print("*"*50)
    print(f"Categories in '{var}' variable:", data[var].unique())




**************************************************
Categories in 'age' variable: [40 49 37 48 54 39 45 58 42 38 43 60 36 44 53 52 51 56 41 32 65 35 59 50
 47 31 46 57 55 63 66 34 33 61 29 62 28 30 74 68 72 64 69 67 73 70 77 75
 76 71]
**************************************************
Categories in 'sex' variable: [1 0]
**************************************************
Categories in 'chest pain type' variable: [2 3 4 1]
**************************************************
Categories in 'resting bp s' variable: [140 160 130 138 150 120 110 136 115 100 124 113 125 145 112 132 118 170
 142 190 135 180 108 155 128 106  92 200 122  98 105 133  95  80 137 185
 165 126 152 116   0 144 154 134 104 139 131 141 178 146 158 123 102  96
 143 172 156 114 127 101 174  94 148 117 192 129 164]
**************************************************
Categories in 'cholesterol' variable: [289 180 283 214 195 339 237 208 207 284 211 164 204 234 273 196 201 248
 267 223 184 288 215 209 260 468 188 518 167 224 1

## Information about the dataset

In [17]:
info = [
    "age",
    "1: male, 0: female",
    "chest pain type, 1: typical angina, 2: atypical angina, 3: non-anginal pain, 4: asymptomatic",
    "resting blood pressure",
    "serum cholesterol in mg/dl",
    "fasting blood sugar > 120 mg/dl",
    "resting electrocardiographic results (values 0,1,2)",
    "maximum heart rate achieved",
    "exercise induced angina",
    "oldpeak = ST depression induced by exercise relative to rest",
    "the slope of the peak exercise ST segment",
    "target"
]

# Ensure the loop runs only up to the minimum length of data.columns and info
for i in range(min(len(data.columns), len(info))):
    print(f"{data.columns[i]}:\t\t\t{info[i]}")



age:			age
sex:			1: male, 0: female
chest pain type:			chest pain type, 1: typical angina, 2: atypical angina, 3: non-anginal pain, 4: asymptomatic
resting bp s:			resting blood pressure
cholesterol:			serum cholesterol in mg/dl
fasting blood sugar:			fasting blood sugar > 120 mg/dl
resting ecg:			resting electrocardiographic results (values 0,1,2)
max heart rate:			maximum heart rate achieved
exercise angina:			exercise induced angina
oldpeak:			oldpeak = ST depression induced by exercise relative to rest
ST slope:			the slope of the peak exercise ST segment
target:			target


Analysing the 'target' variable

In [18]:
data['target'].describe()

count    918.000000
mean       0.553377
std        0.497414
min        0.000000
25%        0.000000
50%        1.000000
75%        1.000000
max        1.000000
Name: target, dtype: float64

In [19]:

data["target"].unique()


array([0, 1], dtype=int64)

Clearly, this is a classification problem, with the target variable having values '0' and '1'


* checking Correlation between columns

In [20]:
print(data.corr()["target"].abs().sort_values(ascending=False))


target                 1.000000
ST slope               0.553461
exercise angina        0.494282
chest pain type        0.471354
oldpeak                0.403951
max heart rate         0.400421
sex                    0.305445
age                    0.282039
fasting blood sugar    0.267291
cholesterol            0.232741
resting bp s           0.107589
resting ecg            0.061011
Name: target, dtype: float64


EDA


In [23]:
import plotly.express as px
import plotly.graph_objects as go


target_temp = data['target'].value_counts()

# Create a bar graph using Plotly
fig = go.Figure([go.Bar(x=target_temp.index, y=target_temp.values)])
fig.update_layout(
    title='Target Value Distribution',
    xaxis_title='Target Value',
    yaxis_title='Count',
    template='plotly_white'
)

fig.show()

In [24]:
data.sample(2)

Unnamed: 0,age,sex,chest pain type,resting bp s,cholesterol,fasting blood sugar,resting ecg,max heart rate,exercise angina,oldpeak,ST slope,target
690,76,0,3,140,197,0,1,116,0,1.1,2,0
446,55,1,3,136,228,0,1,124,1,1.6,2,1


In [25]:
countNoDisease = len(data[data.target == 0])
datacountHaveDisease = len(data[data.target == 1])
totalPatients = data.shape[0]
percentageNoDisease = (countNoDisease * 100) / totalPatients
percentageDisease =(datacountHaveDisease)*100 /totalPatients

print(f"Percentage of Patients without heart problem: {percentageNoDisease:.2f}%")

print(f"Percentage of Patients without heart problem: {percentageDisease:.2f}%")

Percentage of Patients without heart problem: 44.66%
Percentage of Patients without heart problem: 55.34%


In [26]:
data["sex"].unique()


array([1, 0], dtype=int64)

In [27]:
import plotly.express as px

# Create a count plot with Plotly
fig = px.histogram(data, x='sex', 
                   title='Count of Observations by Sex',
                   labels={'sex': 'Sex'},
                   text_auto=True)  # Automatically displays the text on bars

# Show the plot
fig.show()


We notice, that females are more likely to have heart problems than males


In [28]:
data.sample(2)

Unnamed: 0,age,sex,chest pain type,resting bp s,cholesterol,fasting blood sugar,resting ecg,max heart rate,exercise angina,oldpeak,ST slope,target
825,64,1,3,125,309,0,0,131,1,1.8,2,1
1166,58,0,4,130,197,0,0,131,0,0.6,2,0


In [29]:
data['chest pain type'].unique()


array([2, 3, 4, 1], dtype=int64)

In [30]:
target_temp = data['chest pain type'].value_counts()

# Create a bar graph using Plotly
fig = go.Figure([go.Bar(x=target_temp.index, y=target_temp.values)])
fig.update_layout(
    title='Target Value Distribution',
    xaxis_title='Chest Pain Type Value',
    yaxis_title='Count',
    template='plotly_white'
)

fig.show()

In [31]:
chest_pain_mapping = {
    1: 'Typical Angina',
    2: 'Atypical Angina',
    3: 'Non-Anginal Pain',
    4: 'Asymptomatic'

}
data['chestpaintype'] = data['chest pain type'].map(chest_pain_mapping)



In [39]:
chest_pain_counts = data['chestpaintype'].value_counts().reset_index()
chest_pain_counts.columns = ['chestpaintype', 'Count']

# Create the bar plot using Plotly
fig = px.bar(chest_pain_counts, x='chestpaintype', y='Count', 
             title='Distribution of Chest Pain Types',
             labels={'Chest Pain Type': 'Chest Pain Type', 'Count': 'Count'},
             text='Count')

# Customize layout
fig.update_layout(
    xaxis_title='Chest Pain Type',
    yaxis_title='Count',
    xaxis_tickangle=-45,  # Rotate x-axis labels
    height=500,           # Set figure height
    width=1000            # Set figure width
)

# Show the plot
fig.show()



In [40]:
data.sample(2)

Unnamed: 0,age,sex,chest pain type,resting bp s,cholesterol,fasting blood sugar,resting ecg,max heart rate,exercise angina,oldpeak,ST slope,target,chestpaintype
115,33,0,4,100,246,0,0,150,1,1.0,2,1,Asymptomatic
108,50,1,4,140,129,0,0,135,0,0.0,1,0,Asymptomatic


In [41]:
data['chest pain type'].unique()

array([2, 3, 4, 1], dtype=int64)

In [43]:
data['fasting blood sugar'].describe()

count    918.000000
mean       0.233115
std        0.423046
min        0.000000
25%        0.000000
50%        0.000000
75%        0.000000
max        1.000000
Name: fasting blood sugar, dtype: float64

In [44]:
data['fasting blood sugar'].unique()

array([0, 1], dtype=int64)

In [45]:
target_temp = data['fasting blood sugar'].value_counts()

# Create a bar graph using Plotly
fig = go.Figure([go.Bar(x=target_temp.index, y=target_temp.values)])
fig.update_layout(
    title='Target Value Distribution',
    xaxis_title='Fasting blood sugar',
    yaxis_title='Count',
    template='plotly_white'
)

fig.show()

In [46]:
data['resting ecg'].unique()

array([0, 1, 2], dtype=int64)

In [48]:
resting_ecg_counts = data['resting ecg'].value_counts().reset_index()
resting_ecg_counts.columns = ['Resting ECG Results', 'Count']

# Create the bar plot using Plotly Express
fig = px.bar(resting_ecg_counts, x='Resting ECG Results', y='Count', 
             title='Distribution of Resting Electrocardiogram Results',
             labels={'Resting ECG Results': 'Resting Electrocardiogram Results', 'Count': 'Count'},
             text='Count')

# Customize layout
fig.update_layout(
    xaxis_title='Resting Electrocardiogram Results',
    yaxis_title='Count',
    template='plotly_white'  # Apply the white theme
)

# Show the plot
fig.show()

We realize that people with restecg '0' are much more likely to have a heart disease than with restecg  '1','2'


In [53]:
exercise_ang_counts = data['exercise angina'].value_counts().reset_index()
exercise_ang_counts.columns = ['Exercise Angina Results', 'Count']

# Create the bar plot using Plotly Express
fig = px.bar(exercise_ang_counts, x='Exercise Angina Results', y='Count', 
             title='Distribution of Exercise Angina Results Results',
             labels={'Exercise Angina  Results': 'Exercise Angina Results', 'Count': 'Count'},
             text='Count')

# Customize layout
fig.update_layout(
    xaxis_title='Exercise Angina Results',
    yaxis_title='Count',
    template='plotly_white'  # Apply the white theme
)

# Show the plot
fig.show()

People with exang=1 i.e. Exercise induced angina are much less likely to have heart problems


In [55]:
data.sample(2)

Unnamed: 0,age,sex,chest pain type,resting bp s,cholesterol,fasting blood sugar,resting ecg,max heart rate,exercise angina,oldpeak,ST slope,target,chestpaintype
358,53,1,2,120,0,0,0,95,0,0.0,2,1,Atypical Angina
78,52,1,2,140,100,0,0,138,1,0.0,1,0,Atypical Angina


In [61]:
data['ST slope'].value_counts()

2    459
1    395
3     63
0      1
Name: ST slope, dtype: int64

In [67]:
data.loc[data['ST slope'] == 0, 'ST slope'] = 1


In [68]:
data['ST slope'].value_counts()

2    459
1    396
3     63
Name: ST slope, dtype: int64

In [69]:
exercise_ang_counts = data['ST slope'].value_counts().reset_index()
exercise_ang_counts.columns = ['ST Slope Results', 'Count']

# Create the bar plot using Plotly Express
fig = px.bar(exercise_ang_counts, x='ST Slope Results', y='Count', 
             title='ST Slope Results',
             labels={'ST Slope  Results': 'ST Slope Results', 'Count': 'Count'},
             text='Count')

# Customize layout
fig.update_layout(
    xaxis_title='ST Slope Results',
    yaxis_title='Count',
    template='plotly_white'  # Apply the white theme
)

# Show the plot
fig.show()

We observe, that Slope '2' causes heart pain much more than Slope '3'


In [70]:
data.columns

Index(['age', 'sex', 'chest pain type', 'resting bp s', 'cholesterol',
       'fasting blood sugar', 'resting ecg', 'max heart rate',
       'exercise angina', 'oldpeak', 'ST slope', 'target', 'chestpaintype'],
      dtype='object')