Contributed by: Alaa Saif and Janat Alkhuld Musallam

# Title: Data Manipulation on Heart Disease Dataset Using Pandas Library.

#### Keywords: Data Manipulation, Python, Pandas, NumPy, DataFrame, Heart Disease, Cardiovascular.

# Intoduction

With the constant development our world is facing, new diseases and dangers are marked down in human history as "Modern Day Diseases".
In the developing world, the risk of heart diseas and related cardiovascular diseases are on the rise.
This dataset aquired from Kaggle.com (https://www.kaggle.com/datasets/rishidamarla/heart-disease-prediction)
contains a dataset that is considered a stepping stone in the work to be done ahead in order to prevent the development or the occurance of a heart attack or stroke.

# Objectives

- Aquire a dataset from kaggle.
- Perform data structure and processing on the dataset.
- Perfrom data manipulation on the aquired dataset.

# Atrributes documentation

1. (age): the age of the patient.      
2. (sex): gender of the patient (0 for F/1 for M).
3. (cp): Chest pain type.     
4. (trestbps): Blood Pressure.
5. (chol): Cholesterol.
6. (fbs): FBS over 120.	
7. (restecg): EKG results.	 
8. (thalach): Max HR.	
9. (exang): Exercise angina.	
10. (oldpeak) ST depression.	  
11. (slope):Slope of ST.	
12. (ca):Number of vessels fluro.	
13. (thal):Thallium. 
14. (num):Heart Disease (the predicted attribute).

# Importing libraries

In [68]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns # to plot heatmap

# Reading the dataset

In [51]:
# To read dataset from Excel file:

df = pd.read_csv("Heart_Disease_Prediction.csv") 

# Data Structure & Description

In [52]:
# To display the first five rows:

df.head()

Unnamed: 0,Age,Sex,Chest pain type,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium,Heart Disease
0,70,1,4,130,322,0,2,109,0,2.4,2,3,3,Presence
1,67,0,3,115,564,0,2,160,0,1.6,2,0,7,Absence
2,57,1,2,124,261,0,0,141,0,0.3,1,0,7,Presence
3,64,1,4,128,263,0,0,105,1,0.2,2,1,7,Absence
4,74,0,2,120,269,0,2,121,1,0.2,1,1,3,Absence


In [53]:
# To display the last five rows:

df.tail() 

Unnamed: 0,Age,Sex,Chest pain type,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium,Heart Disease
265,52,1,3,172,199,1,0,162,0,0.5,1,0,7,Absence
266,44,1,2,120,263,0,0,173,0,0.0,1,0,7,Absence
267,56,0,2,140,294,0,2,153,0,1.3,2,0,3,Absence
268,57,1,4,140,192,0,0,148,0,0.4,2,0,6,Absence
269,67,1,4,160,286,0,2,108,1,1.5,2,3,3,Presence


In [54]:
# To find out the size of the dataset:

df.shape

(270, 14)

In [55]:
# To find out the datatypes within the dataset:

df.dtypes

Age                          int64
Sex                          int64
Chest pain type              int64
BP                           int64
Cholesterol                  int64
FBS over 120                 int64
EKG results                  int64
Max HR                       int64
Exercise angina              int64
ST depression              float64
Slope of ST                  int64
Number of vessels fluro      int64
Thallium                     int64
Heart Disease               object
dtype: object

In [56]:
#to fetch information about the dataset (helps us to determine the missing values):

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 270 entries, 0 to 269
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Age                      270 non-null    int64  
 1   Sex                      270 non-null    int64  
 2   Chest pain type          270 non-null    int64  
 3   BP                       270 non-null    int64  
 4   Cholesterol              270 non-null    int64  
 5   FBS over 120             270 non-null    int64  
 6   EKG results              270 non-null    int64  
 7   Max HR                   270 non-null    int64  
 8   Exercise angina          270 non-null    int64  
 9   ST depression            270 non-null    float64
 10  Slope of ST              270 non-null    int64  
 11  Number of vessels fluro  270 non-null    int64  
 12  Thallium                 270 non-null    int64  
 13  Heart Disease            270 non-null    object 
dtypes: float64(1), int64(12), 

In [57]:
# statistical measure of dataset

df.describe()

Unnamed: 0,Age,Sex,Chest pain type,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium
count,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0
mean,54.433333,0.677778,3.174074,131.344444,249.659259,0.148148,1.022222,149.677778,0.32963,1.05,1.585185,0.67037,4.696296
std,9.109067,0.468195,0.95009,17.861608,51.686237,0.355906,0.997891,23.165717,0.470952,1.14521,0.61439,0.943896,1.940659
min,29.0,0.0,1.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,1.0,0.0,3.0
25%,48.0,0.0,3.0,120.0,213.0,0.0,0.0,133.0,0.0,0.0,1.0,0.0,3.0
50%,55.0,1.0,3.0,130.0,245.0,0.0,2.0,153.5,0.0,0.8,2.0,0.0,3.0
75%,61.0,1.0,4.0,140.0,280.0,0.0,2.0,166.0,1.0,1.6,2.0,1.0,7.0
max,77.0,1.0,4.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,3.0,3.0,7.0


In [58]:
# Finding if there are an unique values within the dataset
df.nunique()

Age                         41
Sex                          2
Chest pain type              4
BP                          47
Cholesterol                144
FBS over 120                 2
EKG results                  3
Max HR                      90
Exercise angina              2
ST depression               39
Slope of ST                  3
Number of vessels fluro      4
Thallium                     3
Heart Disease                2
dtype: int64

In [59]:
# checking the absence and presence of heart disease

df['Heart Disease'].value_counts()

Absence     150
Presence    120
Name: Heart Disease, dtype: int64

#  Data Manipulation

In [60]:
df = pd.read_csv("Heart_Disease_Prediction.csv")

# Deleting unnecessary column that have been observed during the information obtaining:

df = df.drop('FBS over 120', axis=1)
df.head(4)

Unnamed: 0,Age,Sex,Chest pain type,BP,Cholesterol,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium,Heart Disease
0,70,1,4,130,322,2,109,0,2.4,2,3,3,Presence
1,67,0,3,115,564,2,160,0,1.6,2,0,7,Absence
2,57,1,2,124,261,0,141,0,0.3,1,0,7,Presence
3,64,1,4,128,263,0,105,1,0.2,2,1,7,Absence


### The 'FBS over 120' column is deleted and the dataset is displayed without it

In [61]:
df = pd.read_csv("Heart_Disease_Prediction.csv")

# Merging two columns togther to observe the correlation between BP and Heart Diseases

df["BP-Heart Disease"] = df['BP'].astype(str) +"-"+ df["Heart Disease"]
print(df)

     Age  Sex  Chest pain type   BP  Cholesterol  FBS over 120  EKG results  \
0     70    1                4  130          322             0            2   
1     67    0                3  115          564             0            2   
2     57    1                2  124          261             0            0   
3     64    1                4  128          263             0            0   
4     74    0                2  120          269             0            2   
..   ...  ...              ...  ...          ...           ...          ...   
265   52    1                3  172          199             1            0   
266   44    1                2  120          263             0            0   
267   56    0                2  140          294             0            2   
268   57    1                4  140          192             0            0   
269   67    1                4  160          286             0            2   

     Max HR  Exercise angina  ST depression  Slope 

### Observation: a new column 'BP - Heart Diseases' , is displayed in the dataset

In [71]:
df = pd.read_csv("Heart_Disease_Prediction.csv")

nonexistent_column = df['NonexistentColumn']
print(nonexistent_column)

KeyError: 'NonexistentColumn'

### The observation is : When I running code, It shows KeyError with a message indicating that the column does not exist in the DataFrame.

In [72]:
# Assign the value 1 to the entire 'EKG results' column.

df = pd.read_csv("Heart_Disease_Prediction.csv")

df['EKG results'] = 1
print(df)

     Age  Sex  Chest pain type   BP  Cholesterol  FBS over 120  EKG results  \
0     70    1                4  130          322             0            1   
1     67    0                3  115          564             0            1   
2     57    1                2  124          261             0            1   
3     64    1                4  128          263             0            1   
4     74    0                2  120          269             0            1   
..   ...  ...              ...  ...          ...           ...          ...   
265   52    1                3  172          199             1            1   
266   44    1                2  120          263             0            1   
267   56    0                2  140          294             0            1   
268   57    1                4  140          192             0            1   
269   67    1                4  160          286             0            1   

     Max HR  Exercise angina  ST depression  Slope 

### Observation: The data of the column 'EKG results' are changed to all be = 1

In [76]:
# Using te sort_values() method based on the 'Max HR'

df.sort_values(['Max HR'])

Unnamed: 0,Age,Sex,Chest pain type,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium,Heart Disease
101,67,1,4,120,237,0,1,71,0,1.0,2,0,3,Presence
122,57,1,4,152,274,0,1,88,1,1.2,2,1,7,Presence
145,53,1,4,123,282,0,1,95,1,2.0,2,2,7,Presence
133,64,1,4,120,246,0,1,96,1,2.2,3,1,3,Presence
57,60,0,3,120,178,1,1,96,0,0.0,1,0,3,Absence
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
229,52,1,1,118,186,0,1,190,0,0.0,2,0,6,Absence
138,34,0,2,118,210,0,1,192,0,0.7,1,0,3,Absence
180,42,1,3,120,240,1,1,194,0,0.8,3,0,7,Absence
144,54,1,2,192,283,0,1,195,0,0.0,1,1,7,Presence


### Observation: The dataset is sorted in ascending order based on 'Max HR'

In [80]:
# Using te rank() method based on the 'Max HR'

df["Rank"] = df['Max HR'].rank()
df

Unnamed: 0,Age,Sex,Chest pain type,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium,Heart Disease,Rank
0,70,1,4,130,322,0,1,109,0,2.4,2,3,3,Presence,16.5
1,67,0,3,115,564,0,1,160,0,1.6,2,0,7,Absence,168.0
2,57,1,2,124,261,0,1,141,0,0.3,1,0,7,Presence,83.5
3,64,1,4,128,263,0,1,105,1,0.2,2,1,7,Absence,11.0
4,74,0,2,120,269,0,1,121,1,0.2,1,1,3,Absence,35.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
265,52,1,3,172,199,1,1,162,0,0.5,1,0,7,Absence,182.5
266,44,1,2,120,263,0,1,173,0,0.0,1,0,7,Absence,232.5
267,56,0,2,140,294,0,1,153,0,1.3,2,0,3,Absence,134.0
268,57,1,4,140,192,0,1,148,0,0.4,2,0,6,Absence,113.0


### Observation:a column rank was added in respect to the 'Max HR' (unsorted)

In [87]:
# Splitting the dataset based on the 'Heart Disease'

df['Existing'] = df['Heart Disease'].str.split()
df

Unnamed: 0,Age,Sex,Chest pain type,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium,Heart Disease,Rank,Existing
0,70,1,4,130,322,0,1,109,0,2.4,2,3,3,Presence,16.5,[Presence]
1,67,0,3,115,564,0,1,160,0,1.6,2,0,7,Absence,168.0,[Absence]
2,57,1,2,124,261,0,1,141,0,0.3,1,0,7,Presence,83.5,[Presence]
3,64,1,4,128,263,0,1,105,1,0.2,2,1,7,Absence,11.0,[Absence]
4,74,0,2,120,269,0,1,121,1,0.2,1,1,3,Absence,35.0,[Absence]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
265,52,1,3,172,199,1,1,162,0,0.5,1,0,7,Absence,182.5,[Absence]
266,44,1,2,120,263,0,1,173,0,0.0,1,0,7,Absence,232.5,[Absence]
267,56,0,2,140,294,0,1,153,0,1.3,2,0,3,Absence,134.0,[Absence]
268,57,1,4,140,192,0,1,148,0,0.4,2,0,6,Absence,113.0,[Absence]


### Observation: The data is split based on the existance of a heart condition.

In [77]:
# Creating subsets of the original dataset

np.random.seed(421) # setting seed for it to be reproducible
sample1 = df.sample(frac = 0.5) # drow random subset
sample1

Unnamed: 0,Age,Sex,Chest pain type,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium,Heart Disease
1,67,0,3,115,564,0,1,160,0,1.6,2,0,7,Absence
14,57,0,4,128,303,0,1,159,0,0.0,1,1,3,Absence
149,41,0,3,112,268,0,1,172,1,0.0,1,0,3,Absence
40,40,1,4,152,223,0,1,181,0,0.0,1,0,7,Presence
54,45,0,2,130,234,0,1,175,0,0.6,2,0,3,Absence
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
224,35,0,4,138,183,0,1,182,0,1.4,1,0,3,Absence
142,50,1,3,140,233,0,1,163,0,0.6,2,1,7,Presence
168,45,0,4,138,236,0,1,152,1,0.2,2,0,3,Absence
158,56,1,1,120,193,0,1,162,0,1.9,2,0,7,Absence


### Observation: 135 rows out 270 rows as a subset for the dataset

In [82]:
# using loc() method to index a specicfic number of elemeents ( all teh rows and three specific columns)

ind1 = df.loc[:,['Age','BP','Thallium']]
ind1

Unnamed: 0,Age,BP,Thallium
0,70,130,3
1,67,115,7
2,57,124,7
3,64,128,7
4,74,120,3
...,...,...,...
265,52,172,7
266,44,120,7
267,56,140,3
268,57,140,6


### Observation: the loc() method creates an index with the specified position

In [84]:
# using iloc() method

ind2 =df.iloc[57]
ind2

Age                              47
Sex                               1
Chest pain type                   3
BP                              108
Cholesterol                     243
FBS over 120                      0
EKG results                       1
Max HR                          152
Exercise angina                   0
ST depression                   0.0
Slope of ST                       1
Number of vessels fluro           0
Thallium                          3
Heart Disease              Presence
Rank                          129.5
Name: 69, dtype: object

### Observation: the data is retrieved row and column through posistion.