# <center>FINAL PROJECT: Heart Attack Analysis & Prediction Dataset<center>

| FULL NAME             | ID NUMBER |
| :-----------          |     :----:|
| Nguyễn Phương Thảo  | 21120336  |
| Phan Cao Nguyên| 21120299  |
| Lê Trần Minh Khuê  | 21120279  |uê

---

## Contents
1. [Introduction](#1-introduction)  
    1.1. [Dataset information](#11-dataset-information)  
    1.2. [Attribute information](#12-attribute-information)  
    1.3. [Why we select this dataset](#13-why-we-select-this-dataset)  
2. [Data exploration](#2-data-exploration)  
    2.1. [Import Lib](#21-import-lib)  
    2.2. [How many rows and columns](#22-how-many-rows-and-columns)  
    2.3. [Rows exploration](#23-rows-exploration)  
    2.4. [Columns exploration](#24-columns-exploration)  
3. [Frame the problem](#3-frame-the-problems---ask-meaningful-questions)  
4. [Preprocessing](#4-preprocessing)  
5. [Analyzing to answer questions](#5-analyzing-to-answer-questions)  
6. [Reflection](#6-reflection)  
7. [References](#7-references)  

## 1. Introduction

### 1.1. Dataset information

**Subject of this data:**
The dataset is the `Cleveland Heart Disease dataset` taken from the UCI repository. The dataset consists of 303 individuals’ data and it can be accessed in the UCI Machine Learning repository site under the "Data Folder".  

**Acknowledgements**

- ***Creators:***
    - Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
    - University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
    - University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
    - V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.

- ***Donor:***
    - David W. Aha (aha '@' ics.uci.edu) (714) 856-8779

**License:**

This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.

**Source:**

**Original data:** [Dataset from UCI Machine Learning repository](https://archive.ics.uci.edu/ml/datasets/Heart+Disease)

*In this project, we will use the processed version of the data posted in the following Kaggle dataset.*
[Heart Attack Analysis & Prediction Dataset](https://www.kaggle.com/datasets/rashikrahmanpritom/heart-attack-analysis-prediction-dataset?resource=download&select=heart.csv)   

### 1.2. Attribute information

Cleveland Heart Disease Dataset (Decryption on attributes from data source webpage):
- `processed.cleveland.data`:
    - `age`: Age of the person (in years)
    - `Sex`: Gender of the person (1 = male, 0 = female)
    - `cp`: Chest Pain type 
        - Value 1: typical angina
        - Value 2: atypical angina
        - Value 3: non-anginal pain
        - Value 4: asymptomatic
    - `trtbps`: resting blood pressure (in mmHg on admission to the hospital)
    - `chol`: serum cholestoral (in mg/dl fetched via BMI sensor - integer value)
    - `fbs`: fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
    - `restecg`: resting electrocardiographic results
        - Value 0: normal
        - Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
        - Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
    - `thalach`: maximum heart rate achieved 
    - `exang`: exercise induced angina (1 = yes; 0 = no)
    - `oldpeak`: previous peak - ST depression induced by exercise relative to rest 
    - `slp`: the slope of the peak exercise ST segment
        - Value 1: upsloping
        - Value 2: flat
        - Value 3: downsloping
    - `ca`: number of major vessels (0 - 3)
    - `thal`: thal rate
        - 3 = normal;
        - 6 = fixed defect;
        - 7 = reversable defect
    - `output`: target variable - diagnosis of heart disease 
        diagnosis of heart disease (angiographic disease status)
        - Value 0: < 50% diameter narrowing
        - Value 1: > 50% diameter narrowing
        
        (in any major vessel: attributes 59 through 68 are vessels)
        
#### NOTE:


### 1.3. Why we select this dataset?

Cardiovascular diseases, including heart disease, rank among the leading causes of death worldwide. The prevalence of heart-related conditions such as angina, myocardial infarction, and stroke makes it a crucial area of focus in medical research. Understanding the contributing factors to heart disease and the ability to predict the risk of its occurrence are paramount in improving intervention and prevention strategies.

The dataset contains valuable information about various factors and clinical parameters related to heart disease. It includes both numerical and categorical features, such as age, sex, cholesterol levels, and electrocardiographic measurements, making it suitable for analyzing and predicting heart disease.

## 2. Data Exploration

### 2.1. IMPORT LIB

In [1]:
# import Lib

#   to download data
import os
import urllib.request
#
import pandas as pd
import numpy as np
import re
# visualization 
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import colors
from matplotlib.colors import ListedColormap
%matplotlib inline


### 2.2. Downloading data

In [2]:
# Create a folder if it does not exist
data_folder = "./Data"
os.makedirs(data_folder, exist_ok=True)

# Path to the "processed.cleveland.data" file in the "Data" folder
file_path = os.path.join(data_folder, "processed.cleveland.data")

# Check if the file exists
if not os.path.exists(file_path):
    # If not, download it from the URL and save it to the "Data" folder
    url = "http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
    urllib.request.urlretrieve(url, file_path)


In [3]:
# # Loading data into dataframe
heart_df = pd.read_csv(file_path,header=None)
#heart_df = pd.read_csv("Data/heart.csv")
heart_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0


As per the provided information, we will proceed to rename the columns to their accurate titles corresponding to the values they hold in the respective columns.

In [4]:
# Rename columns
heart_df.columns = ['age', 'sex', 'cp', 'trtbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slp', 'ca', 'thal', 'num']
heart_df.head()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalach,exang,oldpeak,slp,ca,thal,num
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0


### 2.3. How many rows and columns?
Use pandas to find the number of rows and columns, then store in 2 variable num_rows and num_cols.

In [5]:
num_rows, num_cols = heart_df.shape
print('Number of rows:',num_rows)
print('Number of columns:',num_cols)

Number of rows: 303
Number of columns: 14


### 2.4. Rows exploration

In [6]:
heart_df.sample()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalach,exang,oldpeak,slp,ca,thal,num
131,51.0,1.0,3.0,94.0,227.0,0.0,0.0,154.0,1.0,0.0,1.0,1.0,7.0,0


#### 2.4.1. Meaning of each row
Each row in the dataset represents an individual, and the values in each column provide information about various health-related factors for that person. The goal is often to analyze these features to predict the likelihood or presence of heart disease.

#### 2.4.2. Are there duplicated rows?
Using Pandas `duplicated()` to check for any duplicate rows.
- Declare the variable `have_duplicated_rows` to store the number of duplicated rows.
- If `have_duplicated_rows` > 0:
    - Print these duplicate rows and then drop them.
    - Check: If `have_duplicated_rows` == 0 (False), then there are no duplicated rows.
- Otherwise, proceed to the next part.

In [7]:
have_duplicated_rows = heart_df.duplicated().sum()
have_duplicated_rows

0

In [8]:
if have_duplicated_rows > 0: 
    duplicates = heart_df[heart_df.duplicated()]
    print(duplicates) 
    
    # remove duplicates
    heart_df.drop_duplicates(inplace=True)
    
    # update 'have_duplicated_rows'
    have_duplicated_rows = heart_df.duplicated().sum()

    # TEST
    assert have_duplicated_rows == False
    

### 2.4. Columns exploration


#### 2.4.1. Meaning of each columns
*based on defined information from source page
- `age`: Age of the person (in years - integer value)
- `Sex`: Gender of the person (1 = male, 0 = female)
- `cp`: Chest Pain type (0 = Typical Angina, 1 = Atypical Angina, 2 = Non-anginal Pain, 3 = Asymptomatic)
- `trtbps`: resting blood pressure (in mmHg on admission to the hospital - integer value)
- `chol`: serum cholestoral (in mg/dl fetched via BMI sensor - integer value)
- `fbs`: fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
- `restecg`: resting electrocardiographic results (0 = Normal, 1 = ST-T wave normality, 2 = Left ventricular hypertrophy)
- `thalach`: maximum heart rate achieved (integer value)
- `exang`: exercise induced angina (1 = yes; 0 = no)
- `oldpeak`: previous peak - ST depression induced by exercise relative to rest
- `slp`: the slope of the peak exercise ST segment (1: upsloping, 2: flat, 3: downsloping)
- `ca`: number of major vessels (0 - 3)
- `thal`: thal rate (3 = normal; 6 = fixed defect; 7 = reversable defect)
- `num`: target variable - diagnosis of heart disease (integer value) (0: < 50% diameter narrowing; 1: > 50% diameter narrowing)

#### 2.4.2. Checking the number of unique values in each column

In [9]:
unique_counts = heart_df.nunique()
unique_counts.sort_values(ascending=True)

sex          2
fbs          2
exang        2
restecg      3
slp          3
cp           4
thal         4
ca           5
num          5
oldpeak     40
age         41
trtbps      50
thalach     91
chol       152
dtype: int64

#### 2.4.3. Current data type? Are there inapproriate data type?
We check the data type of each element in a column and then store the results in a series named `col_dtypes`.

In [10]:
col_dtypes = heart_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   age      303 non-null    float64
 1   sex      303 non-null    float64
 2   cp       303 non-null    float64
 3   trtbps   303 non-null    float64
 4   chol     303 non-null    float64
 5   fbs      303 non-null    float64
 6   restecg  303 non-null    float64
 7   thalach  303 non-null    float64
 8   exang    303 non-null    float64
 9   oldpeak  303 non-null    float64
 10  slp      303 non-null    float64
 11  ca       303 non-null    object 
 12  thal     303 non-null    object 
 13  num      303 non-null    int64  
dtypes: float64(11), int64(1), object(2)
memory usage: 33.3+ KB


##### Processing for 'ca' and 'thal'.
`ca` and `thal` columns should be assigned numerical values, but the results we obtained show them as `object`. Therefore, we will print the rows that are not numbers or cannot be converted to numbers for further examination.

In [11]:
# condition to find rows with values that are not numbers or cannot be converted to numbers
non_numeric_rows = ~heart_df.apply(lambda row: row.map(lambda x: pd.to_numeric(x, errors='coerce')).notna()).all(axis=1)

# Filter the DataFrame to retain only the rows that satisfy the condition.
filtered_df = heart_df[non_numeric_rows]
filtered_df

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalach,exang,oldpeak,slp,ca,thal,num
87,53.0,0.0,3.0,128.0,216.0,0.0,2.0,115.0,0.0,0.0,1.0,0.0,?,0
166,52.0,1.0,3.0,138.0,223.0,0.0,0.0,169.0,0.0,0.0,1.0,?,3.0,0
192,43.0,1.0,4.0,132.0,247.0,1.0,2.0,143.0,1.0,0.1,2.0,?,7.0,1
266,52.0,1.0,4.0,128.0,204.0,1.0,0.0,156.0,1.0,1.0,2.0,0.0,?,2
287,58.0,1.0,2.0,125.0,220.0,0.0,0.0,144.0,0.0,0.4,2.0,?,7.0,0
302,38.0,1.0,3.0,138.0,175.0,0.0,0.0,173.0,0.0,0.0,1.0,?,3.0,0


It is easy to notice that the rows with values that cannot be converted are represented by the character "?" - indicating missing data. Therefore, we will replace them with 'nan'.

In [12]:
for column in heart_df.columns:
    if heart_df[column].dtype == object and '?' in heart_df[column].values:
        heart_df.loc[non_numeric_rows & (heart_df[column] == '?'), column] = np.nan

heart_df.loc[[87,166,192,266,287,302]]

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalach,exang,oldpeak,slp,ca,thal,num
87,53.0,0.0,3.0,128.0,216.0,0.0,2.0,115.0,0.0,0.0,1.0,0.0,,0
166,52.0,1.0,3.0,138.0,223.0,0.0,0.0,169.0,0.0,0.0,1.0,,3.0,0
192,43.0,1.0,4.0,132.0,247.0,1.0,2.0,143.0,1.0,0.1,2.0,,7.0,1
266,52.0,1.0,4.0,128.0,204.0,1.0,0.0,156.0,1.0,1.0,2.0,0.0,,2
287,58.0,1.0,2.0,125.0,220.0,0.0,0.0,144.0,0.0,0.4,2.0,,7.0,0
302,38.0,1.0,3.0,138.0,175.0,0.0,0.0,173.0,0.0,0.0,1.0,,3.0,0


>***Note:*** 
We will not convert 'ca' and 'thal' to numerical types. The specific reasons will be explained in the next section.

##### Processing for categorical columns:
Take a closer look at the previously established definitions and values, we observe that certain attributes should be of categorical type, such as:
- `sex` (Male/Female),

- `cp` (1- typical angina/ 2 - atypical angina/ 3 - non-anginal pain/ 4 - asymptomatic),

- `fbs` (True/False),

- `restecg` (normal, ST-T wave abnormality, left ventricular hypertrophy),

- `exang` (Yes/No),

- `slp` (unsloping/ flat/ downsloping),

- `ca` (0-3 major vessels)

- `thal` (3 - normal/ 6 - fixed defect/ 7 - reversible defect)

Because the values in these columns are independent and unordered, they carry more categorical meaning. Therefore, if represented as numbers, it can lead to confusion during calculations.

- For columns that only take 2 values (`sex`, `exang`, `fbs`), we will replace "0" and "1" with their actual representations.
- For columns that take more than 2 values (or have a large value length), we will simply change the data type to object.

In [13]:
heart_df['sex'] = heart_df['sex'].replace({1: 'Male', 0: 'Female'})
heart_df['exang'] = heart_df['exang'].replace({1: 'Yes', 0: 'No'})
heart_df['fbs'] = heart_df['fbs'].replace({1: 'True', 0: 'False'})

heart_df[['cp', 'restecg', 'slp', 'ca', 'thal']] = heart_df[['cp', 'restecg', 'slp', 'ca', 'thal']].astype(str)

heart_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   age      303 non-null    float64
 1   sex      303 non-null    object 
 2   cp       303 non-null    object 
 3   trtbps   303 non-null    float64
 4   chol     303 non-null    float64
 5   fbs      303 non-null    object 
 6   restecg  303 non-null    object 
 7   thalach  303 non-null    float64
 8   exang    303 non-null    object 
 9   oldpeak  303 non-null    float64
 10  slp      303 non-null    object 
 11  ca       303 non-null    object 
 12  thal     303 non-null    object 
 13  num      303 non-null    int64  
dtypes: float64(5), int64(1), object(8)
memory usage: 33.3+ KB


#### 2.4.4. Numeric columns - How are values distributed?
- First, we will retrieve columns with numeric types.

- Next, we will calculate the missing ratio and consider some statistical values associated with them.

In [14]:
numeric_columns = heart_df.select_dtypes(include=['number'])
numeric_columns.columns

Index(['age', 'trtbps', 'chol', 'thalach', 'oldpeak', 'num'], dtype='object')

Remove the 'num' column as it is the target column; we are not considering it.

In [15]:
numeric_columns = numeric_columns.drop('num', axis=1)
numeric_columns.columns

Index(['age', 'trtbps', 'chol', 'thalach', 'oldpeak'], dtype='object')

***But wait a moment.*** 

Take a quick look at the 'num' column.

In [16]:
unique_values_num = heart_df['num'].unique()
unique_values_num

array([0, 2, 1, 3, 4], dtype=int64)

According to the provided definition, 'num' only takes values of 0 (< 50% diameter narrowing - aka ) or 1 (> 50% diameter narrowing).

But right now, we observe that 'num' takes 5 values [0, 2, 1, 3, 4]. Therefore, we need to normalize 2, 1, 3, 4 to a single class = 1 (indicating patients with the disease).

In [17]:
heart_df['num'].replace([2, 3, 4], 1, inplace=True)
unique_values_num = heart_df['num'].unique()
unique_values_num

array([0, 1], dtype=int64)

##### **Missing ratio**

To calculate the missing ratio for each column with names in the numeric_column_names list, we use the `isnull()` and `mean` method to count the number of missing values and then divide by the total number of rows.

> Explain: 
>  - When using the `isnull()` method, it creates a new DataFrame with *True* values where there are missing values and *False* where values are not missing.
> - When applying `mean()` to this DataFrame, it automatically converts *True* to 1 and *False* to 0 during the averaging process, resulting in the missing ratio.

In [18]:
missing_ratio_numeric_columns = numeric_columns.isnull().mean()*100
missing_ratio_numeric_columns

age        0.0
trtbps     0.0
chol       0.0
thalach    0.0
oldpeak    0.0
dtype: float64

There are no missing value!

The missing ratio for `ca` is 1.32%. We decided to drop all records with missing values for `ca` to prevent potential effects later on.

##### **Min, max, quantiles**

Use the describe function to compute the minimum, maximum, and quantile values.

In [19]:
numeric_columns.describe()

Unnamed: 0,age,trtbps,chol,thalach,oldpeak
count,303.0,303.0,303.0,303.0,303.0
mean,54.438944,131.689769,246.693069,149.607261,1.039604
std,9.038662,17.599748,51.776918,22.875003,1.161075
min,29.0,94.0,126.0,71.0,0.0
25%,48.0,120.0,211.0,133.5,0.0
50%,56.0,130.0,241.0,153.0,0.8
75%,61.0,140.0,275.0,166.0,1.6
max,77.0,200.0,564.0,202.0,6.2


##### **Are they abnormal?**
- **Age:**
  - The dataset includes 303 individuals.
  - The average age is approximately 54.44 years, with a standard deviation of around 9.04 years.
  - The youngest individual is 29 years old, while the oldest is 77.

- **Resting Blood Pressure (trtbps):**
  - Average resting blood pressure is approximately 131.69 mm Hg, with a standard deviation of around 17.60 mm Hg. (higher than the normal range (below 120/80 mm Hg))
  - The minimum resting blood pressure is 94 mm Hg, and the maximum is 200 mm Hg.

- **Serum Cholesterol (chol):**
  - Average serum cholesterol is approximately 246.69 mg/dl, exceeding the normal limit (below 200 mg/dl)
  - The minimum serum cholesterol level is 126 mg/dl, and the maximum is 564 mg/dl.

- **Maximum Heart Rate Achieved (thalach):**
  - The average maximum heart rate achieved is approximately 149.61 beats per minute, with a standard deviation of around 22.88 beats per minute.
  - The minimum maximum heart rate achieved is 71 beats per minute, and the maximum is 202 beats per minute.

- **ST Depression Induced by Exercise (oldpeak):**
  - The average ST depression induced by exercise is approximately 1.04, with a standard deviation of around 1.16.
  - The minimum ST depression is 0, and the maximum is 6.2.

#### 2.4.5. Categorical columns - How are values distributed?
- First, we will retrieve columns with categorical types.

- Next, we will calculate the missing ratio and take a look at their different values.

In [20]:
categorical_columns = heart_df.select_dtypes(include='object')
categorical_columns.columns

Index(['sex', 'cp', 'fbs', 'restecg', 'exang', 'slp', 'ca', 'thal'], dtype='object')

**Missing ratio**

In [21]:
missing_ratio_categorical = (heart_df[categorical_columns.columns] == 'nan').mean()
missing_ratio_categorical

sex        0.000000
cp         0.000000
fbs        0.000000
restecg    0.000000
exang      0.000000
slp        0.000000
ca         0.013201
thal       0.006601
dtype: float64

In [22]:
heart_df = heart_df.applymap(lambda x: np.nan if x == 'nan' else x).dropna()
categorical_columns = heart_df.select_dtypes(include='object')
heart_df.shape

  heart_df = heart_df.applymap(lambda x: np.nan if x == 'nan' else x).dropna()


(297, 14)

In [23]:
missing_ratio_categorical = (heart_df[categorical_columns.columns] == 'nan').mean()
missing_ratio_categorical

sex        0.0
cp         0.0
fbs        0.0
restecg    0.0
exang      0.0
slp        0.0
ca         0.0
thal       0.0
dtype: float64

**Different values? Show a few**


In [24]:
ob_col_info_df = pd.DataFrame([], index=['num diff value', 'Most appear', 'Min appear'])

for col in categorical_columns.columns:
    diff_value = len(categorical_columns[col].dropna().unique())
    count_value = categorical_columns[col].dropna().value_counts()
    most_ap = count_value.index[0]
    least_ap = count_value.index[-1]

    # Thêm thông tin vào DataFrame
    ob_col_info_df[col] = [diff_value, most_ap, least_ap]

ob_col_info_df

Unnamed: 0,sex,cp,fbs,restecg,exang,slp,ca,thal
num diff value,2,4.0,2,3.0,2,3.0,4.0,3.0
Most appear,Male,4.0,False,0.0,No,1.0,0.0,3.0
Min appear,Female,1.0,True,1.0,Yes,3.0,3.0,6.0


**Are they abnormal?**

Yes

## 3. Frame the problems - Ask meaningful questions

### Question 1: 
#### Benefits of finding the answer:

### Question 2: 
#### Benefits of finding the answer:

## 4. Preprocessing

## 5. Analyzing to answer questions

## 6. Reflection:

## 7. References:
- Data source: Janosi,Andras, Steinbrunn,William, Pfisterer,Matthias, and Detrano,Robert. (1988). Heart Disease. UCI Machine Learning Repository. https://doi.org/10.24432/C52P4X.