# <center>FINAL PROJECT: Heart Attack Analysis & Prediction Dataset<center>

| FULL NAME             | ID NUMBER |
| :-----------          |     :----:|
| Nguyễn Phương Thảo  | 21120336  |
| Phan Cao Nguyên| 21120299  |
| Lê Trần Minh Khuê  | 21120279  |uê

---

## Contents
1. [Introduction](#1-introduction)  
    1.1. [Dataset information](#11-dataset-information)  
    1.2. [Attribute information](#12-attribute-information)  
    1.3. [Why we select this dataset](#13-why-we-select-this-dataset)  
2. [Data exploration](#2-data-exploration)  
    2.1. [Import Lib](#21-import-lib)  
    2.2. [How many rows and columns](#22-how-many-rows-and-columns)  
    2.3. [Rows exploration](#23-rows-exploration)  
    2.4. [Columns exploration](#24-columns-exploration)  
3. [Frame the problem](#3-frame-the-problems---ask-meaningful-questions)  
4. [Preprocessing](#4-preprocessing)  
5. [Analyzing to answer questions](#5-analyzing-to-answer-questions)  
6. [Reflection](#6-reflection)  
7. [References](#7-references)  

## 1. Introduction

### 1.1. Dataset information

**Subject of this data:**
The dataset is the `Cleveland Heart Disease dataset` taken from the UCI repository. The dataset consists of 303 individuals’ data and it can be accessed in the UCI Machine Learning repository site under the "Data Folder".  

**Acknowledgements**

- ***Creators:***
    - Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
    - University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
    - University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
    - V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.

- ***Donor:***
    - David W. Aha (aha '@' ics.uci.edu) (714) 856-8779

**License:**

This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.

**Source:**

**Original data:** [Dataset from UCI Machine Learning repository](https://archive.ics.uci.edu/ml/datasets/Heart+Disease)

*In this project, we will use the processed version of the data posted in the following Kaggle dataset.*
[Heart Attack Analysis & Prediction Dataset](https://www.kaggle.com/datasets/rashikrahmanpritom/heart-attack-analysis-prediction-dataset?resource=download&select=heart.csv)   

### 1.2. Attribute information

Cleveland Heart Disease Dataset (Decryption on attributes from data source webpage):
- `processed.cleveland.data`:
    - `age`: Age of the person (in years - integer value)
    - `Sex`: Gender of the person (1 = male, 0 = female)
    - `cp`: Chest Pain type 
        - Value 1: typical angina
        - Value 2: atypical angina
        - Value 3: non-anginal pain
        - Value 4: asymptomatic
    - `trtbps`: resting blood pressure (in mmHg on admission to the hospital - integer value)
    - `chol`: serum cholestoral (in mg/dl fetched via BMI sensor - integer value)
    - `fbs`: fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
    - `restecg`: resting electrocardiographic results
        - Value 0: normal
        - Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
        - Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
    - `thalach`: maximum heart rate achieved (integer value)
    - `exang`: exercise induced angina (1 = yes; 0 = no)
    - `oldpeak`: previous peak - ST depression induced by exercise relative to rest (integer value)
    - `slp`: the slope of the peak exercise ST segment
        - Value 1: upsloping
        - Value 2: flat
        - Value 3: downsloping
    - `ca`: number of major vessels (0 - 3)
    - `thal`: thal rate
        - 3 = normal; 
        - 6 = fixed defect; 
        - 7 = reversable defect
    - `output`: target variable - diagnosis of heart disease (integer value)
        diagnosis of heart disease (angiographic disease status)
        -- Value 0: < 50% diameter narrowing
        -- Value 1: > 50% diameter narrowing
        (in any major vessel: attributes 59 through 68 are vessels)
        
#### NOTE:


### 1.3. Why we select this dataset?

Cardiovascular diseases, including heart disease, rank among the leading causes of death worldwide. The prevalence of heart-related conditions such as angina, myocardial infarction, and stroke makes it a crucial area of focus in medical research. Understanding the contributing factors to heart disease and the ability to predict the risk of its occurrence are paramount in improving intervention and prevention strategies.

The dataset contains valuable information about various factors and clinical parameters related to heart disease. It includes both numerical and categorical features, such as age, sex, cholesterol levels, and electrocardiographic measurements, making it suitable for analyzing and predicting heart disease.

## 2. Data Exploration

### 2.1. IMPORT LIB

In [1]:
# import Lib

#   to download data
import os
import urllib.request
#
import pandas as pd
import numpy as np
import re
# visualization 
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import colors
from matplotlib.colors import ListedColormap
%matplotlib inline


### 2.2. Downloading data

In [2]:
# # Create a folder if it does not exist
# data_folder = "Data"
# os.makedirs(data_folder, exist_ok=True)

# # Path to the "processed.cleveland.data" file in the "Data" folder
# file_path = os.path.join(data_folder, "processed.cleveland.data")

# # Check if the file exists
# if not os.path.exists(file_path):
#     # If not, download it from the URL and save it to the "Data" folder
#     url = "http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
#     urllib.request.urlretrieve(url, file_path)


In [3]:
# # Loading data into dataframe
# heart_df = pd.read_csv(file_path,header=None)
heart_df = pd.read_csv("Data/heart.csv")
heart_df.head()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


As per the provided information, we will proceed to rename the columns to their accurate titles corresponding to the values they hold in the respective columns.

In [4]:
# Rename columns
heart_df.columns = ['age', 'sex', 'cp', 'trtbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slp', 'ca', 'thal', 'num']
heart_df.head()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalach,exang,oldpeak,slp,ca,thal,num
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


### 2.3. How many rows and columns?
Use pandas to find the number of rows and columns, then store in 2 variable num_rows and num_cols.

In [5]:
num_rows, num_cols = heart_df.shape
print('Number of rows:',num_rows)
print('Number of columns:',num_cols)

Number of rows: 303
Number of columns: 14


### 2.4. Rows exploration

In [6]:
heart_df.sample()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalach,exang,oldpeak,slp,ca,thal,num
40,51,0,2,140,308,0,0,142,0,1.5,2,1,2,1


#### 2.4.1. Meaning of each row
Each row in the dataset represents an individual, and the values in each column provide information about various health-related factors for that person. The goal is often to analyze these features to predict the likelihood or presence of heart disease.

#### 2.4.2. Are there duplicated rows?
Using Pandas `duplicated()` to check for any duplicate rows.
- Declare the variable `have_duplicated_rows` to store the number of duplicated rows.
- If `have_duplicated_rows` > 0:
    - Print these duplicate rows and then drop them.
    - Check: If `have_duplicated_rows` == 0 (False), then there are no duplicated rows.
- Otherwise, proceed to the next part.

In [7]:
have_duplicated_rows = heart_df.duplicated().sum()
have_duplicated_rows

1

In [8]:
if have_duplicated_rows > 0: 
    duplicates = heart_df[heart_df.duplicated()]
    print(duplicates) 
    
    # remove duplicates
    heart_df.drop_duplicates(inplace=True)
    
    # update 'have_duplicated_rows'
    have_duplicated_rows = heart_df.duplicated().sum()

    # TEST
    assert have_duplicated_rows == False
    

     age  sex  cp  trtbps  chol  fbs  restecg  thalach  exang  oldpeak  slp  \
164   38    1   2     138   175    0        1      173      0      0.0    2   

     ca  thal  num  
164   4     2    1  


### 2.4. Columns exploration


#### 2.4.1. Meaning of each columns
- `age`: Age of the person (in years - integer value)
- `Sex`: Gender of the person (1 = male, 0 = female)
- `cp`: Chest Pain type (0 = Typical Angina, 1 = Atypical Angina, 2 = Non-anginal Pain, 3 = Asymptomatic)
- `trtbps`: resting blood pressure (in mmHg on admission to the hospital - integer value)
- `chol`: serum cholestoral (in mg/dl fetched via BMI sensor - integer value)
- `fbs`: fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
- `restecg`: resting electrocardiographic results (0 = Normal, 1 = ST-T wave normality, 2 = Left ventricular hypertrophy)
- `thalach`: maximum heart rate achieved (integer value)
- `exang`: exercise induced angina (1 = yes; 0 = no)
- `oldpeak`: previous peak - ST depression induced by exercise relative to rest
- `slp`: the slope of the peak exercise ST segment (1: upsloping, 2: flat, 3: downsloping)
- `ca`: number of major vessels (0 - 3)
- `thal`: thal rate (3 = normal; 6 = fixed defect; 7 = reversable defect) - mapping to 0,1,2.
- `num`: target variable - diagnosis of heart disease (integer value) (0: < 50% diameter narrowing; 1: > 50% diameter narrowing)

#### 2.4.2. Current data type? Are there inapproriate data type?
We check the data type of each element in a column and then store the results in a series named `col_dtypes`.

In [9]:
col_dtypes = heart_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 302 entries, 0 to 302
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   age      302 non-null    int64  
 1   sex      302 non-null    int64  
 2   cp       302 non-null    int64  
 3   trtbps   302 non-null    int64  
 4   chol     302 non-null    int64  
 5   fbs      302 non-null    int64  
 6   restecg  302 non-null    int64  
 7   thalach  302 non-null    int64  
 8   exang    302 non-null    int64  
 9   oldpeak  302 non-null    float64
 10  slp      302 non-null    int64  
 11  ca       302 non-null    int64  
 12  thal     302 non-null    int64  
 13  num      302 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 35.4 KB


#### 2.4.3. Numeric columns - How are values distributed?


**Missing ratio**

**Min, max, quantiles**

**Are they abnormal?**

#### 2.4.3. Categorical columns - How are values distributed?


**Missing ratio**

**Different values? Show a few**

**Are they abnormal?**

## 3. Frame the problems - Ask meaningful questions

### Question 1: 
#### Benefits of finding the answer:

### Question 2: 
#### Benefits of finding the answer:

## 4. Preprocessing

## 5. Analyzing to answer questions

## 6. Reflection:

## 7. References:
- Data source: Janosi,Andras, Steinbrunn,William, Pfisterer,Matthias, and Detrano,Robert. (1988). Heart Disease. UCI Machine Learning Repository. https://doi.org/10.24432/C52P4X.