## Introduction

The growth of AI has generated value in many fields with the field of medicine making strides in the application of machine learning to understand disease patterns and draw valuable insights. 



## Heart Disease UCI Dataset

The dataset utilized for the project is part of a larger dataset with 76 attributes but only 14 attributes are selected for use in this project. The dataset is obtained from UCI Machine Learning Repository (https://archive-beta.ics.uci.edu/ml/datasets/heart+disease#Attributes). This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license. This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.

The features used in this dataset are as follows:
1. Age
2. Sex
3. CP (Chest Pain Type)
    * 0 = Typical Angina
    * 1 = Atypical Angina
    * 2 = non-anginal pain
    * 3 = Asymptomatic
4. Trestbps (Resting Blood Pressure in mm Hg)
5. Chol (Serum cholestrol in mg/dl)
6. Fbs (Fasting Blood sugar > 120 mg/dl) 
    * 1 = True 
    * 0 = False
7. Restcg
    * 0 = normal
    * 1 = Having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    * 2 = showing probable or definite left ventricular hypertrophy by Estes’ criteria
8. Thalach (Max heart rate achieved)
9. Exang (Exercise induced angina)
    * 1 = yes
    * 0 = no
10. Oldpeak (ST depression induced by exercise relative to rest)
11. Slope (the slope of the peak exercise ST segment)
    * 0 = upsloping
    * 1 = flat
    * 2 = downsloping
12. Thal (A blood disorder called thalassemia)18
    * 3 = normal
    * 6 = fixed defect
    * 7 = reversible defect
13. Ca (number of major vessels (0-3) colored by fluoroscopy19)
14. Target
    * 1
    * 0

## Importing Librabries

In [6]:
# Libraries for Data Loading and Visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

## Loading Data

In [7]:
df = pd.read_csv('data/heart_disease_uci.csv')

df.head()

Unnamed: 0,id,age,sex,dataset,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num
0,1,63,Male,Cleveland,typical angina,145.0,233.0,True,lv hypertrophy,150.0,False,2.3,downsloping,0.0,fixed defect,0
1,2,67,Male,Cleveland,asymptomatic,160.0,286.0,False,lv hypertrophy,108.0,True,1.5,flat,3.0,normal,2
2,3,67,Male,Cleveland,asymptomatic,120.0,229.0,False,lv hypertrophy,129.0,True,2.6,flat,2.0,reversable defect,1
3,4,37,Male,Cleveland,non-anginal,130.0,250.0,False,normal,187.0,False,3.5,downsloping,0.0,normal,0
4,5,41,Female,Cleveland,atypical angina,130.0,204.0,False,lv hypertrophy,172.0,False,1.4,upsloping,0.0,normal,0


## Exploratory Data Analysis

In [8]:
# Check for Missing Values
df.isnull().sum()

id            0
age           0
sex           0
dataset       0
cp            0
trestbps     59
chol         30
fbs          90
restecg       2
thalch       55
exang        55
oldpeak      62
slope       309
ca          611
thal        486
num           0
dtype: int64

There are many missing values in the dataset.

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 920 entries, 0 to 919
Data columns (total 16 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   id        920 non-null    int64  
 1   age       920 non-null    int64  
 2   sex       920 non-null    object 
 3   dataset   920 non-null    object 
 4   cp        920 non-null    object 
 5   trestbps  861 non-null    float64
 6   chol      890 non-null    float64
 7   fbs       830 non-null    object 
 8   restecg   918 non-null    object 
 9   thalch    865 non-null    float64
 10  exang     865 non-null    object 
 11  oldpeak   858 non-null    float64
 12  slope     611 non-null    object 
 13  ca        309 non-null    float64
 14  thal      434 non-null    object 
 15  num       920 non-null    int64  
dtypes: float64(5), int64(3), object(8)
memory usage: 115.1+ KB


In [10]:
df.shape

(920, 16)