<a href="https://colab.research.google.com/github/21Ovi/Heart-Disease-Classification/blob/main/Heart_Diesease_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting Heart Disease
 
Heart disease is the number one cause of death worldwide, so if you're looking to use data science for good you've come to the right place. To learn how to prevent heart disease we must first learn to reliably detect it.
Our dataset is from a study of heart disease that has been open to the public for many years. The study collects various measurements on patient health and cardiovascular statistics, and of course makes patient identities anonymous.
 
About:
Preventing heart disease is important. Good data-driven systems for predicting heart disease can improve the entire research and prevention process, making sure that more people can live healthy lives.
In the United States, the Centers for Disease Control and Prevention is a good resource for information about heart disease. According to their website:
•	About 610,000 people die of heart disease in the United States every year–that’s 1 in every 4 deaths.
•	Heart disease is the leading cause of death for both men and women. More than half of the deaths due to heart disease in 2009 were in men.
•	Coronary heart disease (CHD) is the most common type of heart disease, killing over 370,000 people annually.
•	Every year about 735,000 Americans have a heart attack. Of these, 525,000 are a first heart attack and 210,000 happen in people who have already had a heart attack.
•	Heart disease is the leading cause of death for people of most ethnicities in the United States, including African Americans, Hispanics, and whites. For American Indians or Alaska Natives and Asians or Pacific Islanders, heart disease is second only to cancer.
 
Problem description
Your goal is to predict the binary class heart_disease_present, which represents whether or not a patient has heart disease:
•	0 represents no heart disease present
•	1 represents heart disease present

 
Dataset

There are 14 columns in the dataset, where the patient_id column is a unique and random identifier. The remaining 13 features are described in the section below.
•	slope_of_peak_exercise_st_segment (type: int): the slope of the peak exercise ST segment, an electrocardiography read out indicating quality of blood flow to the heart
•	thal (type: categorical): results of thallium stress test measuring blood flow to the heart, with possible values normal, fixed_defect, reversible_defect
•	resting_blood_pressure (type: int): resting blood pressure
•	chest_pain_type (type: int): chest pain type (4 values)
•	num_major_vessels (type: int): number of major vessels (0-3) colored by flourosopy
•	fasting_blood_sugar_gt_120_mg_per_dl (type: binary): fasting blood sugar > 120 mg/dl
•	resting_ekg_results (type: int): resting electrocardiographic results (values 0,1,2)
•	serum_cholesterol_mg_per_dl (type: int): serum cholestoral in mg/dl
•	oldpeak_eq_st_depression (type: float): oldpeak = ST depression induced by exercise relative to rest, a measure of abnormality in electrocardiograms
•	sex (type: binary): 0: female, 1: male
•	age (type: int): age in years
•	max_heart_rate_achieved (type: int): maximum heart rate achieved (beats per minute)
•	exercise_induced_angina (type: binary): exercise-induced chest pain (0: False, 1: True)
 
Feature data example

Here's an example of one of the rows in the dataset so that you can see the kinds of values you might expect in the dataset. Some are binary, some are integers, some are floats, and some are categorical. There are no missing values.
field	value
slope_of_peak_exercise_st_segment	2
thal	normal
resting_blood_pressure	125
chest_pain_type	3
num_major_vessels	0
fasting_blood_sugar_gt_120_mg_per_dl	1
resting_ekg_results	2
serum_cholesterol_mg_per_dl	245
oldpeak_eq_st_depression	2.4
sex	1
age	51
max_heart_rate_achieved	166
exercise_induced_angina	0



In [1]:
# Import all the tools we need

# Regular EDA(Exploratory data analysis) and plotting libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# we want our plots to appear inside the notebook
%matplotlib inline 

# Models from Scikit-Learn
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# Model Evaluation
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score,f1_score
from sklearn.metrics import plot_roc_curve

In [3]:
values = pd.read_csv("https://raw.githubusercontent.com/21Ovi/DataScience-DataSets/main/PRCP-1016-HeartDieseasePred/Data/values.csv")
labels = pd.read_csv("https://raw.githubusercontent.com/21Ovi/DataScience-DataSets/main/PRCP-1016-HeartDieseasePred/Data/labels.csv")

In [4]:
values.head()

Unnamed: 0,patient_id,slope_of_peak_exercise_st_segment,thal,resting_blood_pressure,chest_pain_type,num_major_vessels,fasting_blood_sugar_gt_120_mg_per_dl,resting_ekg_results,serum_cholesterol_mg_per_dl,oldpeak_eq_st_depression,sex,age,max_heart_rate_achieved,exercise_induced_angina
0,0z64un,1,normal,128,2,0,0,2,308,0.0,1,45,170,0
1,ryoo3j,2,normal,110,3,0,0,0,214,1.6,0,54,158,0
2,yt1s1x,1,normal,125,4,3,0,2,304,0.0,1,77,162,1
3,l2xjde,1,reversible_defect,152,4,0,0,0,223,0.0,1,40,181,0
4,oyt4ek,3,reversible_defect,178,1,0,0,2,270,4.2,1,59,145,0


In [5]:
labels.head()

Unnamed: 0,patient_id,heart_disease_present
0,0z64un,0
1,ryoo3j,0
2,yt1s1x,1
3,l2xjde,1
4,oyt4ek,0
