# Project

The project aims to build a machine learning model using various Python-based machine learning libraries to predict whether or not someone has heart disease based on their medical attributes.

## Problem Definition

We are dealing with a Binary classification problem. To classify whether someone has a heart disease based on given clinical parameters.

## About Data

This is a multivariate type of dataset which means providing or involving a variety of separate mathematical or statistical variables, multivariate numerical data analysis. This database includes 76 attributes, but all published studies relate to using a subset of 14 of them. The Cleveland database is the only one used by ML researchers to date. One of the major tasks on this dataset is to predict based on the given attributes of a patient whether that particular person has heart disease or not and the other is the experimental task to diagnose and find out various insights from this dataset which could help in understanding the problem more.

- **age:** It is the age of the patient in the years unit; Integer
- **sex:** Patient is either male or female; categorical field
- **cp:** level of chest pain; categorical 
    - 0: Typical Angina: chest pain related to decreased blood supply to the heart.
    - 1: Atypical Angina: chest pain not related to the heart.
    - 2: Non-anginal pain: typically esophageal spasms (non-heart related)
    - 3: Asymptomatic: chest pain not showing signs of disease.
- **trestbps:** resting blood pressure (on admission to the hospital), which is measured in 'mm Hg'; Integer
- **chol:** serum cholesterol level in 'mg/dl', above 200 can causes a concern; Integer
- **fbs:** fasting blood sugar > 120 mg/dl; Categorical field -- values [1- Ture, 0-false]
- **restecg:** resting electrocardiographic results)
    - 0: normal
    - 1: stt abnormality
    - 2: lv hypertrophy, showing probable or definite left ventricular hypertrophy by Esste's criteria.
- **thalach:** maximum heart rate achieved; Integer
- **exang:** exercise-induced angina; Categorical --values [1- yes, 2- no]
- **oldpeak:** ST depression induced by exercise relative to rest; Integer
- **slope:** the slope of the peak exercise ST segment; categorical -- values [flat, downsloping, upsloping]
- **ca:** number of major vessels (0-3) coloured by fluoroscopy; Integer
- **thal:** [normal; fixed defect; reversible defect]
- **target:** has a heart disease or not; categorical


## Evalution Metric

Initially, the target of the model is to achieve 95% accuracy, which might vary based on the performance of models and other metrics such as AUC-ROC, confusion matrix and classification report.  

## Imports - tools needed for the project

In [14]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns



## Data Exploration

In [16]:
data = pd.read_csv("./heart-disease.csv")
data.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
