# Heart Disease
In this document, we will apply the different concepts and technologies learned to work on clasifying heart diseases

The data is coming from the UC Irvine website, and contains approximately 14 features and 1 target

## Framing the problem

### Defining the objective in business terms
Our main objective is to create a Machine learning model that can classify whether a patient has a heart disease based on many features containing info about his life style going from his heart rate to his smoking habits

Only 14 attributes used:
      1. #3  (age)       
      2. #4  (sex)       
      3. #9  (cp)  cp: chest pain type
        -- Value 1: typical angina
        -- Value 2: atypical angina
        -- Value 3: non-anginal pain
        -- Value 4: asymptomatic  
      4. #10 (trestbps) resting blood pressure (in mm Hg on admission to the hospital)  
      5. #12 (chol)      
      6. #16 (fbs) (fasting blood sugar > 120 mg/dl)  (1 = true; 0 = false)    
      7. #19 (restecg)   resting electrocardiographic results
        -- Value 0: normal
        -- Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
        -- Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria  
      8. #32 (thalach)    maximum heart rate achieved  
      9. #38 (exang)      exercise induced angina (1 = yes; 0 = no)  
      10. #40 (oldpeak)    ST depression induced by exercise relative to rest  
      11. #41 (slope)     the slope of the peak exercise ST segment
        -- Value 1: upsloping
        -- Value 2: flat
        -- Value 3: downsloping  
      12. #44 (ca)        number of major vessels (0-3) colored by flourosopy  
      13. #51 (thal)       thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
      14. #58 (num)       (the predicted attribute)



### How will our soluton be used ?
Our solution will be used in a desktop or web application in hospitals that the doctors are gonna use to help them see the probability of the patient having a heart disease  

We're going to create an API that can be used to introduce the patient info into It, to be able to make predictions on whether the patient has a disease or not  

The API will be made using FastAPI to create endpoints
## 

## How we would frame the problem ?
This problem can be framed as being a classification supervised learning offline problem, as our goal is to be able to classify the patients on a pre-established database that has labels on the data

## How should the performance be measured ?
As this is a classification problem, we will go for creating a confusion matrix, and calculate the precision and recall on them, to ensure that model gives reasonable answers, and to see how can we improve on some instances of the target

## Is the performance measure aligned with the business objective?

The performance measure is indeed aligned with the business objective, as we can get getting false positives, than having false negatives, as It can lead to the patient's death

## What would be the minimum performance needed to reach the business objective
The minimum performance would be to have as much precision for true positives, and as lowest rate of false negatives, We would opt for 95% 

## Is human expertise available
Yes, the measurements should be done by a professional to get as closest precision as possible and for the measurements to not be noisy or mis-conducted

## Get the data

We can automatically download the data from the uci using their API using the code below with the id of the repo

In [4]:
import pandas as pd
import numpy as np

In [1]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
heart_disease = fetch_ucirepo(id=45) 

Now, we can go and see some information about the data including the first 5 rows of the data, description and info about It

In [23]:
heart_disease.data

{'ids': None,
 'features':      age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  \
 0     63    1   1       145   233    1        2      150      0      2.3   
 1     67    1   4       160   286    0        2      108      1      1.5   
 2     67    1   4       120   229    0        2      129      1      2.6   
 3     37    1   3       130   250    0        0      187      0      3.5   
 4     41    0   2       130   204    0        2      172      0      1.4   
 ..   ...  ...  ..       ...   ...  ...      ...      ...    ...      ...   
 298   45    1   1       110   264    0        0      132      0      1.2   
 299   68    1   4       144   193    1        0      141      0      3.4   
 300   57    1   4       130   131    0        0      115      1      1.2   
 301   57    0   2       130   236    0        2      174      0      0.0   
 302   38    1   3       138   175    0        0      173      0      0.0   
 
      slope   ca  thal  
 0        3  0.0   6.0 

In [38]:
# Copy the data into a df
df = heart_disease.data.original.copy()
columns_names = heart_disease.data.headers.tolist()

column_target = columns_names.pop(-1)
columns_features = columns_names.copy()

In [39]:
print(f"These are the feature columns {columns_features}")
print(f"This is the target column {column_target}")

These are the feature columns ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']
This is the target column num


Now, we will go and study each attribute of the data

In [40]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,2
2,67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
3,37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0


In [42]:
# Getting numerical statistics of our data
df.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,299.0,301.0,303.0
mean,54.438944,0.679868,3.158416,131.689769,246.693069,0.148515,0.990099,149.607261,0.326733,1.039604,1.60066,0.672241,4.734219,0.937294
std,9.038662,0.467299,0.960126,17.599748,51.776918,0.356198,0.994971,22.875003,0.469794,1.161075,0.616226,0.937438,1.939706,1.228536
min,29.0,0.0,1.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,1.0,0.0,3.0,0.0
25%,48.0,0.0,3.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,3.0,0.0
50%,56.0,1.0,3.0,130.0,241.0,0.0,1.0,153.0,0.0,0.8,2.0,0.0,3.0,0.0
75%,61.0,1.0,4.0,140.0,275.0,0.0,2.0,166.0,1.0,1.6,2.0,1.0,7.0,2.0
max,77.0,1.0,4.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,3.0,3.0,7.0,4.0


In [43]:
# Getting info about the columns and their data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        299 non-null    float64
 12  thal      301 non-null    float64
 13  num       303 non-null    int64  
dtypes: float64(3), int64(11)
memory usage: 33.3 KB


We can see that the data contain 303 entries with 14 columns, 13 of which are features and one target with some missing values in the ca and thal