# Various factors contributing to heart diseases 

This dataset gives a number of variables along with a target condition of having or not having heart disease. The "goal" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4.

### Attribute Information:

It's a clean, easy to understand set of data. However, the meaning of some of the column headers are not obvious. Here's what they mean,

1. age: The person's age in years
1. sex: The person's sex (1 = male, 0 = female)
1. cp: The chest pain experienced (Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic)
1. trestbps: The person's resting blood pressure (mm Hg on admission to the hospital)
1. chol: The person's cholesterol measurement in mg/dl
1. fbs: The person's fasting blood sugar (> 120 mg/dl, 1 = true; 0 = false)
1. restecg: Resting electrocardiographic measurement (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes' criteria)
1. thalach: The person's maximum heart rate achieved
1. exang: Exercise induced angina (1 = yes; 0 = no)
1. oldpeak: ST depression induced by exercise relative to rest ('ST' relates to positions on the ECG plot. See more here)
1. slope: the slope of the peak exercise ST segment (Value 1: upsloping, Value 2: flat, Value 3: downsloping)
1. ca: The number of major vessels (0-3)
1. thal: A blood disorder called thalassemia (3 = normal; 6 = fixed defect; 7 = reversable defect)
1. target: Heart disease (0 = no, 1 = yes)
### Acknowledgements
#### Creators:

1. Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
1. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
1. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
1. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.
#### Donor:
1. David W. Aha (aha '@' ics.uci.edu) (714) 856-8779

## Downloading the Dataset

In [None]:
!pip install azureml-opendatasets

In [None]:
!pip install jovian opendatasets --upgrade --quiet

Let's begin by downloading the data, and listing the files within the dataset.

In [None]:

dataset_url = 'https://www.kaggle.com/ronitf/heart-disease-uci'

In [None]:
import opendatasets as od
od.download(dataset_url)

The dataset has been downloaded and extracted.

In [None]:
data_dir = './heart-disease-uci'

In [None]:
import os
os.listdir(data_dir)

Let us save and upload our work to Jovian before continuing.

In [None]:
project_name = "zerotopandas-course-project-starter" # change this (use lowercase letters and hyphens only)

In [None]:
!pip install jovian --upgrade -q

In [None]:
import jovian

In [None]:
jovian.commit(project=project_name)

## Data Preparation and Cleaning

Let us analyze this using pandas, matplotlib and seaborn. 



In [None]:
!pip install pandas --upgrade -q

In [None]:
import pandas as pd

In [None]:
heart_df = pd.read_csv('./heart-disease-uci/heart.csv')
heart_df.head()

In [None]:
heart_df.describe()

In [None]:
heart_df.shape

This dataset has 303 rows of data and 14 columns

In [None]:
heart_df.isna().sum()

We can see that after using isna() method on heart_df that the dataset does not contain any null values.

In [None]:
import jovian

In [None]:
jovian.commit(project = project_name)

## Exploratory Analysis and Visualization

I'm going to take a look at online guides on how heart disease is diagnosed, and look up some of the terms above.

Diagnosis: The diagnosis of heart disease is done on a combination of clinical signs and test results. The types of tests run will be chosen on the basis of what the physician thinks is going on [https://www.mayoclinic.org/diseases-conditions/heart-disease/diagnosis-treatment/drc-20353124](1), ranging from electrocardiograms and cardiac computerized tomography (CT) scans, to blood tests and exercise stress tests [https://www.heartfoundation.org.au/heart-health-education/Medical-tests-for-heart-disease](2). Looking at information of heart disease risk factors led me to the following: high cholesterol, high blood pressure, diabetes, weight, family history and smoking [https://www.bhf.org.uk/informationsupport/risk-factors](3) . According to another source [https://www.heart.org/en/health-topics/heart-attack/understand-your-risks-to-prevent-a-heart-attack](4), the major factors that can't be changed are: **increasing age, male gender and heredity**. Note that **thalassemia**, one of the variables in this dataset, is heredity. Major factors that can be modified are: **Smoking, high cholesterol, high blood pressure, physical inactivity, and being overweight and having diabetes**. Other factors include **stress, alcohol and poor diet/nutrition**.

I can see no reference to the 'number of major vessels', but given that the definition of heart disease is **"...what happens when your heart's blood supply is blocked or interrupted by a build-up of fatty substances in the coronary arteries"**, it seems logical the more **major vessels is a good thing**, and therefore will reduce the probability of heart disease.



Let's begin by importing`matplotlib.pyplot` and `seaborn`.

In [None]:
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

Let us analyze if our assumptions hold true based on the data that we have.

In [None]:
heart_df_no = heart_df[heart_df['target']==0]
heart_df_yes = heart_df[heart_df['target']==1]
sns.distplot(heart_df_no["chol"],label = 'Heart disease = 0');
sns.distplot(heart_df_yes["chol"],label = 'Heart disease = 1');
plt.xlabel('Cholesterol');
plt.legend();


Here we can observe that pateints having higher readings for cholesterol has higher chance of having a heart disease

In [None]:
sns.distplot(heart_df_no["age"],label = 'Heart disease = 0');
sns.distplot(heart_df_yes["age"],label = 'Heart disease = 1');
plt.xlabel('Age');
plt.legend();

From the above distribution plot we can see that higher age does not mean higher chances of having a heart disease.

In [None]:
jovian.commit(project=project_name)

In [None]:
temp = (heart_df.groupby(['target']))['cp'].value_counts(normalize=True).mul(100).reset_index(name = "percentage");
sns.barplot(x = "target", y = "percentage", hue = "cp", data = temp).set_title("Chest Pain vs Heart Disease");

Bar plot of different Chest pains for target. We can see that pateints who have  cp = 0 typical angina has higher chances of having no heart disease.

In [None]:
temp = (heart_df.groupby(['target']))['thal'].value_counts(normalize=True).mul(100).reset_index(name = "percentage");
sns.barplot(x = "target", y = "percentage", hue = "thal", data = temp).set_title("Thalassemia vs Heart Disease");

Bar plot of Thalassemia vs target shows that pateints having non curable thalessemia have higher chances of having a heart disease.


Let us save and upload our work to Jovian before continuing

In [None]:
import jovian

In [None]:
jovian.commit(project = project_name)

## Asking and Answering Questions

Let us answer some basic questions about the data using pandas and matplotlib/seaborn



#### Q1: What is the distribution of the hear disease patients among males and females?

In [None]:
sns.countplot(x="target", data=heart_df,hue = 'sex');
plt.title("GENDER distribution of Heart Diseases");

We can see that in the data we have more pateients who are male and who are also prone to heart disease.

#### Q2: What is the averge age of male and female who has heart disease?

In [None]:
avg_male = round(heart_df[(heart_df.target ==  1) & (heart_df.sex == 1)].age.mean(), 2)
avg_female = round(heart_df[(heart_df.target ==  1) & (heart_df.sex == 0)].age.mean(), 2)

In [None]:
print(f'Average age for male and female who have heart disease is {avg_male} and {avg_female} respectively')

Here we can infer that male have higher change of getting heart disease even with a lower age.

#### Q3: What are the average values for all the features look like for a male and female who has the heart disease

In [None]:
avg_male_df = pd.DataFrame()
avg_female_df = pd.DataFrame()
#Creating empty dataframe for analysis

In [None]:
mean_columns = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'slope']
heart_df_columns = heart_df.columns
for column in heart_df_columns:
    if column in mean_columns:
        avg_male_df[column] = round(heart_df[(heart_df.target ==  1) & (heart_df.sex == 1)][column].mean(), 2)
        avg_female_df[column] = round(heart_df[(heart_df.target ==  1) & (heart_df.sex == 0)][column].mean(), 2)
    else:
        avg_male_df[column] = heart_df[(heart_df.target ==  1) & (heart_df.sex == 1)][column].mode()
        avg_female_df[column] = heart_df[(heart_df.target ==  1) & (heart_df.sex == 0)][column].mode()

       


In [None]:
avg_male_df

In [None]:
avg_female_df

#### Q4: What is the effect of fasting blood sugar > 120 g/ml?


In [None]:
temp = (heart_df.groupby(['target']))['fbs'].value_counts(normalize=True).mul(100).reset_index(name = "percentage");
sns.barplot(x = "target", y = "percentage", hue = "fbs", data = temp).set_title("Fasting blood sugar vs Heart Disease");

There is no clear relationship between fasting blood sugar > 120 g/ml and the patient having heart disease.

#### Q5: How does Exercise induced angina affects the chances of person having heart disease?


In [None]:
temp = (heart_df.groupby(['target']))['exang'].value_counts(normalize=True).mul(100).reset_index(name = "percentage");
sns.barplot(x = "target", y = "percentage", hue = "exang", data = temp).set_title("Exercise induced angina vs Heart Disease");

Therefore person who does not have Exercise induced angina have higher chances of having heart disease.


Let us save and upload our work to Jovian before continuing.

In [None]:
import jovian

In [None]:
jovian.commit(project = project_name)

## Inferences and Conclusion

1. Heart problems are more prevalent in the males compared females. Also males are more prone for getting a heart disease in a smaller age as compared to females.
1. Pateints having higher readings for cholesterol has higher chance of having a heart disease.
1. When both male and females are considered age does not represent a direct relationship with having a heart disease.
1. Person who does not have Exercise induced angina have higher chances of having heart disease.
1. There is no clear relationship between fasting blood sugar > 120 g/ml and the patient having heart disease.
1. Average age for male and female who have heart disease is 50.9 and 54.56 respectively.


In [None]:
import jovian

In [None]:
jovian.commit(project = project_name)

## References and Future Work

1. We can use this data for training a machine learning algo which can be used to deduce the probability of a new patient having a heart disease.

References
1. https://www.kaggle.com/ritikasaini/heart-disease-classifier-using-xgboost
1. https://www.kaggle.com/ronitf/heart-disease-uci
1. https://pandas.pydata.org/pandas-docs/stable/index.html
1. https://matplotlib.org/
1. https://seaborn.pydata.org/

In [None]:
import jovian

In [None]:
jovian.commit(project = project_name)