![Heart.png](attachment:Heart.png)

# Preface

In each data analysis project like this, there is a certain path toward success. Such a path can be considered as the set of following steps that we need to go through to take the required actions, gather the necessary information, and finally perform a well-structured data analysis.

## Problem Identification/ Problem statement

First of all, in each project, we should understand what exactly the problem is, which reasons exist behind the problem, how does it affect all the involved parties and what are clarified points in the main goal of this data science project or the business goal of the action. 

To clarify the final objective of the project, we may ask oursekves the following question: 

***For what purpose can the company/customer use the provided analysis?***

## Business understanding

To this aim, we should get the **domain knowledge** by speaking to the stakeholders, industrial partners, and domain experts. In addition, we should get enough information about the company/customer itself, their business/services, and their intended products. 

Furthermore, by reading the **documentation of the project** including data/feature descriptions, the format of provided data files and other primary technical aspects, we can get useful information about the data and get familiar with it that may help us to identify potential problems.

To this aim, the following 3 steps can be considered as the best starting points:

1. We should make clear, **which questions do we want to answer in this project** to provide useful insights from the data and improve decision-making process as well as providing data-driven solutions? By **designing meaningful and insightful questions** that should be answered during the analysis, we can correctly understand and interpret hidden relations that will be discovered during the data analysis process.


2. Moreover, it is really recommended to **search over the internet and review reliable external sources** to add some useful aspects, insights and suggestions to the project or provide some hints to perform useful Exploratory Data Analysis (EDA) and data visualization steps. It is also a good idea to **find some valid documentation (information/reports/research/analysis), evidence, and statistics** that support the obtained knowledge.


3. Finally, if it is possible, it is strongly recommended to **add reliable supplementary data to the original data set** and if required, combine structured and unstructured data. 

### Identifying Challenges

In this step, we need to clarify **which challenges we are facing regarding the project and the data**. Here, as the most important action, we need to <span style="color:red;font-size:18pt;font-weight: bold">check data reliability</span> <u>to be sure if the data is a good source of knowledge to provide proper answers to our questions about the problem or not</u>. 

After such an examination, we need to consider **technical challenges** as well. For example, in a case that we are going work with an unstructured data like texts, we should be aware of the **operational comlexities** in required technical steps such as extracting information from text files, converting unstructured information into the structured format, and so on to be able to <u>**determine the required time and resources for those steps**</u>.

### Project Properties

As the next step, we should also determine **key objectives of the project*, **project specifications**, **project schedule**, **calculate time requirements** and **prioritize project elements**.

### Extra Decorations

During our technical data analysis process, it is really recommended to add some **artistic aspects** to the final result (report) in the form of supplementary images, diagrams, and figures to help stakeholders understand the usefulness of the provided insights, and support them in their decision-making process. In this way, they can digest the provided knowledge and act on it.

# Our usecase: Heart attack prediction problem

Based on the above-mentioned steps, we determine the following points for the current project:

## Problem Identification/ Problem statement

The problem, this project is going to analyze is <u>**one of the most common and deadly human body failures**</u>: <span style="color:red;font-size:16pt">**The heart attack**</span>. 

According to the <span style="color:#008dc9;font-size:16pt">**World Health Organization (WHO)**</span> <a href="https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)">report</a>:
- <span style="color:#5C1009;font-size:16pt;font-weight:bold">Cardiovascular diseases (CVDs)</span> are the **leading cause of death globally**. 
- An estimated **17.9 million people died from CVDs in 2019**, representing **32% of all global deaths**. Of these deaths, 85% were due to **heart attack** and **stroke**.
- Over **three quarters of CVD deaths** take place in **low- and middle-income countries**.
- Out of the **17 million premature deaths (under the age of 70)** due to noncommunicable diseases in **2019**, **38% were caused by CVDs**.
- Most cardiovascular diseases <u>can be prevented</u> by addressing **behavioural risk factors** such as:
    - **tobacco use**, 
    - **unhealthy diet** and **obesity**, 
    - **physical inactivity** and **harmful use of alcohol**.

Another report reveals about half of all Americans (47%) have at least 1 of 3 key risk factors for heart disease: 
- **high blood pressure**, 
- **high cholesterol**, and 
- **smoking**'.

In addition, while some actions can be taken to decrease the effect of the above-mentioned risk factors, some other risk factors of heart disease such as the **age** or the **family history** cannot be controlled.

The following map (<a href="https://ourworldindata.org/grapher/cardiovascular-disease-death-rates">source</a>) illustrates how **different life styles** and **eating habits** of the people around the world varies the death rate from CVDs in different countries:

In [None]:
from IPython.display import IFrame
from IPython.core.display import display

display(IFrame('https://ourworldindata.org/grapher/cardiovascular-disease-death-rates', '100%', '600px'))

Therefore, to save more lives, <span style="color:red;font-size:16pt">**it is important to detect cardiovascular disease as early as possible**</span> so that management with counseling and medicines can begin.

---

The final objectives of this project is to use the provided data to perform the following tasks:

1. Performing EDA to obtain useful primary insights 
2.  Using the provided set of input features and their corresponding labels to <span style="color:green;font-size:14pt">**predict the likelihood of a potential heart attack**</span> in each sample. In other words, the problem is to provide a binary classifier that is able to predict if a person is prone to a heart attack or not.

-----------------------

# Project properties

## Data exploration

The provided data for the project is in the shape of **two CSV files** of `heart.csv` and `o2Saturation.csv`.

The `heart.csv` file includes **13 input features** and their corrsponding **target feature** for **303 samples**.

Beside the first tow features that express the **age** and the **gender** of each sample (person), remaining **11 featurs** are providing some values of various medical measurements of that person.

**3 input features** as well as the **target feature** have **binary values ({0,1})** , while others have numerical values in different ranges. So, although there is no categorial feature among the data that should be transferred a numrical feature, in the case of using some specific machine learning models, we may need to normalize our numerical input features before feeding them to the selected model. 

The `o2Saturation.csv` is a single column csv file that includes **3585 floating point values** without nay further infomation. So, using the values of this file along our data analysis process without any clu about its meaning and its different dimensionality with respect to the data of the `heart.csv` file, can be quite challenging.

## Feature description:

The very short documentation of the project, provides following descriptions for each feature of the the `heart.csv` file: 

1. **Age**: Age of the patient in years

2. **Sex**: Gender of the patient, possibly:
    - 0: Female
    - 1: Male
    
3. **cp**: Chest Pain type, For this feature, possible values and their meaning are as follows:
    - 0: typical angina
    - 1: atypical angina
    - 2: non-anginal pain
    - 3: asymptomatic    
    
4. **trtbps**: resting blood pressure (in mm Hg on admission to the hospital)

5. **chol**: Serum cholestoral (mg/dl) fetched via BMI sensor

6. **fbs**: (fasting blood sugar > 120 mg/dl)
  - 0: No
  - 1: Yes

7. **restecg**: resting electrocardiographic results. 
    
    For this feature, possible values and their meaning are as follows:
  - 0: hypertrophy (enlargement and thickening of the walls of heart's main pumping chamber - left ventricle)
  - 1: normal
  - 2: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
  
8. **thalach**: maximum heart rate achieved during thalium stress test  
  
9. **exng**: exercise induced angina:
    - 0: No
    - 1: Yes
    
10. **oldpeak**: ST depression induced by exercise relative to rest    

11. **slp**: the slope of the peak exercise ST segment:
  - 0: downsloping
  - 1: flat
  - 2: upsloping
    
12. **caa**: number of major vessels colored by flourosopy (0-3) .

13. **thall**: Thalium stress test result: 
    - 1: fixed defect
    - 2: normal
    - 3: reversible defect

14. **output**: the predicted angiographic disease status (target feature)
    - 0: less than 50% diameter narrowing
    - 1: more than 50% diameter narrowing

-------------------------

### External obtained knowledege about input features of the project

As explianed before, since we do not have a partner, we cannot get more information about these features from our partner. So, we search over the internet and try to get some information about them.

3.**cp**: This feature is related to some well-known **chest pain types** as follows:
       
- **Typical angina (TA)**: is defined as substernal chest pain precipitated by physical exertion or emotional stress and relieved with rest or nitroglycerin. Angina chest pain is a pressure or squeezing like sensation that is usually caused when your heart muscle doesn’t get an adequate supply of oxygenated blood.

- **Atypical angina**: When one experiences chest pain that doesn’t meet the criteria for angina, it’s known as atypical chest pain. If the chest pain cannot be considered as angina, then that person is said to suffer from atypical chest pain, which unlike typical chest pain, doesn’t occur in the sternum and may radiate to other parts of the body.

- **Nonanginal pain**: Which has one characteristic of typical angina. Nonanginal chest pain carries intermediate risk for CAD (stable coronary artery disease) in women older than 60 years and men older than 40 years.

- **asymptomatic (Silent myocardial ischemia - SMI)**: is defined as a transient alteration in myocardial perfusion in the absence of chest pain or the usual anginal equivalents. Patients may be classified as having one of the three types of SMI: 
    - **type A**: totally asymptomatic patients with no history of angina or myocardial infarction,
    - **type B**: asymptomatic patients with previous myocardial infarction,
    - **type C**: patients with angina and asymptomatic ischemic episodes.

4.**trtbps**: Normal resting blood pressure, in an adult is approximately "120/80 mmHg"
       
6.**fbs**: Fasting blood sugar test. A blood sample will be taken after an overnight fast. A fasting blood sugar level less than 100 mg/dL (5.6 mmol/L) is normal. A fasting blood sugar level from 100 to 125 mg/dL (5.6 to 6.9 mmol/L) is considered prediabetes. If it's 126 mg/dL (7 mmol/L) or higher on two separate tests, you have diabetes
       
7.**restecg**: is used to assess known cardiovascular diseases

9.**exng**: Exercise-induced angina (AP) is a common complaint of cardiac patients, particularly when exercising in the cold. This feature indicates whether the patient had the experience of the exercise induced angina or not. 

12.**caa**: number of major vessels. The major blood vessels connected to your heart are 
  - aorta, 
  - superior vena cava, 
  - inferior vena cava, 
  - pulmonary artery (which takes oxygen-poor blood from the heart to the lungs where it is oxygenated), 
  - pulmonary veins (which bring oxygen-rich blood from the lungs to the heart), and 
  - coronary arteries (which supply blood to the heart muscle). 

    This feature indicates the **number of observable major vessels colored by fluoroscopy for each patient**. 

    **Fluoroscopy** is a test that uses a steady beam of x-rays (like a movie) to look at parts of the body and movement within the body, such as blood moving through a blood vessel. Fluoroscopy also can be used to help find a foreign object in the body, position a needle for a medical procedure, or realign a broken bone.

---

##### Some questions that we may ask about the problem (before any data exploration or going further to the analysis steps):

- What are the most important factors that increase the risk of heart attack occurrences?
- What is the amount of correlation between the given features and heart attack occurrences? 
- Does the sex affect the heart attack occurrences? 
- How is the rate of heart attack occurrence in different countries?
- Is there a special period of time during the day or night that the rate of heart attack occurrence increases?
- Is there a special year that the rate of heart attack increased? if yes, why?
- How does the rate of heart attack change during different seasons of a year?
- What is the most effective strategy or medicine to prevent the heart attack occurances?
- Which factor(s) determine(s) if a heart attack leads to death or not?
- What kind of the chest pain can be considered as a serious warning of potencial heart attack for a patient?
- Is there any correlation between levels of        
  * blood pressure
  * cholestoral 
  * blood sugar      
  and the risk of having a heart attack?
- which features can be considered as the heart health measures that may reduce the risk of having a heart attack in a patient?  

   
More questions can be proposed during the data analysis process.

---
## Data Analytics Process

### 1. Exploratory Data Analysis

Exploratory Data Analysis (EDA) is one of the first steps in the data analytics process. This step can be difined as the application of different data visualization techniques on the data to achieve all or some of the following objectives:

1. Gaining new insights about the data
2. Identifying important chracteristics of the data 
3. Discovering the hidden relations among the data
4. Detecting outliers and anomalies of the data
5. Catching mistakes and anomalies
6. Testing our primary assumption about the data

and much more.

As *Jake VanderPlas* mentioned in his book **Python Data Science Handbook**:

"**No matter what the data are, the first step in making them analyzable will be to transform them into arrays of
numbers**". 

Here are some primary hints:

- Since we are going to work with mathematical models, obviously we should work with numbers. So, in the data, if we see any column that has **non-numerical values**, we need to tranfrom it in some way into numbers. 

- In addition, any column with the type of `object` should be transformed to a numerical data type as well, even if the object contains only numberical values. 

  As an example, when you see the type of a columns using `.info()` or `.dtypes` methods of a pandas DataFrame as:

    ```
    area 5000 non-null object
    ```

    we should note, although the feature value is numerical, the data type of the feature is `object`. So, it should be transformed into a **numrical data type**  using pandas `.astype()` method.

Let's start our coding journey. First, we need to **import some necessary libraries**:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()


# import graph objects as "go" and import tools
import plotly.graph_objs as go
from plotly import tools
import plotly.express as px
import plotly.offline as py

As you may geuss, we use the well-known and powerful **pandas** library to put the data in and work on it.

**Pandas** library has a data structure called **DataFrame (DF)**.

**DataFrames** are <u>similar but not equal</u> to **SQL tables and spreadsheets**. In many cases, pandas DataFrames are faster, easier to use, and more powerful than tables or spreadsheets because they are an integral part of the Python and NumPy ecosystems. So, they provide a wide range of **<u>high-performance data analysis tools</u>**.

So, the first step is to read the data from the **csv file** and put it into a pandas **DataFrame**:

In [None]:
df = pd.read_csv('../input/heart-attack-prediction-data-set/heart.csv')

now, let's take a look at the size of the data and the very first rows of the data we just read :

In [None]:
len(df)

In [None]:
df.head()

Okay, as can be ssen, the data has **303 samples** in total.

It seems the data is <u>nice and clear</u>. All the features have numerical values and some of them are binary variables.

Let's check if our assumption is correct and if the data contains any **missed values** for any feature.
Here, `.info()` method of the pandas DataFrame can be used to get a concise summary of the dataframe data:

In [None]:
df.info()

In [None]:
df.isnull().sum().sum()

From the above cells, we can see, the data **has no missed values**. 

It is also abvious that we do not have **<u>object type (non-numric) features</u>**. Therefore, there is no need to perform **data type transformation and imputation techniques** (as part of feature engineering or pre-processing) to transfom non-numeric (categorial) features into numerical values.

Now, let's see want insights can we obtain through basic statistical chracteristics of the feature.
Here, the `.describe()` function can provide a summary of statistics of the DataFrame columns:

In [None]:
pd.set_option("display.precision", 2) # to decrease the display precision to 2 digits
df.describe()

Different primary insights can be obtained from the above table. For example:

- The age range of the patients is **[29, 77]** and the average age is **54.37**.
- Some sex (gender) information (**mean: 0.68**) reveals that most of the sample paitients are **male**.
- Cholestrol values show, the colestrol level of **25%** of patients is below **211**.

### Checking data redundancies

If you are familiar with the concept of data storage and retrieval, you should know **data redundancy (having duplicated rows among the data)** may lead to data anomalies and corruption and should be avoided when using a pandas DataFrame as well. 

To detect data redundencies in a pandas DataFrame, we can use the `.duplicated()` method. 

Let's see, if we have any **duplicated sample** among the data:

In [None]:
df.duplicated().any()

**YES**, we have! 

Let's check **what is the duplicated sample** and on **which indices do we have a duplication**:

In [None]:
df[df.duplicated()]

In [None]:
dups = df[df.duplicated(keep=False)]

dups = dups.groupby(list(dups)).apply(lambda x: tuple(x.index)).tolist()
print (dups)

So, the above mentioned sample is repeated in indices **163, 164**. see:

In [None]:
df[163:165]

Now, it is time to **drop duplicated values** and get rid of them!

In [None]:
df = df.drop_duplicates()
# let's see if we have any duplicaed sample anymore?
df[df.duplicated()]

### Increasing the readability of the data and code

Based on the well-known coding guideleines, it is always recommended to **select meaningful names for your variables and functions**. The same is true for a data analysis process. So, it is always a good idea to have meaningful names for all the features (columns) among your data. Regarding the name of DataFrame(s), many people prefer to keep it simple to keep it more understandble for others. Therefore, using a name like `data` of `df` for your main DataFrame is quite acceptable:

Let's rename our features as follows to have a better feeling when working with them:

'age' &#8594; 'Age'

'sex' &#8594; 'Gender'

'cp' &#8594; 'ChestPainType'

'trtbps' &#8594; 'RestingBloodPressure'

'chol' &#8594; 'CholesterolLevel'

'fbs' &#8594; 'FastingBloodSugar'

'restecg' &#8594; 'RestingECG'

'thalachh' &#8594; 'MaxHR'

'exng' &#8594; 'ExerciseInducedAngina'

'oldpeak' &#8594; 'ST_Depression'

'slp' &#8594; 'ST_Slope'

'caa' &#8594; 'NumMajorVessels'

'thall' &#8594; 'Thalassemia'

'output' &#8594; 'Prediction'

In [None]:
# Definition of some colors used in diagrams

G0_color = '#ef553b' # Gender 0
G1_color = '#636efa' # Gender 1

#G0_color = 'rgb(239,85,59)' # Gender 0
#G1_color = 'rgb(99,110,250)' # Gender 1

H_color = '#239b56' # Healthy
R_color = '#e74c3c' # Risky

In [None]:
df.columns = ['Age', 'Gender', 'ChestPainType', 'RestingBloodPressure', 'CholesterolLevel',
              'FastingBloodSugar', 'RestingECG', 'MaxHR', 'ExerciseInducedAngina', 'ST_Depression',
              'ST_Slope', 'NumMajorVessels', 'Thalassemia', 'Prediction']

### Primary Data Observation 

Now that we have **no duplication among**, **our features are all numerical** and **our features' names are meaningful enough**, it's time to **get familiar with data through some data visualization steps**:

First, let's see if our data is balanced w.r.t "**age**" and "**Gender**" of samples or not:

In [None]:
Labels = ['Gender 0', 'Gender 1']
v0 = len(df[df['Gender'] == 0])
v1 = len(df[df['Gender'] == 1])
Values = [v0, v1]

fig = go.Figure(data=[go.Pie(labels=Labels,
                             values=Values,
                             #hovertext=df['Gender'],
                             textfont_size=20,
                             #legendgrouptitle=dict(text='Gender'),
                             title=dict(position='top center',text ='Gender Distribution among the data'),
                             marker=dict(colors=[G0_color, G1_color])
                            )
                     ]
               )

fig.show()

Regarding the **Gender** feature, it is not clear what do values `0` and `1` mean. We cannot simply interpret `0` as **Female** and `1` as **Male** and vice versa. So, we prefer to keep the vague `0` and `1` and use them accordingly. But, it is obvious that <span style="color:red;font-size:20pt">**our data is imbalanced with respect to Gender distribution**</span> that me need to addredd later in our preprocessing staps.


Let's take alook at **Age distribution** as well:

In [None]:
def plot_histogram(data, col, color, title, shade=False, shade_options=[]):
    '''
    This fuctions plot the histogram of the [col] feature of the [data] data frame in [color] color with the [title] title
    '''    
    fig = go.Figure()
    fig.add_trace(go.Histogram(
        x=data[col],
        name=col,
        xbins=dict(
            start=data[col].min(),
            end=data[col].max(),
            size=1
        ),

        marker_color=color,
        opacity=0.75,
    ))

    fig.update_layout(
        title_text= title, # title of plot
        xaxis_title_text=col, # xaxis label
        yaxis_title_text='Quantity', # yaxis label
        bargap=0.2, # gap between bars of adjacent location coordinates
    )
    
    if shade:
        fig.add_vrect(
            x0=shade_options[0], x1=shade_options[1],
            fillcolor=shade_options[2], opacity=0.5,
            layer="below", line_width=0,
            )
    fig.update_layout(xaxis_range=[df[col].min(),df[col].max()]) # to force all figures to have equal x-axis range
    fig.show()
    
plot_histogram(df, 'Age', 'navy', 'Age Distribution of all the samples:', shade=True, shade_options=[40, 60, 'dimgray'])    

Beautiful!

The age feature rangeis [29,76] and as can be seen **most of our samples** are **<u>between 40 and 60 years old</u>**. But the age distribution alone, is not so informative.

Let's define some age groups and plot their distribution along with age distribution to have a more clear overview on these two important features:


In [None]:
df.head()

In [None]:
df['AgeGroup'] = ''

df.loc[(df['Age'] >= 29) & (df['Age'] < 39), 'AgeGroup'] = 'A'
df.loc[(df['Age'] >= 39) & (df['Age'] < 49), 'AgeGroup'] = 'B'
df.loc[(df['Age'] >= 49) & (df['Age'] < 59), 'AgeGroup'] = 'C'
df.loc[(df['Age'] >= 59) & (df['Age'] < 69), 'AgeGroup'] = 'D'
df.loc[(df['Age'] >= 69) & (df['Age'] < 79), 'AgeGroup'] = 'E'

In [None]:
Gender_0 = df[df['Gender'] == 0]
G0_Groups = Gender_0['AgeGroup'].value_counts().sort_index().tolist()
Gender_1 = df[df['Gender'] == 1]
G1_Groups = Gender_1['AgeGroup'].value_counts().sort_index().tolist()
Groups = sorted(df['AgeGroup'].unique().tolist())

fig = go.Figure(data=[
    go.Bar(name='Gender 0', x= Groups, y=G0_Groups),
    go.Bar(name='Gender 1', x= Groups, y=G1_Groups)
]
               )
fig.add_trace(
    go.Scatter(
        x=Groups,
        y=G0_Groups,
        line=dict(color=G1_color, shape ='spline'),
        marker=dict(opacity=0),
        showlegend=False
    ))
fig.add_trace(
    go.Scatter(
        x=Groups,
        y=G1_Groups,
        line=dict(color=G0_color, shape ='spline'),
        marker=dict(opacity=0),
        showlegend=False
    ))

fig.update_layout(
    xaxis = dict(
        tickmode = 'array',
        tickvals = Groups,
        ticktext = ['29-38', '39-48', '49-58', '59-68', '69-78']),
        xaxis_title_text='Age Groups', # xaxis label
        yaxis_title_text='Quantity', # yaxis label
        barmode='group' # change the barchart mode
    )

fig.show()

As you see, **both genders have almost the same age distribution**. So, we can be sure <span style="color:blue;font-size:20pt">**there is no bias due to different age distributions**</span>.

Another a possible approach for further data investigation would be to separate our samples to **<span style="color:green;font-size:20pt">healthy</span>** and **<span style="color:red;font-size:20pt">risky</span>** samples based on the likelihhod of a potential heart attack for them and see how they are different in various aspects and what insights we can obtain from these diffwerences:

In [None]:
df['Label'] = ''

df.loc[df['Prediction'] == 0, 'Label'] = 'Healthy'
df.loc[df['Prediction'] == 1, 'Label'] = 'Risky'


healthies = df[df['Prediction'] == 0]
riskies = df[df['Prediction'] == 1]

In [None]:
from plotly.subplots import make_subplots

fig = make_subplots(rows=2, cols=1)

plot_histogram(healthies, 'Age', 'darkgreen', 'Age Distribution of healthy samples:', shade=True, shade_options=[41, 55, 'lightgreen'])
plot_histogram(riskies, 'Age', 'maroon', 'Age Distribution of risky samples:', shade=True, shade_options=[41, 55, 'salmon'])

Comparing the aboved figures, it is reasonable to conclude **samples with age between 41 and 55 years old** (inside the highlighted area) are **<u>more likely to have a heart attack</u>**. So, possibly, **<span style="color:darkred">there is an anti-correlaton between the age of a sample and his/her risk to have a heart attack in the future(!)</span>**.

<span style="color:red;font-size:14pt;font-weight: bold">Such an insight is really questionable that may indicate a serious bias among the data </span>, don't you think so?

One of the important simple's charcteristics in our data is **gender**. So, we might need to know, how is our data distribution w.r.t. this feature

In [None]:
H_Zeros= len(healthies[healthies['Gender'] == 0])
R_Zeros = len(riskies[riskies['Gender'] == 0])

H_Ones = len(healthies[healthies['Gender'] == 1])
R_Ones = len(riskies[riskies['Gender'] == 1])

fig = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])

fig.add_trace(
    go.Pie(labels=['Healthy', 'Risky'], values=[H_Zeros, R_Zeros], title='Gender 0'), row=1, col=1)
fig.update_traces(textfont_size=20, marker=dict(colors=['limegreen', 'red']))

fig.add_trace(
    go.Pie(labels=['Healthy', 'Risky'], values=[H_Ones, R_Ones], title='Gender 1'), row=1, col=2)
fig.update_traces(textfont_size=20, marker=dict(colors=[H_color, R_color]))

fig.show()

- As mentioned before, our data is somehow <span style="color:red">**inbalance with respect to the gender of patients**</span>.
- In addition, the **target feature values (being healthy or risky)** have **completely different distributions in two genders**. While, 
  
  - the <span style="color:red">**majority of Gender 0 samples (75%)**</span> are labeled as <span style="color:red">**risky**</span>. 
  - the <span style="color:green">**majority of the Gender 1 samples (55.3%)**</span> are labeled as <span style="color:green">**healthy**</sapn>, 
  
- So, either 
    - the **gender has a strong correlation with the target feature**
    - or, the sampling strategy of the <u>dataset is biased</u> towards <span style="color:red">**risky Gender 0 samples**</span> and <span style="color:green">**healthy Gender 1 samples**</span>.

## Features correlations

Now, it is time to take a look at the existing correlations between our features.
In the following **corellation diagram**, each correlation between a pairs of two features is illustrated as a colored squared. The value of the correlation varies between **[-1, 1]** where **bigger red square** expresses a **stronger negative correlation** between two features and a **bigger bue square** can be interpreted as a **stronger positive correlation**.

In this diagram, we are interested in two types of correlation:
1. a **strong correlation (positive or negative) between each input feature and the target feature (output)**. To this aim, we should pay attention to the lowest row or the most right column of the diagram and their squares (bigger is better).

2. We also need to pay attention to the **strong corelations between each pair of the input features**. Here, we hope **<u>not to see a strong correlation</u>** (the bigger th square the worse it is) because each big square (strong correlation) between two input features indicates a **co-linearity** between those features that in simple words means, those two features will present the same information to the model and so, one of them should be removed from the set of input features. Therefore, having such a sitution between each pair of input features may lead to less number of features that we have to feed to the model which means there is less useful information among the data.

In [None]:
# Import the two methods from heatmap library
from heatmap import heatmap, corrplot

In [None]:
plt.figure(figsize=(16, 16))
corrplot(df.corr(), size_scale=500);

Based on the above diagram input features can be categorized in 3 groups w.r.t. their correlation with the target feature (output):

1. `RestingBloodPressure`, `CholesterolLevel`, `FastingBloodSugar` and `RestingECG` features <span style="color:blue;font-size:20pt">**DO NOT have a strong correlation**</span> with the **target feature**.
2. `ChestPainType`, `ST_Slope`and `MaxHR` features have a <span style="color:red;font-size:20pt">**strong direct correlation**</span> with the **target feature**, which means <span style="color:red">**the higher value they have, the more likelihood the patient has to have a heart attack in the future**</span>.
3. `ExerciseInducedAngina`, `ST_Depression`, `NumMajorVessels`, and `Thalassemia` features have a <span style="color:green;font-size:20pt">**strong reverse correlation**</span> with the **target feature**, which means <span style="color:green">**the higher value they have, the less likelihood the patient has to have a heart attack in the future**</span>.

## More illustrations on features with direction correlation with heart attack probability

As discussed in the second item of the above section, `cp`, `slp`and `talachh` are 3 features that seem to have a **direct correlation with the heart attack probabilty**.

The first feature, `ChestPainType` (chest pain type), can be considered as a <u>phisical sign that something is wrong</u>. Generally speaking, when a person has pain in the chest, it is more possible to conclude that the person is more prone to have a kind of heart problem. 

Since the `ChestPainType` featues <u>does <span style="color:red">**NOT**</span> include</u> any values that indicates **no chest pain**, we cannot determine the correlation between **having the chest pain** and the **probability of having a heart attack in the future**. Instead, we can investigate **which types of the chest pain can be considered as serious signs of a potetial heart attack**:

In [None]:
feature = 'ChestPainType'

H_values = healthies[feature].value_counts().sort_index().tolist()
R_values = riskies[feature].value_counts().sort_index().tolist()

Groups = sorted(df[feature].unique().tolist())
print(Groups)

fig = go.Figure(data=[
    go.Bar(name='Healthy', x = Groups, y = H_values, marker_color=H_color),
    go.Bar(name='Risky', x = Groups, y = R_values, marker_color=R_color)
]
               )

fig.add_trace(
    go.Scatter(
        x=[number - 0 for number in Groups],
        y=H_values,
        line=dict(color=H_color, shape ='spline'),
        marker=dict(opacity=0),
        showlegend=False
    ))
fig.add_trace(
    go.Scatter(
        x=[number + 0 for number in Groups],
        y=R_values,
        line=dict(color=R_color, shape ='spline'),
        marker=dict(opacity=0),
        showlegend=False
    ))

fig.update_layout(
    title = 'Chest Pain Type Distribution Comparison Between Healthy & Risky Samples',
    xaxis = dict(
        tickmode = 'array',
        tickvals = Groups,
        ticktext = ['typical angina', 'atypical angina', 'non-anginal pain', 'asymptomatic']),
        xaxis_title_text='Chest Pain Type', # xaxis label
        yaxis_title_text='Quantity', # yaxis label
        barmode='group' # change the barchart mode
    )

fig.show()

In [None]:
fig = px.scatter(df, x="Age", y="ChestPainType",
                color="Label", color_discrete_sequence= [H_color, R_color])

fig.update_traces(marker=dict(size=10,
                              line=dict(width=2,
                                        color='DarkSlateGrey')),
                              selector=dict(mode='markers'))

fig.update_layout(
    yaxis = dict(
        tickmode = 'array',
        tickvals = [0, 1, 2, 3],
        ticktext = ['typical angina', 'atypical angina', 'non-anginal pain', 'asymptomatic']),  
        xaxis_title_text='Age', # xaxis label
        yaxis_title_text='ChestPainType', # yaxis label
        barmode='group', # change the barchart mode
        title='Age/Chest Pain Type Correlations with the Target Feature'
    )

fig.show()


As can be seen in the above diagrams, having a <span style="color:green; font-size:20pt">**typical angina chest pain**</span> is **more common among <span style="color:green; font-size:20pt">healthy</span> samples**, 

while having a <span style="color:red;font-size:20pt">**non-anginal pain**</span> is **more likely in younger <span style="color:red;font-size:20pt">risky</span> samples**. 

As describe before, the second feature that has a **direct correlation with the heart attack probabilty** is `ST_Slope`.

In [None]:
feature = 'ST_Slope'

H_values = healthies[feature].value_counts().sort_index().tolist()
R_values = riskies[feature].value_counts().sort_index().tolist()
Groups = sorted(df[feature].unique().tolist())

fig = go.Figure(data=[
    go.Bar(name='Healthy', x = Groups, y = H_values, marker_color=H_color),
    go.Bar(name='Risky', x = Groups, y = R_values, marker_color=R_color)
]
               )

fig.add_trace(
    go.Scatter(
        x=[number - 0 for number in Groups],
        y=H_values,
        line=dict(color=H_color, shape ='spline'),
        marker=dict(opacity=0),
        showlegend=False
    ))
fig.add_trace(
    go.Scatter(
        x=[number + 0 for number in Groups],
        y=R_values,
        line=dict(color=R_color, shape ='spline'),
        marker=dict(opacity=0),
        showlegend=False
    ))

fig.update_layout(
    xaxis = dict(
        tickmode = 'array',
        tickvals = Groups,
        ticktext = ['downsloping', 'flat', 'upsloping']),
        xaxis_title_text='ST Slope Type', # xaxis label
        yaxis_title_text='Quantity', # yaxis label
        barmode='group' # change the barchart mode
    )

fig.show()

As can be seen in the above diagram, while the <span style="color:blue;font-size:15pt">downsloping ST (left group of bars)</span> value <span style="color:blue">**can be equaly seen among risky and healthy samples**</span>, the <span style="color:green;font-size:15pt">flat ST slope (middle group of bars)</span> can be cosidered as an <span style="color:green">**indication for healthy samples**</span> , while the <span style="color:red;font-size:15pt">upsloping ST (right group of bars)</span> can be cosidered as the <span style="color:red">**indication for risky samples**</span>. 

Now, it is time to take a look at `MaxHR` difference between <span style="color:red">**risky**</span> and <span style="color:green">**risky**</span> samples.

In [None]:
fig = px.scatter(df, x="Age", y="MaxHR",
                color="Label", color_discrete_sequence= [H_color, R_color])

fig.update_traces(marker=dict(size=10,
                              line=dict(width=2,
                                        color='DarkSlateGrey')),

                  selector=dict(mode='markers'))


fig.add_shape(type="circle",
    xref="x", yref="y",
    x0=35, y0=135,
    x1=58, y1=195,
    opacity=0.2,
    fillcolor="green",
    line_color="green",
)

fig.add_shape(type="circle",
    xref="x", yref="y",
    x0=50, y0=85,
    x1=70, y1=170,
    opacity=0.2,
    fillcolor="red",
    line_color="red",
)

fig.update_layout(  
        xaxis_title_text='Age', # xaxis label
        yaxis_title_text='Max HR achieved during Thalium stress test ', # yaxis label
        barmode='group', # change the barchart mode
        title='Age/Max Thalium HR Correlations with the Target Feature'
    )

fig.show()

In the above diagram, there are <span style="color:blue;font-size:18pt">**2 obvious clusters**</span> of <span style="color:green;font-size:18pt">**healthy (green) samples at top left**</span> and <span style="color:red;font-size:18pt">**risky (red) samples at bottom right**</span> that show the `MaxHr` feature can be used as a useful input feature to discriminate healthy and risky samples from each other.

Now, it is time to investigate features that have a <span style="color:red;font-size:18pt">**strong direct correlation**</span> with the **target feature** to see if they are also useful enough to  be used in our prediction model.

Let's start with `ST_Depression` that indicates ST depression induced by exercise relative to rest:

In [None]:
import plotly.figure_factory as ff



def show_feature_dist(feature, binsize, rug=True, curve=False):
    H_values = healthies[feature]
    R_values = riskies[feature]

    hist_data = [H_values, R_values]

    group_labels = ['Healthy', 'Risky']

    fig = ff.create_distplot(hist_data,
                             group_labels,
                             bin_size=[binsize, binsize],
                             colors=[H_color, R_color],
                             show_hist =True,
                             curve_type='kde',
                             show_curve=curve,
                             show_rug = rug , 
                             #rug_text=['Healthy', 'Risky']
                            )

    fig.update_layout(  
            xaxis_title_text = feature + ' Values',
            yaxis_title_text = 'Label Probability Density',
            title = feature + 'distribution for healthy and risky samples'
        )


    fig.show()
    
show_feature_dist('ST_Depression', 0.5, rug=False, curve=True)

Apparently, <span style="color:red;font-size:16pt">low `ST Depression` values are more common among risky samples</span>, while <span style="color:green;font-size:16pt">healthy people are more likely to have higher values of this feature</span>. 

The sitaution is almost the same for `NumMajorVessels` feature:

In [None]:
show_feature_dist('NumMajorVessels', 1, rug=False, curve=True)

Let's take a look at the correlation between the **Cholesterol Level**, as **one of the known heart attack risk factors** and the target feature (heart attack prediction result) among the data:

In [None]:
show_feature_dist('CholesterolLevel', 3, rug=True, curve=True)

This diagram indicates:
1. most of the samples in both groups (healthy/risky) have almost the same values of Cholestrol level (somewhere between 200 & 300)
2. There are more healthy samples that have cholesterol levels higher than 250. 

The above-mentioned insights are telling us the **cholestrol level is a neutral factor for heart attcks(!)** which is in contrast with medical evidences including WHO report on heart attack risk factors. So, this is <span style="color:red;font-size:16pt;font-weight:bold">another indicator for a possible bias in the data</span>.

### Conclusion

At this point, we have extracted several insights during our EDA process. Unfortunately, there are several indications that these insights are in contradiction with the reliable and accepted medical evidences. For example:

- While medical reports, as well as our everyday experiences, show the risk of having a heart attack is higher when we get old, one of our extracted insights indicates the risk of heart attack is at maximum level for patients between 41 and 55 years old, and the likelihood of having a heart attack decreases after the age 55!


- Medical reports reveal although men tend to develop coronary artery disease earlier in life, after age 65 the risk of heart disease in women is almost the same as in men. In spite of this medical fact, our insights indicate gender has a strong correlation with the risk of having heart attacks.


- According to numerous medical studies, at higher levels of cholesterol, the human body can develop fatty deposits in blood vessels. Eventually, these deposits grow and make it difficult for enough blood to flow through arteries. Sometimes, those deposits can break suddenly and form a clot that causes a heart attack or stroke. Nevertheless, in the previous diagram, with a drastic increase in the level of cholesterol, the risk of heart attack tends to zero!

Based on all the above contradictions, we can conclude, <span style="color:red;font-size:16pt;font-weight:bold">the provided data is not reliable enough to be used for the prediction of the heart attack probability</span>.

