<div style="text-align: center; background-color: #5A96E3; font-family: 'Trebuchet MS', Arial, sans-serif; color: white; padding: 20px; font-size: 40px; font-weight: bold; border-radius: 0 0 0 0; box-shadow: 0px 6px 8px rgba(0, 0, 0, 0.2);">
  Stage 02 - Exploratory Data Analysis
</div>

# 1. Import libraries

In [None]:
import pandas as pd
import numpy as np

import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sn

# 2. Read data

In [None]:
cleaned_data = pd.read_csv('../data/cleaned_data.csv')
cleaned_data

# 3. Questions

## Question 1: Which indicators can affect heart attack most?

- **Purpose:** Understanding the factors that can influence the occurrence of a heart attack empowers us to take preventative measures and adjust our daily routines to mitigate these risk factors. Furthermore, it aids medical professionals in identifying the causes and treatment options for heart attack. 
- **How to answer:**
    - Choose appropriate columns, we will choose all columns having only `Yes, No` values and three columns having multiple values `HadDiabetes, SmokerStatus, ECigaretteUsage`.
    - Preprocess columns having multiple values so that these columns only have `Yes, No` values.
    - Calculate the probability of a heart attack based on the presence of a specific indicator.

### Preprocessing

**Preprocess columns having multiple values**

First, we will check the unique values of these columns.

In [None]:
heart_attack_df = cleaned_data[cleaned_data['HadHeartAttack'] == 'Yes']

print(heart_attack_df['HadDiabetes'].value_counts())
print('================================================')
print(heart_attack_df['SmokerStatus'].value_counts())
print('================================================')
print(heart_attack_df['ECigaretteUsage'].value_counts())

Then, we convert these unique values to only `Yes No` values.

In [None]:
cleaned_data_copy = cleaned_data.copy()

cleaned_data_copy['HadDiabetes'] = cleaned_data_copy['HadDiabetes'].replace(['Yes, but only during pregnancy (female)', 
                                                                             'No, pre-diabetes or borderline diabetes'], ['Yes', 'Yes'])

cleaned_data_copy['SmokerStatus'] = cleaned_data_copy['SmokerStatus'].replace(['Never smoked', 'Former smoker', 
                                                                               'Current smoker - now smokes every day', 
                                                                               'Current smoker - now smokes some days'], 
                                                                              ['No', 'Yes', 'Yes', 'Yes'])

cleaned_data_copy['ECigaretteUsage'] = cleaned_data_copy['ECigaretteUsage'].replace(['Never used e-cigarettes in my entire life', 
                                                                                     'Not at all (right now)', 'Use them some days', 
                                                                                     'Use them every day'], ['No', 'No', 'Yes', 'Yes'])
cleaned_data_copy

**Select rows having `HadHeartAttack = Yes`**

In [None]:
heart_attack_df = cleaned_data_copy[cleaned_data_copy['HadHeartAttack'] == 'Yes']

**Calculate the probability of a person experiencing a heart attack when exhibiting any of the indicators we believe may be associated with such an event.**

In [None]:
# Select columns having only yes and no values
yes_no_cols = heart_attack_df.columns[(heart_attack_df.isin(['Yes', 'No']).all()) & (heart_attack_df.columns != 'HadHeartAttack')]

# Seperate yes_no_cols into two different lists
# no_cols: a list containing all columns' names that we will use only No value
# yes_cols: a list containing all columns' names that we will use only Yes value
no_cols = ['PhysicalActivities', 'ChestScan']
yes_cols = yes_no_cols[~yes_no_cols.isin(no_cols)]

def count_yes(col):
    counts = col.value_counts()
    return counts['Yes']

def count_no(col):
    counts = col.value_counts()
    return counts['No']

# Calculate conditional probability
 # The number of people has the particular indicator
num_yes = cleaned_data_copy[yes_cols].agg(count_yes)
num_no = cleaned_data_copy[no_cols].agg(count_no)
num_has_indicator = pd.concat([num_yes, num_no])

# The number of people has both the particular indicator and heart attack
num_yes = heart_attack_df[yes_cols].agg(count_yes)
num_no = heart_attack_df[no_cols].agg(count_no)
num_has_indicator_heart_attack = pd.concat([num_yes, num_no])

prop_heart_attack_under_indicator = (num_has_indicator_heart_attack * 100 / num_has_indicator).round(2)
prop_heart_attack_under_indicator = prop_heart_attack_under_indicator.sort_values()

### Visualization

In [None]:
fig = px.bar(prop_heart_attack_under_indicator, x = prop_heart_attack_under_indicator.values, y = prop_heart_attack_under_indicator.index, 
             title = 'The likelihood of a person experiencing a heart attack based on a specific indicator', labels = {'x': 'Probability(%)', 'index': 'Indicators'}, 
             range_x = (0, 100))
fig.update_layout(height=800, width=800)
fig.show()

### Observation

- Observing the data, it's evident that the columns for `HadAngina and HadStroke` show the highest probabilities **(42.66% and 24.9% respectively)**. This implies that if an individual has angina or has experienced a stroke, there's a higher likelihood of them having a heart attack.
- The next highest probabilities are associated with `HadKidneyDisease and HadCOPD` at **16.18% and 13.56%** respectively.
- Following closely are `HadDiabetes and HadArthritis` with probabilities of **9.64% and 8.32%** respectively.
- Conversely, indicators such as `SmokerStatus, HadAsthma, AlcoholDrinkers and ECigaretteUsage` appear to have a lower impact on the likelihood of a heart attack.

## Question 2:

- **Purpose:**
- **How to answer:**

### Preprocessing

### Visualization

### Observation

## Question 3:

- **Purpose:**
- **How to answer:**

### Preprocessing

### Visualization

### Observation