<div style="text-align: center; background-color: #5A96E3; font-family: 'Trebuchet MS', Arial, sans-serif; color: white; padding: 20px; font-size: 40px; font-weight: bold; border-radius: 0 0 0 0; box-shadow: 0px 6px 8px rgba(0, 0, 0, 0.2);">
  Stage 02 - Exploratory Data Analysis
</div>

# **1. Import libraries**

In [None]:
import pandas as pd
import numpy as np

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt
import seaborn as sn
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import RobustScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# **2. Read data**

- The data has been previously cleaned and saved into the file ../data/cleaned_data.csv.
- We read the data and saved it into a variable called `data` as a dataframe.

In [None]:
data = pd.read_csv('../data/cleaned_data.csv')
data

# **Overview**

Before conducting the analysis, let's examine the correlations between the variables in the dataset.

- **First**, we need to encode the data to compute the correlation matrix.


In [None]:
df = data.copy()
le = LabelEncoder()
for column in df.columns:
    if df[column].dtype == 'object':
        df[column] = le.fit_transform(df[column].astype(str))
df.head(5)

- **Secondly**, we calculate the correlation matrix using the `corr()` function and visualize it using a **heatmap** chart.

In [None]:
# Calculate the correlation matrix
correlation_matrix = df.corr()

# Draw heatmap
plt.figure(figsize=(30, 20))

# Create a triangular mask to hide the upper triangle of the heatmap
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))

# Plot the correlation matrix as a heatmap
sn.heatmap(correlation_matrix, mask=mask, annot=True, cmap='coolwarm')

# Add a title to the plot
plt.title('Correlation Matrix')

# Display the heatmap
plt.show()


**Observations:** Looking at the correlation matrix, we can observe the relationships between variables through the **correlation coefficients**. This provides a general overview of the relationships between columns in the data, which can be helpful for analyzing the following questions.

# **3. Questions**

## **Question 1: What is the overall comparison of general health statuses between males and females across different age groups?**

- **Purpose:** The purpose of asking the question is to understand the overall comparison of general health statuses between males and females across different age groups. This analysis aims to identify any patterns, trends, or disparities in health between genders and age groups.
- **How to answer:** To answer this question, we will extract relevant columns of data such as `Sex`, `GeneralHealth`, and `AgeCategory`. Then, we will group the data by `Sex` and `AgeCategory`, analyze and compare the general health statuses using statistical analysis techniques. The results can be displayed by utilizing appropriate visualizations, and the findings can be explained to draw conclusions and discuss the significance of healthcare policies and interventions.

**1. Preprocessing**
- Step 1: Filter the DataFrame to include only the relevant columns for analysis (`Sex`, `GeneralHealth`, and `AgeCategory`)
- Step 2: Group the data by `Sex` and `AgeCategory` columns, and calculate the count of each health status using the `groupby()` and `value_counts()` functions.

In [None]:
#Filter the DataFrame to include only the relevant columns for analysis
df_health = data[["Sex", "GeneralHealth", "AgeCategory"]]
# Group the data by "Sex" and "AgeCategory" columns, and calculate the count of each health status
health_counts = df_health.groupby(["Sex", "AgeCategory"])["GeneralHealth"].value_counts().unstack().reset_index()

**2. Visualization**

In [None]:
# Filter the data for males and females
male_data = health_counts[health_counts["Sex"] == "Male"]
female_data = health_counts[health_counts["Sex"] == "Female"]

# Define the colors for each health status
colors = ['rgba(31, 119, 180, 0.7)', 'rgba(255, 127, 14, 0.7)', 'rgba(44, 160, 44, 0.7)',
          'rgba(214, 39, 40, 0.7)', 'rgba(148, 103, 189, 0.7)']

# Create subplots for males and females with shared y-axis
fig = make_subplots(rows=1, cols=2, subplot_titles=("Male", "Female"), shared_yaxes=True)

# Plot the stacked bar chart for males
for i, status in enumerate(["Excellent", "Very good", "Good", "Fair", "Poor"]):
    fig.add_trace(go.Bar(x=male_data["AgeCategory"], y=male_data[status], name=status, marker_color=colors[i]), row=1, col=1)

# Plot the stacked bar chart for females
for i, status in enumerate(["Excellent", "Very good", "Good", "Fair", "Poor"]):
    fig.add_trace(go.Bar(x=female_data["AgeCategory"], y=female_data[status], name=status, marker_color=colors[i]), row=1, col=2)

# Customize the legend
fig.update_layout(legend=dict(x=1, y=1, traceorder="normal", bgcolor='rgba(0,0,0,0)'), showlegend=True)

# Customize the layout
fig.update_layout(title="General Health by Gender and Age Group", xaxis_title="Age Category", yaxis_title="Count")

# Show the plot
fig.show()

**3. Observation**

- In general, at every age and gender, the proportion of people with good or above-average health remains high.
- Specific analysis:
    - For the age group of 18 to 24: it can be said that this is the age group with the best overall health, with the lowest proportion of poor health.
    - Health status tends to decline over time. In males, from the age of 50 onwards, there is a significant increase in poor health. For females, from the age of 35 onwards, there is a noticeable decline in health.

## **Question 2: Which factors can influence heart attack?**

- **Purpose:** Understanding the factors that can influence the occurrence of a heart attack empowers us to take preventative measures and adjust our daily routines to mitigate these risk factors. Furthermore, it aids medical professionals in identifying the causes and treatment options for heart attack. 
- **How to answer:**
    - Choose appropriate columns, we will choose all columns having only `Yes, No` values and three columns having multiple values `HadDiabetes, SmokerStatus, ECigaretteUsage`.
    - Preprocess columns having multiple values so that these columns only have `Yes, No` values.
    - Calculate the probability of a heart attack based on the presence of a specific factor (it can be a disease or habit).

**1. Preprocessing**

**Preprocess columns having multiple values**

First, we will check the unique values of these columns.

In [None]:
heart_attack_df = data[data['HadHeartAttack'] == 'Yes']

print(heart_attack_df['HadDiabetes'].value_counts())
print('================================================')
print(heart_attack_df['SmokerStatus'].value_counts())
print('================================================')
print(heart_attack_df['ECigaretteUsage'].value_counts())

Then, we convert these unique values to only `Yes No` values.

In [None]:
cleaned_data_copy = data.copy()

cleaned_data_copy['HadDiabetes'] = cleaned_data_copy['HadDiabetes'].replace(['Yes, but only during pregnancy (female)', 
                                                                             'No, pre-diabetes or borderline diabetes'], ['Yes', 'Yes'])

cleaned_data_copy['SmokerStatus'] = cleaned_data_copy['SmokerStatus'].replace(['Never smoked', 'Former smoker', 
                                                                               'Current smoker - now smokes every day', 
                                                                               'Current smoker - now smokes some days'], 
                                                                              ['No', 'Yes', 'Yes', 'Yes'])

cleaned_data_copy['ECigaretteUsage'] = cleaned_data_copy['ECigaretteUsage'].replace(['Never used e-cigarettes in my entire life', 
                                                                                     'Not at all (right now)', 'Use them some days', 
                                                                                     'Use them every day'], ['No', 'No', 'Yes', 'Yes'])
cleaned_data_copy

**Select rows having `HadHeartAttack = Yes`**

In [None]:
heart_attack_df = cleaned_data_copy[cleaned_data_copy['HadHeartAttack'] == 'Yes']

**Calculate the probability of a person experiencing a heart attack when exhibiting any of the indicators we believe may be associated with such an event.**

In [None]:
# Select columns having only yes and no values
yes_no_cols = heart_attack_df.columns[(heart_attack_df.isin(['Yes', 'No']).all()) & (heart_attack_df.columns != 'HadHeartAttack')]

# Seperate yes_no_cols into two different lists
# no_cols: a list containing all columns' names that we will use only No value
# yes_cols: a list containing all columns' names that we will use only Yes value
no_cols = ['PhysicalActivities', 'ChestScan']
yes_cols = yes_no_cols[~yes_no_cols.isin(no_cols)]

def count_yes(col):
    counts = col.value_counts()
    return counts['Yes']

def count_no(col):
    counts = col.value_counts()
    return counts['No']

# Calculate conditional probability
# The number of people has the particular indicator
num_yes = cleaned_data_copy[yes_cols].agg(count_yes)
num_no = cleaned_data_copy[no_cols].agg(count_no)
num_has_indicator = pd.concat([num_yes, num_no])

# The number of people has both the particular indicator and heart attack
num_yes = heart_attack_df[yes_cols].agg(count_yes)
num_no = heart_attack_df[no_cols].agg(count_no)
num_has_indicator_heart_attack = pd.concat([num_yes, num_no])

prop_heart_attack_under_indicator = (num_has_indicator_heart_attack * 100 / num_has_indicator).round(2)
prop_heart_attack_under_indicator = prop_heart_attack_under_indicator.sort_values(ascending=False)

**2. Visualization**

In [None]:
fig = px.bar(prop_heart_attack_under_indicator, x = prop_heart_attack_under_indicator.index, y = prop_heart_attack_under_indicator.values,  
             title = 'The likelihood of a person experiencing a heart attack based on a specific factor', labels = {'index': 'Indicators', 'y': 'Probability(%)'}, 
             range_y = (0, 100), color = prop_heart_attack_under_indicator.values)
fig.update_layout(height=500, width=1000)
fig.show()

**3. Observation**

- Individuals who have experienced `Angina or Stroke` exhibit a higher likelihood of having a heart attack.
- Additionally, individuals with conditions such as `Kidney Disease, COPD, Diabetes or Arthritis` show a relatively lower but still notable probability of experiencing a heart attack.
- Columns labeled with `Difficulty` suggest a higher probability, potentially linked to both physical and mental health issues. Individuals facing these challenges may have a somewhat elevated risk of a heart attack.
- Conversely, factors like `Smoker Status, Had Asthma, Alcohol Drinkers or E-Cigarette Usage` seem to have a comparatively lower impact on the likelihood of a heart attack.

## **Question 3: Can we predict whether a person has a heart disease based on certain features? Which features can be used to predict whether a person has a heart disease or not?**

- **Purpose:** The purpose of the question is to determine whether it is possible to predict whether a person has a heart disease based on certain features. This question aims to explore the predictive power of the selected features and their relationship to the presence of heart disease.
- **Significance:** Answering this question has several implications. Firstly, it provides insights into the potential of using specific features to predict heart disease. By identifying the most important features, it can help researchers and healthcare professionals understand the underlying factors associated with heart disease. Additionally, the model's accuracy on the test set indicates how well it can generalize to new, unseen data. This information can guide decision-making for healthcare interventions and preventive measures related to heart disease.
- **How to answer:** The methodology involves building a Random Forest model to predict the presence of heart disease using the selected features. The data is split into training and testing sets, and data preprocessing techniques are applied to handle categorical and numeric features. The preprocessed data is then used to train the Random Forest model. Cross-validation is performed to evaluate the model's performance, and the accuracy of the model is measured on the test set.

### Preprocessing

In [None]:
#Selecting features and target variable
categorical_features = ['RemovedTeeth','SmokerStatus','AgeCategory', 'AlcoholDrinkers',
'HadAngina', 'HadStroke', 'HadCOPD', 'HadKidneyDisease',
'HadArthritis', 'HadDiabetes', 'DifficultyWalking', 'DifficultyDressingBathing', 'DifficultyErrands', 'ChestScan']
numeric_features = ['PhysicalHealthDays', 'MentalHealthDays', 'SleepHours', 'HeightInMeters', 'WeightInKilograms']
target = 'HadHeartAttack'

#Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data.drop(target, axis=1), data[target], test_size=0.2, random_state=42)

#Data preprocessing
categorical_transformer = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))])
numeric_transformer = Pipeline(steps=[('scaler', RobustScaler())])

preprocessor = ColumnTransformer(transformers=[
('cat', categorical_transformer, categorical_features),
('num', numeric_transformer, numeric_features)])

#Build the Random Forest model
model = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', RandomForestClassifier())])

#Evaluate the model using Cross Validation
scores = cross_val_score(model, X_train, y_train, cv=5)

print("Accuracy of the model (Cross Validation): {:.2f}% (+/- {:.2f}%)".format(scores.mean() * 100, scores.std() * 100))

#Train the model
model.fit(X_train, y_train)

#Make predictions on the test set
y_pred = model.predict(X_test)

#Evaluate the prediction performance
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of the model on the test set: {:.2f}%".format(accuracy * 100))

### Visualization

In [None]:
#Get the feature importance from the model
importances = model.named_steps['classifier'].feature_importances_
#Get the names of the features
feature_names = categorical_features + numeric_features
#Sort the importances and feature names in descending order
indices = np.argsort(importances)[::-1]
sorted_importances = importances[indices]
sorted_feature_names = [feature_names[i] for i in indices if i < len(feature_names)]

#Visualize the feature importances
# Define a custom color palette
custom_colors = ['#FFC300', '#FF5733', '#C70039', '#900C3F', '#581845']
# Create a bar chart figure with custom colors
fig = go.Figure(data=go.Bar(x= sorted_feature_names, y= sorted_importances, marker=dict(color=custom_colors)))
# Update the layout of the figure
fig.update_layout(
    title='Feature Importance',
    xaxis=dict(title='Features'),
    yaxis=dict(title='Importance')
)
# Show the figure
fig.show()

### Observation