<div style="text-align: center; background-color: #5A96E3; font-family: 'Trebuchet MS', Arial, sans-serif; color: white; padding: 20px; font-size: 40px; font-weight: bold; border-radius: 0 0 0 0; box-shadow: 0px 6px 8px rgba(0, 0, 0, 0.2);">
  Stage 02 - Exploratory Data Analysis
</div>

# **1. Import libraries**

In [None]:
import pandas as pd
import numpy as np

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt
import seaborn as sn
from sklearn.preprocessing import LabelEncoder

# **2. Read data**

- The data has been previously cleaned and saved into the file ../data/cleaned_data.csv.
- We read the data and saved it into a variable called `data` as a dataframe.

In [None]:
data = pd.read_csv('../data/cleaned_data.csv')
data

# **Overview**

Before conducting the analysis, let's examine the correlations between the variables in the dataset.

- **First**, we need to encode the data to compute the correlation matrix.


In [None]:
df = data.copy()
le = LabelEncoder()
for column in df.columns:
    if df[column].dtype == 'object':
        df[column] = le.fit_transform(df[column].astype(str))
df.head(5)

- **Secondly**, we calculate the correlation matrix using the `corr()` function and visualize it using a **heatmap** chart.

In [None]:
# Calculate the correlation matrix
correlation_matrix = df.corr()

# Draw heatmap
plt.figure(figsize=(30, 20))

# Create a triangular mask to hide the upper triangle of the heatmap
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))

# Plot the correlation matrix as a heatmap
sn.heatmap(correlation_matrix, mask=mask, annot=True, cmap='coolwarm')

# Add a title to the plot
plt.title('Correlation Matrix')

# Display the heatmap
plt.show()


**Observations:** Looking at the correlation matrix, we can observe the relationships between variables through the **correlation coefficients**. This provides a general overview of the relationships between columns in the data, which can be helpful for analyzing the following questions.

# **3. Questions**

## Question 1: 

- **Purpose:**
- **How to answer:**

### Preprocessing

### Visualization

### Observation

### **Question 2: What is the overall comparison of general health statuses between males and females across different age groups?**

- **Purpose:** The purpose of asking the question is to understand the overall comparison of general health statuses between males and females across different age groups. This analysis aims to identify any patterns, trends, or disparities in health between genders and age groups.
- **How to answer:** To answer this question, we will extract relevant columns of data such as `Sex`, `GeneralHealth`, and `AgeCategory`. Then, we will group the data by `Sex` and `AgeCategory`, analyze and compare the general health statuses using statistical analysis techniques. The results can be displayed by utilizing appropriate visualizations, and the findings can be explained to draw conclusions and discuss the significance of healthcare policies and interventions.

**1. Preprocessing**
- Step 1: Filter the DataFrame to include only the relevant columns for analysis (`Sex`, `GeneralHealth`, and `AgeCategory`)
- Step 2: Group the data by `Sex` and `AgeCategory` columns, and calculate the count of each health status using the `groupby()` and `value_counts()` functions.

In [None]:
#Filter the DataFrame to include only the relevant columns for analysis
df_health = data[["Sex", "GeneralHealth", "AgeCategory"]]
# Group the data by "Sex" and "AgeCategory" columns, and calculate the count of each health status
health_counts = df_health.groupby(["Sex", "AgeCategory"])["GeneralHealth"].value_counts().unstack().reset_index()

**2. Visualization**

In [None]:
# Filter the data for males and females
male_data = health_counts[health_counts["Sex"] == "Male"]
female_data = health_counts[health_counts["Sex"] == "Female"]

# Define the colors for each health status
colors = ['rgba(31, 119, 180, 0.7)', 'rgba(255, 127, 14, 0.7)', 'rgba(44, 160, 44, 0.7)',
          'rgba(214, 39, 40, 0.7)', 'rgba(148, 103, 189, 0.7)']

# Create subplots for males and females with shared y-axis
fig = make_subplots(rows=1, cols=2, subplot_titles=("Male", "Female"), shared_yaxes=True)

# Plot the stacked bar chart for males
for i, status in enumerate(["Excellent", "Very good", "Good", "Fair", "Poor"]):
    fig.add_trace(go.Bar(x=male_data["AgeCategory"], y=male_data[status], name=status, marker_color=colors[i]), row=1, col=1)

# Plot the stacked bar chart for females
for i, status in enumerate(["Excellent", "Very good", "Good", "Fair", "Poor"]):
    fig.add_trace(go.Bar(x=female_data["AgeCategory"], y=female_data[status], name=status, marker_color=colors[i]), row=1, col=2)

# Customize the legend
fig.update_layout(legend=dict(x=1, y=1, traceorder="normal", bgcolor='rgba(0,0,0,0)'), showlegend=True)

# Customize the layout
fig.update_layout(title="General Health by Gender and Age Group", xaxis_title="Age Category", yaxis_title="Count")

# Show the plot
fig.show()

### Observation

- In general, at every age and gender, the proportion of people with good or above-average health remains high.
- Specific analysis:
    - For the age group of 18 to 24: it can be said that this is the age group with the best overall health, with the lowest proportion of poor health.
    - Health status tends to decline over time. In males, from the age of 50 onwards, there is a significant increase in poor health. For females, from the age of 35 onwards, there is a noticeable decline in health.

## Question 3:

- **Purpose:**
- **How to answer:**

### Preprocessing

### Visualization

### Observation