In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

sns.set(style="whitegrid")

df = pd.read_csv("../gym_members_exercise_tracking.csv") 
df.head()

In [None]:
# Data types and non-null counts
df.info()

In [None]:
# Summary statistics (mean, std, quartiles, etc.)
df.describe()

In [None]:
# Mode (optional but nice)
df.mode(numeric_only=True).head()

In [None]:
df.isna().sum()

### Table-level EDA

- `df.info()` shows data types and non-null counts.
- `df.describe()` provides summary statistics (mean, std, min, quartiles, max) for numeric columns.
- `df.mode()` shows the most frequent values.
- `df.isna().sum()` indicates how many missing values each column has.

In [None]:
numeric_cols = [
    "Age",
    "Weight (kg)",
    "Height (m)",
    "Max_BPM",
    "Avg_BPM",
    "Resting_BPM",
    "Session_Duration (hours)",
    "Calories_Burned",
    "Fat_Percentage",
    "Water_Intake (liters)",
    "Workout_Frequency (days/week)",
    "BMI",
]

for col in numeric_cols:
    plt.figure(figsize=(6, 4))
    sns.histplot(data=df, x=col, bins=20)
    plt.title(f"Distribution of {col}")
    plt.xlabel(col)
    plt.ylabel("Count")
    plt.show()

## Numeric Distributions

Histograms show the shape and range of numeric features.  
Some variables (like Calories_Burned, Session_Duration) are right-skewed, with
most values low and a few very high.


In [None]:
for col in numeric_cols:
    plt.figure(figsize=(6, 4))
    sns.boxplot(data=df, x=col)
    plt.title(f"Boxplot of {col}")
    plt.xlabel(col)
    plt.show()

## Boxplots and Outliers

Boxplots highlight the spread and outliers.  
We see some extreme values in features like Calories_Burned and Session_Duration
that we may handle during preprocessing.


In [None]:
cat_cols = ["Gender", "Experience_Level"]  

for col in cat_cols:
    plt.figure(figsize=(4, 4))
    sns.countplot(data=df, x=col)
    plt.title(f"Countplot of {col}")
    plt.xlabel(col)
    plt.ylabel("Count")
    plt.show()


## Categorical Distributions

Countplots show how categories like Gender, Experience_Level
are distributed.  
Most categories are reasonably represented, though some may be slightly imbalanced.


In [None]:
plt.figure(figsize=(10, 8))
corr = df[numeric_cols].corr()
sns.heatmap(corr, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Heatmap (Numeric Features)")
plt.show()

## Correlation

The heatmap shows how numeric features relate.  
Some pairs are strongly correlated (e.g., BMI with Weight (kg)), and some have
clear relation with Calories_Burned, useful for modeling.


In [None]:
subset_cols = ["Calories_Burned", "Session_Duration (hours)", "Age", "BMI", "Fat_Percentage"]
sns.pairplot(df[subset_cols])
plt.show()

### Correlation and Pairwise Relationships

- The correlation heatmap shows how strongly numeric features move together.
- The pairplot visualizes pairwise relationships and distributions for a subset 
  of important variables such as Calories_Burned, Session_Duration (hours), Age, BMI, 
  and Fat_Percentage.


In [None]:
missing = df.isna().sum()
missing[missing > 0]

In [25]:
col = "Calories_Burned"

Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
outliers.shape


(10, 15)

## Outliers in Calories_Burned

Using the IQR method, we found 10 rows with extreme Calories_Burned values.
These will be considered during preprocessing (keep, cap, or remove).


In [None]:
df.groupby("Gender")["Calories_Burned"].mean()

In [None]:
df.groupby("Experience_Level")["Calories_Burned"].agg(["mean", "median", "count"])

In [None]:
df.groupby("Gender")["Fat_Percentage"].agg(["mean", "median"])

## Grouped Aggregations

Grouped stats by Gender and Experience_Level show how averages differ across groups,
for example in Calories_Burned and Fat_Percentage.


## Trend Analysis

There is no date/time column, so time-based trend analysis is not applicable here.


In [None]:
plt.figure(figsize=(6, 4))
sns.scatterplot(data=df, x="BMI", y="Calories_Burned", hue="Gender")
plt.title("Calories Burned vs BMI by Gender")
plt.xlabel("BMI")
plt.ylabel("Calories Burned")
plt.show()

## Extra Analysis: BMI vs Calories_Burned

A scatterplot of BMI vs Calories_Burned (colored by Gender) helps explore how
body composition relates to workout output.

## EDA Summary

- Dataset is clean with no major missing value issues.
- Numeric features show some skewness and a few outliers (e.g., Calories_Burned).
- Categorical features are reasonably balanced.
- Some features are strongly correlated (e.g., BMI with Weight (kg)).
- Grouped stats show meaningful differences across Gender and Experience_Level.

These insights will guide preprocessing and model selection.
