# 1. Exploratory data analysis

-----

The idea here is basically to study the dataset structure and extract some basic metrics from the dataset, like mean median among others, and how they behave overall.

## 1.1 Load the data: 

--------------

In [None]:
pip install kagglehub[pandas-datasets]

In [None]:
import kagglehub
from kagglehub import KaggleDatasetAdapter
import pandas as pd
file_path = "Gym_Progress_Dataset.csv"  #imports csv directly from kaggle

df = kagglehub.load_dataset(
    KaggleDatasetAdapter.PANDAS,
    "rishabhagarwal997889/gym-progress-tracking-dataset-200-days",
    file_path,
)

df.head()#loads the first 5

In [None]:
df.columns#columns

-  We have here the first analysis of the dataset, we have some features, like the day of the year (200 days) that the analysis was made, the person's weight, calorie intake on the day, amount of protein consumption, workout duration done on the day, and steps walked.

# 1.2. Descriptive statistics

------------

In [None]:
df.describe() #main basic statistics.

In [None]:
print(df.info()) # dataset information

- we see that most features behave as numbers (float and int)

In [None]:
print(df.isnull().sum()) #checks the amount of null values in the dataset

- No null data, which advances the analysis even more

## 1.3. Distribution visualization

-----------------

- individual histogram

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams['figure.figsize'] = (15, 10) #figure size
numeric_cols = ['Weight_kg', 'Calories_Intake', 'Protein_Intake_g', 
                'Workout_Duration_min', 'Steps_Walked'] #numerical features for analysis in the graph

fig, axes = plt.subplots(3, 2, figsize=(15, 12))# Create subplots
axes = axes.ravel() #transforms the subplot location array to 1D to facilitate analysis
#axes = [ax[0], ax[1], ax[2], ax[3], ax[4], ax[5]]

for idx, col in enumerate(numeric_cols): #for each feature, enumerate returns two values
    #idx being the index of each element and col, the column value
    axes[idx].hist(df[col], bins=30, color='skyblue', edgecolor='black', alpha=0.7) #for each index plots the values in histogram
    #hist, makes it plot in histogram model.
    axes[idx].set_title(f'Distribution of {col}', fontsize=14, fontweight='bold')
    axes[idx].set_xlabel(col, fontsize=12)
    axes[idx].set_ylabel('Frequency', fontsize=12)

fig.delaxes(axes[5])#Remove the extra subplot

plt.tight_layout() #automatically adjusts spacing between subplots
plt.show()

In [None]:
fig, axes = plt.subplots(3, 2, figsize=(15, 12))
axes = axes.ravel()

for idx, col in enumerate(numeric_cols):
    sns.histplot(df[col], bins=30, kde=True, color='steelblue', ax=axes[idx])## Histograms with density to verify behavior
    #smooths the data distribution
    axes[idx].set_title(f'Distribution of {col} with KDE', fontsize=14, fontweight='bold')
    axes[idx].set_xlabel(col, fontsize=12)
    axes[idx].set_ylabel('Frequency', fontsize=12)
fig.delaxes(axes[5])
plt.tight_layout()
plt.show()

- Analyzing the data we see that the weight distribution is the one that gets closest to a normal distribution. We also see that the calorie distribution resembles a normal one too. The other features show behaviors far from a normal distribution.

- The next analysis should be the boxplot to identify outliers (points outside the lines)

![Alternative text](\images\Boxplot.png)

In [None]:
#Boxplots
fig, axes = plt.subplots(3, 2, figsize=(15, 12))
axes = axes.ravel()

for idx, col in enumerate(numeric_cols):
    sns.boxplot(y=df[col], color='lightcoral', ax=axes[idx])
    #instead of using hist we use boxplot to plot the boxplot graphs and analyze behavior
    axes[idx].set_title(f'Boxplot of {col}', fontsize=14, fontweight='bold')
    axes[idx].set_ylabel(col, fontsize=12)
    axes[idx].grid(axis='y', alpha=0.3)

fig.delaxes(axes[5])
plt.tight_layout()
plt.show()

The Weight_kg feature showed LOWER relative variation and has identifiable outliers. Steps_Walked and Calories_Intake showed the HIGHEST variations in the data.

- For a more complete analysis to know where the data is concentrated, it is necessary to do the violin analysis that combines boxplot + KDE

In [None]:
# Violin plots
fig, axes = plt.subplots(3, 2, figsize=(15, 12))
axes = axes.ravel()

for idx, col in enumerate(numeric_cols):
    sns.violinplot(y=df[col], color='mediumpurple', ax=axes[idx]) #plot violin graph to know where the data concentrates.
    axes[idx].set_title(f'Violin Plot of {col}', fontsize=14, fontweight='bold')
    axes[idx].set_ylabel(col, fontsize=12)

fig.delaxes(axes[5])
plt.tight_layout()
plt.show()

- Weight_kg: Almost perfectly symmetric and unimodal distribution centered at ~70kg, confirming normality. Well-defined concentration with little dispersion.

- Other features: Protein_Intake, Workout_Duration and Steps_Walked show more uniform/flattened distributions without well-defined peaks, while Calories_Intake presents slight bimodality, confirming the high variability observed in boxplots.

# 2. Correlation analysis

---------------

## 2.1. Load the data

-----------

In [None]:
pip install kagglehub[pandas-datasets]

In [None]:
import kagglehub
from kagglehub import KaggleDatasetAdapter
import pandas as pd
file_path = "Gym_Progress_Dataset.csv"  #imports csv directly from kaggle

df = kagglehub.load_dataset(
    KaggleDatasetAdapter.PANDAS,
    "rishabhagarwal997889/gym-progress-tracking-dataset-200-days",
    file_path,
)

df.head()#loads the first 5

## 2.2. Correlation Matrix

---------

The correlation matrix aims to verify the linear relationships between variables.

Negative values (-1 to 0) indicate negative correlation: when one variable increases, the other tends to decrease.

Positive values (0 to +1) indicate positive correlation: when one variable increases, the other also tends to increase.

Values close to 0 indicate weak or absent correlation between variables.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

numeric_cols = ['Weight_kg', 'Calories_Intake', 'Protein_Intake_g', 
                'Workout_Duration_min', 'Steps_Walked']#numerical features
correlation_matrix = df[numeric_cols].corr() #calculates correlation matrix for all features above
print(correlation_matrix)

- To visualize in Heatmap (heat map) indicating red colors for stronger relationships and blue for weaker ones:

In [None]:

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, #model
            annot=True,#shows the values
            cmap='coolwarm',#color scheme (blue/red)
            center=0,# Center at 0
            fmt='.3f',#Format with 3 decimal places
            square=True, #Square cells
            linewidths=1,#Line between cells
            cbar_kws={'label': 'Correlation'}) #titles for x axis and y axis

plt.title('Correlation Matrix - Gym Dataset Variables',  #title
          fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

All correlations are extremely weak (< 0.14), indicating that the variables are practically independent of each other. Weight_kg shows almost zero correlation with all features (calories, protein, workout, steps), suggesting it is not linearly influenced by these factors. The strongest correlation found is Workout_Duration vs Steps_Walked (0.138), which is still very weak and indicates minimal linear relationship between any variables in the dataset.

The correlations suggest that the data is synthetic, as informed by the dataset author. In real fitness data, we would expect significant correlations between weight and caloric consumption, as well as between physical activity and body changes. The absence of these logical relationships confirms that the variables were generated independently and randomly, without considering the natural causal relationships present in real training and nutrition scenarios.

- The next mission is to discover which factor most influences weight loss and gain.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

#Correlation of all variables with weight
weight_correlation = df[numeric_cols].corr()['Weight_kg'].sort_values(ascending=False)
print("CORRELATION WITH WEIGHT (Weight_kg)")
print(weight_correlation)# shows only the weight correlations with other variables.
print("\n")

In [None]:
#Visualize weight correlations in bars
plt.figure(figsize=(10, 6))
weight_Filtered = weight_correlation.drop('Weight_kg')  #removes weight by weight analysis (will give 1, since they are the same variables)
weight_Filtered.plot(kind='barh', color='steelblue')#plots in bar chart
plt.title('Correlation of Variables with Weight_kg', fontsize=16, fontweight='bold')
plt.xlabel('Correlation', fontsize=12)
plt.ylabel('Variables', fontsize=12)
plt.axvline(x=0, color='black', linestyle='--', linewidth=0.8)
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

- We see almost no relationship between variables with weight increase or decrease.

- The next analysis should be made from the scatter plot (Visualizes the relationship between TWO numerical variables). If the points form a diagonal line = strong correlation


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

features = ['Calories_Intake', 'Protein_Intake_g', 'Workout_Duration_min', 'Steps_Walked']#features
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
axes = axes.ravel()

for idx, feature in enumerate(features):
    axes[idx].scatter(df[feature], df['Weight_kg'], #Applies the scatter plot between variables as a function of weight
                     alpha=0.6,
                     color='steelblue',
                     edgecolors='black',
                     s=60)

- we see completely separated data, with no clear relationship. to confirm let's draw a regression line.

In [None]:
features = ['Calories_Intake', 'Protein_Intake_g', 'Workout_Duration_min', 'Steps_Walked']

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()
for idx, feature in enumerate(features):
    axes[idx].scatter(df[feature], df['Weight_kg'], alpha=0.6)#Scatter plot
    # Trend line (linear regression)
    z = np.polyfit(df[feature], df['Weight_kg'], 1)
    p = np.poly1d(z)
    axes[idx].plot(df[feature], p(df[feature]),
                   color='red',
                   linestyle='--',
                   linewidth=2,
                   label='Linear Trend')
    
    corr = df[feature].corr(df['Weight_kg'])
    axes[idx].set_title(f'{feature} vs Weight_kg\nCorrelation: {corr:.3f}',
                       fontsize=14, fontweight='bold')
    axes[idx].set_xlabel(feature, fontsize=12)
    axes[idx].set_ylabel('Weight_kg', fontsize=12)
    axes[idx].legend(loc='best')
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

- Once again, the graph does not show a clear relationship between variables.