# Bivariate & Multivariate Exploration – Weigh Lifestyle Dataset
### Objectives
- Quantify joint relationships between behavioural, nutritional, and physiological features.
- Prioritise features for modelling by measuring their association with the weight target.
- Document categorical distributions to surface dominant lifestyle archetypes.

### Data Assets
- Cleaned dataset: `../data/dataset_cleaned.csv`
- Generated charts: `../plots/matrice_corr/`, `../plots/correlation/`



## Analysis Plan
### Step 1 – Load Prepared Observations
- Read the cleaned dataset from disk and keep shared paths available for every downstream step.

### Step 2 – Build Feature Correlation Matrix
- Compute pairwise correlations across all numeric features including the weight target to visualize complete relationships.

### Step 3 – Visualise Correlation Structure
- Render the complete correlation heatmap and persist it to the correlation portfolio for review.

### Step 4 – Measure Feature–Target Associations
- Calculate Spearman rank correlations and export scatter plots to inspect monotonic trends against `Weight (kg)`.

### Step 5 – Profile Categorical Variables
- Generate detailed and summary frequency tables to document dominant lifestyle categories across the cohort.

In [56]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from scipy.stats import spearmanr

# Fix shared resources with explicit paths to keep reruns reproducible across environments.
DATA_DIRECTORY: Path = Path("..") / "data"
CLEANED_DATASET_PATH: Path = DATA_DIRECTORY / "dataset_cleaned.csv"
PLOTS_MATRIX_DIR: Path = Path("..") / "plots" / "matrice_corr"
PLOTS_CORRELATION_DIR: Path = Path("..") / "plots" / "correlation"
PLOTS_MATRIX_PATH: Path = PLOTS_MATRIX_DIR / "correlation_matrix.png"


In [57]:
def compute_correlation_matrix(df: pd.DataFrame) -> pd.DataFrame:
    """
    Compute correlation matrix for all numeric features including target variable.
    
    Args:
        df: DataFrame with all features including target
        
    Returns:
        Correlation matrix including Weight (kg) column
    """
    # Select only numeric columns to avoid type errors
    numeric_df = df.select_dtypes(include=[np.number])
    
    return numeric_df.corr()

In [58]:
# Load cleaned dataset using centralised path for consistency across notebooks.
df = pd.read_csv(CLEANED_DATASET_PATH)

# Compute correlation matrix
corr_matrix = compute_correlation_matrix(df)
print(f"Correlation matrix shape: {corr_matrix.shape}")

# Show only unique feature pairs (lower triangle) sorted by absolute correlation strength
mask = np.tril(np.ones_like(corr_matrix, dtype=bool), k=-1)
corr_pairs = corr_matrix.where(mask).stack().reset_index()
corr_pairs.columns = ['Feature_1', 'Feature_2', 'Correlation']

corr_pairs_sorted = corr_pairs.sort_values(by='Correlation', key=lambda s: s.abs(), ascending=False)
corr_pairs_sorted.head(20)


Correlation matrix shape: (19, 19)


Unnamed: 0,Feature_1,Feature_2,Correlation
156,pct_maxHR,Avg_BPM,0.840785
35,Experience_Level,Workout_Frequency (days/week),0.83644
33,Experience_Level,Session_Duration (hours),0.758127
26,Workout_Frequency (days/week),Session_Duration (hours),0.638039
155,pct_maxHR,Max_BPM,-0.559991
16,Water_Intake (liters),Weight (kg),0.397971
34,Experience_Level,Water_Intake (liters),0.311843
20,Water_Intake (liters),Session_Duration (hours),0.287751
27,Workout_Frequency (days/week),Water_Intake (liters),0.241312
83,cholesterol_mg,Session_Duration (hours),0.092592


ATTENTION : Décision de supprimer les varibales : cal_from_macros / pct_HRR / Carbs / Protein / Fasts car trop corrélées. 

In [59]:
def plot_correlation_matrix(corr_matrix: pd.DataFrame, output_path: Path) -> None:
    """
    Render a complete heatmap of the feature correlation matrix and save to disk.
    
    Args:
        corr_matrix: Correlation matrix (all numeric features)
        output_path: File path where the plot should be saved
    """
    plt.figure(figsize=(14, 12))
    
    # Use diverging colormap to highlight both positive and negative correlations
    sns.heatmap(
        corr_matrix,
        annot=False,  # Too many values would clutter the plot
        cmap='coolwarm',
        center=0,
        vmin=-1,
        vmax=1,
        square=True,
        linewidths=0.5,
        cbar_kws={'shrink': 0.8, 'label': 'Correlation coefficient'}
    )
    
    plt.title('Correlation Matrix - All Numeric Features', fontsize=16, pad=20)
    plt.tight_layout()
    
    # Ensure directory exists to avoid save-time failures on fresh environments.
    output_path.parent.mkdir(parents=True, exist_ok=True)
    plt.savefig(output_path, dpi=300, bbox_inches='tight')
    plt.close()
    
    print(f"Correlation matrix saved to: {output_path}")

In [60]:
# Plot and save correlation matrix using the shared path constant.
plot_correlation_matrix(corr_matrix, PLOTS_MATRIX_PATH)


Correlation matrix saved to: ../plots/matrice_corr/correlation_matrix.png


In [61]:
def plot_spearman_correlation_with_target(df: pd.DataFrame, target_column: str, output_dir: Path) -> pd.DataFrame:
    """
    Compute Spearman correlation between each numeric feature and target variable,
    then create scatter plots for each relationship.
    
    Args:
        df: DataFrame with all features including target
        target_column: Name of the target variable column
        output_dir: Directory where correlation plots should be persisted
        
    Returns:
        DataFrame with features and their Spearman correlation with target
    """
        
    # Select only numeric columns to avoid type errors when calling scipy helpers.
    numeric_df = df.select_dtypes(include=[np.number])
    
    # Exclude target from features so correlations highlight candidate predictors only.
    features = numeric_df.drop(columns=[target_column], errors='ignore')
    
    # Quantify monotonic relationships to prioritise the strongest feature signals.
    correlations = []
    for col in features.columns:
        corr, pvalue = spearmanr(df[col], df[target_column], nan_policy='omit')
        correlations.append({
            'Feature': col,
            'Spearman_Correlation': corr,
            'P_Value': pvalue
        })
    
    corr_df = pd.DataFrame(correlations).sort_values(by='Spearman_Correlation', key=lambda s: s.abs(), ascending=False)
    
    # Guarantee directory exists so batch plotting never fails on a fresh workspace.
    output_dir.mkdir(parents=True, exist_ok=True)
    
    # Visualise each relationship to spot non-linear patterns or heteroscedasticity.
    for _, row in corr_df.iterrows():
        feature = str(row['Feature'])
        corr_value = float(row['Spearman_Correlation'])
        
        plt.figure(figsize=(10, 6))
        plt.scatter(df[feature], df[target_column], alpha=0.5, s=20)
        plt.xlabel(feature, fontsize=12)
        plt.ylabel(target_column, fontsize=12)
        plt.title(f'{feature} vs {target_column}\\nSpearman ρ = {corr_value:.3f}', fontsize=14)
        plt.grid(True, alpha=0.3)
        plt.tight_layout()
        
        # Sanitize filenames to avoid issues across operating systems.
        safe_filename = feature.replace(' ', '_').replace('/', '_').replace('(', '').replace(')', '')
        plot_path = output_dir / f'{safe_filename}_vs_Weight.png'
        plt.savefig(plot_path, dpi=300, bbox_inches='tight')
        plt.close()
    
    print(f"Created {len(corr_df)} correlation plots in {output_dir}")
    
    return corr_df


In [62]:
# Compute and plot Spearman correlations with target variable
spearman_results = plot_spearman_correlation_with_target(
    df=df,
    target_column='Weight (kg)',
    output_dir=PLOTS_CORRELATION_DIR
)

# Display correlation results with rounded p-values
spearman_results_display = spearman_results.copy()
spearman_results_display['P_Value'] = spearman_results_display['P_Value'].round(4)
spearman_results_display


Created 18 correlation plots in ../plots/correlation


Unnamed: 0,Feature,Spearman_Correlation,P_Value
5,Water_Intake (liters),0.415222,0.0
7,Experience_Level,0.072177,0.0
1,Max_BPM,0.063006,0.0
15,cook_time_min,-0.049948,0.0
6,Workout_Frequency (days/week),0.048058,0.0
8,Daily meals frequency,0.045625,0.0
13,serving_size_g,0.044526,0.0
4,Session_Duration (hours),0.043467,0.0
0,Age,-0.041913,0.0
3,Resting_BPM,-0.034753,0.0


Calorie est très corrélée à la target, relation quasi-linéaire. Elle donne peut être beaucoup trop d'informations aux modèles mais cela fait partie du style de vie et donne aucune indication sur la taille de la personne donc on garde la feature. 

# Analyse des variables catégorielles

## 1. Statistiques descriptives univariées


In [63]:
def compute_categorical_frequencies(df: pd.DataFrame, column: str) -> pd.DataFrame:
    """
    Compute absolute and relative frequencies for a categorical variable.
    
    Args:
        df: DataFrame containing the data
        column: Name of the categorical column to analyze
        
    Returns:
        DataFrame with absolute frequency, relative frequency, and cumulative frequency
    """
    # Capture absolute counts to keep the raw population scale accessible.
    freq_abs = df[column].value_counts().sort_index()
    
    # Convert counts to percentages so categories remain comparable across samples.
    freq_rel = (freq_abs / len(df) * 100).round(2)
    
    # Track cumulative share to highlight how quickly categories saturate the population.
    freq_cum = freq_rel.cumsum().round(2)
    
    # Bundle metrics into one table to feed the reporting cells directly.
    result = pd.DataFrame({
        'Fréquence absolue': freq_abs,
        'Fréquence relative (%)': freq_rel,
        'Fréquence cumulative (%)': freq_cum
    })
    
    # Append totals to simplify quick validation during presentations.
    result.loc['TOTAL'] = [freq_abs.sum(), 100.0, 100.0]
    
    return result


In [64]:
def analyze_categorical_variables(df: pd.DataFrame) -> pd.DataFrame:
    """
    Compute descriptive statistics for all categorical variables.
    
    Args:
        df: DataFrame containing the data
        
    Returns:
        DataFrame with mode, unique values count, and most frequent value percentage
    """
    # Restrict to categorical columns because numerics follow a dedicated analysis track.
    categorical_cols = df.select_dtypes(include=['object', 'category']).columns
    
    stats = []
    for col in categorical_cols:
        mode_value = df[col].mode()[0]  # Favour the first mode to keep the summary deterministic.
        n_unique = df[col].nunique()
        mode_freq = (df[col] == mode_value).sum()
        mode_pct = (mode_freq / len(df) * 100).round(2)
        
        stats.append({
            'Variable': col,
            'Mode': mode_value,
            'Fréquence du mode': mode_freq,
            'Fréquence du mode (%)': mode_pct,
            'Valeurs uniques': n_unique
        })
    
    return pd.DataFrame(stats)


In [65]:
# Surface detailed frequencies for each categorical variable to uncover imbalances.
categorical_cols = df.select_dtypes(include=['object', 'category']).columns

for col in categorical_cols:
    print(f"\n{'='*80}")
    print(f"Variable: {col}")
    print('='*80)
    freq_table = compute_categorical_frequencies(df, col)
    print(freq_table)
    print(f"\nMode: {df[col].mode()[0]}")
    print(f"Nombre de valeurs uniques: {df[col].nunique()}")



Variable: Gender
        Fréquence absolue  Fréquence relative (%)  Fréquence cumulative (%)
Gender                                                                     
Female            10028.0                   50.14                     50.14
Male               9972.0                   49.86                    100.00
TOTAL             20000.0                  100.00                    100.00

Mode: Female
Nombre de valeurs uniques: 2

Variable: Workout_Type
              Fréquence absolue  Fréquence relative (%)  \
Workout_Type                                              
Cardio                   4923.0                   24.62   
HIIT                     4974.0                   24.87   
Strength                 5071.0                   25.36   
Yoga                     5032.0                   25.16   
TOTAL                   20000.0                  100.00   

              Fréquence cumulative (%)  
Workout_Type                            
Cardio                           24.62 

In [66]:
# Build an aggregated view to compare categorical dominance at a glance.
categorical_stats = analyze_categorical_variables(df)
print("=== Vue d'ensemble des variables catégorielles ===\n")
categorical_stats

=== Vue d'ensemble des variables catégorielles ===



Unnamed: 0,Variable,Mode,Fréquence du mode,Fréquence du mode (%),Valeurs uniques
0,Gender,Female,10028,50.14,2
1,Workout_Type,Strength,5071,25.36,4
2,diet_type,Paleo,3403,17.02,6
3,cooking_method,Baked,2953,14.76,7
