# Prédire le risque de surentraînement
## Objectif : Utiliser les caractéristiques instantanées pour estimer si cette personne est à risque de surentraînement ou non, en se basant sur des corrélations entre les variables.
### Variables clés à analyser :
- **Resting_BPM** : Un Resting_BPM élevé peut indiquer un stress physiologique.
- **Fat_Percentage** et BMI : Un déséquilibre peut suggérer un métabolisme perturbé.
- **Workout_Frequency et Session_Duration** : Une fréquence ou durée excessive peut être un facteur de risque.
- **Calories_Burned** : Un nombre anormalement bas ou élevé par rapport à la moyenne du groupe.
- **Experience_Level** : Les débutants et les athlètes expérimentés n’ont pas les mêmes risques.

### Import

In [14]:
import pandas as pd
import numpy as np

### Consolidation des datas
Nous avons 2 datasets qui possèdent les mêmes colonnes, nous allons les unir pour en former qu'un et en faire un csv.

In [15]:
df1 = pd.read_csv('exercise_tracking.csv')
df1.head()

Unnamed: 0,Age,Gender,Weight (kg),Height (m),Max_BPM,Avg_BPM,Resting_BPM,Session_Duration (hours),Calories_Burned,Workout_Type,Fat_Percentage,Water_Intake (liters),Workout_Frequency (days/week),Experience_Level,BMI
0,56,Male,88.3,1.71,180,157,60,1.69,1313.0,Yoga,12.6,3.5,4,3,30.2
1,46,Female,74.9,1.53,179,151,66,1.3,883.0,HIIT,33.9,2.1,4,2,32.0
2,32,Female,68.1,1.66,167,122,54,1.11,677.0,Cardio,33.4,2.3,4,2,24.71
3,25,Male,53.2,1.7,190,164,56,0.59,532.0,Strength,28.8,2.1,3,1,18.41
4,38,Male,46.1,1.79,188,158,68,0.64,556.0,Strength,29.2,2.8,3,1,14.39


In [16]:
df2 = pd.read_csv('exercise_tracking_synthetic_data.csv')
df2.head()

Unnamed: 0,Age,Gender,Weight (kg),Height (m),Max_BPM,Avg_BPM,Resting_BPM,Session_Duration (hours),Calories_Burned,Workout_Type,Fat_Percentage,Water_Intake (liters),Workout_Frequency (days/week),Experience_Level,BMI
0,34.0,Female,86.7,1.86,174,152.0,74.0,1.12,712.0,Strength,12.8,2.4,5.0,2.0,14.31
1,26.0,Female,84.7,1.83,166,156.0,73.0,1.0,833.0,Strength,27.9,2.8,5.0,2.0,33.49
2,22.0,Male,64.8,1.85,187,166.0,64.0,1.24,1678.0,Cardio,28.7,1.9,3.0,2.0,12.73
3,54.0,Female,75.3,1.82,187,169.0,58.0,1.45,628.0,Cardio,31.8,2.4,4.0,1.0,20.37
4,34.0,Female,52.8,1.74,177,169.0,66.0,1.6,1286.0,Strength,26.4,3.2,4.0,2.0,20.83


In [17]:
df1.shape, df2.shape

((973, 15), (1800, 15))

In [18]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 973 entries, 0 to 972
Data columns (total 15 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Age                            973 non-null    int64  
 1   Gender                         973 non-null    object 
 2   Weight (kg)                    973 non-null    float64
 3   Height (m)                     973 non-null    float64
 4   Max_BPM                        973 non-null    int64  
 5   Avg_BPM                        973 non-null    int64  
 6   Resting_BPM                    973 non-null    int64  
 7   Session_Duration (hours)       973 non-null    float64
 8   Calories_Burned                973 non-null    float64
 9   Workout_Type                   973 non-null    object 
 10  Fat_Percentage                 973 non-null    float64
 11  Water_Intake (liters)          973 non-null    float64
 12  Workout_Frequency (days/week)  973 non-null    int

In [19]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1800 entries, 0 to 1799
Data columns (total 15 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Age                            1790 non-null   float64
 1   Gender                         1729 non-null   object 
 2   Weight (kg)                    1778 non-null   float64
 3   Height (m)                     1774 non-null   float64
 4   Max_BPM                        1779 non-null   object 
 5   Avg_BPM                        1770 non-null   float64
 6   Resting_BPM                    1781 non-null   float64
 7   Session_Duration (hours)       1777 non-null   float64
 8   Calories_Burned                1777 non-null   float64
 9   Workout_Type                   1739 non-null   object 
 10  Fat_Percentage                 1784 non-null   float64
 11  Water_Intake (liters)          1776 non-null   float64
 12  Workout_Frequency (days/week)  1742 non-null   f

In [20]:
# Convertir les types de df2 en type de df1 car c'est les plus cohérentes
for col in df1.columns:
    dtype = df1[col].dtype
    if dtype == 'int64':
        df2[col] = pd.to_numeric(df2[col], errors='coerce').fillna(0).astype('int64') # on forme la conversion en cas d'erreur
    elif dtype == 'float64':
        df2[col] = pd.to_numeric(df2[col], errors='coerce').fillna(0.0)
    else:
        df2[col] = df2[col].astype(dtype)

# Vérifier les types
print(df2.dtypes)

Age                                int64
Gender                            object
Weight (kg)                      float64
Height (m)                       float64
Max_BPM                            int64
Avg_BPM                            int64
Resting_BPM                        int64
Session_Duration (hours)         float64
Calories_Burned                  float64
Workout_Type                      object
Fat_Percentage                   float64
Water_Intake (liters)            float64
Workout_Frequency (days/week)      int64
Experience_Level                   int64
BMI                              float64
dtype: object


In [21]:
df = pd.concat([df1, df2], ignore_index=True)
df.shape

(2773, 15)

In [22]:
df.to_csv('consolidated_exercise_tracking.csv', index=False)

## Colonne à prédire
Dans cette partie nous allons ajouter notre colonne Y qui nous servira à la classification.\
Étapes:
- Construire les dataframes qui stock les valeurs de référence pour : Resting Heart, et Body Fat %
- Utiliser ce dataset pour mettre une première condition sur la colonne Resting_BPM.

In [23]:
df.Age.min(), df.Age.max()

(0, 59)

In [35]:
ages = df.Age.unique()
ages.sort()
ages

array([ 0, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
       51, 52, 53, 54, 55, 56, 57, 58, 59])

In [None]:
# ============================================================================
# REFERENCE DATA: Age and Gender-Adjusted Normal Values
# ============================================================================

# 1. Resting Heart Rate Reference by Age
resting_hr_reference = pd.DataFrame({
    'Age_Min': [18, 26, 36, 46, 56, 66],
    'Age_Max': [25, 35, 45, 55, 65, 100],
    'Normal_Min': [60, 60, 62, 62, 64, 64],
    'Normal_Max': [73, 75, 76, 77, 78, 79],
    'Athlete_Max': [55, 57, 58, 59, 60, 62]
})

# 2. Body Fat Percentage Reference by Gender and Age
body_fat_reference = pd.DataFrame({
    'Gender': ['Male', 'Male', 'Male', 'Male', 'Female', 'Female', 'Female', 'Female'],
    'Age_Min': [18, 30, 40, 50, 18, 30, 40, 50],
    'Age_Max': [29, 39, 49, 100, 29, 39, 49, 100],
    'Essential_Max': [5, 5, 5, 5, 13, 13, 13, 13],
    'Athlete_Max': [13, 13, 13, 13, 20, 20, 20, 20],
    'Fitness_Max': [17, 18, 19, 20, 24, 25, 26, 27],
    'Normal_Max': [24, 25, 26, 27, 31, 32, 33, 34],
    'High_Threshold': [25, 26, 27, 28, 32, 33, 34, 35]
})

# 3. Maximum Heart Rate Reference (220 - age, formula with adjustments)
def get_max_hr_expected(age):
    """Calculate expected maximum heart rate by age"""
    return 220 - age

# 4. BMI Reference (standard for adults)
bmi_reference = {
    'Underweight': 18.5,
    'Normal': 25.0,
    'Overweight': 30.0,
    'Obese': 35.0
}

In [25]:
def get_resting_hr_range(age):
    """Get normal resting HR range for age"""
    ref = resting_hr_reference[
        (resting_hr_reference['Age_Min'] <= age) & 
        (resting_hr_reference['Age_Max'] >= age)
    ]
    if len(ref) > 0:
        return ref.iloc[0]['Normal_Min'], ref.iloc[0]['Normal_Max'], ref.iloc[0]['Athlete_Max']
    return 60, 75, 55  # Default if age out of range

def get_body_fat_range(age, gender):
    """Get normal body fat range for age and gender"""
    ref = body_fat_reference[
        (body_fat_reference['Gender'] == gender) &
        (body_fat_reference['Age_Min'] <= age) & 
        (body_fat_reference['Age_Max'] >= age)
    ]
    if len(ref) > 0:
        return (ref.iloc[0]['Essential_Max'], ref.iloc[0]['Athlete_Max'], 
                ref.iloc[0]['Fitness_Max'], ref.iloc[0]['Normal_Max'], 
                ref.iloc[0]['High_Threshold'])
    # Default values if not found
    if gender == 'Male':
        return 5, 13, 17, 24, 25
    else:
        return 13, 20, 24, 31, 32

In [26]:
def calculate_overtraining_risk(row):
    """
    Calculate overtraining risk based on age and gender-adjusted reference values
    Returns: risk_score (int), risk_level (str), details (dict)
    """
    risk_score = 0
    details = {}
    
    age = row['Age']
    gender = row['Gender']
    
    # -------------------------------------------------------------------------
    # 1. RESTING HEART RATE ANALYSIS (Most important indicator)
    # -------------------------------------------------------------------------
    hr_min, hr_max, hr_athlete = get_resting_hr_range(age)
    resting_bpm = row['Resting_BPM']
    
    if resting_bpm > hr_max + 10:
        risk_score += 3  # Severely elevated
        details['Resting_HR'] = f"Severely elevated ({resting_bpm} vs normal {hr_min}-{hr_max})"
    elif resting_bpm > hr_max + 5:
        risk_score += 2  # Moderately elevated
        details['Resting_HR'] = f"Elevated ({resting_bpm} vs normal {hr_min}-{hr_max})"
    elif resting_bpm > hr_max:
        risk_score += 1  # Slightly elevated
        details['Resting_HR'] = f"Slightly elevated ({resting_bpm} vs normal {hr_min}-{hr_max})"
    else:
        details['Resting_HR'] = "Normal"
    
    # -------------------------------------------------------------------------
    # 2. HEART RATE RESERVE (Max HR - Resting HR)
    # -------------------------------------------------------------------------
    expected_max_hr = get_max_hr_expected(age)
    hr_reserve = row['Max_BPM'] - row['Resting_BPM']
    expected_reserve = expected_max_hr - hr_max
    
    # Low HR reserve indicates poor cardiovascular fitness or overtraining
    if hr_reserve < expected_reserve * 0.7:  # Less than 70% of expected
        risk_score += 2
        details['HR_Reserve'] = f"Low reserve ({hr_reserve} vs expected ~{expected_reserve})"
    elif hr_reserve < expected_reserve * 0.85:
        risk_score += 1
        details['HR_Reserve'] = f"Below average ({hr_reserve} vs expected ~{expected_reserve})"
    else:
        details['HR_Reserve'] = "Adequate"
    
    # -------------------------------------------------------------------------
    # 3. BODY FAT PERCENTAGE (Age and Gender Adjusted)
    # -------------------------------------------------------------------------
    essential, athlete, fitness, normal, high = get_body_fat_range(age, gender)
    fat_pct = row['Fat_Percentage']
    
    if fat_pct < essential:
        risk_score += 2  # Too low - hormonal issues, poor recovery
        details['Body_Fat'] = f"Dangerously low ({fat_pct}% vs min {essential}%)"
    elif fat_pct > high:
        risk_score += 1  # Too high - metabolic stress
        details['Body_Fat'] = f"High ({fat_pct}% vs normal max {normal}%)"
    elif fat_pct > normal:
        risk_score += 0.5
        details['Body_Fat'] = f"Above normal ({fat_pct}% vs normal max {normal}%)"
    else:
        details['Body_Fat'] = "Normal range"
    
    # -------------------------------------------------------------------------
    # 4. TRAINING VOLUME (Frequency × Duration)
    # -------------------------------------------------------------------------
    workout_freq = row['Workout_Frequency (days/week)']
    session_duration = row['Session_Duration (hours)']
    training_load = workout_freq * session_duration
    
    # High training load thresholds
    if training_load > 12:  # e.g., 6 days × 2+ hours
        risk_score += 3
        details['Training_Load'] = f"Excessive ({training_load:.1f} hrs/week)"
    elif training_load > 10:
        risk_score += 2
        details['Training_Load'] = f"Very high ({training_load:.1f} hrs/week)"
    elif training_load > 8:
        risk_score += 1
        details['Training_Load'] = f"High ({training_load:.1f} hrs/week)"
    else:
        details['Training_Load'] = "Moderate"
    
    # Single session too long
    if session_duration > 2.5:
        risk_score += 1
        details['Session_Duration'] = f"Very long sessions ({session_duration} hrs)"
    elif session_duration > 2.0:
        risk_score += 0.5
        details['Session_Duration'] = f"Long sessions ({session_duration} hrs)"
    
    # -------------------------------------------------------------------------
    # 5. RECOVERY INDICATORS
    # -------------------------------------------------------------------------
    # Water intake (hydration is crucial for recovery)
    water_intake = row['Water_Intake (liters)']
    if water_intake < 2.0:
        risk_score += 1
        details['Hydration'] = f"Low water intake ({water_intake}L)"
    elif water_intake < 2.5:
        risk_score += 0.5
        details['Hydration'] = f"Suboptimal hydration ({water_intake}L)"
    else:
        details['Hydration'] = "Adequate"
    
    # -------------------------------------------------------------------------
    # 6. EXPERIENCE LEVEL MODIFIER
    # -------------------------------------------------------------------------
    experience = row['Experience_Level']
    
    # Beginners are more susceptible to overtraining
    if experience == 1:  # Beginner
        risk_score *= 1.3  # 30% increase in risk
        details['Experience'] = "Beginner (higher susceptibility)"
    elif experience == 2:  # Intermediate
        risk_score *= 1.1  # 10% increase
        details['Experience'] = "Intermediate"
    else:  # Advanced
        details['Experience'] = "Advanced"
    
    # -------------------------------------------------------------------------
    # 7. BMI CONSIDERATIONS
    # -------------------------------------------------------------------------
    bmi = row['BMI']
    if bmi < bmi_reference['Underweight']:
        risk_score += 1
        details['BMI'] = f"Underweight ({bmi:.1f})"
    elif bmi > bmi_reference['Overweight']:
        risk_score += 1
        details['BMI'] = f"Overweight ({bmi:.1f})"
    
    # -------------------------------------------------------------------------
    # 8. CALORIC EFFICIENCY
    # -------------------------------------------------------------------------
    calories_burned = row['Calories_Burned']
    calories_per_hour = calories_burned / session_duration if session_duration > 0 else 0
    
    # Very low calorie burn might indicate fatigue/inefficiency
    if calories_per_hour < 250:
        risk_score += 1
        details['Caloric_Efficiency'] = f"Low efficiency ({calories_per_hour:.0f} cal/hr)"
    
    # -------------------------------------------------------------------------
    # FINAL CLASSIFICATION
    # -------------------------------------------------------------------------
    if risk_score >= 7:
        risk_level = "HIGH"
        risk_binary = 1
    elif risk_score >= 4:
        risk_level = "MODERATE"
        risk_binary = 1
    else:
        risk_level = "LOW"
        risk_binary = 0
    
    return risk_binary, risk_level, risk_score, details

In [27]:
# ============================================================================
# APPLY TO DATAFRAME
# ============================================================================

def add_overtraining_risk_column(df):
    """
    Add overtraining risk columns to the dataframe
    
    Parameters:
    df: DataFrame with required columns
    
    Returns:
    DataFrame with added columns:
        - Overtraining_Risk (binary: 0 or 1)
        - Risk_Level (LOW, MODERATE, HIGH)
        - Risk_Score (continuous score)
        - Risk_Details (dictionary with explanation)
    """
    results = df.apply(calculate_overtraining_risk, axis=1, result_type='expand')
    results.columns = ['Overtraining_Risk', 'Risk_Level', 'Risk_Score', 'Risk_Details']
    
    df_with_risk = pd.concat([df, results], axis=1)
    
    return df_with_risk

In [36]:
new_df = add_overtraining_risk_column(df)
new_df

Unnamed: 0,Age,Gender,Weight (kg),Height (m),Max_BPM,Avg_BPM,Resting_BPM,Session_Duration (hours),Calories_Burned,Workout_Type,Fat_Percentage,Water_Intake (liters),Workout_Frequency (days/week),Experience_Level,BMI,Overtraining_Risk,Risk_Level,Risk_Score,Risk_Details
0,56,Male,88.3,1.71,180,157,60,1.69,1313.0,Yoga,12.6,3.5,4,3,30.20,0,LOW,1.00,"{'Resting_HR': 'Normal', 'HR_Reserve': 'Adequa..."
1,46,Female,74.9,1.53,179,151,66,1.30,883.0,HIIT,33.9,2.1,4,2,32.00,0,LOW,2.10,"{'Resting_HR': 'Normal', 'HR_Reserve': 'Adequa..."
2,32,Female,68.1,1.66,167,122,54,1.11,677.0,Cardio,33.4,2.3,4,2,24.71,0,LOW,1.65,"{'Resting_HR': 'Normal', 'HR_Reserve': 'Adequa..."
3,25,Male,53.2,1.70,190,164,56,0.59,532.0,Strength,28.8,2.1,3,1,18.41,0,LOW,2.95,"{'Resting_HR': 'Normal', 'HR_Reserve': 'Adequa..."
4,38,Male,46.1,1.79,188,158,68,0.64,556.0,Strength,29.2,2.8,3,1,14.39,0,LOW,2.30,"{'Resting_HR': 'Normal', 'HR_Reserve': 'Adequa..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2768,54,Male,88.5,2.00,173,134,58,1.11,1388.0,HIIT,27.7,3.7,3,2,36.73,0,LOW,1.55,"{'Resting_HR': 'Normal', 'HR_Reserve': 'Adequa..."
2769,52,Male,84.3,1.69,164,169,54,0.77,1367.0,HIIT,32.6,2.9,3,2,15.11,0,LOW,2.10,"{'Resting_HR': 'Normal', 'HR_Reserve': 'Adequa..."
2770,47,Male,70.1,1.84,188,129,67,1.20,1261.0,Strength,28.4,2.5,3,2,17.99,0,LOW,2.10,"{'Resting_HR': 'Normal', 'HR_Reserve': 'Adequa..."
2771,35,Male,49.3,1.71,180,152,73,1.04,956.0,Cardio,32.9,1.7,4,3,12.65,0,LOW,3.00,"{'Resting_HR': 'Normal', 'HR_Reserve': 'Adequa..."


In [48]:
new_df[new_df['Risk_Level'] == 'LOW'].shape, new_df[new_df['Risk_Level'] == 'MODERATE'].shape

((2712, 19), (61, 19))

**Remarque:** Il faudra utiliser un moyen d'augmenter le nombre de MODERATE pour éviter le underfitting 