# Gym Workout Pattern Mining - Data Augmentation

## Objective
Merge and augment 2 datasets để tạo exercise-level data cho Association Rules Mining

**Input:**
- Dataset 1: gym_members_exercise_tracking.csv (973 users)
- Dataset 2: megaGymDataset.csv (2,918 exercises)

**Output:** gym_workout_sessions.csv (~15,000 records)

---

## CELL 1: Dataset 1 - Gym Members

| Column | Meaning |
|--------|--------|
| Age | Tuổi (years) |
| Gender | Giới tính (Male/Female) |
| Weight (kg) | Cân nặng |
| Height (m) | Chiều cao |
| Max_BPM | Nhịp tim tối đa |
| Avg_BPM | Nhịp tim trung bình |
| Resting_BPM | Nhịp tim nghỉ |
| Session_Duration (hours) | Thời gian buổi tập |
| Calories_Burned | Tổng calories đốt |
| Workout_Type | Loại: Strength/Cardio/HIIT/Yoga |
| Fat_Percentage | Tỷ lệ mỡ (%) |
| Water_Intake (liters) | Lượng nước uống |
| Workout_Frequency (days/week) | Tần suất tập/tuần |
| Experience_Level | 1=Beginner, 2=Intermediate, 3=Expert |
| BMI | Body Mass Index |

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')
np.random.seed(42)

# Load Dataset 1
gym_members = pd.read_csv('./data/gym_members_exercise_tracking.csv')
print(f"Dataset 1: {gym_members.shape[0]} users, {gym_members.shape[1]} columns")
print(f"Columns: {gym_members.columns.tolist()}")
print("\nFirst 5 rows:")
display(gym_members.head())
print(f"\nWorkout Types: {gym_members['Workout_Type'].value_counts().to_dict()}")

Dataset 1: 973 users, 15 columns
Columns: ['Age', 'Gender', 'Weight (kg)', 'Height (m)', 'Max_BPM', 'Avg_BPM', 'Resting_BPM', 'Session_Duration (hours)', 'Calories_Burned', 'Workout_Type', 'Fat_Percentage', 'Water_Intake (liters)', 'Workout_Frequency (days/week)', 'Experience_Level', 'BMI']

First 5 rows:


Unnamed: 0,Age,Gender,Weight (kg),Height (m),Max_BPM,Avg_BPM,Resting_BPM,Session_Duration (hours),Calories_Burned,Workout_Type,Fat_Percentage,Water_Intake (liters),Workout_Frequency (days/week),Experience_Level,BMI
0,56,Male,88.3,1.71,180,157,60,1.69,1313.0,Yoga,12.6,3.5,4,3,30.2
1,46,Female,74.9,1.53,179,151,66,1.3,883.0,HIIT,33.9,2.1,4,2,32.0
2,32,Female,68.1,1.66,167,122,54,1.11,677.0,Cardio,33.4,2.3,4,2,24.71
3,25,Male,53.2,1.7,190,164,56,0.59,532.0,Strength,28.8,2.1,3,1,18.41
4,38,Male,46.1,1.79,188,158,68,0.64,556.0,Strength,29.2,2.8,3,1,14.39



Workout Types: {'Strength': 258, 'Cardio': 255, 'Yoga': 239, 'HIIT': 221}


---

## CELL 2: Dataset 2 - MegaGym Exercises

| Column | Meaning |
|--------|--------|
| Title | Tên bài tập (VD: Squat, Burpee) |
| Desc | Mô tả chi tiết |
| Type | Strength/Cardio/Plyometrics/Stretching |
| BodyPart | Nhóm cơ (Chest/Back/Legs...) |
| Equipment | Dụng cụ (Barbell/Dumbbell/Bodyweight) |
| Level | Beginner/Intermediate/Expert |
| Rating | Đánh giá 0-10 |
| RatingDesc | Mô tả rating |

In [2]:
# Load Dataset 2
mega_gym = pd.read_csv('./data/megaGymDataset.csv')
print(f"Dataset 2: {mega_gym.shape[0]} exercises, {mega_gym.shape[1]} columns")
print(f"Columns: {mega_gym.columns.tolist()}")
print("\nFirst 5 rows:")
display(mega_gym.head())
print(f"\nExercise Types: {mega_gym['Type'].value_counts().to_dict()}")

Dataset 2: 2918 exercises, 9 columns
Columns: ['Unnamed: 0', 'Title', 'Desc', 'Type', 'BodyPart', 'Equipment', 'Level', 'Rating', 'RatingDesc']

First 5 rows:


Unnamed: 0.1,Unnamed: 0,Title,Desc,Type,BodyPart,Equipment,Level,Rating,RatingDesc
0,0,Partner plank band row,The partner plank band row is an abdominal exe...,Strength,Abdominals,Bands,Intermediate,0.0,
1,1,Banded crunch isometric hold,The banded crunch isometric hold is an exercis...,Strength,Abdominals,Bands,Intermediate,,
2,2,FYR Banded Plank Jack,The banded plank jack is a variation on the pl...,Strength,Abdominals,Bands,Intermediate,,
3,3,Banded crunch,The banded crunch is an exercise targeting the...,Strength,Abdominals,Bands,Intermediate,,
4,4,Crunch,The crunch is a popular core exercise targetin...,Strength,Abdominals,Bands,Intermediate,,



Exercise Types: {'Strength': 2545, 'Stretching': 147, 'Plyometrics': 97, 'Powerlifting': 37, 'Cardio': 35, 'Olympic Weightlifting': 35, 'Strongman': 22}


---

## CELL 3: Augmentation & Output

### Output Columns

| Column | Source/Calculation | Meaning |
|--------|-------------------|--------|
| User_ID | Dataset 1 (index) | ID người dùng |
| Session_ID | Generated: U{id}_S{num} | ID buổi tập |
| Date | Generated: 2024-01-01 + offset | Ngày tập |
| Workout_Time | Generated: Morning/Afternoon/Evening | Thời gian tập |
| Exercise_Name | Dataset 2 (Title) filtered by Type+Level | Tên bài tập |
| Workout_Type | Dataset 1 | Loại workout |
| Sets | Generated: 3-5 (Strength), 1 (Cardio), 3 (Yoga) | Số sets |
| Reps | Generated: 8-12 (Strength), None (Cardio), 10 (Yoga) | Số reps |
| Duration_Minutes | Dataset 1: Session_Duration / num_exercises | Thời gian/bài tập |
| Calories | Calculated: MET × weight × (duration/60) × 1.07(male) | Calories đốt |
| Weight_kg | Dataset 1 | Cân nặng |
| Fat_pct | Dataset 1 (Fat_Percentage) | Tỷ lệ mỡ |
| BMI | Dataset 1 | BMI |
| Goal | Calculated: f(BMI, Fat%) | Weight_Loss/Muscle_Gain/Fat_Loss/Fitness |
| Protein_Intake | Calculated: weight × (1.0-2.2 g/kg) based on goal | Protein (g/day) |
| Protein_Level | Derived: Low/Medium/High from Protein_Intake | Mức protein |
| Calories_Intake | Calculated: weight × (22-38 kcal/kg) based on goal | Calories ăn vào |
| Experience_Level | Dataset 1: 1→Beginner, 2→Intermediate, 3→Expert | Trình độ |
| Gender | Dataset 1 | Giới tính |
| Age | Dataset 1 | Tuổi |
| Frequency_per_week | Dataset 1 (Workout_Frequency) | Tần suất/tuần |
| Success | Calculated: Multi-factor score >= 60 → 1, else 0 | Thành công (0/1) |

### Success Scoring Formula
```
Score = Workout-Goal_alignment(30) + Frequency(25) + Protein(20) + Duration(15) + Experience(10)
Success = 1 if Score >= 60, else 0
```

In [3]:
# ===== CONFIGURATION =====
exp_mapping = {1: 'Beginner', 2: 'Intermediate', 3: 'Expert'}
type_mapping = {
    'Strength': ['Strength', 'Powerlifting', 'Strongman'],
    'Cardio': ['Cardio'],
    'HIIT': ['Plyometrics', 'Cardio'],
    'Yoga': ['Stretching']
}
met_db = {
    'squat': 8.0, 'deadlift': 8.0, 'bench': 6.0, 'press': 6.0, 'pull': 8.0, 'row': 6.0,
    'run': 9.8, 'sprint': 12.0, 'cycl': 7.5, 'swim': 8.0, 'treadmill': 8.0,
    'burpee': 11.0, 'jump': 10.0, 'mountain': 9.0, 'box': 10.0,
    'yoga': 3.0, 'stretch': 2.5
}

# ===== HELPER FUNCTIONS =====
def estimate_calories(ex_name, duration_min, weight, is_male):
    met = 6.0
    for kw, val in met_db.items():
        if kw in ex_name.lower():
            met = val
            break
    cal = met * weight * (duration_min / 60)
    return round(cal * 1.07 if is_male else cal, 1)

def determine_goal(bmi, fat, gender):
    if bmi >= 30: return 'Weight_Loss'
    if bmi < 18.5: return 'Muscle_Gain'
    if (gender == 'Male' and fat > 25) or (gender == 'Female' and fat > 32): return 'Fat_Loss'
    return 'Fitness'

def calculate_nutrition(goal, weight):
    if goal == 'Muscle_Gain':
        protein = weight * np.random.uniform(1.8, 2.2)
        p_level, cal = 'High', weight * np.random.uniform(32, 38)
    elif goal in ['Weight_Loss', 'Fat_Loss']:
        protein = weight * np.random.uniform(1.4, 1.8)
        p_level, cal = 'Medium', weight * np.random.uniform(22, 28)
    else:
        protein = weight * np.random.uniform(1.0, 1.4)
        p_level, cal = 'Low' if protein < weight*1.2 else 'Medium', weight * np.random.uniform(28, 32)
    return {'Protein_Intake': round(protein, 1), 'Protein_Level': p_level, 'Calories_Intake': round(cal, 0)}

def get_exercises(workout_type, exp_level, mega_df, n=4):
    exp_str = exp_mapping.get(exp_level, 'Beginner')
    types = type_mapping.get(workout_type, ['Strength'])
    filtered = mega_df[(mega_df['Type'].isin(types)) & (mega_df['Level'] == exp_str)]
    if len(filtered) < n:
        filtered = mega_df[mega_df['Type'].isin(types)]
    return filtered.sample(min(n, len(filtered)), replace=False)['Title'].tolist() if len(filtered) > 0 else ['General Exercise']*n

def calculate_success(row):
    score = 0
    # Workout-Goal alignment (30)
    if (row['Goal']=='Weight_Loss' and row['Workout_Type'] in ['Cardio','HIIT']) or \
       (row['Goal']=='Muscle_Gain' and row['Workout_Type']=='Strength') or \
       (row['Goal']=='Fat_Loss' and row['Workout_Type']=='HIIT'):
        score += 30
    elif row['Goal']=='Fitness' and row['Workout_Type'] in ['Cardio','Yoga']:
        score += 25
    else:
        score += 10
    # Frequency (25)
    score += 25 if row['Frequency_per_week']>=5 else 20 if row['Frequency_per_week']==4 else 15 if row['Frequency_per_week']==3 else 5
    # Protein (20)
    if row['Goal']=='Muscle_Gain':
        score += 20 if row['Protein_Level']=='High' else 10 if row['Protein_Level']=='Medium' else 0
    elif row['Goal'] in ['Weight_Loss','Fat_Loss']:
        score += 15 if row['Protein_Level'] in ['High','Medium'] else 5
    else:
        score += 10
    # Duration (15)
    score += 15 if row['Duration_Minutes']>=30 else 10 if row['Duration_Minutes']>=20 else 5
    # Experience (10)
    score += 10 if row['Experience_Level']=='Expert' else 7 if row['Experience_Level']=='Intermediate' else 3
    return 1 if score >= 60 else 0

# ===== AUGMENTATION =====
print("Processing...")
all_sessions = []
for idx, user in gym_members.iterrows():
    if idx % 200 == 0: print(f"  {idx}/{len(gym_members)} users...")
    
    user_id, workout_type, exp = user.name, user['Workout_Type'], user['Experience_Level']
    freq, duration = user['Workout_Frequency (days/week)'], user['Session_Duration (hours)'] * 60
    weight, fat, bmi = user['Weight (kg)'], user['Fat_Percentage'], user['BMI']
    is_male, age, gender = (user['Gender']=='Male'), user['Age'], user['Gender']
    
    goal = determine_goal(bmi, fat, gender)
    nutrition = calculate_nutrition(goal, weight)
    
    num_sessions = int(freq * 4)
    start_date = datetime(2024, 1, 1)
    times = ['Morning', 'Afternoon', 'Evening']
    
    for s_idx in range(num_sessions):
        n_ex = np.random.randint(3, 6)
        exercises = get_exercises(workout_type, exp, mega_gym, n_ex)
        dur_per_ex = duration / len(exercises)
        days_offset = s_idx * (7 // freq if freq > 0 else 2)
        s_date = start_date + timedelta(days=days_offset)
        w_time = times[s_idx % len(times)]
        
        for ex in exercises:
            cal = estimate_calories(ex, dur_per_ex, weight, is_male)
            if workout_type == 'Strength':
                sets, reps = np.random.randint(3,5), np.random.randint(8,13)
            elif workout_type in ['Cardio','HIIT']:
                sets, reps = 1, None
            else:
                sets, reps = 3, 10
            
            all_sessions.append({
                'User_ID': user_id, 'Session_ID': f"U{user_id:03d}_S{s_idx:02d}",
                'Date': s_date.strftime('%Y-%m-%d'), 'Workout_Time': w_time,
                'Exercise_Name': ex, 'Workout_Type': workout_type,
                'Sets': sets, 'Reps': reps, 'Duration_Minutes': round(dur_per_ex, 1),
                'Calories': cal, 'Weight_kg': round(weight, 1), 'Fat_pct': round(fat, 1),
                'BMI': round(bmi, 1), 'Goal': goal,
                'Protein_Intake': nutrition['Protein_Intake'],
                'Protein_Level': nutrition['Protein_Level'],
                'Calories_Intake': nutrition['Calories_Intake'],
                'Experience_Level': exp_mapping[exp], 'Gender': gender,
                'Age': age, 'Frequency_per_week': freq
            })

augmented_df = pd.DataFrame(all_sessions)
augmented_df['Success'] = augmented_df.apply(calculate_success, axis=1)

print(f"\nCompleted: {len(augmented_df):,} records generated from {len(gym_members)} users")

# ===== STATISTICS =====
print(f"\nStatistics:")
print(f"  Users: {augmented_df['User_ID'].nunique():,}")
print(f"  Sessions: {augmented_df['Session_ID'].nunique():,}")
print(f"  Unique Exercises: {augmented_df['Exercise_Name'].nunique():,}")
print(f"  Date Range: {augmented_df['Date'].min()} to {augmented_df['Date'].max()}")
print(f"  Success Rate: {augmented_df['Success'].mean()*100:.1f}%")
print(f"\nGoal Distribution: {augmented_df['Goal'].value_counts().to_dict()}")
print(f"Protein Level: {augmented_df['Protein_Level'].value_counts().to_dict()}")
print(f"Workout Time: {augmented_df['Workout_Time'].value_counts().to_dict()}")

# ===== PREVIEW =====
print(f"\nPreview (10 random rows):")
display(augmented_df.sample(10).sort_values(['User_ID', 'Session_ID']))

# ===== SAVE =====
output = './data/gym_workout_sessions.csv'
augmented_df.to_csv(output, index=False)
print(f"\nSaved: {output} ({len(augmented_df):,} rows × {len(augmented_df.columns)} columns)")

Processing...
  0/973 users...
  200/973 users...
  400/973 users...
  600/973 users...
  800/973 users...

Completed: 51,633 records generated from 973 users

Statistics:
  Users: 973
  Sessions: 12,928
  Unique Exercises: 2,759
  Date Range: 2024-01-01 to 2024-01-23
  Success Rate: 55.8%

Goal Distribution: {'Fitness': 24076, 'Weight_Loss': 9785, 'Fat_Loss': 9353, 'Muscle_Gain': 8419}
Protein Level: {'Medium': 31855, 'Low': 11359, 'High': 8419}
Workout Time: {'Morning': 18445, 'Afternoon': 17185, 'Evening': 16003}

Preview (10 random rows):


Unnamed: 0,User_ID,Session_ID,Date,Workout_Time,Exercise_Name,Workout_Type,Sets,Reps,Duration_Minutes,Calories,...,BMI,Goal,Protein_Intake,Protein_Level,Calories_Intake,Experience_Level,Gender,Age,Frequency_per_week,Success
6527,123,U123_S14,2024-01-15,Evening,Trail Running/Walking,Cardio,1,,19.2,328.8,...,29.3,Fat_Loss,144.2,Medium,2316.0,Intermediate,Male,44,4,0
6935,130,U130_S09,2024-01-19,Morning,Lying Cable Curl - Gethin Variation,Strength,3,8.0,18.6,84.1,...,14.9,Muscle_Gain,94.9,High,1672.0,Intermediate,Female,49,3,1
13326,251,U251_S00,2024-01-01,Morning,Linear Depth Jump,HIIT,1,,15.6,325.5,...,45.1,Weight_Loss,184.5,Medium,2866.0,Intermediate,Male,33,3,1
25160,478,U478_S06,2024-01-19,Morning,Vertical Mountain Climber,HIIT,1,,18.1,170.7,...,21.3,Fat_Loss,103.2,Medium,1547.0,Beginner,Male,58,2,0
26841,511,U511_S11,2024-01-12,Evening,Single-arm kettlebell overhead squat,Strength,4,11.0,23.5,289.2,...,26.0,Fitness,109.1,Medium,2564.0,Expert,Male,39,5,1
40714,766,U766_S11,2024-01-23,Evening,Posterior Tibialis Stretch,Yoga,3,10.0,29.4,103.8,...,20.8,Fitness,97.0,Medium,2441.0,Intermediate,Male,45,3,1
41976,791,U791_S02,2024-01-03,Evening,Walking lunge,Strength,4,9.0,27.0,172.5,...,17.8,Muscle_Gain,130.4,High,1947.0,Intermediate,Male,39,4,1
46395,875,U875_S03,2024-01-10,Morning,Standing Two-Arm Overhead Throw,HIIT,1,,13.7,103.7,...,27.1,Fitness,79.3,Low,2172.0,Beginner,Male,22,2,0
47081,887,U887_S02,2024-01-05,Evening,Lying Hamstring,Yoga,3,10.0,9.2,51.5,...,18.5,Muscle_Gain,100.8,High,1756.0,Beginner,Male,55,3,0
48782,920,U920_S08,2024-01-09,Evening,Trail Running/Walking,Cardio,1,,18.7,288.6,...,33.6,Weight_Loss,140.2,Medium,2303.0,Expert,Male,51,4,1



Saved: ./data/gym_workout_sessions.csv (51,633 rows × 22 columns)
