# Preprocessing for Calories_Burned Model

Goal: prepare the gym dataset for machine learning by:
- choosing target and features
- cleaning missing values
- encoding categorical columns
- scaling numeric columns
- creating train/test splits


In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

df = pd.read_csv("../gym_members_exercise_tracking.csv")  
df.head()


Unnamed: 0,Age,Gender,Weight (kg),Height (m),Max_BPM,Avg_BPM,Resting_BPM,Session_Duration (hours),Calories_Burned,Workout_Type,Fat_Percentage,Water_Intake (liters),Workout_Frequency (days/week),Experience_Level,BMI
0,56,Male,88.3,1.71,180,157,60,1.69,1313.0,Yoga,12.6,3.5,4,3,30.2
1,46,Female,74.9,1.53,179,151,66,1.3,883.0,HIIT,33.9,2.1,4,2,32.0
2,32,Female,68.1,1.66,167,122,54,1.11,677.0,Cardio,33.4,2.3,4,2,24.71
3,25,Male,53.2,1.7,190,164,56,0.59,532.0,Strength,28.8,2.1,3,1,18.41
4,38,Male,46.1,1.79,188,158,68,0.64,556.0,Strength,29.2,2.8,3,1,14.39


## Load dataset

We load the gym members dataset into a pandas DataFrame to start preprocessing.


In [5]:
target = "Calories_Burned"

numeric_features = [
    "Age",
    "BMI",
    "Weight (kg)",
    "Height (m)",
    "Max_BPM",
    "Avg_BPM",
    "Resting_BPM",
    "Session_Duration (hours)",
    "Fat_Percentage",
    "Water_Intake (liters)",
    "Workout_Frequency (days/week)",
]

categorical_features = [
    "Gender",
    "Workout_Type",
    "Experience_Level",
]

X = df[numeric_features + categorical_features]
y = df[target]


## Select target and features

- Target `y`: Calories_Burned  
- Features `X`: numeric (age, heart rates, BMI, etc.) and categorical (gender, workout type, experience level).

These features will be used to predict Calories_Burned.


In [6]:
# Check missing values
X.isna().sum()

# Combine X and y, drop rows with any missing values
data = pd.concat([X, y], axis=1)
data = data.dropna()

X = data[numeric_features + categorical_features]
y = data[target]

X.shape, y.shape


((973, 14), (973,))

## Handle missing values

We combine `X` and `y` and drop any rows with missing values.  
This keeps the example simple and avoids passing NaNs to the model.


In [7]:
numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown="ignore")

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)


## Define preprocessing

We use a ColumnTransformer to:
- standardize numeric features with StandardScaler
- one‑hot encode categorical features with OneHotEncoder

Note: Experience_Level is stored as numbers, but we treat it as a categorical variable and one‑hot encode it (levels like beginner/intermediate/advanced).

This turns all features into a clean numeric matrix for modeling.


In [8]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

X_train_processed.shape, X_test_processed.shape


((778, 20), (195, 20))

## Train–test split and transform

We split data into train (80%) and test (20%).  
The preprocessor is fitted on training data and then applied to both train and test sets, giving `X_train_processed` and `X_test_processed` for the next modeling step.
