# Preprocessing for Calories_Burned Model

Goal: prepare the gym dataset for machine learning by:
- choosing target and features
- encoding categorical columns
- scaling numeric columns
- creating train/test splits


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

df = pd.read_csv("../data/gym_members_exercise_tracking.csv")
df.head()

## Load dataset

We load the gym members dataset into a pandas DataFrame to start preprocessing.


In [None]:
target = "Calories_Burned"
numeric_features = [
    "Age",
    "BMI",
    "Weight (kg)",
    "Height (m)",
    "Max_BPM",
    "Avg_BPM",
    "Resting_BPM",
    "Session_Duration (hours)",
    "Fat_Percentage",
    "Water_Intake (liters)",
    "Workout_Frequency (days/week)",
]

categorical_features = [
    "Gender",
    "Workout_Type",
    "Experience_Level",
]

df_model = df[numeric_features + categorical_features + [target]]
df_model = pd.get_dummies(df_model, columns=categorical_features, drop_first=True)
X = df_model.drop(columns=[target])
y = df_model[target]

## Select target and features

- Target `y`: Calories_Burned  
- Features `X`: numeric (age, heart rates, BMI, etc.) and categorical (gender, workout type, experience level).

These features will be used to predict Calories_Burned.


In [None]:
X = df[numeric_features + categorical_features]
y = df[target]
X.shape, y.shape

In [None]:
numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown="ignore")
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

## Define preprocessing

We use a ColumnTransformer to:
- standardize numeric features with StandardScaler
- one‑hot encode categorical features with OneHotEncoder

Note: Experience_Level is stored as numbers, but we treat it as a categorical variable and one‑hot encode it (levels like beginner/intermediate/advanced).

This turns all features into a clean numeric matrix for modeling.


In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)
X_train_processed.shape, X_test_processed.shape

## Train–test split and transform

We split data into train (80%) and test (20%).  
The preprocessor is fitted on training data and then applied to both train and test sets, giving `X_train_processed` and `X_test_processed` for the next modeling step.


In [None]:
df_model.to_csv("../data/preprocessed_gym_data.csv", index=False)