### Baseline Regression Model for Calories Burned

### Data loading and train–test split

We load the preprocessed dataset (with one-hot encoded categorical features) and
split it into training and test sets. The target variable is `Calories_Burned`,
and all other columns are used as input features.


In [11]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,classification_report

df = pd.read_csv("../data/preprocessed_gym_data.csv") 

feature_cols = [
    "Age",
    "BMI",
    "Weight (kg)",
    "Max_BPM",
    "Avg_BPM",
    "Session_Duration (hours)",
    "Workout_Frequency (days/week)",
    "Gender_Male",
    "Workout_Type_HIIT",
    "Experience_Level_2",
]

X = df[feature_cols]
y = df["Calories_Burned"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


### Baseline Linear Regression model and evaluation

We train a baseline `LinearRegression` model on the training data and evaluate
it on the test set using:

- **MAE (Mean Absolute Error)**: average absolute difference between
  predicted and actual calories burned (lower is better).
- **R² (coefficient of determination)**: proportion of variance in
  `Calories_Burned` explained by the model (closer to 1 is better).


In [15]:
# Train baseline Linear Regression model and evaluate on test data
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("MAE:", mae)
print("R2 :", r2)


MAE: 29.776388174872018
R2 : 0.9809072290557926


In [None]:
ac_score = classification_report(y_test , y_pred)
print(ac_score)

In [None]:
import joblib

joblib.dump(model, "../models/calories_model.joblib")


['../models/calories_model.joblib']