##Problem Statement
We built a Linear Regression model to predict an individual’s weight (kg) from:

* Height (cm)

* Age (years)

* Exercise frequency (times/week, encoded numerically)

This simple regression task addresses a core healthcare‐analytics scenario: understanding how basic demographics and lifestyle factors relate to body weight.

In [33]:
# 1) Install & import dependencies
!pip install kagglehub[pandas-datasets] --quiet

import os
import glob
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

import kagglehub

## Data Loading & Preprocessing
*Dataset: “Health and Lifestyle” from Kaggle, loaded via kagglehub.

*Column Cleaning: Converted all column names to lowercase with underscores.

*Feature Selection: chose height_cm, age, exercise_freq, and weight_kg.

*Encoding: Transformed textual exercise_freq categories (e.g. “3-5 times/week”) into numeric midpoints.

*Missing Values: Dropped any rows with nulls in our four target columns.

In [59]:
data_dir = kagglehub.dataset_download("sahilislam007/health-and-lifestyle-dataset")
print("Dataset directory:", data_dir)
print("Files:", os.listdir(data_dir))


Dataset directory: /kaggle/input/health-and-lifestyle-dataset
Files: ['synthetic_health_lifestyle_dataset.csv']


In [60]:
csv_path = glob.glob(os.path.join(data_dir, "*.csv"))[0]
print("Loading:", os.path.basename(csv_path))

df = pd.read_csv(csv_path)
print("Raw shape:", df.shape)
df.head()


Loading: synthetic_health_lifestyle_dataset.csv
Raw shape: (7500, 13)


Unnamed: 0,ID,Age,Gender,Height_cm,Weight_kg,BMI,Smoker,Exercise_Freq,Diet_Quality,Alcohol_Consumption,Chronic_Disease,Stress_Level,Sleep_Hours
0,1,56,Other,177.6,37.3,11.8,Yes,,Poor,,No,9,8.5
1,2,69,Other,169.3,70.7,24.7,No,1-2 times/week,Good,High,No,2,5.9
2,3,46,Female,159.1,69.0,27.3,No,Daily,Excellent,Moderate,No,3,4.8
3,4,32,Male,170.6,76.4,26.3,No,3-5 times/week,Excellent,Moderate,No,9,6.6
4,5,60,Male,158.4,60.4,24.1,No,3-5 times/week,Excellent,Low,Yes,6,6.1


In [61]:
print("Columns before:", df.columns.tolist())

df.columns = [c.strip().lower().replace(" ", "_") for c in df.columns]
print("Columns after: ", df.columns.tolist())


Columns before: ['ID', 'Age', 'Gender', 'Height_cm', 'Weight_kg', 'BMI', 'Smoker', 'Exercise_Freq', 'Diet_Quality', 'Alcohol_Consumption', 'Chronic_Disease', 'Stress_Level', 'Sleep_Hours']
Columns after:  ['id', 'age', 'gender', 'height_cm', 'weight_kg', 'bmi', 'smoker', 'exercise_freq', 'diet_quality', 'alcohol_consumption', 'chronic_disease', 'stress_level', 'sleep_hours']


In [63]:
import re
import numpy as np

def parse_exercise_freq(s):
    s = str(s).lower().strip()
    # never or zero
    if s in ('never', 'none', '0', '0 times/week'):
        return 0.0
    # range like "3-5"
    m = re.match(r'(\d+)\s*-\s*(\d+)', s)
    if m:
        return (float(m.group(1)) + float(m.group(2))) / 2.0
    # "6+"
    m2 = re.match(r'(\d+)\+', s)
    if m2:
        return float(m2.group(1))
    # anything else → NaN
    return np.nan

# apply it
df['exercise_freq_num'] = df['exercise_freq'].apply(parse_exercise_freq)

# how many failed to parse?
n_unmapped = df['exercise_freq_num'].isnull().sum()
print(f"{n_unmapped} rows couldn’t be parsed → will drop them")

# drop rows where parsing failed
df = df.dropna(subset=['exercise_freq_num'])


3804 rows couldn’t be parsed → will drop them


In [64]:
data = df[['height_cm', 'age', 'exercise_freq_num', 'weight_kg']].dropna()


In [65]:
data = df[['height_cm', 'age', 'exercise_freq_num', 'weight_kg']].dropna()
print("Final dataset shape:", data.shape)
data.head()


Final dataset shape: (3696, 4)


Unnamed: 0,height_cm,age,exercise_freq_num,weight_kg
1,169.3,69,1.5,70.7
3,170.6,32,4.0,76.4
4,158.4,60,4.0,60.4
6,152.5,38,1.5,88.0
9,162.0,40,1.5,77.4


##Modeling
Train/Test Split: 80/20 split using random_state=42 for reproducibility.

Algorithm: scikit-learn’s LinearRegression, yielding an intercept and one coefficient per feature.

Interpretation:

* Each coefficient represents the expected change in weight (kg) for a one-unit increase in the corresponding feature, holding others constant.

* The intercept is the predicted weight when all features are zero (not meaningful in isolation here, but part of the model).

In [67]:
from sklearn.model_selection import train_test_split

X = data[['height_cm', 'age', 'exercise_freq_num']]
y = data['weight_kg']

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.20,
    random_state=42
)

In [68]:
print(f"Train samples: {X_train.shape[0]}, Test samples: {X_test.shape[0]}")


Train samples: 2956, Test samples: 740


In [69]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train, y_train)

print("Model coefficients:")
for feat, coef in zip(X.columns, lr.coef_):
    print(f"  {feat}: {coef:.4f}")
print(f"Intercept: {lr.intercept_:.4f}")

Model coefficients:
  height_cm: -0.0068
  age: 0.0027
  exercise_freq_num: 0.0400
Intercept: 70.8792


## Coefficient Interpretation

- **Height (–0.0068):**  
  Each additional centimeter of height is associated with a **0.0068 kg decrease** in predicted weight, holding age and exercise constant. This small negative relationship may reflect complex interactions (e.g., taller individuals having leaner body composition) or multicollinearity—worth investigating further.

- **Age (+0.0027):**  
  Each additional year of age adds about **0.0027 kg** to predicted weight, all else equal. This effect is very small and suggests age alone isn’t a strong linear driver of weight in this cohort.

- **Exercise Frequency (+0.0400):**  
  For each additional “time per week” of exercise (on our numeric encoding), weight increases by **0.04 kg**. A possible explanation is that more active individuals may have greater muscle mass, which weighs more; again, this warrants deeper domain‐specific investigation.

- **Intercept (70.8792):**  
  This is the model’s predicted weight when all features are zero (height = 0 cm, age = 0 years, exercise = 0 times/week). While not meaningful in real life, it anchors the linear equation.


In [70]:
from sklearn.metrics import mean_squared_error

y_pred = lr.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Test Mean Squared Error (MSE): {mse:.2f}")


Test Mean Squared Error (MSE): 209.89


##  Model Evaluation
We predict on the test set and compute the Mean Squared Error (MSE), which turns out to be 209.89. Taking its square root gives an RMSE of about 14.49 kg, indicating that, on average, our weight predictions stray nearly 15 kg from the true values—far too high for reliable individual forecasting.

## Reflection and Next Steps
While our simple linear model highlights some interpretable relationships, its high error suggests important non‐linearities or missing predictors. We recommend plotting residuals to check for systematic patterns, engineering interaction or polynomial features to capture curvature, scaling inputs for numerical stability, and comparing against regularized (Ridge/Lasso) or non-linear (Random Forest) algorithms. Finally, collaborating with healthcare experts to incorporate domain-specific variables, such as diet quality or BMI will likely yield a more accurate and actionable weight‐prediction model.