# Lifestyle & Health Risk â€“ Preprocessing

This notebook performs **data preprocessing** for modeling.

**Preprocessing steps:**
1. Data loading
2. Separation of features and target variable
3. Encoding of categorical variables
4. Normalization/Standardization of numerical variables
5. Handling class imbalance (SMOTE conservative)
6. Train/Test split
7. Saving preprocessed data


In [None]:
# Imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from imblearn.over_sampling import SMOTE
import pickle
import os
import warnings

# Suppress joblib/loky warnings about CPU cores (Windows issue)
os.environ['LOKY_MAX_CPU_COUNT'] = '4'  # Adjust based on your CPU cores
warnings.filterwarnings('ignore', category=UserWarning, module='joblib')

# Display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# For reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)


## 1. Data Loading


In [13]:
# Load data
file_path = "Lifestyle_and_Health_Risk_Prediction_Synthetic_Dataset (1).csv"
df = pd.read_csv(file_path)

print("Dataset shape (rows, columns):", df.shape)
print("\nFirst rows:")
display(df.head())


Dataset shape (rows, columns): (5000, 12)

First rows:


Unnamed: 0,age,weight,height,exercise,sleep,sugar_intake,smoking,alcohol,married,profession,bmi,health_risk
0,56,67,195,low,6.1,medium,yes,yes,yes,office_worker,17.6,high
1,69,76,170,high,6.9,high,no,no,no,teacher,26.3,high
2,46,106,153,high,6.6,low,yes,no,no,artist,45.3,high
3,32,54,186,medium,8.5,medium,no,no,no,artist,15.6,low
4,60,98,195,high,8.0,low,no,no,yes,teacher,25.8,high


## 2. Separation of Features and Target Variable


In [3]:
# Separate features / target
target_col = "health_risk"
X = df.drop(columns=[target_col])
y = df[target_col]

print("Shape of X (features):", X.shape)
print("Shape of y (target):", y.shape)
print("\nTarget variable distribution:")
print(y.value_counts())
print("\nProportions:")
print(y.value_counts(normalize=True).round(3))


Shape of X (features): (5000, 11)
Shape of y (target): (5000,)

Target variable distribution:
health_risk
high    3490
low     1510
Name: count, dtype: int64

Proportions:
health_risk
high    0.698
low     0.302
Name: proportion, dtype: float64


## 3. Variable Type Identification


In [4]:
# Identify numerical and categorical columns
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_cols = X.select_dtypes(include=['object', 'category']).columns.tolist()

print("Numerical variables:", numerical_cols)
print("Categorical variables:", categorical_cols)

# Display unique values for categorical variables
print("\nUnique values per categorical variable:")
for col in categorical_cols:
    print(f"\n{col}: {X[col].unique()}")
    print(f"  Number of unique values: {X[col].nunique()}")


Numerical variables: ['age', 'weight', 'height', 'sleep', 'bmi']
Categorical variables: ['exercise', 'sugar_intake', 'smoking', 'alcohol', 'married', 'profession']

Unique values per categorical variable:

exercise: ['low' 'high' 'medium' 'none']
  Number of unique values: 4

sugar_intake: ['medium' 'high' 'low']
  Number of unique values: 3

smoking: ['yes' 'no']
  Number of unique values: 2

alcohol: ['yes' 'no']
  Number of unique values: 2

married: ['yes' 'no']
  Number of unique values: 2

profession: ['office_worker' 'teacher' 'artist' 'farmer' 'driver' 'engineer' 'student'
 'doctor']
  Number of unique values: 8


## 4. Encoding of Categorical Variables

**Optimized encoding strategy:**
- **Binary variables (2 categories)**: Label Encoding (0/1) - one column is sufficient
  - Examples: `smoking`, `alcohol`, `married` (yes/no)
- **Multi-category variables (>2 categories)**: One-Hot Encoding
  - Examples: `exercise` (4 categories), `profession` (8 categories), `sugar_intake` (3 categories)


In [5]:
# Separate binary and multi-category variables
binary_cols = [col for col in categorical_cols if X[col].nunique() == 2]
multi_cat_cols = [col for col in categorical_cols if X[col].nunique() > 2]

print("Binary variables (Label Encoding 0/1):", binary_cols)
print("Multi-category variables (One-Hot Encoding):", multi_cat_cols)

# Encode binary variables with Label Encoding (0/1)
X_encoded = X.copy()
label_encoders = {}

for col in binary_cols:
    le = LabelEncoder()
    X_encoded[col] = le.fit_transform(X[col])
    label_encoders[col] = le
    print(f"\n{col} - Mapping:")
    for i, label in enumerate(le.classes_):
        print(f"  {label} -> {i}")

# One-Hot Encoding for multi-category variables
if len(multi_cat_cols) > 0:
    X_encoded = pd.get_dummies(X_encoded, columns=multi_cat_cols, drop_first=False)

print(f"\n{'='*60}")
print(f"Shape before encoding: {X.shape}")
print(f"Shape after encoding: {X_encoded.shape}")
print(f"Binary variables encoded: {len(binary_cols)} (â†’ {len(binary_cols)} columns)")
print(f"Multi-category variables encoded: {len(multi_cat_cols)} (â†’ {X_encoded.shape[1] - X.shape[1] + len(binary_cols)} new columns)")

print("\nFirst columns after encoding:")
display(X_encoded.head())


Binary variables (Label Encoding 0/1): ['smoking', 'alcohol', 'married']
Multi-category variables (One-Hot Encoding): ['exercise', 'sugar_intake', 'profession']

smoking - Mapping:
  no -> 0
  yes -> 1

alcohol - Mapping:
  no -> 0
  yes -> 1

married - Mapping:
  no -> 0
  yes -> 1

Shape before encoding: (5000, 11)
Shape after encoding: (5000, 23)
Binary variables encoded: 3 (â†’ 3 columns)
Multi-category variables encoded: 3 (â†’ 15 new columns)

First columns after encoding:


Unnamed: 0,age,weight,height,sleep,smoking,alcohol,married,bmi,exercise_high,exercise_low,exercise_medium,exercise_none,sugar_intake_high,sugar_intake_low,sugar_intake_medium,profession_artist,profession_doctor,profession_driver,profession_engineer,profession_farmer,profession_office_worker,profession_student,profession_teacher
0,56,67,195,6.1,1,1,1,17.6,False,True,False,False,False,False,True,False,False,False,False,False,True,False,False
1,69,76,170,6.9,0,0,0,26.3,True,False,False,False,True,False,False,False,False,False,False,False,False,False,True
2,46,106,153,6.6,1,0,0,45.3,True,False,False,False,False,True,False,True,False,False,False,False,False,False,False
3,32,54,186,8.5,0,0,0,15.6,False,False,True,False,False,False,True,True,False,False,False,False,False,False,False
4,60,98,195,8.0,0,0,1,25.8,True,False,False,False,False,True,False,False,False,False,False,False,False,False,True


## 5. Target Variable Encoding

The target variable `health_risk` must be encoded into numerical values (0 for 'low', 1 for 'high').


In [6]:
# Encode target variable
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

print("Label mapping:")
for i, label in enumerate(label_encoder.classes_):
    print(f"  {label} -> {i}")

print(f"\nDistribution after encoding:")
unique, counts = np.unique(y_encoded, return_counts=True)
for val, count in zip(unique, counts):
    print(f"  {val}: {count} ({count/len(y_encoded)*100:.1f}%)")


Label mapping:
  high -> 0
  low -> 1

Distribution after encoding:
  0: 3490 (69.8%)
  1: 1510 (30.2%)


## 6. Train/Test Split

We split the data first before applying SMOTE to avoid data leakage.


In [7]:
# Train/Test split (stratified to preserve target variable distribution)
X_train, X_test, y_train, y_test = train_test_split(
    X_encoded, 
    y_encoded, 
    test_size=0.2, 
    random_state=RANDOM_STATE,
    stratify=y_encoded  # Stratification to preserve distribution
)

print(f"Shape X_train: {X_train.shape}")
print(f"Shape X_test: {X_test.shape}")
print(f"Shape y_train: {y_train.shape}")
print(f"Shape y_test: {y_test.shape}")

print("\nDistribution in training set:")
unique, counts = np.unique(y_train, return_counts=True)
for val, count in zip(unique, counts):
    label = label_encoder.inverse_transform([val])[0]
    print(f"  {label} ({val}): {count} ({count/len(y_train)*100:.1f}%)")

print("\nDistribution in test set:")
unique, counts = np.unique(y_test, return_counts=True)
for val, count in zip(unique, counts):
    label = label_encoder.inverse_transform([val])[0]
    print(f"  {label} ({val}): {count} ({count/len(y_test)*100:.1f}%)")


Shape X_train: (4000, 23)
Shape X_test: (1000, 23)
Shape y_train: (4000,)
Shape y_test: (1000,)

Distribution in training set:
  high (0): 2792 (69.8%)
  low (1): 1208 (30.2%)

Distribution in test set:
  high (0): 698 (69.8%)
  low (1): 302 (30.2%)


## 7. Normalization/Standardization of Numerical Variables

We standardize numerical variables so that all features have the same scale.


In [8]:
# Identify numerical columns in X_encoded
# Original numerical columns are still present
numeric_cols_in_encoded = [col for col in numerical_cols if col in X_encoded.columns]

print("Numerical columns to standardize:", numeric_cols_in_encoded)

# Standardization
scaler = StandardScaler()
X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()

# Standardize only numerical columns
X_train_scaled[numeric_cols_in_encoded] = scaler.fit_transform(X_train[numeric_cols_in_encoded])
X_test_scaled[numeric_cols_in_encoded] = scaler.transform(X_test[numeric_cols_in_encoded])

print("\nStatistics before standardization (train):")
display(X_train[numeric_cols_in_encoded].describe())

print("\nStatistics after standardization (train):")
display(X_train_scaled[numeric_cols_in_encoded].describe())


Numerical columns to standardize: ['age', 'weight', 'height', 'sleep', 'bmi']

Statistics before standardization (train):


Unnamed: 0,age,weight,height,sleep,bmi
count,4000.0,4000.0,4000.0,4000.0,4000.0
mean,48.78375,77.44075,172.094,6.9872,26.8275
std,17.802464,18.77143,15.876387,1.432744,8.286519
min,18.0,45.0,145.0,3.0,11.4
25%,34.0,61.0,159.0,6.0,20.3
50%,49.0,77.0,172.0,7.0,25.9
75%,64.0,94.0,186.0,8.0,32.4
max,79.0,109.0,199.0,10.0,51.4



Statistics after standardization (train):


Unnamed: 0,age,weight,height,sleep,bmi
count,4000.0,4000.0,4000.0,4000.0,4000.0
mean,1.24345e-16,2.922107e-16,3.694822e-16,2.602363e-16,-7.01661e-17
std,1.000125,1.000125,1.000125,1.000125,1.000125
min,-1.729401,-1.728414,-1.706773,-2.78326,-1.861992
25%,-0.8305366,-0.8759485,-0.8248499,-0.6891138,-0.7878237
50%,0.01214871,-0.02348277,-0.005921483,0.008935025,-0.1119428
75%,0.8548341,0.8822621,0.8760015,0.7069839,0.6725619
max,1.697519,1.681449,1.69493,2.103082,2.965729


## 8. Handling Class Imbalance with SMOTE Conservative

The dataset is imbalanced (69.8% high vs 30.2% low). 

We use **Conservative SMOTE** which balances to 60/40, adding fewer synthetic samples (~700) to reduce overfitting risk while still improving minority class detection.


In [9]:
# Distribution before balancing
print("Distribution BEFORE balancing:")
unique, counts = np.unique(y_train, return_counts=True)
for val, count in zip(unique, counts):
    label = label_encoder.inverse_transform([val])[0]
    print(f"  {label} ({val}): {count} ({count/len(y_train)*100:.1f}%)")

# Conservative SMOTE: balance to 60/40 instead of 50/50 (adds fewer synthetic samples)
# This reduces synthetic samples while still helping the model
print("\nâœ“ Using CONSERVATIVE SMOTE (60/40 balance) - adds fewer synthetic samples")
smote = SMOTE(sampling_strategy=0.6, random_state=RANDOM_STATE)  # 60% minority class
X_train_balanced, y_train_balanced = smote.fit_resample(X_train_scaled, y_train)

print(f"\nShape BEFORE - X_train: {X_train_scaled.shape}, y_train: {y_train.shape}")
print(f"Shape AFTER - X_train: {X_train_balanced.shape}, y_train: {y_train_balanced.shape}")

print("\nDistribution AFTER balancing:")
unique, counts = np.unique(y_train_balanced, return_counts=True)
for val, count in zip(unique, counts):
    label = label_encoder.inverse_transform([val])[0]
    print(f"  {label} ({val}): {count} ({count/len(y_train_balanced)*100:.1f}%)")

# Show how many synthetic samples were added
synthetic_added = len(y_train_balanced) - len(y_train)
print(f"\nðŸ“Š Synthetic samples added: {synthetic_added}")
print(f"   Original minority class: {min([c for _, c in zip(unique, counts)])}")
print(f"   Final minority class: {max([c for _, c in zip(unique, counts)])}")


Distribution BEFORE balancing:
  high (0): 2792 (69.8%)
  low (1): 1208 (30.2%)

âœ“ Using CONSERVATIVE SMOTE (60/40 balance) - adds fewer synthetic samples

Shape BEFORE - X_train: (4000, 23), y_train: (4000,)
Shape AFTER - X_train: (4467, 23), y_train: (4467,)

Distribution AFTER balancing:
  high (0): 2792 (62.5%)
  low (1): 1675 (37.5%)

ðŸ“Š Synthetic samples added: 467
   Original minority class: 1675
   Final minority class: 2792


## 9. Summary of Preprocessed Data


In [10]:
print("=" * 60)
print("PREPROCESSING SUMMARY")
print("=" * 60)

print(f"\n1. Original data:")
print(f"   - Shape: {df.shape}")
print(f"   - Features: {X.shape[1]}")
print(f"   - Categorical variables: {len(categorical_cols)}")
print(f"   - Numerical variables: {len(numerical_cols)}")

print(f"\n2. After optimized encoding:")
print(f"   - Binary variables (Label Encoding): {len(binary_cols)} â†’ {len(binary_cols)} columns")
print(f"   - Multi-category variables (One-Hot): {len(multi_cat_cols)} â†’ {X_encoded.shape[1] - X.shape[1] + len(binary_cols)} new columns")
print(f"   - Total features: {X_encoded.shape[1]} (instead of {X.shape[1] + sum([X[col].nunique() for col in categorical_cols])} with full One-Hot)")

print(f"\n3. Train/Test split:")
print(f"   - Train: {X_train.shape[0]} samples ({X_train.shape[0]/len(df)*100:.1f}%)")
print(f"   - Test: {X_test.shape[0]} samples ({X_test.shape[0]/len(df)*100:.1f}%)")

print(f"\n4. After standardization:")
print(f"   - Standardized columns: {len(numeric_cols_in_encoded)}")

print(f"\n5. After SMOTE (conservative):")
print(f"   - Train balanced: {X_train_balanced.shape[0]} samples")
print(f"   - Final features: {X_train_balanced.shape[1]}")

print(f"\n6. Data ready for modeling:")
print(f"   - X_train_balanced: {X_train_balanced.shape} (type: {type(X_train_balanced).__name__})")
print(f"   - y_train_balanced: {y_train_balanced.shape} (type: {type(y_train_balanced).__name__})")
print(f"   - X_test_scaled: {X_test_scaled.shape} (type: {type(X_test_scaled).__name__})")
print(f"   - y_test: {y_test.shape} (type: {type(y_test).__name__})")
print("=" * 60)


PREPROCESSING SUMMARY

1. Original data:
   - Shape: (5000, 12)
   - Features: 11
   - Categorical variables: 6
   - Numerical variables: 5

2. After optimized encoding:
   - Binary variables (Label Encoding): 3 â†’ 3 columns
   - Multi-category variables (One-Hot): 3 â†’ 15 new columns
   - Total features: 23 (instead of 32 with full One-Hot)

3. Train/Test split:
   - Train: 4000 samples (80.0%)
   - Test: 1000 samples (20.0%)

4. After standardization:
   - Standardized columns: 5

5. After SMOTE (conservative):
   - Train balanced: 4467 samples
   - Final features: 23

6. Data ready for modeling:
   - X_train_balanced: (4467, 23) (type: DataFrame)
   - y_train_balanced: (4467,) (type: ndarray)
   - X_test_scaled: (1000, 23) (type: DataFrame)
   - y_test: (1000,) (type: ndarray)


## 10. Saving Preprocessed Data and Preprocessing Objects

We save the preprocessed data and objects (scaler, label_encoder) so they can be reused later.


In [11]:
# Create directory to save preprocessed data
output_dir = "preprocessed_data"
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
    print(f"Directory '{output_dir}' created.")

# Save preprocessed data (in numpy format for efficiency)
# Note: X_train_balanced and y_train_balanced are already numpy arrays after SMOTE
# X_test_scaled is a DataFrame, we convert it to array for saving
X_train_balanced_array = X_train_balanced if isinstance(X_train_balanced, np.ndarray) else X_train_balanced.values
X_test_scaled_array = X_test_scaled.values if hasattr(X_test_scaled, 'values') else X_test_scaled

np.save(os.path.join(output_dir, "X_train_balanced.npy"), X_train_balanced_array)
np.save(os.path.join(output_dir, "y_train_balanced.npy"), y_train_balanced)
np.save(os.path.join(output_dir, "X_test_scaled.npy"), X_test_scaled_array)
np.save(os.path.join(output_dir, "y_test.npy"), y_test)

# Save column names (from X_train_scaled which is a DataFrame)
feature_names = X_train_scaled.columns.tolist()
with open(os.path.join(output_dir, "feature_names.pkl"), "wb") as f:
    pickle.dump(feature_names, f)

# Save preprocessing objects
with open(os.path.join(output_dir, "scaler.pkl"), "wb") as f:
    pickle.dump(scaler, f)

with open(os.path.join(output_dir, "label_encoder.pkl"), "wb") as f:
    pickle.dump(label_encoder, f)

# Save column information
preprocessing_info = {
    "numerical_cols": numerical_cols,
    "categorical_cols": categorical_cols,
    "binary_cols": binary_cols,
    "multi_cat_cols": multi_cat_cols,
    "numeric_cols_in_encoded": numeric_cols_in_encoded,
    "feature_names": feature_names,
    "balancing_strategy": "smote_conservative"
}

# Save label encoders for binary variables
with open(os.path.join(output_dir, "label_encoders_binary.pkl"), "wb") as f:
    pickle.dump(label_encoders, f)

with open(os.path.join(output_dir, "preprocessing_info.pkl"), "wb") as f:
    pickle.dump(preprocessing_info, f)

print("âœ“ Data and objects saved successfully in 'preprocessed_data' directory")
print(f"\nSaved files:")
print(f"  - X_train_balanced.npy")
print(f"  - y_train_balanced.npy")
print(f"  - X_test_scaled.npy")
print(f"  - y_test.npy")
print(f"  - feature_names.pkl")
print(f"  - scaler.pkl")
print(f"  - label_encoder.pkl (target variable)")
print(f"  - label_encoders_binary.pkl (binary variables)")
print(f"  - preprocessing_info.pkl")


âœ“ Data and objects saved successfully in 'preprocessed_data' directory

Saved files:
  - X_train_balanced.npy
  - y_train_balanced.npy
  - X_test_scaled.npy
  - y_test.npy
  - feature_names.pkl
  - scaler.pkl
  - label_encoder.pkl (target variable)
  - label_encoders_binary.pkl (binary variables)
  - preprocessing_info.pkl
