# 04 - Model Training (Gender-Specific XGBoost)
## Osteoporosis Risk Prediction Model
**DSGP Group 40** | Student: Isum Gamage (ID: 20242052)

This notebook trains gender-specific XGBoost models for male and female cohorts.


## Step 1: Install and Import Required Libraries

In [1]:
# Install required libraries (run once)
!pip install xgboost scikit-learn pandas numpy matplotlib seaborn shap joblib --upgrade

import warnings
warnings.filterwarnings('ignore')

import xgboost as xgb
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, confusion_matrix, classification_report
)
import joblib
import matplotlib.pyplot as plt
import seaborn as sns

print("✓ All libraries imported successfully!")
print(f"XGBoost version: {xgb.__version__}")
print(f"Pandas version: {pd.__version__}")

Collecting numpy
  Using cached numpy-2.4.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (6.6 kB)
✓ All libraries imported successfully!
XGBoost version: 3.1.3
Pandas version: 2.3.3


## Step 2: Load Preprocessed Data
Load the dataset that was prepared and encoded in previous notebooks.

In [2]:
# Option A: Load from local file (after uploading to Colab)
from google.colab import files
print("Upload your preprocessed dataset (osteoporosis_cleaned_reorganized.csv):")
uploaded = files.upload()

# Get the filename
filename = list(uploaded.keys())[0]
df = pd.read_csv(filename)

print(f"\n✓ Dataset loaded successfully!")
print(f"Shape: {df.shape}")
print(f"\nColumn names:\n{df.columns.tolist()}")
print(f"\nTarget variable distribution:")
print(df['Osteoporosis'].value_counts())

Upload your preprocessed dataset (osteoporosis_cleaned_reorganized.csv):


Saving dataset_loaded.csv to dataset_loaded.csv

✓ Dataset loaded successfully!
Shape: (1958, 16)

Column names:
['Id', 'Age', 'Gender', 'Hormonal Changes', 'Family History', 'Race/Ethnicity', 'Body Weight', 'Calcium Intake', 'Vitamin D Intake', 'Physical Activity', 'Smoking', 'Alcohol Consumption', 'Medical Conditions', 'Medications', 'Prior Fractures', 'Osteoporosis']

Target variable distribution:
Osteoporosis
0    979
1    979
Name: count, dtype: int64


## Step 3: Data Preprocessing (Feature Engineering)
Apply encoding, scaling, and feature engineering as documented.

In [3]:
# Create a working copy
df_processed = df.copy()

# ===== STEP 3.1: Handle Missing Values =====
print("Step 3.1: Handling Missing Values")
print("-" * 60)

# Alcohol Consumption: Fill with 'None' category
if 'Alcohol Consumption' in df_processed.columns:
    df_processed['Alcohol Consumption'].fillna('None', inplace=True)
    print(f"✓ Alcohol Consumption: filled {df['Alcohol Consumption'].isnull().sum()} missing values")

# Medical Conditions: Fill with 'None'
if 'Medical Conditions' in df_processed.columns:
    df_processed['Medical Conditions'].fillna('None', inplace=True)
    print(f"✓ Medical Conditions: filled {df['Medical Conditions'].isnull().sum()} missing values")

# Medications: Fill with 'None'
if 'Medications' in df_processed.columns:
    df_processed['Medications'].fillna('None', inplace=True)
    print(f"✓ Medications: filled {df['Medications'].isnull().sum()} missing values")

# Verify no missing values remain
remaining_missing = df_processed.isnull().sum().sum()
print(f"\n✓ Remaining missing values: {remaining_missing}")

# ===== STEP 3.2: Label Encoding for Binary Features =====
print("\nStep 3.2: Binary Feature Encoding")
print("-" * 60)

binary_encoding = {
    'Gender': {'Male': 0, 'Female': 1},
    'Hormonal Changes': {'Normal': 0, 'Post-menopausal': 1},
    'Body Weight': {'Normal': 0, 'Underweight': 1},
    'Calcium Intake': {'Adequate': 0, 'Low': 1},
    'Vitamin D': {'Sufficient': 0, 'Insufficient': 1},
    'Physical Activity': {'Active': 0, 'Sedentary': 1},
    'Smoking': {'No': 0, 'Yes': 1},
    'Prior Fractures': {'No': 0, 'Yes': 1},
    'Family History': {'No': 0, 'Yes': 1},
}

for col, mapping in binary_encoding.items():
    if col in df_processed.columns:
        df_processed[col] = df_processed[col].map(mapping)
        print(f"✓ {col}: encoded")

# ===== STEP 3.3: One-Hot Encoding for Multi-Category Features =====
print("\nStep 3.3: Multi-Category Feature Encoding")
print("-" * 60)

categorical_cols = ['Race/Ethnicity', 'Alcohol Consumption', 'Medical Conditions', 'Medications']

for col in categorical_cols:
    if col in df_processed.columns:
        # One-hot encode
        encoded = pd.get_dummies(df_processed[col], prefix=col, drop_first=False)
        df_processed = pd.concat([df_processed, encoded], axis=1)
        df_processed.drop(col, axis=1, inplace=True)
        print(f"✓ {col}: one-hot encoded")

# ===== STEP 3.4: Feature Scaling (Age) =====
print("\nStep 3.4: Feature Scaling")
print("-" * 60)

scaler = StandardScaler()
if 'Age' in df_processed.columns:
    df_processed['Age'] = scaler.fit_transform(df_processed[['Age']])
    print(f"✓ Age: standardized (mean=0, std=1)")

# ===== STEP 3.5: Interaction Terms =====
print("\nStep 3.5: Interaction Terms")
print("-" * 60)

if 'Age' in df_processed.columns and 'Hormonal Changes' in df_processed.columns:
    df_processed['Age_x_Hormonal'] = df_processed['Age'] * df_processed['Hormonal Changes']
    print("✓ Age × Hormonal Changes")

if 'Age' in df_processed.columns and 'Prior Fractures' in df_processed.columns:
    df_processed['Age_x_Fractures'] = df_processed['Age'] * df_processed['Prior Fractures']
    print("✓ Age × Prior Fractures")

if 'Calcium Intake' in df_processed.columns and 'Vitamin D' in df_processed.columns:
    df_processed['Calcium_x_VitaminD'] = df_processed['Calcium Intake'] * df_processed['Vitamin D']
    print("✓ Calcium Intake × Vitamin D")

print(f"\n✓ Feature engineering complete!")
print(f"Final dataset shape: {df_processed.shape}")

Step 3.1: Handling Missing Values
------------------------------------------------------------
✓ Alcohol Consumption: filled 988 missing values
✓ Medical Conditions: filled 647 missing values
✓ Medications: filled 985 missing values

✓ Remaining missing values: 0

Step 3.2: Binary Feature Encoding
------------------------------------------------------------
✓ Gender: encoded
✓ Hormonal Changes: encoded
✓ Body Weight: encoded
✓ Calcium Intake: encoded
✓ Physical Activity: encoded
✓ Smoking: encoded
✓ Prior Fractures: encoded
✓ Family History: encoded

Step 3.3: Multi-Category Feature Encoding
------------------------------------------------------------
✓ Race/Ethnicity: one-hot encoded
✓ Alcohol Consumption: one-hot encoded
✓ Medical Conditions: one-hot encoded
✓ Medications: one-hot encoded

Step 3.4: Feature Scaling
------------------------------------------------------------
✓ Age: standardized (mean=0, std=1)

Step 3.5: Interaction Terms
---------------------------------------------

## Step 4: Separate Data by Gender

In [4]:
print("Step 4: Gender-Specific Data Separation")
print("=" * 60)

# Prepare features and target
X = df_processed.drop('Osteoporosis', axis=1)
y = df_processed['Osteoporosis']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")

# Separate by gender (assuming Gender is encoded as 0=Male, 1=Female)
male_mask = X['Gender'] == 0
female_mask = X['Gender'] == 1

# Split data
X_male = X[male_mask].copy()
y_male = y[male_mask].copy()

X_female = X[female_mask].copy()
y_female = y[female_mask].copy()

print(f"\n✓ Male cohort: {X_male.shape[0]} samples")
print(f"  Risk distribution: {y_male.value_counts().to_dict()}")

print(f"\n✓ Female cohort: {X_female.shape[0]} samples")
print(f"  Risk distribution: {y_female.value_counts().to_dict()}")

Step 4: Gender-Specific Data Separation
Features shape: (1958, 23)
Target shape: (1958,)

✓ Male cohort: 992 samples
  Risk distribution: {1: 502, 0: 490}

✓ Female cohort: 966 samples
  Risk distribution: {0: 489, 1: 477}


## Step 5: Train-Test Split (80-20 Stratified)

In [5]:
print("Step 5: Train-Test Split (80-20 Stratified)")
print("=" * 60)

# Male split
X_train_male, X_test_male, y_train_male, y_test_male = train_test_split(
    X_male, y_male, test_size=0.2, stratify=y_male, random_state=42
)

# Female split
X_train_female, X_test_female, y_train_female, y_test_female = train_test_split(
    X_female, y_female, test_size=0.2, stratify=y_female, random_state=42
)

print(f"\nMALE MODEL:")
print(f"  Train: {X_train_male.shape[0]} samples, Risk: {y_train_male.sum()} cases")
print(f"  Test: {X_test_male.shape[0]} samples, Risk: {y_test_male.sum()} cases")

print(f"\nFEMALE MODEL:")
print(f"  Train: {X_train_female.shape[0]} samples, Risk: {y_train_female.sum()} cases")
print(f"  Test: {X_test_female.shape[0]} samples, Risk: {y_test_female.sum()} cases")

print(f"\n✓ Data split complete!")

Step 5: Train-Test Split (80-20 Stratified)

MALE MODEL:
  Train: 793 samples, Risk: 401 cases
  Test: 199 samples, Risk: 101 cases

FEMALE MODEL:
  Train: 772 samples, Risk: 381 cases
  Test: 194 samples, Risk: 96 cases

✓ Data split complete!


## Step 6: XGBoost Hyperparameters Configuration

In [6]:
print("Step 6: XGBoost Configuration")
print("=" * 60)

xgb_params = {
    'objective': 'binary:logistic',
    'max_depth': 6,
    'learning_rate': 0.05,
    'n_estimators': 150,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'min_child_weight': 3,
    'gamma': 0.1,
    'random_state': 42,
    'verbosity': 0,
    'eval_metric': 'logloss'
}

print("XGBoost Parameters:")
for key, value in xgb_params.items():
    print(f"  {key}: {value}")

Step 6: XGBoost Configuration
XGBoost Parameters:
  objective: binary:logistic
  max_depth: 6
  learning_rate: 0.05
  n_estimators: 150
  subsample: 0.8
  colsample_bytree: 0.8
  min_child_weight: 3
  gamma: 0.1
  random_state: 42
  verbosity: 0
  eval_metric: logloss


## Step 7: Train Male XGBoost Model

In [10]:
print("\nStep 7: Training Male XGBoost Model")
print("=" * 60)
print(f"Training samples: {X_train_male.shape[0]}")
print(f"Features: {X_train_male.shape[1]}\n")

# Ensure 'Vitamin D Intake' is correctly typed for XGBoost
# This column seems to have been missed during the initial binary encoding step due to a name mismatch ('Vitamin D' vs 'Vitamin D Intake').
# Since enable_categorical=True, converting it to pandas 'category' dtype will allow XGBoost to handle it.
if 'Vitamin D Intake' in X_train_male.columns:
    if X_train_male['Vitamin D Intake'].dtype == 'object':
        print("Warning: 'Vitamin D Intake' column is of object type. Converting to 'category' dtype.")
        X_train_male['Vitamin D Intake'] = X_train_male['Vitamin D Intake'].astype('category')
        X_test_male['Vitamin D Intake'] = X_test_male['Vitamin D Intake'].astype('category')
    elif X_train_male['Vitamin D Intake'].dtype != 'category' and X_train_male['Vitamin D Intake'].dtype != 'int64' and X_train_male['Vitamin D Intake'].dtype != 'float64':
        # This handles cases where it might be a different non-numeric, non-category type
        print("Warning: 'Vitamin D Intake' column is not numerical or category. Converting to 'category' dtype.")
        X_train_male['Vitamin D Intake'] = X_train_male['Vitamin D Intake'].astype('category')
        X_test_male['Vitamin D Intake'] = X_test_male['Vitamin D Intake'].astype('category')

# Initialize and train model
male_model = xgb.XGBClassifier(**xgb_params, enable_categorical=True)

male_model.fit(
    X_train_male, y_train_male,
    eval_set=[(X_train_male, y_train_male), (X_test_male, y_test_male)],
    verbose=False
)

# Make predictions
y_pred_male = male_model.predict(X_test_male)
y_pred_proba_male = male_model.predict_proba(X_test_male)[:, 1]

print(f"\n✓ Male model training complete!")
print(f"Predictions generated for {len(y_pred_male)} male test samples")


Step 7: Training Male XGBoost Model
Training samples: 793
Features: 23


✓ Male model training complete!
Predictions generated for 199 male test samples


## Step 8: Train Female XGBoost Model

In [12]:
print("\nStep 8: Training Female XGBoost Model")
print("=" * 60)
print(f"Training samples: {X_train_female.shape[0]}")
print(f"Features: {X_train_female.shape[1]}\n")

# Ensure 'Vitamin D Intake' is correctly typed for XGBoost
# This column seems to have been missed during the initial binary encoding step due to a name mismatch ('Vitamin D' vs 'Vitamin D Intake').
# Since enable_categorical=True, converting it to pandas 'category' dtype will allow XGBoost to handle it.
if 'Vitamin D Intake' in X_train_female.columns:
    if X_train_female['Vitamin D Intake'].dtype == 'object':
        print("Warning: 'Vitamin D Intake' column is of object type. Converting to 'category' dtype.")
        X_train_female['Vitamin D Intake'] = X_train_female['Vitamin D Intake'].astype('category')
        X_test_female['Vitamin D Intake'] = X_test_female['Vitamin D Intake'].astype('category')
    elif X_train_female['Vitamin D Intake'].dtype != 'category' and X_train_female['Vitamin D Intake'].dtype != 'int64' and X_train_female['Vitamin D Intake'].dtype != 'float64':
        # This handles cases where it might be a different non-numeric, non-category type
        print("Warning: 'Vitamin D Intake' column is not numerical or category. Converting to 'category' dtype.")
        X_train_female['Vitamin D Intake'] = X_train_female['Vitamin D Intake'].astype('category')
        X_test_female['Vitamin D Intake'] = X_test_female['Vitamin D Intake'].astype('category')

# Initialize and train model
female_model = xgb.XGBClassifier(**xgb_params, enable_categorical=True)

female_model.fit(
    X_train_female, y_train_female,
    eval_set=[(X_train_female, y_train_female), (X_test_female, y_test_female)],
    verbose=False
)

# Make predictions
y_pred_female = female_model.predict(X_test_female)
y_pred_proba_female = female_model.predict_proba(X_test_female)[:, 1]

print(f"\n✓ Female model training complete!")
print(f"Predictions generated for {len(y_pred_female)} female test samples")


Step 8: Training Female XGBoost Model
Training samples: 772
Features: 23


✓ Female model training complete!
Predictions generated for 194 female test samples


## Step 9: Save Models and Scaler

In [13]:
print("\nStep 9: Model Serialization")
print("=" * 60)

# Save models
male_model_filename = 'osteoporosis_male_model.pkl'
female_model_filename = 'osteoporosis_female_model.pkl'
scaler_filename = 'age_scaler.pkl'

joblib.dump(male_model, male_model_filename)
joblib.dump(female_model, female_model_filename)
joblib.dump(scaler, scaler_filename)

print(f"✓ Male model saved: {male_model_filename}")
print(f"✓ Female model saved: {female_model_filename}")
print(f"✓ Scaler saved: {scaler_filename}")

# Verify loading
loaded_male = joblib.load(male_model_filename)
loaded_female = joblib.load(female_model_filename)
loaded_scaler = joblib.load(scaler_filename)

print(f"\n✓ Models verified and ready for deployment!")


Step 9: Model Serialization
✓ Male model saved: osteoporosis_male_model.pkl
✓ Female model saved: osteoporosis_female_model.pkl
✓ Scaler saved: age_scaler.pkl

✓ Models verified and ready for deployment!


## Summary

✅ **Model Training Complete!**

- ✓ Data preprocessed with feature engineering
- ✓ Gender-specific data separation (Male: 992, Female: 966)
- ✓ Male XGBoost model trained
- ✓ Female XGBoost model trained
- ✓ Models serialized and saved

**Next Steps:** Run `05_Model_Evaluation.ipynb` to evaluate model performance