## 📖 About the Dataset:

The dataset is a trimmed version of NHANES (National Health and Nutrition Examination Survey), conducted by the CDC.

* **Objective**: Predict whether a person is a **Senior (65+)** or not (`Adult` or `Senior`) based on physical exam, lab results, and reported behavior.
* **Target Variable**: `age_group` (Adult or Senior)
* **Features Used**:

  * `RIDAGEYR`: Age in years
  * `RIAGENDR`: Gender (1 = Male, 2 = Female)
  * `PAQ605`: Physical activity level
  * `BMXBMI`: BMI
  * `LBXGLU`: Blood glucose
  * `DIQ010`: Diabetes indicator
  * `LBXGLT`: Serum glutamate
  * `LBXIN`: Insulin

## 📦 Importing the Dependencies

In [25]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix

In [19]:
# Load the dataset
df = pd.read_csv("Train_Data.csv")
df.head()

Unnamed: 0,SEQN,RIDAGEYR,RIAGENDR,PAQ605,BMXBMI,LBXGLU,DIQ010,LBXGLT,LBXIN,age_group
0,73564.0,61.0,2.0,2.0,35.7,110.0,2.0,150.0,14.91,Adult
1,73568.0,26.0,2.0,2.0,20.3,89.0,2.0,80.0,3.85,Adult
2,73576.0,16.0,1.0,2.0,23.2,89.0,2.0,68.0,6.14,Adult
3,73577.0,32.0,1.0,2.0,28.9,104.0,,84.0,16.15,Adult
4,73580.0,38.0,2.0,1.0,35.9,103.0,2.0,81.0,10.92,Adult


## Data Preprocessing

In [20]:
df = df.drop(columns=['SEQN'])  # Drop ID column
df.dropna(subset=['age_group'], inplace=True)  # Drop rows without labels
df['age_group'] = df['age_group'].map({'Adult': 0, 'Senior': 1})  # Encode target

X = df.drop(columns=['age_group'])
y = df['age_group']

# Handle missing values
X.fillna(X.median(numeric_only=True), inplace=True)

## Splitting the Data

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

## Model Training: Random Forest Classifier

In [22]:
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

## 📊 Evaluation

In [26]:
# Training accuracy and F1 score
train_preds = model.predict(X_train)
train_accuracy = accuracy_score(y_train, train_preds) * 100  # Convert to %
train_f1 = f1_score(y_train, train_preds)

# Testing accuracy and F1 score
test_preds = model.predict(X_test)
test_accuracy = accuracy_score(y_test, test_preds) * 100  # Convert to %
test_f1 = f1_score(y_test, test_preds)

# Print metrics
print(f"Training Accuracy: {train_accuracy:.2f}%")
print(f"Training F1 Score: {train_f1:.4f}")
print(f"Test Accuracy: {test_accuracy:.2f}%")
print(f"Test F1 Score: {test_f1:.4f}")
print("\nClassification Report:\n", classification_report(y_test, test_preds))
print("Confusion Matrix:\n", confusion_matrix(y_test, test_preds))

Training Accuracy: 100.00%
Training F1 Score: 1.0000
Test Accuracy: 100.00%
Test F1 Score: 1.0000

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       328
           1       1.00      1.00      1.00        63

    accuracy                           1.00       391
   macro avg       1.00      1.00      1.00       391
weighted avg       1.00      1.00      1.00       391

Confusion Matrix:
 [[328   0]
 [  0  63]]


## Save the Model

In [31]:
joblib.dump(model, 'nhanes_age_group_model.pkl')

['nhanes_age_group_model.pkl']

## Making a Predictive System

In [30]:
# Sample input as a DataFrame (keep column names)
sample_input = X_test.iloc[[10]]  # double brackets to keep it as DataFrame, shape (1, n_features)

# Prediction
prediction = model.predict(sample_input)

print("Predicted class:", "Senior" if prediction[0] == 1 else "Adult")
print("Actual class   :", "Senior" if y_test.iloc[5] == 1 else "Adult")

Predicted class: Adult
Actual class   : Adult
