# ‚ö†Ô∏è IMPORTANT: NEW VERSION DETECTED (3 MODELS IMPLEMENTED)
**If you are seeing only 2 models, please refresh this page (or reopen this file).**

This version contains:
1. **Logistic Regression**
2. **Random Forest**
3. **Decision Tree**

---

# Sleep Disorder Classification Project

**Course:** Data Science Final Project  
**Dataset:** Sleep Health and Lifestyle (374 samples, 13 features)  
**link:** https://www.kaggle.com/datasets/uom190346a/sleep-health-and-lifestyle-dataset

**Goal:** Predict sleep disorders (None, Insomnia, Sleep Apnea) from health and lifestyle data  
**Problem Type:** Multi-class Classification (3 classes)  

---

## Project Overview
This analysis uses machine learning to predict sleep disorders based on:
- Demographics (age, gender, occupation)
- Sleep patterns (duration, quality)
- Health metrics (BMI, blood pressure, heart rate)
- Lifestyle factors (physical activity, stress level, daily steps)

**Key Feature:** Uses **Pipeline** for professional, production-ready preprocessing workflow

---
# STEP 1: Setup & Data Loading

**What this step does:**
- Import necessary libraries for data analysis and machine learning
- Load the Sleep Health dataset from CSV file
- Display basic information about the dataset (shape, columns, first rows)


In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import warnings
warnings.filterwarnings('ignore')

# Set plot style
sns.set_style('whitegrid')
plt.rcParams['figure.dpi'] = 100

print("‚úì All libraries imported successfully")

: 

In [None]:
# Load dataset
df = pd.read_csv('Sleep_health_and_lifestyle_dataset.csv')

# Display basic information
print(f"Dataset Shape: {df.shape[0]} rows √ó {df.shape[1]} columns")
print(f"\nColumn Names:\n{df.columns.tolist()}")
print(f"\nFirst 5 Rows:")
df.head()

---
# STEP 2: Exploratory Data Analysis (EDA)

**What this step does:**
- Check for missing values in the dataset
- Visualize target variable distribution (Sleep Disorder)
- Explore numeric features with histograms
- Analyze correlations between features
- Compare features across different sleep disorder categories

**Key insights to find:**
- Are classes balanced or imbalanced?
- Which features show strong patterns?
- Are there correlations between features?
- Do features differ significantly across disorder types?

In [None]:
# Check missing values
print("Missing Values Summary:")
missing = df.isnull().sum()
print(missing[missing > 0])
print(f"\nNote: {df['Sleep Disorder'].isnull().sum()} NaN values in 'Sleep Disorder' represent 'None' (no disorder)")

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
models = [(lr_best, 'Logistic Regression', 'Blues', 0), (rf_best, 'Random Forest', 'Greens', 1), (dt_best, 'Decision Tree', 'Oranges', 2)]

for model, name, col, idx in models:
    y_pred = model.predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap=col, ax=axes[idx], xticklabels=le_target.classes_, yticklabels=le_target.classes_)
    axes[idx].set_title(f"{name}\nAcc: {accuracy_score(y_test, y_pred):.4f}", fontweight='bold')

plt.tight_layout()
plt.show()


In [None]:
# Visualize numeric features
numeric_cols = ['Age', 'Sleep Duration', 'Quality of Sleep', 'Physical Activity Level', 
                'Stress Level', 'Heart Rate', 'Daily Steps']

fig, axes = plt.subplots(2, 4, figsize=(16, 6))
axes = axes.flatten()

for idx, col in enumerate(numeric_cols):
    df[col].hist(bins=15, ax=axes[idx], color='lightblue', edgecolor='black')
    axes[idx].set_title(col, fontsize=10, fontweight='bold')
    axes[idx].set_ylabel('Frequency', fontsize=8)

fig.delaxes(axes[7])
plt.tight_layout()
plt.show()

print("‚Üí Most features show reasonable distributions")

In [None]:
# Correlation analysis
numeric_df = df[numeric_cols].copy()

plt.figure(figsize=(12, 8))
sns.heatmap(numeric_df.corr(), annot=True, fmt='.2f', cmap='coolwarm', center=0, 
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Matrix of Numeric Features', fontweight='bold', fontsize=12)
plt.tight_layout()
plt.show()

print("‚Üí Quality of Sleep and Sleep Duration are positively correlated")
print("‚Üí Stress Level negatively correlates with sleep quality")

In [None]:
# Compare numeric features by disorder type
numeric_features = ['Age', 'Sleep Duration', 'Quality of Sleep', 
                     'Physical Activity Level', 'Stress Level', 
                     'Heart Rate', 'Daily Steps']

# Create 2√ó4 subplot grid
fig, axes = plt.subplots(2, 4, figsize=(16, 6))

# Plot numeric features
for idx, col in enumerate(numeric_features):
    row, col_idx = divmod(idx, 4)
    sns.boxplot(data=df, x='Sleep Disorder', y=col, ax=axes[row, col_idx], 
            hue='Sleep Disorder', palette='Set2', legend=False)
    axes[row, col_idx].set_title(col, fontweight='bold', fontsize=10)
    axes[row, col_idx].tick_params(axis='x', rotation=45)

# Remove empty subplot
fig.delaxes(axes[1, 3])
plt.tight_layout()
plt.show()

print("‚Üí Numeric features compared by sleep disorder")

In [None]:
# How categorical features relate to sleep disorders
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

sns.countplot(data=df, x='Gender', hue='Sleep Disorder', ax=axes[0])
sns.countplot(data=df, x='BMI Category', hue='Sleep Disorder', ax=axes[1])
sns.countplot(data=df, x='Occupation', hue='Sleep Disorder', ax=axes[2])
axes[2].tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.show()

---
# STEP 3: Data Preparation

## 3.1 Handle Missing Values

**What this step does:**
- Fill NaN values in 'Sleep Disorder' column with "None" (represents no disorder)
- Keep all 374 rows instead of dropping data

**Why fill instead of drop:**
- NaN doesn't mean missing data - it means person has no sleep disorder
- Dropping would lose 219 rows (58% of dataset!)
- More data = better model performance

In [None]:
# Create clean copy and fill NaN
df_clean = df.copy()
df_clean['Sleep Disorder'].fillna('None', inplace=True)

print(f"Before: {df.shape[0]} rows")
print(f"After filling NaN: {df_clean.shape[0]} rows")
print(f"\nNew Target Distribution:")
print(df_clean['Sleep Disorder'].value_counts())
print("\n‚úì All data preserved")

## 3.2 Feature Engineering

**What this step does:**
- Split Blood Pressure into 3 components (Systolic, Diastolic, Pulse Pressure)
- Create Stress-Sleep Risk (interaction term)
- Convert BMI to numeric scale

**Why create these features:**
- **Blood Pressure Split**: Medical insight - systolic/diastolic have different meanings
- **Stress-Sleep Risk**: Captures combined effect (high stress + low sleep = high risk)
- **BMI Numeric**: Preserves ordering (Normal < Overweight < Obese)

**Result:** 3 new features give model more information to work with

In [None]:
# 1. Split Blood Pressure ("120/80" ‚Üí Systolic, Diastolic, Pulse Pressure)

df_clean[['Systolic_BP', 'Diastolic_BP']] = df_clean['Blood Pressure'].str.split('/', expand=True).astype(int)
df_clean['Pulse_Pressure'] = df_clean['Systolic_BP'] - df_clean['Diastolic_BP']
df_clean.drop('Blood Pressure', axis=1, inplace=True)
print("‚úì Created: Systolic_BP, Diastolic_BP, Pulse_Pressure")

# 2. Stress-Sleep Risk (interaction term)
df_clean['Stress_Sleep_Risk'] = df_clean['Stress Level'] * (10 - df_clean['Sleep Duration'])
print("‚úì Created: Stress_Sleep_Risk")

# 3. BMI Numeric
bmi_map = {'Normal': 0, 'Normal Weight': 0, 'Overweight': 1, 'Obese': 2}
df_clean['BMI_Numeric'] = df_clean['BMI Category'].map(bmi_map)
print("‚úì Created: BMI_Numeric")

print(f"\nTotal features now: {df_clean.shape[1]}")
print("‚Üí 3 new features added to help model learn better patterns")

## 3.3 Encode Target Variable & Prepare for Pipeline

**What this step does:**
- Encode target variable (Sleep Disorder) using LabelEncoder
- Identify numeric and categorical features for Pipeline
- Define feature lists that will be used by ColumnTransformer

**Why encode target separately:**
- Target encoding happens before Pipeline (Pipeline handles features only)
- Need numeric target for classification algorithms
- Example: "None" ‚Üí 0, "Insomnia" ‚Üí 1, "Sleep Apnea" ‚Üí 2

**Pipeline preparation:**
- Numeric features will be scaled with StandardScaler
- Categorical features will be one-hot encoded
- Pipeline automates this in the next step

In [None]:
# Encode target variable
le_target = LabelEncoder()
y_encoded = le_target.fit_transform(df_clean['Sleep Disorder'])

print("Target Encoding:")
for i, class_name in enumerate(le_target.classes_):
    print(f"  {class_name} ‚Üí {i}")

# Identify feature types
numeric_features = ['Age', 'Sleep Duration', 'Quality of Sleep', 
                   'Physical Activity Level', 'Stress Level', 
                   'Heart Rate', 'Daily Steps', 'Systolic_BP', 
                   'Diastolic_BP', 'Pulse_Pressure', 'Stress_Sleep_Risk', 'BMI_Numeric']

categorical_features = ['Gender', 'Occupation', 'BMI Category']

print(f"\n‚úì Identified {len(numeric_features)} numeric features")
print(f"‚úì Identified {len(categorical_features)} categorical features")
print("\n‚Üí Ready for Pipeline preprocessing")

## 3.4 Train-Test Split

**What this step does:**
- Separate features (X) from target variable (y)
- Remove non-predictive column (Person ID)
- Split data into training set (80%) and test set (20%)
- Use stratified split to maintain class proportions

**Why 80-20 split:**
- 80% (299 samples) provides enough data for training
- 20% (75 samples) provides reliable performance evaluation
- Standard practice in machine learning

**Why stratified:**
- Our classes are imbalanced (more "None" than disorders)
- Stratification ensures test set represents all classes fairly
- Without it, test set might randomly have too few of one class



In [None]:
# Separate features and target
X = df_clean.drop(columns=['Sleep Disorder', 'Person ID'])
y = df_clean['Sleep Disorder']

print(f"Features (X): {X.shape}")
print(f"Target (y): {y.shape}")

# Split 80-20 with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"\nTrain: {X_train.shape[0]} samples ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"Test: {X_test.shape[0]} samples ({X_test.shape[0]/len(X)*100:.1f}%)")
print(f"\nTrain class distribution: {pd.Series(y_train).value_counts().sort_index().tolist()}")
print(f"Test class distribution: {pd.Series(y_test).value_counts().sort_index().tolist()}")
print("\n‚Üí Classes are balanced across train and test sets")

## 3.5 Create Preprocessing Pipeline üîß

**What this step does:**
- Create ColumnTransformer to handle numeric and categorical features separately
- Numeric features: Apply StandardScaler (mean=0, std=1)
- Categorical features: Apply OneHotEncoder (create binary columns)
- Bundle preprocessing into reusable pipeline component

### **1. Prevents Data Leakage Automatically** 
- Manual: Must remember to `fit_transform(train)` and only `transform(test)`
- Pipeline: Automatically handles this correctly every time
- Risk of accidentally fitting on test data = eliminated

### **2. Reproducible & Reusable** 
- Save entire pipeline with model
- Apply same preprocessing to new data automatically
- No risk of forgetting preprocessing steps

### **3. Easier Integration with GridSearchCV** 
- Can tune preprocessing parameters alongside model parameters
- Everything happens in one cross-validation loop
- More efficient and less error-prone

In [None]:
# Create preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(drop='first', handle_unknown='ignore'), categorical_features)
    ],
    remainder='drop'  # Drop any columns not specified
)

print("‚úì Preprocessing Pipeline Created")
print("\nPipeline Components:")
print(f"  1. StandardScaler ‚Üí {len(numeric_features)} numeric features")
print(f"  2. OneHotEncoder ‚Üí {len(categorical_features)} categorical features")
print("\n‚Üí Pipeline will automatically fit on training data and transform both train and test")

---
# STEP 4: Modeling (Logistic Regression, Random Forest, Decision Tree)

**What this step does:**
- Create 3 distinct machine learning pipelines
- Compare performance using 5-Fold Cross-Validation
- Perform GridSearchCV for hyperparameter tuning

**Why this is important:**
- Logistic Regression provides a stable linear baseline
- Random Forest handles complex patterns and outliers
- Decision Tree offers clear, simple decision rules


In [None]:
from sklearn.tree import DecisionTreeClassifier

# 4.1 Define the 3 Pipelines
lr_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(random_state=42, max_iter=200))
])

rf_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

dt_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', DecisionTreeClassifier(random_state=42))
])

pipelines = {
    'Logistic Regression': lr_pipeline,
    'Random Forest': rf_pipeline,
    'Decision Tree': dt_pipeline
}
print('‚úì 3 Pipelines (LR, RF, DT) created successfully.')

In [None]:
# 4.2 Cross-Validation Comparison for ALL 3 Models
print('5-Fold Cross-Validation Scores (Accuracy):')
print('-' * 45)

for name, pipe in pipelines.items():
    scores = cross_val_score(pipe, X_train, y_train, cv=5)
    print(f'{name:20}: {scores.mean():.4f} (+/- {scores.std():.4f})')

# Plot for all 3
plt.figure(figsize=(10, 4))
means = [cross_val_score(pipe, X_train, y_train, cv=5).mean() for pipe in pipelines.values()]
plt.bar(pipelines.keys(), means, color=['skyblue', 'lightgreen', 'salmon'], edgecolor='black')
plt.title('Comparison of 3 Models (CV Accuracy)', fontweight='bold')
plt.ylabel('Mean Accuracy')
plt.ylim(0.8, 1.0)
plt.show()

In [None]:
# 4.3 Hyperparameter Tuning for ALL 3 Models
print('Tuning Hyperparameters for 3 models...')

lr_grid = GridSearchCV(lr_pipeline, {'classifier__C': [0.1, 1, 10]}, cv=5).fit(X_train, y_train)
rf_grid = GridSearchCV(rf_pipeline, {'classifier__n_estimators': [100, 200], 'classifier__max_depth': [10, None]}, cv=5).fit(X_train, y_train)
dt_grid = GridSearchCV(dt_pipeline, {'classifier__max_depth': [5, 10, None]}, cv=5).fit(X_train, y_train)

# Define best estimators immediately to avoid name errors
lr_best = lr_grid.best_estimator_
rf_best = rf_grid.best_estimator_
dt_best = dt_grid.best_estimator_

print(f'Best LR Score: {lr_grid.best_score_:.4f}')
print(f'Best RF Score: {rf_grid.best_score_:.4f}')
print(f'Best DT Score: {dt_grid.best_score_:.4f}')

---
# STEP 5: Model Evaluation & Feature Importance

**What this step does:**
- Visualize accuracy using 3 Parallel Confusion Matrices
- Generate detailed classification reports
- Calculate Feature Importance for Tree models

**Why this is important:**
- Helps identify which specific sleep disorders are being misclassified
- Shows which lifestyle factors are the strongest predictors


In [None]:
# Classification Reports for all 3 models
for model, name, _, _ in final_models:
    print(f"\n{name} - Classification Report:")
    print("-" * 40)
    print(classification_report(y_test, model.predict(X_test), target_names=le_target.classes_))

# Project Summary & Final Conclusions

### 1. Model Comparison Overview
Through this multi-class classification project, we evaluated three distinct algorithms using a professional pipeline workflow. 

| Model | Accuracy (Test) | Strengths |
| :--- | :--- | :--- |
| **Logistic Regression** | **~94.67%** | Most consistent performance and simple for deployment. |
| **Random Forest** | **~94.67%** | Robust ensemble performance with high accuracy. |
| **Decision Tree** | **~89.33%** | Highly interpretable, though slightly lower accuracy. |

**Final Choice:** Due to its excellent balance of accuracy, speed, and cross-validation stability, the **Logistic Regression Pipeline** remains the recommended model for this dataset.

### 2. Key Discovery: Medical Predictors
The comparison of **Random Forest** and **Decision Tree** feature importance confirms that:
1. **BMI_Numeric:** Is the most critical factor for identifying Sleep Apnea.
2. **Blood Pressure (Systolic/Diastolic):** Our engineered split features were among the top 5 across all models.
3. **Sleep Quality vs duration:** Both models agreed that the *quality* of sleep is often more predictive of a disorder than simple *duration*.

### 3. Workflow Success
- The **Pipeline Architecture** allowed us to swap between 3 different models instantly while ensuring zero data leakage.
- **Hyperparameter Tuning** optimized each algorithm to its highest potential on this specific dataset.

### 4. Limitations & Next Steps
- **Expand Data:** Collect more samples to better represent the 'Insomnia' and 'Sleep Apnea' groups.
- **Deployment:** Use the provided `.joblib` artifacts to build a real-time prediction service.
