# **Data Gathering**

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import kagglehub
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.preprocessing import LabelEncoder

pd.set_option('display.max_columns', None)

path = kagglehub.dataset_download("uom190346a/sleep-health-and-lifestyle-dataset")
data = pd.read_csv(f"{path}/Sleep_health_and_lifestyle_dataset.csv")

display(data.head())

print("\nDataset Info:")
display(data.info())

The dataset is downloaded using kagglehub.dataset_download() and loaded into a pandas DataFrame. We display basic information about the dataset using data.head() and data.info() to understand its structure.

# **Cleansing Data**

In [None]:
print("Missing Values:\n")
display(data.isnull().sum())

numeric_columns = data.select_dtypes(include=['float64', 'int64']).columns
for col in numeric_columns:
    data[col].fillna(data[col].median(), inplace=True)

# Blood Pressure
data[['Systolic', 'Diastolic']] = data['Blood Pressure'].str.split('/', expand=True).astype(float)
data = data.drop('Blood Pressure', axis=1)

# Sleep Duration
data = data[(data['Sleep Duration'] >= 3) & (data['Sleep Duration'] <= 15)]

# Categorical variables
categorical_columns = ['Gender', 'BMI Category', 'Occupation']
data = pd.get_dummies(data, columns=categorical_columns, drop_first=True)

# Sleep Disorder (target variable)
le = LabelEncoder()
data['Sleep Disorder'] = le.fit_transform(data['Sleep Disorder'])
print("\nSleep Disorder Encoding Mapping:")
for category, encoded_value in dict(zip(le.classes_, le.transform(le.classes_))).items():
    print(f"{category}: {encoded_value}")

display(data.head())

The provided dataset contains information about various sleep health and lifestyle factors, including demographics (age, gender, occupation), sleep metrics (duration, quality), physiological measurements (blood pressure, heart rate), and behavioral factors (physical activity, stress level). The data cleaning process involves handling missing values, splitting the "Blood Pressure" column into "Systolic" and "Diastolic" components, and encoding categorical variables like "Gender," "BMI Category," "Sleep Disorder", and "Occupation."

The cleaned and preprocessed data can be used to address the problem statement and data science questions. By analyzing the relationships between variables like stress levels, physical activity, sleep duration, and blood pressure, we can gain insights into how lifestyle and health metrics impact sleep disorders. For instance, we can explore whether higher stress levels are associated with poorer sleep quality, or if increased physical activity is correlated with longer sleep duration. Additionally, by building predictive models, we can identify which factors are most predictive of specific sleep disorders, such as insomnia or sleep apnea.

# **Exploratory Data Analysis**

Problem Statement: How do lifestyle and health metrics impact sleep disorders among individuals? Can we identify patterns or predictors that influence sleep disorders, such as Insomnia or Sleep Apnea?

Data Science Questions:

1. What is the relationship between stress levels and sleep quality?
2. Are physical activity levels correlated with sleep duration?
3. Which factors (e.g., age, BMI category, heart rate) are most indicative of the presence of a sleep disorder?

In [None]:
# Correlation heatmap for numeric variables
plt.figure(figsize=(12, 10))
numeric_data = data.select_dtypes(include=['float64', 'int64'])
sns.heatmap(numeric_data.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title("Correlation Heatmap")
plt.tight_layout()
plt.show()

The correlation heatmap provides insights into the relationship between various health and lifestyle factors, aligning with the problem statement's focus on sleep disorders. Notably, stress has a strong negative impact on sleep quality and duration. While physical activity might indirectly influence sleep through stress reduction, its direct impact on sleep duration is less pronounced. Sleep disorders are significantly correlated with poor sleep quality, duration, and higher blood pressure. To further address the problem statement, a data model could be used to predict sleep disorders based on these factors, explore potential causal relationships, and identify subgroups with specific risk factors.

In [None]:
#Distribution of sleep disorders across BMI categories
plt.figure(figsize=(10, 6))
sns.countplot(data=data, x='Sleep Disorder', hue='BMI Category_Normal Weight')
plt.title('Distribution of Sleep Disorders by Normal Weight')
plt.xticks(rotation=45)
plt.show()

plt.figure(figsize=(10, 6))
sns.countplot(data=data, x='Sleep Disorder', hue='BMI Category_Obese')
plt.title('Distribution of Sleep Disorders by Obese')
plt.xticks(rotation=45)
plt.show()

plt.figure(figsize=(10, 6))
sns.countplot(data=data, x='Sleep Disorder', hue='BMI Category_Overweight')
plt.title('Distribution of Sleep Disorders by Overweight')
plt.xticks(rotation=45)
plt.show()

The graphs reveal a strong correlation between weight status and sleep disorders. Individuals with normal weight have a significantly lower risk of experiencing sleep disorders like Insomnia and Sleep Apnea compared to those who are overweight or obese. This suggests that maintaining a healthy weight is crucial for promoting good sleep health.

In [None]:
# Distribution of Sleep Disorders
plt.figure(figsize=(10, 6))
sns.countplot(data=data, x='Sleep Disorder', hue='Sleep Disorder')
plt.title("Distribution of Sleep Disorders")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

This bar plot shows the distribution of individuals across different sleep disorder categories (0, 1, 2). The majority of individuals seem to fall into category 2, suggesting a higher prevalence of sleep disorders in the dataset. It's important to understand the specific definitions of these categories to interpret the findings accurately.

In [None]:
# Blood Pressure
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

sns.histplot(data=data, x='Systolic', kde=True, ax=ax1)
ax1.set_title('Systolic Blood Pressure Distribution')

sns.histplot(data=data, x='Diastolic', kde=True, ax=ax2)
ax2.set_title('Diastolic Blood Pressure Distribution')

plt.tight_layout()
plt.show()

1. Systolic Blood Pressure Distribution: This histogram with a density curve shows the distribution of systolic blood pressure values. The distribution appears to be roughly normal, with a peak around 125 mmHg. The density curve helps visualize the underlying probability distribution.
2. Diastolic Blood Pressure Distribution: Similar to the systolic blood pressure distribution, this histogram shows the distribution of diastolic blood pressure values. The distribution is also approximately normal, with a peak around 80 mmHg.

In [None]:
# 1. Stress levels vs Sleep Quality
plt.figure(figsize=(10, 6))
sns.scatterplot(data=data, x='Stress Level', y='Quality of Sleep', hue='Sleep Disorder')
plt.title('Stress Levels vs Sleep Quality')
plt.show()

This scatter plot illustrates a clear negative correlation between stress levels and sleep quality. As stress levels increase, sleep quality generally decreases. Individuals with insomnia (category 0) tend to have lower sleep quality, regardless of their stress level. Those with sleep apnea (category 1) also experience poorer sleep quality, especially at higher stress levels.

In [None]:
# 2. Physical activity vs Sleep Duration
plt.figure(figsize=(10, 6))
sns.scatterplot(data=data, x='Physical Activity Level', y='Sleep Duration', hue='Sleep Disorder')
plt.title('Physical Activity Levels vs Sleep Duration')
plt.show()

The relationship between physical activity levels and sleep duration is less straightforward. While there's a slight positive trend, indicating that higher physical activity might lead to longer sleep duration, the correlation is not strong. Individuals with insomnia tend to have shorter sleep durations, regardless of their physical activity levels. Sleep apnea seems to have a more variable impact on sleep duration.

# **Modelling**

In [None]:
# Prepare features and target
X = data.drop(['Sleep Disorder'], axis=1)
y = data['Sleep Disorder']

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Initialize and train models
models = {
    'Random Forest': RandomForestClassifier(
        random_state=42,
        n_estimators=100,
        class_weight='balanced'
    ),
    'Logistic Regression': LogisticRegression(
        max_iter=1000,
        class_weight='balanced'
    )
}

results = {}
feature_importance = {}

for name, model in models.items():
    print(f"\nTraining {name}...")
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    results[name] = {
        'model': model,
        'predictions': y_pred,
        'y_test': y_test
    }

    if name == 'Random Forest':
        feature_importance = dict(zip(X.columns, model.feature_importances_))


The code first separates the dataset into features (X) and the target variable (y), which is the "Sleep Disorder" column. It then splits the data into training and testing sets to prepare for model training and evaluation. Next, it initializes two machine learning models: Random Forest Classifier and Logistic Regression, both configured to handle potential class imbalance in the data. The models are trained on the training set and then used to make predictions on the testing set. The predictions, along with the actual target values, are stored for later evaluation. Additionally, for the Random Forest model, the code extracts the feature importances to understand which factors contribute most to the prediction of sleep disorders.

# **Evaluation**

In [None]:
# For each model in results
for name, result in results.items():
    try:
        # Get the true and predicted values
        y_true = result['y_test'].copy()
        y_pred = result['predictions'].copy()

        # Replace NaN with a string label 'No Disorder'
        y_true = np.where(pd.isna(y_true), 'No Disorder', y_true)
        y_pred = np.where(pd.isna(y_pred), 'No Disorder', y_pred)

        print(f"\n{name} Results:")
        print("Classification Report:")
        print(classification_report(y_true, y_pred))

        # Get unique classes including 'No Disorder'
        classes = np.unique(np.concatenate([y_true, y_pred]))

        # Calculate metrics
        precision = precision_score(y_true, y_pred, average='weighted')
        recall = recall_score(y_true, y_pred, average='weighted')
        f1 = f1_score(y_true, y_pred, average='weighted')

        print(f"\nMetrics Summary:")
        print(f"Precision: {precision:.3f}")
        print(f"Recall: {recall:.3f}")
        print(f"F1 Score: {f1:.3f}")

        # Plot confusion matrix
        plt.figure(figsize=(8, 6))
        cm = confusion_matrix(y_true, y_pred)
        sns.heatmap(cm,
                   annot=True,
                   fmt="d",
                   cmap="Blues",
                   xticklabels=classes,
                   yticklabels=classes)
        plt.title(f"{name} Confusion Matrix")
        plt.ylabel('True Label')
        plt.xlabel('Predicted Label')
        plt.xticks(rotation=45)
        plt.tight_layout()
        plt.show()

    except Exception as e:
        print(f"Error in analysis for {name}: {str(e)}")
        print(f"y_test unique values: {np.unique(result['y_test'])}")
        print(f"predictions unique values: {np.unique(result['predictions'])}")

# Plot feature importance for Random Forest
plt.figure(figsize=(12, 6))
importance_df = pd.DataFrame({
    'Feature': feature_importance.keys(),
    'Importance': feature_importance.values()
}).sort_values('Importance', ascending=False)

sns.barplot(data=importance_df, x='Importance', y='Feature')
plt.title('Feature Importance (Random Forest)')
plt.tight_layout()
plt.show()

The code evaluates the performance of the trained models by calculating classification metrics like precision, recall, and F1-score for each model. It also visualizes confusion matrices to understand how well the models classified different sleep disorder categories, including "No Disorder." Additionally, for the Random Forest model, it plots feature importance to identify the most influential factors in predicting sleep disorders.

1. Random Forest Confusion Matrix

  The Random Forest model performed well in classifying sleep disorders, particularly in accurately identifying individuals with no sleep disorder. The model's precision, recall, and F1-score of 0.920 indicate strong performance. However, the model struggled slightly in distinguishing between insomnia and sleep apnea, suggesting room for improvement in this area.

2. Logistic Regression Confusion Matrix

  The Logistic Regression model achieved a precision of 0.880, a recall of 0.867, and an F1-score of 0.869. This indicates a strong performance overall.

  The confusion matrix provides a detailed breakdown of the model's predictions. The model excels at correctly identifying individuals with no sleep disorder (True Negative: 40). It also performs well in classifying individuals with Sleep Apnea, with a few misclassifications as Insomnia. However, the model struggles to accurately distinguish between Insomnia and Sleep Apnea, as evidenced by the misclassifications in these categories.

3. Feature Importance

  The feature importance plot reveals that individual-specific factors, such as Person ID, are the most significant predictors of sleep disorders. Additionally, health metrics like blood pressure and BMI category, as well as occupational factors, play crucial roles. Lifestyle factors such as sleep duration, quality of sleep, stress level, and physical activity level also contribute to the model's predictions.

**Conclusion**

Problem Statement:
How do lifestyle and health metrics impact sleep disorders among individuals? Can we identify patterns or predictors that influence sleep disorders, such as Insomnia or Sleep Apnea?

Data Science Questions:

1. What is the relationship between stress levels and sleep quality?
2. Are physical activity levels correlated with sleep duration?
3. Which factors (e.g., age, BMI category, heart rate) are most indicative of the presence of a sleep disorder?

Based on the analysis of the data, it appears that lifestyle and health metrics have a significant impact on sleep disorders.

1. As expected, there is a strong negative correlation between stress levels and sleep quality. Higher stress levels are associated with poorer sleep quality, particularly for individuals with insomnia and sleep apnea.

2. While physical activity can positively influence sleep duration, its impact is less pronounced. Individuals with insomnia tend to have shorter sleep durations, regardless of their physical activity levels.

3. Several factors have been identified as strong predictors of sleep disorders:
  * Individual Factors: Personal characteristics, as indicated by Person ID, play a crucial role.
  * Health Metrics: Blood pressure and BMI category are significant predictors. Higher blood pressure and increased weight are often associated with sleep disorders.
  * Occupational Factors: Certain occupations, particularly shift work and high-stress jobs, can increase the risk of sleep disorders.
  * Lifestyle Factors: Sleep duration, sleep quality, stress levels, and physical activity are important factors influencing sleep health.

**Limitations & Recommendations**

While this model provides valuable insights, it is important to acknowledge its limitations and explore its opportunities. Future research could focus on expanding the dataset, incorporating additional features, and employing more advanced machine learning techniques to enhance the model's predictive accuracy and generalizability.

