In [2]:
import pandas as pd
import numpy as np

# Load your dataset
# Assuming your dataset is in a CSV file named 'data.csv'
dataset = pd.read_csv('Sleep_health_and_lifestyle_dataset.csv')

# Function to generate synthetic dataset using bootstrapping
def generate_synthetic_data(dataset, num_samples=1000):
    synthetic_data = pd.DataFrame()
    num_rows = dataset.shape[0]

    for column in dataset.columns:
        synthetic_column = np.random.choice(dataset[column], size=num_samples, replace=True)
        synthetic_data[column] = synthetic_column

    return synthetic_data

# Generate synthetic dataset
synthetic_dataset = generate_synthetic_data(dataset, num_samples=1000)

# Save synthetic dataset to a CSV file
synthetic_dataset.to_csv('synthetic_data.csv', index=False)

===============================================================================

#  **Sleep Disorder Prediction**

===============================================================================

# 1. Data Understanding

**Data:** This dataset contains sleep and cardiovascular metrics as well as lifestyle factors of close to 400 fictive persons.

**Background:** A health insurance company requires to identify whether or not a potential client is likely to have a sleep disorder. The company wants to use this information to determine the premium they want the client to pay.

**Objective:** Automatically identify potential sleep disorders.

**Problem Solution:** Construct a classifier to predict the presence of a sleep disorder based on the other columns in the dataset.

**The data contains the following columns:**

- `Person ID`
- `Gender`
- `Age`
- `Occupation`
- `Sleep Duration`: Average number of hours of sleep per day
- `Quality of Sleep`: A subjective rating on a 1-10 scale
- `Physical Activity Level`: Average number of minutes the person engages in physical activity daily
- `Stress Level`: A subjective rating on a 1-10 scale
- `BMI Category`
- `Blood Pressure`: Indicated as systolic pressure over diastolic pressure
- `Heart Rate`: In beats per minute
- `Daily Steps`
- `Sleep Disorder`: One of `None`, `Insomnia` or `Sleep Apnea`

Let's start with the first step: Data Understanding. We'll load the data and check the first few rows to understand its structure. We'll also look at the data types and check for any missing values.

In [68]:
import pandas as pd

data = pd.read_csv('Sleep_health_and_lifestyle_dataset.csv')
data.head()

data.info()

data.head(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 374 entries, 0 to 373
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Person ID                374 non-null    int64  
 1   Gender                   374 non-null    object 
 2   Age                      374 non-null    int64  
 3   Occupation               374 non-null    object 
 4   Sleep Duration           374 non-null    float64
 5   Quality of Sleep         374 non-null    int64  
 6   Physical Activity Level  374 non-null    int64  
 7   Stress Level             374 non-null    int64  
 8   BMI Category             374 non-null    object 
 9   Blood Pressure           374 non-null    object 
 10  Heart Rate               374 non-null    int64  
 11  Daily Steps              374 non-null    int64  
 12  Sleep Disorder           374 non-null    object 
dtypes: float64(1), int64(7), object(5)
memory usage: 38.1+ KB


Unnamed: 0,Person ID,Gender,Age,Occupation,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Blood Pressure,Heart Rate,Daily Steps,Sleep Disorder
0,1,Male,27,Software Engineer,6.1,6,42,6,Overweight,126/83,77,4200,
1,2,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,
2,3,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,
3,4,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea
4,5,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea
5,6,Male,28,Software Engineer,5.9,4,30,8,Obese,140/90,85,3000,Insomnia
6,7,Male,29,Teacher,6.3,6,40,7,Obese,140/90,82,3500,Insomnia
7,8,Male,29,Doctor,7.8,7,75,6,Normal,120/80,70,8000,
8,9,Male,29,Doctor,7.8,7,75,6,Normal,120/80,70,8000,
9,10,Male,29,Doctor,7.8,7,75,6,Normal,120/80,70,8000,


The dataset contains 374 entries and 13 columns. Each entry represents a fictive individual's health and sleep-related metrics. There are missing values in the 'Sleep Disorder' column which we will encode as "None" category. The data types are also consistent with the data description provided.

In [87]:
distinct_values = data['BMI Category'].unique()
distinct_values

array(['Overweight', 'Normal', 'Obese', 'Normal Weight'], dtype=object)

In [76]:
y_test

array([0, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 0, 1, 1, 0, 2, 1, 1, 1, 1,
       1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 0, 1, 1, 2, 0, 2, 1, 2, 1, 1, 1, 0,
       0, 2, 2, 0, 2, 2, 1, 1, 1, 1, 1, 1, 0, 2, 1, 2, 0, 0, 1, 1, 1, 1,
       1, 1, 1, 2, 1, 2, 0, 1, 1, 1, 1, 1, 0, 2, 2, 0, 0, 1, 0, 1, 0, 0,
       1, 2, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 2, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 1])

In [60]:
# Encoding missing values in 'Sleep Disorder' to 'None'
data['Sleep Disorder'].fillna('None', inplace=True)

# Correcting the inconsistency in 'BMI Category'
data['BMI Category'].replace({'Normal Weight': 'Normal'}, inplace=True)

# Splitting the 'Blood Pressure' column into 'Systolic' and 'Diastolic' columns
data['Systolic'] = data['Blood Pressure'].str.split('/').str[0].astype(int)
data['Diastolic'] = data['Blood Pressure'].str.split('/').str[1].astype(int)

data.drop(['Blood Pressure','Person ID'], axis=1, inplace=True)

Unnamed: 0,Person ID,Gender,Age,Occupation,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Blood Pressure,Heart Rate,Daily Steps,Sleep Disorder
0,1,Male,27,Software Engineer,6.1,6,42,6,Overweight,126/83,77,4200,
1,2,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,
2,3,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,
3,4,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea
4,5,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea
...,...,...,...,...,...,...,...,...,...,...,...,...,...
369,370,Female,59,Nurse,8.1,9,75,3,Overweight,140/95,68,7000,Sleep Apnea
370,371,Female,59,Nurse,8.0,9,75,3,Overweight,140/95,68,7000,Sleep Apnea
371,372,Female,59,Nurse,8.1,9,75,3,Overweight,140/95,68,7000,Sleep Apnea
372,373,Female,59,Nurse,8.1,9,75,3,Overweight,140/95,68,7000,Sleep Apnea


In [64]:
#installing required libraries
import pickle
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay

X = data.drop(['Sleep Disorder'], axis=1) 
y = data['Sleep Disorder']

# Label encoding for categorical variables in X
label_encoders = {}  # To store the encoder objects for potential inverse transformations later

for col in X.select_dtypes(include=['object']).columns:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col])
    label_encoders[col] = le
    with open(f'label_encoder_{col}.pkl', 'wb') as f:
        pickle.dump(le, f)

# Encoding the target variable
le_target = LabelEncoder()
y = le_target.fit_transform(y)

if()

with open('label_encoder_Sleep_Disorder.pkl', 'wb') as f:
    pickle.dump(le_target, f)
    
# Encoding for the 'BMI Category' column
le_bmi_category = LabelEncoder()
X['BMI Category'] = le_bmi_category.fit_transform(X['BMI Category'])
label_encoders['BMI Category'] = le_bmi_category

# Save the label encoder object for 'BMI Category'

if
with open('label_encoder_BMI_Category.pkl', 'wb') as f:
    pickle.dump(le_bmi_category, f)


Unnamed: 0,Age,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,Heart Rate,Daily Steps,Systolic,Diastolic
311,52,6.6,7,45,7,72,6000,130,85
304,51,6.1,6,90,8,75,10000,140,95
213,43,7.8,8,90,5,70,8000,130,85
261,45,6.6,7,45,4,65,6000,135,90
33,31,6.1,6,30,8,72,5000,125,80
...,...,...,...,...,...,...,...,...,...
108,37,7.8,8,70,4,68,7000,120,80
78,33,6.0,6,30,8,72,5000,125,80
197,43,6.5,6,45,7,72,6000,130,85
122,37,7.2,8,60,4,68,7000,115,75


In [66]:
# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

# Applying StandardScaler to Numerical Variables
scaler = StandardScaler()

X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()

X_train_scaled[num_vars] = scaler.fit_transform(X_train[num_vars])
X_test_scaled[num_vars] = scaler.transform(X_test[num_vars])

with open('scaler.pkl', 'wb') as f:
    pickle.dump(scaler, f)
    
X_test_scaled

Unnamed: 0,Gender,Age,Occupation,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Heart Rate,Daily Steps,Systolic,Diastolic
278,0,0.923577,5,-1.321664,-1.095797,1.450972,1.485775,2,1.208391,1.935017,1.492108,1.690428
72,1,-1.075404,1,-1.321664,-1.095797,-1.450601,1.485775,0,0.465131,-1.154665,-0.446691,-0.738205
332,0,1.393926,2,1.614660,1.435203,-1.450601,-1.382474,0,-1.269143,-1.154665,-0.446691,-0.738205
128,1,-0.487469,3,0.210331,0.591536,0.000185,-0.235174,0,-0.525883,0.699144,0.199575,0.071339
150,0,-0.369881,0,1.103995,1.435203,0.967376,-1.382474,0,-0.773636,0.390176,-1.739224,-1.062023
...,...,...,...,...,...,...,...,...,...,...,...,...
323,0,1.276339,2,1.742327,1.435203,-1.450601,-1.382474,0,-1.269143,-1.154665,-0.446691,-0.738205
15,1,-1.545753,1,-1.449330,-1.095797,-1.450601,1.485775,0,-0.030376,0.699144,-1.092958,-0.738205
44,1,-1.310579,1,0.720996,-0.252130,0.725579,0.338475,0,-0.030376,0.699144,-1.092958,-0.738205
41,1,-1.310579,1,0.720996,-0.252130,0.725579,0.338475,0,-0.030376,0.699144,-1.092958,-0.738205


In [67]:
num_vars


['Age',
 'Sleep Duration',
 'Quality of Sleep',
 'Physical Activity Level',
 'Stress Level',
 'Heart Rate',
 'Daily Steps',
 'Systolic',
 'Diastolic']

In [63]:
X_train_set, X_test_set = X_train_scaled, X_test_scaled
model = LogisticRegression(max_iter=10000, class_weight='balanced')
model.fit(X_train_set, y_train)
with open('logistic_regression.pkl', 'wb') as f:
        pickle.dump(trained_model, f)

## 4.5 Feature Importance

We will analyze the most important features according to the model.

The feature importances provided by the Random Forest model indicates that the most influential feature in predicting sleep disorder is `Systolic` with importance approximately 19%, followed by `Diastolic` blood pressure (15%), `BMI category` (12%), `Occupation` (12%) and so on.

# Executive Summary

- The `Logistic Regression` model displayed the highest average F1-score (88%) with minimal variability (±2%), indicating a strong and consistent predictive capability across various data subsets.
- The `Random Forest Classifier`, while having a lower average F1-score (84%) in cross-validation, demonstrated a superior F1-score of 95% on the test set after hyperparameter tuning.
- The `Support Vector Classifier (SVC)` and `Ridge Classifier` showed competitive performance, but with slightly higher variability in results.
- Confusion matrix analysis revealed that the Random Forest Classifier is particularly adept at identifying 'Sleep Apnea' and has shown improvement in identifying 'Insomnia' after hyperparameter tuning.
- Feature importance analysis indicated that `'Systolic'` blood pressure is the most significant predictor, followed by `'Diastolic'` blood pressure and `'BMI Category'`.

# Recommendations

- Adopt Logistic Regression as the primary model for setting health insurance premiums due to its stable and high F1-score, ensuring that premiums are based on reliable predictions.
- Further explore and fine-tune Random Forest given its high performance on the test set, especially to reduce overfitting as indicated by the variability in cross-validation scores.
- Consider systolic blood pressure as a key factor in premium calculation, as it is the most influential feature in predicting sleep disorders.
- Incorporate model insights into health assessments, using identified key features like blood pressure and BMI to guide risk evaluation and premium setting.
- Implement regular re-evaluation of the models with new data to maintain and improve predictive accuracy and reliability.
- By leveraging these insights and recommendations, the health insurance company can more accurately assess the risk of sleep disorders among potential clients, leading to fairer and more precise insurance premium settings.