<a href="https://colab.research.google.com/github/AzlinRusnan/Sleep_Quality_Analysis/blob/main/Sleep_Duration_vs_Quality.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Data Description**

In [2]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [3]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
import warnings
warnings.filterwarnings("ignore")

file_path = '/content/gdrive/MyDrive/Sleep_health_and_lifestyle_dataset.csv'
df = pd.read_csv(file_path)

In [4]:
df.head()

Unnamed: 0,Person ID,Gender,Age,Occupation,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Blood Pressure,Heart Rate,Daily Steps,Sleep Disorder
0,1,Male,27,Software Engineer,6.1,6,42,6,Overweight,126/83,77,4200,
1,2,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,
2,3,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,
3,4,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea
4,5,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea


##### **Columns Explanation:**

1. Person ID: An identifier for each individual.
2. Gender: The gender of the person (Male/Female).
3. Age: The age of the person in years.
4. Occupation: The occupation or profession of the person.
5. Sleep Duration (hours): The number of hours the person sleeps per day.
6. Quality of Sleep (scale: 1-10): A subjective rating of the quality of sleep, ranging from 1 to 10.
7. Physical Activity Level (minutes/day): The number of minutes the person engages in physical activity daily.
8. Stress Level (scale: 1-10): A subjective rating of the stress level experienced by the person, ranging from 1 to 10.
9. BMI Category: The BMI category of the person (e.g., Underweight, Normal, Overweight).
10. Blood Pressure (systolic/diastolic): The blood pressure measurement of the person, indicated as systolic pressure over diastolic pressure.
11. Heart Rate (bpm): The resting heart rate of the person in beats per minute.
12. Daily Steps: The number of steps the person takes per day.
13. Sleep Disorder: The presence or absence of a sleep disorder in the person (None, Insomnia, Sleep Apnea).

##### **Details about Sleep Disorder Column:**

1. None: The individual does not exhibit any specific sleep disorder.
2. Insomnia: The individual experiences difficulty falling asleep or staying asleep, leading to inadequate or poor-quality sleep.
3. Sleep Apnea: The individual suffers from pauses in breathing during sleep, resulting in disrupted sleep patterns and potential health risks.

##### **Checking the Columns Names**

In [5]:
df.columns

Index(['Person ID', 'Gender', 'Age', 'Occupation', 'Sleep Duration',
       'Quality of Sleep', 'Physical Activity Level', 'Stress Level',
       'BMI Category', 'Blood Pressure', 'Heart Rate', 'Daily Steps',
       'Sleep Disorder'],
      dtype='object')

##### **Checking the Total Number of Missing Values**

In [None]:
df.isnull().sum().to_frame().rename(columns={0:"Total No. of Missing Values"})

Unnamed: 0,Total No. of Missing Values
Person ID,0
Gender,0
Age,0
Occupation,0
Sleep Duration,0
Quality of Sleep,0
Physical Activity Level,0
Stress Level,0
BMI Category,0
Blood Pressure,0


**Note:**
The **Sleep Disorder** variable has incorrectly captured 219 missing values. In the raw data, these missing values are labeled as 'None'. To fix this, we replace "NaN" with "None".

In [6]:
df['Sleep Disorder'].fillna("None",inplace=True)

print(df['Sleep Disorder'].value_counts())

Sleep Disorder
None           219
Sleep Apnea     78
Insomnia        77
Name: count, dtype: int64


In [7]:
# Total No. of Missing Values after NaN replacement
df.isnull().sum().to_frame().rename(columns={0:"Total No. of Missing Values after NaN replacement"})

Unnamed: 0,Total No. of Missing Values after NaN replacement
Person ID,0
Gender,0
Age,0
Occupation,0
Sleep Duration,0
Quality of Sleep,0
Physical Activity Level,0
Stress Level,0
BMI Category,0
Blood Pressure,0


##### **Splitting Blood Pressure to Two Columns**

I'm splitting the Blood Pressure to two columns as i dont see any significant if we have so many columns of different Blood Pressure readings.

In [8]:
if 'Blood Pressure' in df.columns:
    # Split the 'Blood Pressure' column into 'Systolic' and 'Diastolic' by splitting on the '/'
    df[['Systolic', 'Diastolic']] = df['Blood Pressure'].str.split('/', expand=True)

    # Convert the new columns to numeric type for analysis
    df['Systolic'] = pd.to_numeric(df['Systolic'], errors='coerce')
    df['Diastolic'] = pd.to_numeric(df['Diastolic'], errors='coerce')

    # Drop the original 'Blood Pressure' column now that we have numerical representations
    data = df.drop(columns=['Blood Pressure'])

data[['Systolic', 'Diastolic']].head()


Unnamed: 0,Systolic,Diastolic
0,126,83
1,125,80
2,125,80
3,140,90
4,140,90


In [9]:
# Drop unnecessary columns and encode categorical variables
data_new = data.drop(columns=['Person ID'])
data_new = pd.get_dummies(data_new, drop_first=True)

# Convert all boolean columns to integer (1 for True, 0 for False)
data_new = data_new.apply(lambda x: x.astype(int) if x.dtype == 'bool' else x)


data_new

Unnamed: 0,Age,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,Heart Rate,Daily Steps,Systolic,Diastolic,Gender_Male,...,Occupation_Sales Representative,Occupation_Salesperson,Occupation_Scientist,Occupation_Software Engineer,Occupation_Teacher,BMI Category_Normal Weight,BMI Category_Obese,BMI Category_Overweight,Sleep Disorder_None,Sleep Disorder_Sleep Apnea
0,27,6.1,6,42,6,77,4200,126,83,1,...,0,0,0,1,0,0,0,1,1,0
1,28,6.2,6,60,8,75,10000,125,80,1,...,0,0,0,0,0,0,0,0,1,0
2,28,6.2,6,60,8,75,10000,125,80,1,...,0,0,0,0,0,0,0,0,1,0
3,28,5.9,4,30,8,85,3000,140,90,1,...,1,0,0,0,0,0,1,0,0,1
4,28,5.9,4,30,8,85,3000,140,90,1,...,1,0,0,0,0,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
369,59,8.1,9,75,3,68,7000,140,95,0,...,0,0,0,0,0,0,0,1,0,1
370,59,8.0,9,75,3,68,7000,140,95,0,...,0,0,0,0,0,0,0,1,0,1
371,59,8.1,9,75,3,68,7000,140,95,0,...,0,0,0,0,0,0,0,1,0,1
372,59,8.1,9,75,3,68,7000,140,95,0,...,0,0,0,0,0,0,0,1,0,1


##### **Target Variable**

The target variable (dependent variable) should be continuous, so I chose Sleep Duration over Quality of Sleep as the target. This is because Quality of Sleep is rated on a 1–10 discrete ordinal scale, which may be better suited for ordinal regression.

## **The Models**

### **Multiple Linear Regression Model**

As per the purpose of the learning, I have used different approaches to build an MLR model.

1. First, I will perform MLR using the statsmodels OLS model.
2. Second, I will perform MLR using sklearn's LinearRegression model.
3. Then, I will compare the results of the two approaches.

In [13]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import OneHotEncoder
import statsmodels.api as sm

In [12]:
X = data_new.drop(columns=['Sleep Duration'])
y = data_new['Sleep Duration']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### **1. statsmodels OLS Model**

In [14]:
X_ols = sm.add_constant(X)

# Fit the OLS model
ols_model = sm.OLS(y, X_ols).fit()

# Display the summary of the OLS model, including coefficients, p-values, and other statistics
ols_summary = ols_model.summary()
ols_summary

0,1,2,3
Dep. Variable:,Sleep Duration,R-squared:,0.912
Model:,OLS,Adj. R-squared:,0.906
Method:,Least Squares,F-statistic:,150.4
Date:,"Fri, 08 Nov 2024",Prob (F-statistic):,5.26e-168
Time:,07:31:14,Log-Likelihood:,9.482
No. Observations:,374,AIC:,31.04
Df Residuals:,349,BIC:,129.1
Df Model:,24,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,6.5255,1.157,5.642,0.000,4.251,8.800
Age,0.0274,0.007,4.199,0.000,0.015,0.040
Quality of Sleep,0.2861,0.056,5.095,0.000,0.176,0.397
Physical Activity Level,0.0093,0.002,5.991,0.000,0.006,0.012
Stress Level,-0.1629,0.034,-4.766,0.000,-0.230,-0.096
Heart Rate,0.0333,0.010,3.265,0.001,0.013,0.053
Daily Steps,-0.0001,2.19e-05,-5.863,0.000,-0.000,-8.53e-05
Systolic,-0.1213,0.016,-7.363,0.000,-0.154,-0.089
Diastolic,0.1360,0.022,6.153,0.000,0.093,0.179

0,1,2,3
Omnibus:,32.19,Durbin-Watson:,1.182
Prob(Omnibus):,0.0,Jarque-Bera (JB):,56.567
Skew:,0.532,Prob(JB):,5.21e-13
Kurtosis:,4.58,Cond. No.,647000.0


### **2. sklearn's LinearRegression Model**

In [15]:
# Initialize and fit the regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model's performance
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

mse, r2

(0.07433370912308548, 0.8883493778792415)

**Insights:**

An R² value of 0.888 indicates that the model explains about 88.83% of the variance in "Sleep Duration," suggesting a strong fit. The low MSE also indicates that the model's predictions are close to the actual values on average.

**Two Approaches Comparison**

1. sklearn's LinearRegression Model: When you used sklearn's LinearRegression, it split the data into training and testing sets. The R2 score calculated in that case (0.888) was only for the testing set, reflecting the model's ability to generalize to unseen data. This gives a more realistic assessment of the model's performance.

2. statsmodels OLS Model: The OLS approach with statsmodels used the entire dataset for training and evaluation, resulting in a slightly higher R2  value (0.912). Since it's calculated on the full dataset, the model fits the data better but may not generalize as well to new data.

### **Variable Selection**

In [None]:
import statsmodels.api as sm

# Define a function to perform forward selection
def forward_selection(X, y, significance_level=0.05):
    initial_features = []
    remaining_features = list(X.columns)
    best_features = []
    while remaining_features:
        best_pval = float("inf")
        best_feature = None
        for feature in remaining_features:
            model = sm.OLS(y, sm.add_constant(X[initial_features + [feature]])).fit()
            pval = model.pvalues[feature]
            if pval < best_pval:
                best_pval = pval
                best_feature = feature
        if best_pval < significance_level:
            initial_features.append(best_feature)
            remaining_features.remove(best_feature)
            best_features.append(best_feature)
        else:
            break
    return best_features

# Define a function to perform backward elimination
def backward_elimination(X, y, significance_level=0.05):
    features = list(X.columns)
    while features:
        model = sm.OLS(y, sm.add_constant(X[features])).fit()
        pvalues = model.pvalues.iloc[1:]  # Exclude the intercept
        max_pval = pvalues.max()
        if max_pval > significance_level:
            excluded_feature = pvalues.idxmax()
            features.remove(excluded_feature)
        else:
            break
    return features

# Define a function to perform stepwise selection (combining forward and backward)
def stepwise_selection(X, y, significance_level=0.05):
    initial_features = []
    remaining_features = list(X.columns)
    best_features = []
    while remaining_features or initial_features:
        changed = False

        # Forward step
        forward_best_pval = float("inf")
        best_feature = None
        for feature in remaining_features:
            model = sm.OLS(y, sm.add_constant(X[initial_features + [feature]])).fit()
            pval = model.pvalues[feature]
            if pval < forward_best_pval:
                forward_best_pval = pval
                best_feature = feature
        if forward_best_pval < significance_level:
            initial_features.append(best_feature)
            remaining_features.remove(best_feature)
            best_features.append(best_feature)
            changed = True

        # Backward step
        model = sm.OLS(y, sm.add_constant(X[initial_features])).fit()
        pvalues = model.pvalues.iloc[1:]  # Exclude the intercept
        max_pval = pvalues.max()
        if max_pval > significance_level:
            excluded_feature = pvalues.idxmax()
            initial_features.remove(excluded_feature)
            remaining_features.append(excluded_feature)
            best_features.remove(excluded_feature)
            changed = True

        if not changed:
            break
    return best_features

# Run variable selection methods
forward_selected_features = forward_selection(X_train, y_train)
backward_selected_features = backward_elimination(X_train, y_train)
stepwise_selected_features = stepwise_selection(X_train, y_train)

# Define a function to evaluate a model based on selected features
def evaluate_model(selected_features, X_train, X_test, y_train, y_test):
    # Fit model with selected features
    model = LinearRegression()
    model.fit(X_train[selected_features], y_train)

    # Make predictions and calculate performance metrics
    y_pred = model.predict(X_test[selected_features])
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    # Calculate adjusted R^2
    n = X_test[selected_features].shape[0]  # Number of observations
    p = len(selected_features)  # Number of predictors
    adjusted_r2 = 1 - ((1 - r2) * (n - 1) / (n - p - 1))

    return mse, r2, adjusted_r2

# Evaluate models for forward, backward, and stepwise selected features
forward_mse, forward_r2, forward_adjusted_r2 = evaluate_model(forward_selected_features, X_train, X_test, y_train, y_test)
backward_mse, backward_r2, backward_adjusted_r2 = evaluate_model(backward_selected_features, X_train, X_test, y_train, y_test)
stepwise_mse, stepwise_r2, stepwise_adjusted_r2 = evaluate_model(stepwise_selected_features, X_train, X_test, y_train, y_test)

# Print results
print("Forward Selected Features:\n" +
      "MSE: " + str(forward_mse) + "\nR2: " + str(forward_r2) + "\nAdjusted R2: " + str(forward_adjusted_r2) + "\n" + "\n".join(forward_selected_features) + "\n")
print("Backward Selected Features:\n" +
      "MSE: " + str(backward_mse) + "\nR2: " + str(backward_r2) + "\nAdjusted R2: " + str(backward_adjusted_r2) + "\n" + "\n".join(backward_selected_features) + "\n")
print("Stepwise Selected Features:\n" +
      "MSE: " + str(stepwise_mse) + "\nR2: " + str(stepwise_r2) + "\nAdjusted R2: " + str(stepwise_adjusted_r2) + "\n" + "\n".join(stepwise_selected_features))

Forward Selected Features:
MSE: 0.0999578095564472
R2: 0.849861499520685
Adjusted R2: 0.8148291827421782
Quality of Sleep
Occupation_Doctor
Occupation_Engineer
Physical Activity Level
Daily Steps
Stress Level
Occupation_Teacher
Occupation_Lawyer
Occupation_Salesperson
BMI Category_Normal Weight
Sleep Disorder_Sleep Apnea
BMI Category_Obese
Heart Rate
Occupation_Sales Representative

Backward Selected Features:
MSE: 0.07248583366880358
R2: 0.8911249213911443
Adjusted R2: 0.8586534067183277
Age
Quality of Sleep
Physical Activity Level
Stress Level
Heart Rate
Daily Steps
Systolic
Diastolic
Occupation_Doctor
Occupation_Engineer
Occupation_Lawyer
Occupation_Sales Representative
Occupation_Salesperson
Occupation_Software Engineer
Occupation_Teacher
BMI Category_Obese
BMI Category_Overweight

Stepwise Selected Features:
MSE: 0.10089702362812338
R2: 0.8484507824093731
Adjusted R2: 0.8161534081687477
Quality of Sleep
Occupation_Doctor
Occupation_Engineer
Physical Activity Level
Daily Steps
Stre

### **The Best MLR Model**

Backward Elimination appears to be the best model in this case:

- It has the lowest MSE (0.0725), suggesting it makes the most accurate predictions.
- It has the highest R2 (0.8911), meaning it explains the most variance in the target variable.
- It has the highest adjusted R2 (0.8587), indicating a good balance between predictive power and the number of predictors.

Therefore, the **Backward Elimination model is the best MLR model.**

### **Model Diagnostics (Residual Plots)**