## Predicting whether a person has sleep disorder or not through logistic regression
https://www.kaggle.com/datasets/uom190346a/sleep-health-and-lifestyle-dataset/data

Initial library imports

In [65]:
import pandas as pd
import numpy as np
import seaborn as sns 
import matplotlib.pyplot as plt 
from sklearn.model_selection import train_test_split

Loading our data. 

In the "BMI Category" column, there were "normal" bmi and "normal weight" bmi. Thus, we will be changing the latter to "normal".

In [66]:
dt = pd.read_csv('Sleep_health_and_lifestyle_dataset.csv')
dt = dt.assign(BMI_Category=lambda x: x['BMI Category'].map({'Obese': "Obese", 'Normal': 'Normal', 'Overweight': 'Overweight', 'Normal Weight': 'Normal'}))
dt.drop(columns = "BMI Category", inplace= True)

In [67]:
dt.describe().round(2)

Unnamed: 0,Person ID,Age,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,Heart Rate,Daily Steps
count,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0
mean,187.5,42.18,7.13,7.31,59.17,5.39,70.17,6816.84
std,108.11,8.67,0.8,1.2,20.83,1.77,4.14,1617.92
min,1.0,27.0,5.8,4.0,30.0,3.0,65.0,3000.0
25%,94.25,35.25,6.4,6.0,45.0,4.0,68.0,5600.0
50%,187.5,43.0,7.2,7.0,60.0,5.0,70.0,7000.0
75%,280.75,50.0,7.8,8.0,75.0,7.0,72.0,8000.0
max,374.0,59.0,8.5,9.0,90.0,8.0,86.0,10000.0


In [68]:
dt.head()

Unnamed: 0,Person ID,Gender,Age,Occupation,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,Blood Pressure,Heart Rate,Daily Steps,Sleep Disorder,BMI_Category
0,1,Male,27,Software Engineer,6.1,6,42,6,126/83,77,4200,,Overweight
1,2,Male,28,Doctor,6.2,6,60,8,125/80,75,10000,,Normal
2,3,Male,28,Doctor,6.2,6,60,8,125/80,75,10000,,Normal
3,4,Male,28,Sales Representative,5.9,4,30,8,140/90,85,3000,Sleep Apnea,Obese
4,5,Male,28,Sales Representative,5.9,4,30,8,140/90,85,3000,Sleep Apnea,Obese


In [69]:
dt.isna().sum()

Person ID                  0
Gender                     0
Age                        0
Occupation                 0
Sleep Duration             0
Quality of Sleep           0
Physical Activity Level    0
Stress Level               0
Blood Pressure             0
Heart Rate                 0
Daily Steps                0
Sleep Disorder             0
BMI_Category               0
dtype: int64

There is no NaN values in our dataset

In [70]:
dt['Blood Pressure'].unique()

array(['126/83', '125/80', '140/90', '120/80', '132/87', '130/86',
       '117/76', '118/76', '128/85', '131/86', '128/84', '115/75',
       '135/88', '129/84', '130/85', '115/78', '119/77', '121/79',
       '125/82', '135/90', '122/80', '142/92', '140/95', '139/91',
       '118/75'], dtype=object)

Changing the data to different groups

For blood pressure:
- Ideal blood pressure systolic (upper number) : less than 120 , diastolic (bottom number) : less than 80

- Normal systolic (upper number) : in range (120 - 129) , diastolic (bottom number) : in range (80 - 84)

- Otherwise, blood pressure is high (which is 1)

In [71]:
dt['Blood Pressure']=dt['Blood Pressure'].apply(lambda x:0 if x in ['120/80','126/83','125/80','128/84','129/84','117/76','118/76','115/75','125/82','122/80'] else 1)
dt["Age"]=pd.cut(dt["Age"],4)            # This will cut the age section into 4 sections (from 0 to 3)
dt["Heart Rate"]=pd.cut(dt["Heart Rate"],4)   # This will also cut the heart rate section into 4 sections (from 0 to 3)
dt["Daily Steps"]=pd.cut(dt["Daily Steps"],4)   
dt["Sleep Duration"]=pd.cut(dt["Sleep Duration"],3)    
dt["Physical Activity Level"]=pd.cut(dt["Physical Activity Level"],4)

Converting non-numeric data (String or Boolean) into numbers and droping "Person ID" since it is useless.

In [72]:
from sklearn.preprocessing import LabelEncoder
LE=LabelEncoder()

categories=['Gender','Age','Occupation','Sleep Duration','Physical Activity Level','BMI_Category','Heart Rate','Daily Steps','Sleep Disorder']
for label in categories:
    dt[label]=LE.fit_transform(dt[label])

dt.drop(['Person ID'], axis=1, inplace=True)

Splitting our data into 25,75 for training and testing.

In [73]:
x=dt.iloc[:,:-1]
y=dt.iloc[:,-1]

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.25,random_state=123,shuffle=True)

Testing our model and checking it's accuracy.

In [74]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
model=LogisticRegression(max_iter=1000).fit(x_train,y_train)


print(f"Training score: {round(model.score(x_train,y_train)*100,2)}")
print(f"Testing score: {round(model.score(x_test,y_test)*100,2)}")

Training score: 98.57
Testing score: 97.87


Haven't implemented our input feature yet.