# CSA vs. OSA: An Introductive Study Replication

This notebook covers the replication of Table 1 and Table 3 of the following study: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4909617. This was our first data task assigned by our sponsor to help us understand the goal of the project as well as important features to take note of. 

Table 1 describes demographics and sleep characteristics of the sleep study participants, classified into no sleep apnea, CSA, and OSA. Table 3 describes co-existing diseases and prescription drug use of sleep study participants.

Resulting tables can be found in the /src/study directory.

## Imports

In [None]:
import pandas as pd
import random
import numpy as np
from scipy import stats

# user defined methods
import sys
sys.path.append('../utils')

from data_exploration import calc_confidence_interval

## Replicating Table 1

In [2]:
random.seed(0)
np.random.seed(0)

df = pd.read_csv('../../data/raw/shhs1-dataset-0.20.0.csv', encoding='cp1252', engine='python')

In [3]:
table1 = ['bmi_s1', 'age_s1', 'gender', 'systbp', 'diasbp', 'ess_s1', 'ahi_a0h4', 'ahi_c0h4', 'ahi_o0h4']

In [4]:
df = df[table1]
df.dropna(inplace=True)

Diagnosis of Sleep Apnea:

* Obstructive sleep apnea: # of total apnea events >= 5 & obstructive AHI > central AHI
* Central sleep apnea: # of central apnea events >= 5 & central AHI > obstructive AHI
* no central sleep apnea: # of total apnea events < 5

In [5]:
# OSA: TOTAL AHI >= 5 & OAHI > CAHI
osa = df[(df['ahi_a0h4'] >= 5) & (df['ahi_o0h4'] > df['ahi_c0h4'])]
# CSA: CAHI >= 5 & CAHI > OAHI
csa = df[(df['ahi_c0h4'] >= 5) & (df['ahi_c0h4'] > df['ahi_o0h4'])]
# No SA: TOTAL AHI < 5
no_sa = df[df['ahi_a0h4'] < 5]

In [6]:
# split csa into bins: [0, 5, 15, 30, 1000]
csa.loc[:, 'ahi_c0h4'] = pd.cut(csa['ahi_c0h4'], bins=[0, 5, 15, 30, 1000], labels=['0-5', '5-15', '15-30', '30+'])

# count the number of each bin
csa['ahi_c0h4'].value_counts()

ahi_c0h4
5-15     97
15-30    36
30+      22
0-5       0
Name: count, dtype: int64

In [7]:
print(len(no_sa))
print(len(osa))
print(len(csa))

2649
2503
155


In [None]:
# add labels based on the criteria above
osa.loc[:, 'label'] = 'OSA'
csa.loc[:, 'label'] = 'CSA'
no_sa.loc[:, 'label'] = 'No SA'

# combine the three dataframes
df = pd.concat([osa, csa, no_sa])

In [10]:
features = ['bmi_s1', 'age_s1', 'gender', 'systbp', 'diasbp', 'ess_s1', 'ahi_a0h4']

In [11]:
df_table1 = df[features + ['label']]
df_table1

Unnamed: 0,bmi_s1,age_s1,gender,systbp,diasbp,ess_s1,ahi_a0h4,label
1,32.950680,78,1,168.0,68.0,14.0,19.780220,OSA
2,24.114150,77,2,127.0,68.0,5.0,5.020921,OSA
6,29.983588,52,1,142.0,99.0,11.0,10.105263,OSA
8,25.817447,69,1,201.0,101.0,10.0,24.409673,OSA
11,25.401235,68,1,152.0,90.0,7.0,20.417335,OSA
...,...,...,...,...,...,...,...,...
5795,35.790598,71,2,126.0,73.0,5.0,2.284041,No SA
5796,21.957367,55,2,136.0,77.0,13.0,0.807537,No SA
5798,32.414213,54,2,118.0,66.0,7.0,1.878669,No SA
5801,24.228571,55,1,89.0,56.0,17.0,3.605769,No SA


In [12]:
df_table1.loc[:,:].dropna(inplace=True)

In [13]:
df_table1

Unnamed: 0,bmi_s1,age_s1,gender,systbp,diasbp,ess_s1,ahi_a0h4,label
1,32.950680,78,1,168.0,68.0,14.0,19.780220,OSA
2,24.114150,77,2,127.0,68.0,5.0,5.020921,OSA
6,29.983588,52,1,142.0,99.0,11.0,10.105263,OSA
8,25.817447,69,1,201.0,101.0,10.0,24.409673,OSA
11,25.401235,68,1,152.0,90.0,7.0,20.417335,OSA
...,...,...,...,...,...,...,...,...
5795,35.790598,71,2,126.0,73.0,5.0,2.284041,No SA
5796,21.957367,55,2,136.0,77.0,13.0,0.807537,No SA
5798,32.414213,54,2,118.0,66.0,7.0,1.878669,No SA
5801,24.228571,55,1,89.0,56.0,17.0,3.605769,No SA


In [14]:
# Do statistics based on label, Do Mean, Std, and perbalance(95% CI)
df_table1.groupby('label').describe().T

Unnamed: 0,label,CSA,No SA,OSA
bmi_s1,count,155.0,2649.0,2503.0
bmi_s1,mean,29.437723,26.808917,29.33806
bmi_s1,std,4.883141,4.403147,5.342028
bmi_s1,min,21.023138,18.0,18.0
bmi_s1,25%,25.801037,23.828125,25.722984
bmi_s1,50%,28.509508,26.264784,28.675689
bmi_s1,75%,32.542248,29.23348,32.111039
bmi_s1,max,46.650769,50.0,50.0
age_s1,count,155.0,2649.0,2503.0
age_s1,mean,64.393548,61.050208,65.714343


In [None]:
# Rename the columns
df_table1.rename({'bmi_s1': 'BMI', 'age_s1': 'Age', 'gender': 'Gender', 'systbp': 'Systolic BP', 'diasbp': 'Diastolic BP', 'ess_s1': 'ESS', 'ahi_a0h4': 'AHI'}, axis=1, inplace=True)

# Keep only mean, std, and 95% CI
# round to 2 decimal places
means = df_table1.groupby('label').mean().T.round(1)
stds = df_table1.groupby('label').std().T.round(1)
counts = df_table1.groupby('label').count().T

In [16]:
counts

label,CSA,No SA,OSA
BMI,155,2649,2503
Age,155,2649,2503
Gender,155,2649,2503
Systolic BP,155,2649,2503
Diastolic BP,155,2649,2503
ESS,155,2649,2503
AHI,155,2649,2503


In [17]:
means

label,CSA,No SA,OSA
BMI,29.4,26.8,29.3
Age,64.4,61.1,65.7
Gender,1.3,1.6,1.4
Systolic BP,129.8,124.8,129.5
Diastolic BP,75.5,73.0,74.0
ESS,8.1,7.3,8.2
AHI,18.5,2.0,18.5


In [19]:
stds

label,CSA,No SA,OSA
BMI,4.9,4.4,5.3
Age,11.3,11.2,10.5
Gender,0.5,0.5,0.5
Systolic BP,19.0,19.1,19.1
Diastolic BP,11.8,11.0,12.1
ESS,4.9,4.2,4.5
AHI,14.8,1.4,15.6


In [20]:
# Comparisons of continuous variables were made among 
# the 3 groups with one-way ANOVA with subsequent pairwise 
# Tukey HSD test

# ANOVA
# H0: The means of the groups are equal
# H1: The means of the groups are not equal
# p-value < 0.05, reject H0
# p-value > 0.05, fail to reject H0
# p-value = 0.05, marginal

variables = ["BMI", "Age", "Gender", "Average Systolic BP", "Average Diastolic BP", "Epworth Sleepiness Scale score", "AHI"]
col_name = ["bmi_s1", "age_s1", "gender", "systbp", "diasbp", "ess_s1", "ahi_a0h4"]

for i in range(len(variables)):
    f_oneway = stats.f_oneway(osa[col_name[i]], csa[col_name[i]], no_sa[col_name[i]])
    print(f'{variables[i]}\nstatistic: {f_oneway.statistic}, pvalue: {f_oneway.pvalue}\n')

BMI
statistic: 178.89195549666258, pvalue: 6.55236171672407e-76

Age
statistic: 119.9245383463436, pvalue: 1.1499882927179695e-51

Gender
statistic: 174.83909338791827, pvalue: 2.9274338921080018e-74

Average Systolic BP
statistic: 41.076643041870014, pvalue: 1.983292012261657e-18

Average Diastolic BP
statistic: 7.208300939239418, pvalue: 0.0007476898168833154

Epworth Sleepiness Scale score
statistic: 29.133155965242462, pvalue: 2.60988694668077e-13

AHI
statistic: 1469.7275857978243, pvalue: 0.0



## Replicating Table 3

In [2]:
df = pd.read_csv('../../data/raw/shhs1-dataset-0.20.0.csv', encoding='cp1252', engine='python')

In [3]:
# Get the No SA, OSA, CSA-G, and CSR people
no_sa = df[df['ahi_a0h4'] < 5]
len(no_sa)

2830

In [4]:
osa = df[df['ahi_a0h4'] >= 5]
osa = osa[osa['ahi_o0h4'] > osa['ahi_c0h4']]
len(osa)

2632

In [5]:
csa = df[df['ahi_c0h4'] >= 5]
csa = csa[csa['ahi_c0h4'] > csa['ahi_o0h4']]
len(csa)

165

In [20]:
# Since we don't have the data for periodic breathing (pb), there's no Cheyne-Stokes respiration group data (CSR), a column in Table 3

In [6]:
no_sa

Unnamed: 0,nsrrid,pptid,ecgdate,lvh3_1,lvh3_3,st4_1_3,st5_1_3,lvhst,mob1,part2deg,...,eoglqual,chinqual,oximqual,posqual,lightoff,oximet51,monitor_id,headbox_id,rcrdtime,psg_month
0,200001,1,,,,,,,,,...,4,4,4,4,1.0,96.0,18.0,18.0,7:16:00,6
3,200004,4,,,,,,,,,...,3,3,3,3,0.0,96.0,19.0,19.0,5:58:00,4
4,200005,5,,,,,,,,,...,4,4,4,4,0.0,96.0,18.0,18.0,7:57:00,3
5,200006,6,,,,,,,,,...,4,4,4,4,1.0,97.0,16.0,16.0,7:49:00,6
7,200008,8,,,,,,,,,...,3,3,4,4,1.0,97.0,17.0,17.0,7:59:00,9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5795,205796,5831,-769.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,4,4,4,4,0.0,94.0,10.0,10.0,7:07:00,11
5796,205797,5832,-702.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,4,4,4,4,1.0,98.0,7.0,64.0,8:29:00,10
5798,205799,5834,-907.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,3,3,3,3,0.0,99.0,10.0,10.0,4:39:00,1
5801,205802,5837,-768.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,4,3,4,4,0.0,95.0,10.0,10.0,7:30:00,10


In [7]:
variables = ['mi15', 'angina15', 'stroke15','cabg15', 'pacem15', 'copd15', 'asthma15', 'loop1', 'diuret1', 
'ccb1', 'beta1', 'ace1', 'lipid1', 'ohga1', 'warf1', 'asa1', 'nsaid1', 'benzod1']

In [8]:
# Define a dictionary to save the result
result = dict()
for column in variables:
    result[column] = calc_confidence_interval(no_sa, column)

In [10]:
# Convert it into dataframe
no_sa_result = pd.DataFrame(result).T
no_sa_result.rename(columns={0:'No SA Percentage', 1:'No SA lower_bound', 2:'No SA upper_bound'}, inplace=True)

In [11]:
# Define a dictionary to save the result
result = dict()
for column in variables:
    result[column] = calc_confidence_interval(osa, column)    

In [12]:
# Convert it into dataframe
osa_result = pd.DataFrame(result).T
osa_result.rename(columns={0:'OSA Percentage', 1:'OSA lower_bound', 2:'OSA upper_bound'}, inplace=True)

In [13]:
# Define a dictionary to save the result
result = dict()
for column in variables:
    result[column] = calc_confidence_interval(csa, column)    

In [14]:
# Convert it into dataframe
csa_result = pd.DataFrame(result).T
csa_result.rename(columns={0:'CSA Percentage', 1:'CSA lower_bound', 2:'CSA upper_bound'}, inplace=True)

In [15]:
no_sa_result

Unnamed: 0,No SA Percentage,No SA lower_bound,No SA upper_bound
mi15,4.628975,4.621234,4.636717
angina15,5.512367,5.503959,5.520776
stroke15,2.402827,2.397185,2.408469
cabg15,2.650177,2.644259,2.656095
pacem15,0.353357,0.351171,0.355543
copd15,1.166078,1.162122,1.170033
asthma15,9.081272,9.070685,9.091859
loop1,2.791519,2.78545,2.797589
diuret1,13.003534,12.991141,13.015926
ccb1,11.201413,11.189794,11.213033


In [16]:
osa_result

Unnamed: 0,OSA Percentage,OSA lower_bound,OSA upper_bound
mi15,7.408815,7.398808,7.418821
angina15,8.510638,8.499978,8.521299
stroke15,4.027356,4.019845,4.034867
cabg15,4.293313,4.285569,4.301057
pacem15,1.519757,1.515083,1.524431
copd15,0.987842,0.984064,0.99162
asthma15,7.902736,7.892429,7.913042
loop1,5.965046,5.955997,5.974094
diuret1,18.351064,18.336276,18.365852
ccb1,16.641337,16.627108,16.655567


In [17]:
csa_result

Unnamed: 0,CSA Percentage,CSA lower_bound,CSA upper_bound
mi15,9.69697,9.651817,9.742122
angina15,14.545455,14.491659,14.59925
stroke15,4.242424,4.21167,4.273179
cabg15,8.484848,8.44233,8.527367
pacem15,3.030303,3.004147,3.056459
copd15,1.212121,1.195424,1.228818
asthma15,9.69697,9.651817,9.742122
loop1,12.727273,12.676419,12.778126
diuret1,21.212121,21.149743,21.2745
ccb1,20.606061,20.544344,20.667778


In [18]:
new_index_mapping = {'mi15': 'History of MI', 'angina15': 'Angina', 'stroke15': 'History of stroke', 'cabg15': 'History of CABG',
                    'pacem15': 'History of pacemaker', 'copd15': 'History of COPD', 'asthma15': 'History of Asthma',
                    'loop1': 'Loop diuretic', 'diuret1': 'Any diuretic', 'ccb1': 'Calcium channel blocker', 'beta1': 'Beta blocker',
                    'ace1': 'Ace inhibitor', 'lipid1': 'Anti-lipid', 'ohga1': 'Oral Hypoglycemic', 'warf1': 'Warfarin',
                    'asa1': 'Aspirin', 'nsaid1': 'Non-aspirin NSAID', 'benzod1': 'Benzodiazepine'}
csa_result = csa_result.rename(index=new_index_mapping)
osa_result = osa_result.rename(index=new_index_mapping)
no_sa_result = no_sa_result.rename(index=new_index_mapping)

In [19]:
csa_result

Unnamed: 0,CSA Percentage,CSA lower_bound,CSA upper_bound
History of MI,9.69697,9.651817,9.742122
Angina,14.545455,14.491659,14.59925
History of stroke,4.242424,4.21167,4.273179
History of CABG,8.484848,8.44233,8.527367
History of pacemaker,3.030303,3.004147,3.056459
History of COPD,1.212121,1.195424,1.228818
History of Asthma,9.69697,9.651817,9.742122
Loop diuretic,12.727273,12.676419,12.778126
Any diuretic,21.212121,21.149743,21.2745
Calcium channel blocker,20.606061,20.544344,20.667778


In [20]:
combined_df = pd.concat([no_sa_result, osa_result], axis=1)
combined_df = pd.concat([combined_df, csa_result], axis=1)

In [21]:
combined_df

Unnamed: 0,No SA Percentage,No SA lower_bound,No SA upper_bound,OSA Percentage,OSA lower_bound,OSA upper_bound,CSA Percentage,CSA lower_bound,CSA upper_bound
History of MI,4.628975,4.621234,4.636717,7.408815,7.398808,7.418821,9.69697,9.651817,9.742122
Angina,5.512367,5.503959,5.520776,8.510638,8.499978,8.521299,14.545455,14.491659,14.59925
History of stroke,2.402827,2.397185,2.408469,4.027356,4.019845,4.034867,4.242424,4.21167,4.273179
History of CABG,2.650177,2.644259,2.656095,4.293313,4.285569,4.301057,8.484848,8.44233,8.527367
History of pacemaker,0.353357,0.351171,0.355543,1.519757,1.515083,1.524431,3.030303,3.004147,3.056459
History of COPD,1.166078,1.162122,1.170033,0.987842,0.984064,0.99162,1.212121,1.195424,1.228818
History of Asthma,9.081272,9.070685,9.091859,7.902736,7.892429,7.913042,9.69697,9.651817,9.742122
Loop diuretic,2.791519,2.78545,2.797589,5.965046,5.955997,5.974094,12.727273,12.676419,12.778126
Any diuretic,13.003534,12.991141,13.015926,18.351064,18.336276,18.365852,21.212121,21.149743,21.2745
Calcium channel blocker,11.201413,11.189794,11.213033,16.641337,16.627108,16.655567,20.606061,20.544344,20.667778


In [22]:
df = combined_df[['No SA Percentage', 'OSA Percentage', 'CSA Percentage']]

In [23]:
df

Unnamed: 0,No SA Percentage,OSA Percentage,CSA Percentage
History of MI,4.628975,7.408815,9.69697
Angina,5.512367,8.510638,14.545455
History of stroke,2.402827,4.027356,4.242424
History of CABG,2.650177,4.293313,8.484848
History of pacemaker,0.353357,1.519757,3.030303
History of COPD,1.166078,0.987842,1.212121
History of Asthma,9.081272,7.902736,9.69697
Loop diuretic,2.791519,5.965046,12.727273
Any diuretic,13.003534,18.351064,21.212121
Calcium channel blocker,11.201413,16.641337,20.606061
