This notebook covers the replication of Table 1 and Table 3 of the following study: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4909617/#:~:text=The%20prevalence%20of%20CSA%20(defined,those%20aged%2065%20and%20older.&text=In%20a%20later%20cohort%20of,was%20appreciably%20higher%20(7.5%25)


# Replicating Table 1

In [15]:
import pandas as pd
import random
import numpy as np

random.seed(0)
np.random.seed(0)

df = pd.read_csv('../data/raw/shhs1-dataset-0.20.0.csv', encoding='cp1252', engine='python')

In [16]:
table1 = ['bmi_s1', 'age_s1', 'gender', 'systbp', 'diasbp', 'ess_s1', 'ahi_a0h4', 'ahi_c0h4', 'ahi_o0h4']

In [17]:
df = df[table1]

In [18]:
df.dropna(inplace=True)

* OSA: TOTOAL AHI >= 5 & OAHI > CAHI
* CSA: CAHI >= 5 & CAHO > OHAI
* No SA: TOTOAL AHI < 5

In [19]:
# OSA: TOTOAL AHI >= 5 & OAHI > CAHI
# CSA: CAHI >= 5 & CAHO > OHAI
# No SA: TOTOAL AHI < 5
osa = df[(df['ahi_a0h4'] >= 5) & (df['ahi_o0h4'] > df['ahi_c0h4'])]
csa = df[(df['ahi_c0h4'] >= 5) & (df['ahi_c0h4'] > df['ahi_o0h4'])]
no_sa = df[df['ahi_a0h4'] < 5]

In [20]:
# split csa based in these bins: bins=[0, 5, 15, 30, 1000]

csa['ahi_c0h4'] = pd.cut(csa['ahi_c0h4'], bins=[0, 5, 15, 30, 1000], labels=['0-5', '5-15', '15-30', '30+'])

# count the number of each bin
csa['ahi_c0h4'].value_counts()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  csa['ahi_c0h4'] = pd.cut(csa['ahi_c0h4'], bins=[0, 5, 15, 30, 1000], labels=['0-5', '5-15', '15-30', '30+'])


ahi_c0h4
5-15     97
15-30    36
30+      22
0-5       0
Name: count, dtype: int64

In [21]:
print(len(no_sa))
print(len(osa))
print(len(csa))

2649
2503
155


In [22]:
# add labels based on the criteria above
osa['label'] = 'OSA'
csa['label'] = 'CSA'
no_sa['label'] = 'No SA'

# combine the three dataframes
df = pd.concat([osa, csa, no_sa])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  osa['label'] = 'OSA'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  csa['label'] = 'CSA'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  no_sa['label'] = 'No SA'


In [23]:
features = ['bmi_s1', 'age_s1', 'gender', 'systbp', 'diasbp', 'ess_s1', 'ahi_a0h4']

In [24]:
df_table1 = df[features + ['label']]

In [25]:
df_table1

Unnamed: 0,bmi_s1,age_s1,gender,systbp,diasbp,ess_s1,ahi_a0h4,label
1,32.950680,78,1,168.0,68.0,14.0,19.780220,OSA
2,24.114150,77,2,127.0,68.0,5.0,5.020921,OSA
6,29.983588,52,1,142.0,99.0,11.0,10.105263,OSA
8,25.817447,69,1,201.0,101.0,10.0,24.409673,OSA
11,25.401235,68,1,152.0,90.0,7.0,20.417335,OSA
...,...,...,...,...,...,...,...,...
5795,35.790598,71,2,126.0,73.0,5.0,2.284041,No SA
5796,21.957367,55,2,136.0,77.0,13.0,0.807537,No SA
5798,32.414213,54,2,118.0,66.0,7.0,1.878669,No SA
5801,24.228571,55,1,89.0,56.0,17.0,3.605769,No SA


In [26]:
df_table1.dropna(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_table1.dropna(inplace=True)


In [27]:
df_table1

Unnamed: 0,bmi_s1,age_s1,gender,systbp,diasbp,ess_s1,ahi_a0h4,label
1,32.950680,78,1,168.0,68.0,14.0,19.780220,OSA
2,24.114150,77,2,127.0,68.0,5.0,5.020921,OSA
6,29.983588,52,1,142.0,99.0,11.0,10.105263,OSA
8,25.817447,69,1,201.0,101.0,10.0,24.409673,OSA
11,25.401235,68,1,152.0,90.0,7.0,20.417335,OSA
...,...,...,...,...,...,...,...,...
5795,35.790598,71,2,126.0,73.0,5.0,2.284041,No SA
5796,21.957367,55,2,136.0,77.0,13.0,0.807537,No SA
5798,32.414213,54,2,118.0,66.0,7.0,1.878669,No SA
5801,24.228571,55,1,89.0,56.0,17.0,3.605769,No SA


In [28]:
# Do statistics based on label, Do Mean, Std, and perbalance(95% CI)
df_table1.groupby('label').describe().T

# Rename the columns
df_table1.rename({'bmi_s1': 'BMI', 'age_s1': 'Age', 'gender': 'Gender', 'systbp': 'Systolic BP', 'diasbp': 'Diastolic BP', 'ess_s1': 'ESS', 'ahi_a0h4': 'AHI'}, axis=1, inplace=True)

# Keep only mean, std, and 95% CI
# round to 2 decimal places
means = df_table1.groupby('label').mean().T.round(1)
stds = df_table1.groupby('label').std().T.round(1)
counts = df_table1.groupby('label').count().T

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_table1.rename({'bmi_s1': 'BMI', 'age_s1': 'Age', 'gender': 'Gender', 'systbp': 'Systolic BP', 'diasbp': 'Diastolic BP', 'ess_s1': 'ESS', 'ahi_a0h4': 'AHI'}, axis=1, inplace=True)


In [29]:
counts

label,CSA,No SA,OSA
BMI,155,2649,2503
Age,155,2649,2503
Gender,155,2649,2503
Systolic BP,155,2649,2503
Diastolic BP,155,2649,2503
ESS,155,2649,2503
AHI,155,2649,2503


In [30]:
means

label,CSA,No SA,OSA
BMI,29.4,26.8,29.3
Age,64.4,61.1,65.7
Gender,1.3,1.6,1.4
Systolic BP,129.8,124.8,129.5
Diastolic BP,75.5,73.0,74.0
ESS,8.1,7.3,8.2
AHI,18.5,2.0,18.5


In [31]:
stds

label,CSA,No SA,OSA
BMI,4.9,4.4,5.3
Age,11.3,11.2,10.5
Gender,0.5,0.5,0.5
Systolic BP,19.0,19.1,19.1
Diastolic BP,11.8,11.0,12.1
ESS,4.9,4.2,4.5
AHI,14.8,1.4,15.6


In [32]:
# Comparisons of continuous variables were made among 
# the 3 groups with one-way ANOVA with subsequent pairwise 
# Tukey HSD test
from scipy import stats

# ANOVA
# H0: The means of the groups are equal
# H1: The means of the groups are not equal
# p-value < 0.05, reject H0
# p-value > 0.05, fail to reject H0
# p-value = 0.05, marginal

# BMI
print("BMI")
print(stats.f_oneway(osa['bmi_s1'], csa['bmi_s1'], no_sa['bmi_s1']))

# Age
print("Age")
print(stats.f_oneway(osa['age_s1'], csa['age_s1'], no_sa['age_s1']))

# gender
print("Gender")
print(stats.f_oneway(osa['gender'], csa['gender'], no_sa['gender']))

# systbp
print("Systbp")
print(stats.f_oneway(osa['systbp'], csa['systbp'], no_sa['systbp']))

# diasbp
print("Diasbp")
print(stats.f_oneway(osa['diasbp'], csa['diasbp'], no_sa['diasbp']))

# ess_s1
print("Ess_s1")
print(stats.f_oneway(osa['ess_s1'], csa['ess_s1'], no_sa['ess_s1']))

# ahi_a0h4
print("Ahi_a0h4")
print(stats.f_oneway(osa['ahi_a0h4'], csa['ahi_a0h4'], no_sa['ahi_a0h4']))

BMI
F_onewayResult(statistic=178.89195549666258, pvalue=6.55236171672407e-76)
Age
F_onewayResult(statistic=119.9245383463436, pvalue=1.1499882927179695e-51)
Gender
F_onewayResult(statistic=174.83909338791827, pvalue=2.9274338921080014e-74)
Systbp
F_onewayResult(statistic=41.076643041870014, pvalue=1.983292012261657e-18)
Diasbp
F_onewayResult(statistic=7.208300939239418, pvalue=0.0007476898168833154)
Ess_s1
F_onewayResult(statistic=29.133155965242462, pvalue=2.60988694668077e-13)
Ahi_a0h4
F_onewayResult(statistic=1469.7275857978243, pvalue=0.0)


# Table 3

In [33]:
import numpy as np
import pandas as pd


In [35]:
shhs1 = pd.read_csv('../data/raw/shhs1-dataset-0.20.0.csv')

  shhs1 = pd.read_csv('../data/raw/shhs1-dataset-0.20.0.csv')


In [36]:
# Get the No SA, OSA, CSA-G, and CSR people
no_sa = shhs1[shhs1['ahi_a0h4'] <= 5]
len(no_sa)

2831

In [37]:
osa = shhs1[shhs1['ahi_a0h4'] >= 5]
osa = osa[osa['ahi_o0h4'] > osa['ahi_c0h4']]
len(osa)

2632

In [38]:
csa = shhs1[shhs1['ahi_c0h4'] >= 5]
csa = csa[csa['ahi_c0h4'] > csa['ahi_o0h4']]
len(csa)

165

In [39]:
# Since we don't have the data for pb, there's no CSR data

In [40]:
variables = ['mi15', 'angina15',
'stroke15','cabg15',
'pacem15', 'copd15',
'asthma15', 'loop1', 'diuret1', 
'ccb1', 'beta1', 'ace1', 
'lipid1', 'ohga1',
'warf1', 'asa1', 'nsaid1',
'benzod1'
]

In [41]:
# Define a dictionary to save the result
result = dict()
for column in variables:
    # Define the specific value we want to calculate the percentage for
    specific_value = 1

    # Calculate the percentage of the specific value in the column
    percentage = (no_sa[column] == specific_value).mean() * 100

    # Step 3: Calculate the standard error and margin of error for the percentage
    n = len(no_sa)
    standard_error = np.sqrt((percentage / 100 * (1 - percentage / 100)) / n)
    margin_of_error = 1.96 * standard_error  # 1.96 corresponds to a 95% confidence interval

    # Step 4: Calculate the confidence interval
    lower_bound = percentage - margin_of_error
    upper_bound = percentage + margin_of_error
    result[column] = [percentage, lower_bound, upper_bound]
    
    

In [42]:
# Convert it into dataframe
no_sa_result = pd.DataFrame(result).T
no_sa_result.rename(columns={0:'No SA Percentage', 1:'No SA lower_bound', 2:'No SA upper_bound'}, inplace=True)

In [43]:
# Define a dictionary to save the result
result = dict()
for column in variables:
    # Define the specific value we want to calculate the percentage for
    specific_value = 1

    # Calculate the percentage of the specific value in the column
    percentage = (osa[column] == specific_value).mean() * 100

    # Step 3: Calculate the standard error and margin of error for the percentage
    n = len(no_sa)
    standard_error = np.sqrt((percentage / 100 * (1 - percentage / 100)) / n)
    margin_of_error = 1.96 * standard_error  # 1.96 corresponds to a 95% confidence interval

    # Step 4: Calculate the confidence interval
    lower_bound = percentage - margin_of_error
    upper_bound = percentage + margin_of_error
    result[column] = [percentage, lower_bound, upper_bound]
    
    

In [44]:
# Convert it into dataframe
osa_result = pd.DataFrame(result).T
osa_result.rename(columns={0:'OSA Percentage', 1:'OSA lower_bound', 2:'OSA upper_bound'}, inplace=True)

In [45]:
# Define a dictionary to save the result
result = dict()
for column in variables:
    # Define the specific value we want to calculate the percentage for
    specific_value = 1

    # Calculate the percentage of the specific value in the column
    percentage = (csa[column] == specific_value).mean() * 100

    # Step 3: Calculate the standard error and margin of error for the percentage
    n = len(no_sa)
    standard_error = np.sqrt((percentage / 100 * (1 - percentage / 100)) / n)
    margin_of_error = 1.96 * standard_error  # 1.96 corresponds to a 95% confidence interval

    # Step 4: Calculate the confidence interval
    lower_bound = percentage - margin_of_error
    upper_bound = percentage + margin_of_error
    result[column] = [percentage, lower_bound, upper_bound]
    

In [46]:
# Convert it into dataframe
csa_result = pd.DataFrame(result).T
csa_result.rename(columns={0:'CSA Percentage', 1:'CSA lower_bound', 2:'CSA upper_bound'}, inplace=True)

In [47]:
no_sa_result

Unnamed: 0,No SA Percentage,No SA lower_bound,No SA upper_bound
mi15,4.62734,4.619602,4.635079
angina15,5.51042,5.502015,5.518826
stroke15,2.401978,2.396338,2.407618
cabg15,2.649241,2.643325,2.655156
pacem15,0.353232,0.351047,0.355418
copd15,1.165666,1.161712,1.16962
asthma15,9.078064,9.067481,9.088647
loop1,2.790533,2.784466,2.796601
diuret1,12.99894,12.986552,13.011328
ccb1,11.197457,11.185841,11.209073


In [48]:
osa_result

Unnamed: 0,OSA Percentage,OSA lower_bound,OSA upper_bound
mi15,7.408815,7.399166,7.418463
angina15,8.510638,8.500359,8.520917
stroke15,4.027356,4.020113,4.034598
cabg15,4.293313,4.285846,4.30078
pacem15,1.519757,1.51525,1.524263
copd15,0.987842,0.984199,0.991485
asthma15,7.902736,7.892798,7.912674
loop1,5.965046,5.956321,5.97377
diuret1,18.351064,18.336805,18.365323
ccb1,16.641337,16.627617,16.655057


In [49]:
csa_result

Unnamed: 0,CSA Percentage,CSA lower_bound,CSA upper_bound
mi15,9.69697,9.686069,9.70787
angina15,14.545455,14.532467,14.558442
stroke15,4.242424,4.235,4.249849
cabg15,8.484848,8.474584,8.495113
pacem15,3.030303,3.023988,3.036618
copd15,1.212121,1.20809,1.216152
asthma15,9.69697,9.686069,9.70787
loop1,12.727273,12.714996,12.73955
diuret1,21.212121,21.197062,21.227181
ccb1,20.606061,20.591161,20.62096


In [50]:
new_index_mapping = {'mi15': 'History of MI', 'angina15': 'Angina', 'stroke15': 'History of stroke', 'cabg15': 'History of CABG',
                    'pacem15': 'History of pacemaker', 'copd15': 'History of COPD', 'asthma15': 'History of Asthma',
                    'loop1': 'Loop diuretic', 'diuret1': 'Any diuretic', 'ccb1': 'Calcium channel blocker', 'beta1': 'Beta blocker',
                    'ace1': 'Ace inhibitor', 'lipid1': 'Anti-lipid', 'ohga1': 'Oral Hypoglycemic', 'warf1': 'Warfarin',
                    'asa1': 'Aspirin', 'nsaid1': 'Non-aspirin NSAID', 'benzod1': 'Benzodiazepine'}
csa_result = csa_result.rename(index=new_index_mapping)
osa_result = osa_result.rename(index=new_index_mapping)
no_sa_result = no_sa_result.rename(index=new_index_mapping)

In [51]:
csa_result

Unnamed: 0,CSA Percentage,CSA lower_bound,CSA upper_bound
History of MI,9.69697,9.686069,9.70787
Angina,14.545455,14.532467,14.558442
History of stroke,4.242424,4.235,4.249849
History of CABG,8.484848,8.474584,8.495113
History of pacemaker,3.030303,3.023988,3.036618
History of COPD,1.212121,1.20809,1.216152
History of Asthma,9.69697,9.686069,9.70787
Loop diuretic,12.727273,12.714996,12.73955
Any diuretic,21.212121,21.197062,21.227181
Calcium channel blocker,20.606061,20.591161,20.62096


In [52]:
combined_df = pd.concat([no_sa_result, osa_result], axis=1)
combined_df = pd.concat([combined_df, csa_result], axis=1)

In [53]:
combined_df

Unnamed: 0,No SA Percentage,No SA lower_bound,No SA upper_bound,OSA Percentage,OSA lower_bound,OSA upper_bound,CSA Percentage,CSA lower_bound,CSA upper_bound
History of MI,4.62734,4.619602,4.635079,7.408815,7.399166,7.418463,9.69697,9.686069,9.70787
Angina,5.51042,5.502015,5.518826,8.510638,8.500359,8.520917,14.545455,14.532467,14.558442
History of stroke,2.401978,2.396338,2.407618,4.027356,4.020113,4.034598,4.242424,4.235,4.249849
History of CABG,2.649241,2.643325,2.655156,4.293313,4.285846,4.30078,8.484848,8.474584,8.495113
History of pacemaker,0.353232,0.351047,0.355418,1.519757,1.51525,1.524263,3.030303,3.023988,3.036618
History of COPD,1.165666,1.161712,1.16962,0.987842,0.984199,0.991485,1.212121,1.20809,1.216152
History of Asthma,9.078064,9.067481,9.088647,7.902736,7.892798,7.912674,9.69697,9.686069,9.70787
Loop diuretic,2.790533,2.784466,2.796601,5.965046,5.956321,5.97377,12.727273,12.714996,12.73955
Any diuretic,12.99894,12.986552,13.011328,18.351064,18.336805,18.365323,21.212121,21.197062,21.227181
Calcium channel blocker,11.197457,11.185841,11.209073,16.641337,16.627617,16.655057,20.606061,20.591161,20.62096


In [54]:
df = combined_df[['No SA Percentage', 'OSA Percentage', 'CSA Percentage']]

In [55]:
df

Unnamed: 0,No SA Percentage,OSA Percentage,CSA Percentage
History of MI,4.62734,7.408815,9.69697
Angina,5.51042,8.510638,14.545455
History of stroke,2.401978,4.027356,4.242424
History of CABG,2.649241,4.293313,8.484848
History of pacemaker,0.353232,1.519757,3.030303
History of COPD,1.165666,0.987842,1.212121
History of Asthma,9.078064,7.902736,9.69697
Loop diuretic,2.790533,5.965046,12.727273
Any diuretic,12.99894,18.351064,21.212121
Calcium channel blocker,11.197457,16.641337,20.606061
