# Creating the Physical Activity Exposure Variables & Replicating Dempsey et al. Results

### Sources:

> Dempsey PC, Rowlands A V, Strain T, et al. Physical activity volume, intensity, and incident cardiovascular disease. Eur Heart J. 2022;43(46):4789-4800. doi:10.1093/EURHEARTJ/EHAC613



## Creating Physical Activity Exposure Variables

This section is where I formally created the physical activity energy expenditure (PAEE) variable used for physical activity volume, as well as % MVPA, which is defined as the percent of PAEE from activity above 125 mgs. 

I replicate the results in the recently published article “Physical activity volume, intensity, and incident cardiovascular disease” in the *European Heart Journal* to confirm that these results.



## Converting ENMO into Physical Activity Energy Expenditure (PAEE) - Physical Activity Volume Variable

The following code shows how we convert the fraction of time spent at different levels of activity measured as mgs into PAEE, our physical activity volume exposure.

In [None]:
# In Bash kernel
dx download PACOHORTprocessedPAVars.csv


In [None]:
# In R kernel

# Reading in csv as data frame
dat <- read.csv("PACOHORTprocessedPAVars.csv")


# Creating blank data frame with interval variables 1 to 67
# This corresponds with the number of fraction of mgs variables in the UKB
blankdf <- as.data.frame(matrix(NA, nrow = 96660, ncol = 67))
n <- 1:67
colnames(blankdf) <- paste("AccelInterval", n, sep = "")


# Merging these blank interval variables with the larger dataset
dat <- cbind(dat, blankdf)


In [None]:
# -------
# Creating a for loop to create accurate proportions of variables
# All variables in UKB are initially defined as <=. This code transforms these inequalities into intervals
# For instance, now p90093 is proportion of time spent between 1 and 2 mgs (not <= 2 mgs as it was before)

# Getting indices for for loop - corresponding to each of the fraction of mgs variables in UK Biobank
summary(match("p90092", colnames(dat)))
summary(match("p90158", colnames(dat)))
# 265 to 331

# Getting indices of blank interval variables
summary(match("AccelInterval1", colnames(dat)))
summary(match("AccelInterval67", colnames(dat)))
# 428 to 494

# Running for loop to create these intervals from original variables
for(i in 1:66){
  
  # Start w p90093 - p90092 and continue - fill AccelInterval
  # START at AccelInterval2 for indexing!
  dat[ , i+428] <- dat[ , i+265] - dat[ , i+264]
  

  
}

# Then AccelInterval1 is simply the first one (still <= 1 mg)
dat$AccelInterval1 <- dat$p90092

In [None]:
# ---------
# Have AccelInterval1 to 67
# Now have to create X variable that will give the midpoint of each interval
# ----------

# Creating empty data frame with individuals x intervals as dimensions
object <- as.data.frame(matrix(NA, nrow = 96660, ncol = 67))


# Sequences correspond to midpoint values of mgs at each interval created in AccelInterval variables
# For instance, AccelInterval2 corresponds to mgs between 1 and 2, so midpoint is 1.5 and so on
object[ , 1:20] <- rep(seq(from = 1, to = 20, by = 1) - 0.5, each = 96660)
object[ , 21:36] <- rep(seq(from = 25, to = 100, by = 5) - 2.5, each = 96660)
object[ , 37:52] <- rep(seq(from = 125, to = 500, by = 25) - 12.5, each = 96660)
object[ , 53:67] <- rep(seq(from = 600, to = 2000, by = 100) - 50, each = 96660)


# Take all of these midpoint values as variables (person-invariant) and add to larger data frame
n <- 1:67
colnames(object) <- paste("X", n, sep = "")

# Merging object w/ larger DF
dat <- cbind(dat, object)

In [None]:
# ------
# NOW have X# variables and AccelInterval# variables
# Using X variables to convert to PAEE
# ------

# Creating PAEE Midpoint variables - which convert mg midpoints to PAEE
PAEEdf <- as.data.frame(matrix(NA, nrow = 96660, ncol = 67))
n <- 1:67

colnames(PAEEdf) <- paste("PAEEMidpoints", n, sep = "")


# Merging these variables w/ overall data frame
dat <- cbind(dat, PAEEdf)


# Getting column indices for for loop
summary(match("PAEEMidpoints1", colnames(dat)))
# 562

# Getting column indices for for loop of mg defined midpoints
summary(match("X1", colnames(dat)))
# 495

# for loop converts mg-defined midpoints in X1 to X67 to PAEE-defined midpoints in PAEEMidpoints1 to PAEEMidpoints67

for(i in 1:67){
  
  # Here the first dat should be indexed starting at FIRST PAEEdf variable
  # The dat in the equation should be indexed at the first X variable
  dat[ , i+561] <- (-10.58 + 1.1176*(1.5 + .8517*dat[ , i+494]) + 2.9418*sqrt((1.5 + .8517*dat[ , i+494])) - 0.00059277*((1.5 + .8517*dat[ , i+494])^2))
  
}

In [None]:
# ---------
# Now combine proportion in each and PAEE then SUM ALL TOGETHER FOR IND LEVEL PAEE
# ---------

# Creating PAEE Interval variables - combine proportion of time spent at intervals w/ PAEE midpoint value of interval
PAEEInts <- as.data.frame(matrix(NA, nrow = 96660, ncol = 67))

n <- 1:67
colnames(PAEEInts) <- paste("PAEEInterval", n, sep = "")


# Merging w/ overall dat
dat <- cbind(dat, PAEEInts)


# Getting col indices in for loop
summary(match("PAEEInterval1", colnames(dat)))
summary(match("AccelInterval1", colnames(dat)))
summary(match("PAEEMidpoints1", colnames(dat)))
# 629
# 428
# 562

# This simply takes the fraction of time spent in an interval (AccelIntervalX) and multiplies it by
# midpoint value of that interval in PAEE (PAEEMidpoints1) to yield total PAEE from each interval by individual
for(i in 1:67){
  dat[ , i + 628] <- dat[ , i + 427] * dat[ , i + 561]
}


# Getting col indices for rowSums
summary(match("PAEEInterval1", colnames(dat)))
summary(match("PAEEInterval67", colnames(dat)))
# 629 to 695

# Now summing across all of these PAEEInterval variables to get overall PAEE for each person in sample
dat$OverallPAEE <- rowSums(dat[ , c(629:695)])
# sum over all rows in range of PAEEInterval variables

In [None]:
# --------
# Converting PAEE to correct units
# Formula is in J/min/kg but converting to jK/kg/days
# -------------

dat$OverallPAEETRANSFORM <- dat$PAEEPOS*1440/1000
# Just a linear transformation, so will have no effect on % MVPA or % vigorous (hence it's fine to do here)


## Creating % MVPA from Overall PAEE - PA Intensity Variable

After converting ENMO into PAEE to serve as the physical activity volume variable, I next created the physical activity intensity variable, which is the percent of overall PAEE from moderate-to-vigorous physical activity. This is defined as the percentage of PAEE accrued above 125 mgs.

In [None]:
# --------
# Creating PA Intensity variable
# --------


# Getting col indices for rowSums for MVPA
summary(match("PAEEInterval37", colnames(dat)))
summary(match("PAEEInterval67", colnames(dat)))
# PAEEInterval37 corresponds to 112.5, so PAEEInterval38 is start of > 125 mg for MVPA
# 665 to 695

# Summarizing PAEE ABOVE MVPA threshold
dat$MVPAPAEE <- rowSums(dat[ , c(666:695)])
# sum over all rows ABOVE 125 mgs for PAEEInterval


# -----------
# NOW restricting to % from MVPA and Vigorous
# -----------

dat$PercentMVPA <- (dat$MVPAPAEE/dat$PAEEPOS)*100

# Write this data set to CSV
write.csv(dat, "PACOHORTprocessedPAVarsPAEE.csv")


In [None]:
# Bash kernel to upload dataset
dx upload PACOHORTprocessedPAVarsPAEE.csv

## Code for Replication of Dempsey Summary Tables

In order to verify that the PA volume and intensity variables were processed correctly, I compared our results with those in Table 1 of the Dempsey *et al.* article. This Table stratifies by sex and tertile of PAEE, so I followed the same process. The formally written up replication results are available in the Replications folder.

In [None]:
# --------------
# Replicating Table 1 of Dempsey Article
# Making sure my Overall Acceleration corresponds to ENMO
# Making sure my overall PAEE is consistent w/ theirs
# Making sure my % MVPA is consistent w/ theirs
# All look to be approximately correct - v close despite slightly different cohort inclusion criteria
# --------------

# Creating sex-stratified subsets
datMALE <- subset(dat, Biological.Sex == "Male")
datFEMALE <- subset(dat, Biological.Sex == "Female")

# Finding tertile cutoffs for PAEE
quantile(datMALE$OverallPAEETRANSFORM, probs = c(0.33, 0.67))
quantile(datFEMALE$OverallPAEETRANSFORM, probs = c(0.33, 0.67))
quantile(dat$OverallPAEETRANSFORM, probs = c(0.33, 0.67))
# Male = 32.57 and 42.32
# Female = 34.97 and 44.48
# Overall = 33.89 and 43.56

# Creating tertiles subgroups based on above cutoffs
# For Males
datMALETert1 <- subset(datMALE, datMALE$OverallPAEETRANSFORM <= 32.57)
datMALETert2 <- subset(datMALE, datMALE$OverallPAEETRANSFORM > 32.57 & datMALE$OverallPAEETRANSFORM <= 42.32)
datMALETert3 <- subset(datMALE, datMALE$OverallPAEETRANSFORM > 42.32)

# For Females
datFEMALETert1 <- subset(datFEMALE, datFEMALE$OverallPAEETRANSFORM <= 34.97)
datFEMALETert2 <- subset(datFEMALE, datFEMALE$OverallPAEETRANSFORM > 34.97 & datFEMALE$OverallPAEETRANSFORM <= 44.48)
datFEMALETert3 <- subset(datFEMALE, datFEMALE$OverallPAEETRANSFORM > 44.48)


# ------
# NOW comparing summary stats to those in their Table 1
# ------

# Male Tertile 1 Results
summary(datMALETert1$OverallPAEETRANSFORM)
summary(datMALETert1$p90012)
summary(datMALETert1$PercentMVPA)

# Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 0.4334 23.4061 27.2864 26.2494 30.1469 32.5692 

# Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 3.55   17.26   19.82   19.22   21.68   89.96 

# Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 0.0000  0.2155  0.2783  0.2814  0.3444  0.7260 


sd(datMALETert1$p90012)
sd(datMALETert1$PercentMVPA)
# 3.57 - also almost exact same as ENMO
# 0.72 - v close to SD for their PercentMVPA too



summary(datMALETert2$OverallPAEETRANSFORM)
summary(datMALETert2$p90012)
summary(datMALETert2$PercentMVPA)

# Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 32.57   34.90   37.27   37.32   39.72   42.32
# Not SUPER close here... 3 points lower than in paper - but seems within margin to me given sample diffs

# Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 18.15   24.79   26.45   26.55   28.11   77.35 
# ALSO just about perfect

# Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 0.07116 0.30825 0.36328 0.36624 0.42225 0.80355
# This one is PERFECT!

summary(datMALETert3$OverallPAEETRANSFORM)
summary(datMALETert3$OverallPAEE)
summary(datMALETert3$p90012)
summary(datMALETert3$PercentMVPA)

# Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 42.32   45.26   49.03   51.37   55.01  149.07 
# A bit off - 3 points lower than mean

# Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 23.70   31.94   34.81   36.75   39.38   97.21
# Pretty good - like 1.5 points off

# Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 0.1269  0.3951  0.4546  0.4583  0.5185  0.8887 
# Almost EXACTLY right!

# --------
# REPEAT FOR FEMALE
# --------

summary(datFEALETert1$OverallPAEETRANSFORM)
summary(datFEMALETert1$p90012)
summary(datFEMALETert1$PercentMVPA)

# Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# -1.553  25.980  29.871  28.729  32.606  34.969 

# Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 2.55   18.73   21.19   20.56   22.99   35.19 

# Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 0.0000  0.2024  0.2606  0.2634  0.3220  0.7920 
# VERY CLOSE in ALL of these! Same pattern as in paper and all within ~ 2 (or 0.02) points!


summary(datFEALETert2$OverallPAEETRANSFORM)
summary(datFEMALETert2$p90012)
summary(datFEMALETert2$PercentMVPA)

#Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 34.97   37.25   39.52   39.59   41.89   44.48 
# VERY close to perfect

# Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 18.38   25.99   27.54   27.64   29.16   47.19 
# JUST about perfect

# Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 0.08672 0.28891 0.34346 0.34550 0.39867 0.87086 
# About 2 points off - relatively close

summary(datFEMALETert3$OverallPAEETRANSFORM)
summary(datFEMALETert3$p90012)
summary(datFEMALETert3$PercentMVPA)

# Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 44.48   47.41   51.14   53.13   56.58  118.46 
# Almost exactly right

# Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 23.09   32.87   35.54   37.13   39.58   97.11 
# Almost exactly right

# Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 0.1263  0.3786  0.4378  0.4390  0.4978  0.7991
# Differs by about a point

Given differences in sample selection, the extremely similar results between our study and Dempsey (with the largest difference of only 3 points) provide further confirmation that we correctly processed the variables. We later truncate PAEE to ensure that no negative values are in the sample.