# Notebook File Overview

In this notebook, we will be importing the *final_diabetes_ptsd* data frame we created in the third notebook:

We will use the code snippet provided by *All of Us* to import our data frames from the workspace bucket 

We will do some data processing, cleaning, and create a summary table, AKA a *Table 1*

# Add the code snippet from the All of Us R and Cloud Storage snippets

**Step 1: Run *Setup***

In [1]:
library(tidyverse)  # Data wrangling packages.

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.2     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.4     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


**Step 2: Run the *copy_file_from_workspace_bucket.R* code snippet**

This will import **final_diabetes_ptsd.csv**

**NOTE: The new data frame will be called *my_dataframe***

In [2]:
# This snippet assumes that you run setup first

# This code copies a file from your Google Bucket into a dataframe

# replace 'test.csv' with the name of the file in your google bucket (don't delete the quotation marks)
name_of_file_in_bucket <- 'final_diabetes_ptsd.csv'

########################################################################
##
################# DON'T CHANGE FROM HERE ###############################
##
########################################################################

# Get the bucket name
my_bucket <- Sys.getenv('WORKSPACE_BUCKET')

# Copy the file from current workspace to the bucket
system(paste0("gsutil cp ", my_bucket, "/data/", name_of_file_in_bucket, " ."), intern=T)

# Load the file into a dataframe
my_dataframe  <- read_csv(name_of_file_in_bucket)


[1mRows: [22m[34m9974[39m [1mColumns: [22m[34m8[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (6): gender, race, ethnicity, ptsd_doctor, ptsd_treatment, diabetes
[32mdbl[39m (2): person_id, age

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


# Inspect data

**Step 1: Use head() to see the first 6 rows of data**

In [3]:
head(my_dataframe)

person_id,age,gender,race,ethnicity,ptsd_doctor,ptsd_treatment,diabetes
<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
6778627,41,Male,Asian,Hispanic or Latino,Yes,Yes,No
1358081,34,Female,Asian,Hispanic or Latino,Yes,Yes,No
2684337,50,Female,Asian,Hispanic or Latino,Yes,Yes,No
2057156,34,Female,Asian,Hispanic or Latino,No,No,No
2991522,27,Female,Asian,Hispanic or Latino,No,No,No
9986364,34,Female,Asian,Hispanic or Latino,Yes,No,No


**Step 2: Use str() to see the structure of the dataset**

In [4]:
str(my_dataframe)

spc_tbl_ [9,974 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ person_id     : num [1:9974] 6778627 1358081 2684337 2057156 2991522 ...
 $ age           : num [1:9974] 41 34 50 34 27 34 46 31 37 36 ...
 $ gender        : chr [1:9974] "Male" "Female" "Female" "Female" ...
 $ race          : chr [1:9974] "Asian" "Asian" "Asian" "Asian" ...
 $ ethnicity     : chr [1:9974] "Hispanic or Latino" "Hispanic or Latino" "Hispanic or Latino" "Hispanic or Latino" ...
 $ ptsd_doctor   : chr [1:9974] "Yes" "Yes" "Yes" "No" ...
 $ ptsd_treatment: chr [1:9974] "Yes" "Yes" "Yes" "No" ...
 $ diabetes      : chr [1:9974] "No" "No" "No" "No" ...
 - attr(*, "spec")=
  .. cols(
  ..   person_id = [32mcol_double()[39m,
  ..   age = [32mcol_double()[39m,
  ..   gender = [31mcol_character()[39m,
  ..   race = [31mcol_character()[39m,
  ..   ethnicity = [31mcol_character()[39m,
  ..   ptsd_doctor = [31mcol_character()[39m,
  ..   ptsd_treatment = [31mcol_character()[39m,
  ..   diabetes = [31mco

**Step 3: Make a quick table using *dplyr* funtions to see a summary fo the values of each column**

In [5]:
library(dplyr)

# Quick summary of all categorical variables
my_dataframe %>%
  select(gender, race, ethnicity, ptsd_doctor, ptsd_treatment, diabetes) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "value") %>%
  count(variable, value) %>%
  arrange(variable)

variable,value,n
<chr>,<chr>,<int>
diabetes,No,9315
diabetes,Yes,659
ethnicity,Hispanic or Latino,563
ethnicity,Not Hispanic or Latino,9411
gender,Female,8384
gender,Male,1590
ptsd_doctor,No,3788
ptsd_doctor,PMI: Skip,35
ptsd_doctor,Yes,6151
ptsd_treatment,No,4983


**FINDINGS: we need to get rid of *PMI: Skip* from each survey queestions and we need to add some *age-groups* to improve our analysis**

## Clean data before create our *Table 1***

**Step 1: Clean data - Drop *PMI: Skip* and create new varaible called *age_groups***

Note: our age range for our cohort is 25 to 65 years old

In [6]:
my_table_one <- my_dataframe %>%
  filter(ptsd_doctor != "PMI: Skip" & ptsd_treatment != "PMI: Skip") %>%
  mutate(age_group = cut(age,
                           breaks = c(25, 35, 45, 55, 65),
                           labels = c("25-34", "35-44", "45-54", "55-64"),
                           right = FALSE,
                           include.lowest = TRUE))

head(my_table_one)

person_id,age,gender,race,ethnicity,ptsd_doctor,ptsd_treatment,diabetes,age_group
<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<fct>
6778627,41,Male,Asian,Hispanic or Latino,Yes,Yes,No,35-44
1358081,34,Female,Asian,Hispanic or Latino,Yes,Yes,No,25-34
2684337,50,Female,Asian,Hispanic or Latino,Yes,Yes,No,45-54
2057156,34,Female,Asian,Hispanic or Latino,No,No,No,25-34
2991522,27,Female,Asian,Hispanic or Latino,No,No,No,25-34
9986364,34,Female,Asian,Hispanic or Latino,Yes,No,No,25-34


**Step 2: Install and load the *tableone()* package**

In [7]:
#install.packages("tableone")
library(tableone)

#install.packages("kableExtra")
library(kableExtra)


Attaching package: ‘kableExtra’


The following object is masked from ‘package:dplyr’:

    group_rows




**Step 3: Create our *Table 1* using the *tableone()* and *kableone()* packages**

Kable stands for **K**nitr T**able** which is a nicely formatted table

In [8]:
# Define variable types
catVars <- c("age_group", "gender", "race", "ethnicity", "ptsd_doctor", "ptsd_treatment")
contVars <- c("age")

# Create stratified Table 1
table1_stratified <- CreateTableOne(vars = c(contVars, catVars),
                                   strata = "diabetes",
                                   data = my_table_one,
                                   factorVars = catVars)


kableone(table1_stratified, caption = "Table 1. Baseline Characteristics by Diabetes Status")



Table: Table 1. Baseline Characteristics by Diabetes Status

|                                       |No           |Yes          |p      |test |
|:--------------------------------------|:------------|:------------|:------|:----|
|n                                      |9246         |650          |       |     |
|age (mean (SD))                        |39.09 (6.88) |42.24 (6.10) |<0.001 |     |
|age_group (%)                          |             |             |<0.001 |     |
|25-34                                  |2567 (27.8)  |88 (13.5)    |       |     |
|35-44                                  |4216 (45.6)  |275 (42.3)   |       |     |
|45-54                                  |2463 (26.6)  |287 (44.2)   |       |     |
|gender = Male (%)                      |1456 (15.7)  |120 (18.5)   |0.076  |     |
|race (%)                               |             |             |<0.001 |     |
|Asian                                  |191 ( 2.1)   |11 ( 1.7)    |       |     |
|Black or Afr

**Step 4: Save your new table to your Jupyter Notebook files to download locally later**

Got to *File* then *Open*

**NOTE: This saves your table as a *Tesxt* file (.txt)**

other options:
* .md
* .markdown
* .Rmd

In [9]:
#Save tableone to your jupyter notebook files
kable_output <- kableone(table1_stratified, 
                        caption = "Table 1. Baseline Characteristics by Diabetes Status")

# Save as HTML
kable_output %>%
  save_kable("table1_baseline_characteristics.txt")