# Notebook File Overview

In this notebook, we will be importing the data frames we created in the first two notebooks:
1. diabetes_control_ptsd.csv
2. diabetes_study_ptsd.csv

We will use the code snippet provided by *All of Us* to import our data frames from the workspace bucket (looks similar to the snippet used to save our data frames to the workspace bucket). 

We will do some data processing and cleaning on our existing data frames including:
* Create a new variable to designate which participants **have** and **do not have** diabetes 
* Combine the two dataframes into one dataframe
* Calculate participant age based on the current date and their DOB and create a new variable called *age*
* Select the final columns we want for our final dataframe we will analyze

# Add the code snippet from the *All of Us R and Cloud Storage snippets*

**Step 1: Run *Setup***

In [1]:
library(tidyverse)  # Data wrangling packages.

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.2     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.4     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


**Step 2: Run the *copy_file_from_workspace_bucket.R* code snippet**

This will import **diabetes_control_ptsd.csv** and **diabetes_study_ptsd.csv** as data frames

**NOTE: You will have to add another line of code to each step to get both files imported**

In [2]:
# This snippet assumes that you run setup first

# This code copies a file from your Google Bucket into a dataframe

# replace 'test.csv' with the name of the file in your google bucket (don't delete the quotation marks)
diabetes_control_ptsd <- 'diabetes_control_ptsd.csv'
diabetes_study_ptsd <- 'diabetes_study_ptsd.csv'

########################################################################
##
################# DON'T CHANGE FROM HERE ###############################
##
########################################################################

# Get the bucket name
my_bucket <- Sys.getenv('WORKSPACE_BUCKET')

# Copy the file from current workspace to the bucket
system(paste0("gsutil cp ", my_bucket, "/data/", diabetes_control_ptsd, " ."), intern=T)
system(paste0("gsutil cp ", my_bucket, "/data/", diabetes_study_ptsd, " ."), intern=T)

# Load the file into a dataframe
diabetes_control_ptsd  <- read_csv(diabetes_control_ptsd)
diabetes_study_ptsd  <- read_csv(diabetes_study_ptsd)

head(diabetes_control_ptsd)
head(diabetes_study_ptsd)

[1mRows: [22m[34m9315[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (6): gender, date_of_birth, race, ethnicity, Are you currently prescribe...
[32mdbl[39m (1): person_id

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m659[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (6): gender, date_of_birth, race, ethnicity, Are you still seeing a doct...
[32mdbl[39m (1): person_id

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


person_id,gender,date_of_birth,race,ethnicity,Are you currently prescribed medications and/or receiving treatment for post-traumatic stress disorder (PTSD)?,Are you still seeing a doctor or health care provider for post-traumatic stress disorder (PTSD)?
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
6778627,Male,1984-06-15 00:00:00 UTC,Asian,Hispanic or Latino,Yes,Yes
1358081,Female,1991-06-15 00:00:00 UTC,Asian,Hispanic or Latino,Yes,Yes
2684337,Female,1975-06-15 00:00:00 UTC,Asian,Hispanic or Latino,Yes,Yes
2057156,Female,1991-06-15 00:00:00 UTC,Asian,Hispanic or Latino,No,No
2991522,Female,1998-06-15 00:00:00 UTC,Asian,Hispanic or Latino,No,No
9986364,Female,1991-06-15 00:00:00 UTC,Asian,Hispanic or Latino,No,Yes


person_id,gender,date_of_birth,race,ethnicity,Are you still seeing a doctor or health care provider for post-traumatic stress disorder (PTSD)?,Are you currently prescribed medications and/or receiving treatment for post-traumatic stress disorder (PTSD)?
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1302288,Male,1976-06-15 00:00:00 UTC,Asian,Hispanic or Latino,Yes,Yes
2000298,Male,1977-06-15 00:00:00 UTC,Asian,Hispanic or Latino,Yes,Yes
1913213,Female,1986-06-15 00:00:00 UTC,Black or African American,Hispanic or Latino,Yes,No
7661404,Female,1984-06-15 00:00:00 UTC,Black or African American,Hispanic or Latino,Yes,Yes
2684487,Female,1991-06-15 00:00:00 UTC,Black or African American,Hispanic or Latino,No,No
3425591,Female,1978-06-15 00:00:00 UTC,Black or African American,Hispanic or Latino,Yes,Yes


# Clean and join the two dataframes

**Step 1: Add new columns to each data frame called *diabetes* and include values based on diabetes status**

We will use the mutate() function from the **dplyr** package to create new variables for each data frame.

In [3]:
library(dplyr)

diabetes_no_ptsd <- diabetes_control_ptsd %>%
  mutate(diabetes = "No")

diabetes_yes_ptsd <- diabetes_study_ptsd%>%
  mutate(diabetes = "Yes")

dim(diabetes_no_ptsd)
head(diabetes_no_ptsd)
dim(diabetes_yes_ptsd)
head(diabetes_yes_ptsd)

person_id,gender,date_of_birth,race,ethnicity,Are you currently prescribed medications and/or receiving treatment for post-traumatic stress disorder (PTSD)?,Are you still seeing a doctor or health care provider for post-traumatic stress disorder (PTSD)?,diabetes
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
6778627,Male,1984-06-15 00:00:00 UTC,Asian,Hispanic or Latino,Yes,Yes,No
1358081,Female,1991-06-15 00:00:00 UTC,Asian,Hispanic or Latino,Yes,Yes,No
2684337,Female,1975-06-15 00:00:00 UTC,Asian,Hispanic or Latino,Yes,Yes,No
2057156,Female,1991-06-15 00:00:00 UTC,Asian,Hispanic or Latino,No,No,No
2991522,Female,1998-06-15 00:00:00 UTC,Asian,Hispanic or Latino,No,No,No
9986364,Female,1991-06-15 00:00:00 UTC,Asian,Hispanic or Latino,No,Yes,No


person_id,gender,date_of_birth,race,ethnicity,Are you still seeing a doctor or health care provider for post-traumatic stress disorder (PTSD)?,Are you currently prescribed medications and/or receiving treatment for post-traumatic stress disorder (PTSD)?,diabetes
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1302288,Male,1976-06-15 00:00:00 UTC,Asian,Hispanic or Latino,Yes,Yes,Yes
2000298,Male,1977-06-15 00:00:00 UTC,Asian,Hispanic or Latino,Yes,Yes,Yes
1913213,Female,1986-06-15 00:00:00 UTC,Black or African American,Hispanic or Latino,Yes,No,Yes
7661404,Female,1984-06-15 00:00:00 UTC,Black or African American,Hispanic or Latino,Yes,Yes,Yes
2684487,Female,1991-06-15 00:00:00 UTC,Black or African American,Hispanic or Latino,No,No,Yes
3425591,Female,1978-06-15 00:00:00 UTC,Black or African American,Hispanic or Latino,Yes,Yes,Yes


**Step 2: Combine data frames and finalize dataset**

We will use several functions from the **dplyr** package to create our final analysis-ready dataset:

* Use the *bind_rows()* function to combine the diabetes_no_ptsd and diabetes_yes_ptsd data frames into one dataset
* Use the *rename()* function to give shorter, more manageable names to the long PTSD survey question columns
* Use the *mutate()* function to calculate participant age from their date of birth and create a new variable called *age*
* Use the *select()* function to reorder columns and remove the original date_of_birth column since we now have the calculated age


We use bind_rows() instead of a join function because the column names for each data frame are the same and in the same order, so we can simply stack the data together which is easier, cleaner, and has fewer potential errors

In [4]:
final_diabetes_ptsd <- bind_rows(diabetes_no_ptsd, diabetes_yes_ptsd) %>%
  rename(
    ptsd_doctor = `Are you still seeing a doctor or health care provider for post-traumatic stress disorder (PTSD)?`,
    ptsd_treatment = `Are you currently prescribed medications and/or receiving treatment for post-traumatic stress disorder (PTSD)?`
  ) %>%
  mutate(
    age = floor(interval(ymd_hms(date_of_birth), today()) / years(1))
  ) %>%
select(person_id, age, everything(), -date_of_birth)



dim(final_diabetes_ptsd)
head(final_diabetes_ptsd)

person_id,age,gender,race,ethnicity,ptsd_treatment,ptsd_doctor,diabetes
<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
6778627,41,Male,Asian,Hispanic or Latino,Yes,Yes,No
1358081,34,Female,Asian,Hispanic or Latino,Yes,Yes,No
2684337,50,Female,Asian,Hispanic or Latino,Yes,Yes,No
2057156,34,Female,Asian,Hispanic or Latino,No,No,No
2991522,27,Female,Asian,Hispanic or Latino,No,No,No
9986364,34,Female,Asian,Hispanic or Latino,No,Yes,No


# Add the code snippet from the *All of Us R and Cloud Storage snippets*

**Run the *copy_file_to_workspace_bucket.R* code snippet**

In [5]:
# This snippet assumes that you run setup first

# This code saves your dataframe into a csv file in a "data" folder in Google Bucket

# Replace df with THE NAME OF YOUR DATAFRAME
my_dataframe <- final_diabetes_ptsd

# Replace 'test.csv' with THE NAME of the file you're going to store in the bucket (don't delete the quotation marks)
destination_filename <- 'final_diabetes_ptsd.csv'

########################################################################
##
################# DON'T CHANGE FROM HERE ###############################
##
########################################################################

# store the dataframe in current workspace
write_excel_csv(my_dataframe, destination_filename)

# Get the bucket name
my_bucket <- Sys.getenv('WORKSPACE_BUCKET')

# Copy the file from current workspace to the bucket
system(paste0("gsutil cp ./", destination_filename, " ", my_bucket, "/data/"), intern=T)

# Check if file is in the bucket
system(paste0("gsutil ls ", my_bucket, "/data/*.csv"), intern=T)
