# Data Workflow

## Scenario

We are a part of a research team conducting a study that aims to understand if there is an association between the average number of hours an individual sleeps each night and the development of hypertension in adults aged 40 years or older. The research team believes that the following factors could obscure such an association, and as such should be adjusted for:

* Average daily alcohol consumption `[alcohol_day]`
* Regular use of marijuana `[regular_marij]`
* Is participant currently a smoker `[smoke_now]`
* Is participant regularly physically active `[phys_active]`
* BMI Group (underweight, healthy, overweight, obese) `[bmi]`

To answer this research question, we will use the US National Health and Nutrition Examination Survey (NHANES) provided within the data file sleep.csv.

```{note}
The statistical methods for survey data are beyond the scope of this R workshop, as such we will treat this data as if it was collected from randomly selected individuals.
```

# R Set-up

There are a few things you need to do every time you open R:

* Install/Load packages
* Read in data (import data into R)


## Install/Load Packages



We **install** packages in R once. We **load** packages we are going to use everytime we open/start R. We recommended that you write the installation/load code at the top of your code script. This makes it easy to see what packages you used at a glance.

```{note}
`#` is used to indicate a comment in R. This is a line which R will skip over and not try to run as code.
```

In [4]:
# Install your packages (first use only)
# install.packages("tidyverse")
# install.packages("janitor")
# install.packages("gtsummary")

# Load your packages
library(tidyverse)
library(janitor)
library(gtsummary)

## Load in your Data



There are a variety of different ways to read data into R, since there are a variety of different ways to save and store data. One of the more common ways to save data is as a CSV (comma delimited file).

```{note}
You will need to know which directory (folder) you're working in when working with R and point correctly to the file sleep.csv.
```

In [None]:
# Read in your data and assign this to a data frame called sleep
sleep <- read_csv("sleep.csv")

Notice the message R prints out. It indicates the delimiter (,), the type of each column, and the row and column count.

```{note}
R automatically saves the first row from the .csv file as column names.
```

# Data Wrangling

## Viewing your data

It is generally **good practice** to look at your data to scan for things that might need fixing, for example:

* Text in numerical cells (e.g. “six” vs 6)
* Odd characters (e.g. >, <, /, etc.)
* Varied Date formats (e.g 01/16/2024, 16-1-24, Jan 16 2024, etc.)
* Coded variables (e.g. 99 = Missing)
* Typos (e.g. ys, yse)
* Variable capitalization or abbreviations (e.g. Yes, Y, yes)
* Misaligned variable types

However, it is **bad practice** to look at your data for patterns or to see what research questions you should ask and test.

There are a couple of different ways you can look at your data:

* Glimpse provides information on each column, it's name, type and the first few entries:


In [None]:
glimpse(sleep)

* Head (Tail) shows the first (last) *n* rows:

In [None]:
head(sleep, n=2)
tail(sleep, n=3)

* View which will work in R Studio will open the whole data file in a separate panel. To do this either type `View(sleep)` or click on `sleep` in the environment panel of R Studio.

## Data Cleaning and Preparation

### Clean Column Names

A good first step is to clean column names, in case there are any odd symbols or mixed cases. The quickest way to do this is with the clean_names function from the `janitor` package.

In [None]:
# Fix column names reassigning this to the data frame sleep
sleep <- clean_names(sleep)

### Trimming Data

It is often practical to trim your data set down to only the
necessary columns (variables). We can easily do this using the select function. We will keep only the variables mentioned in the Scenario.

*Note: it is best practise to preserve the original data frame so when we trim we assign this to a new data frame.*

In [None]:
# Trim sleep to select columns only and assign this to a new data frame sleep_trim
sleep_trim <- sleep %>%
  select(id,
    gender,
    age,
    hypertension,
    sleep_hrs_night,
    phys_active,
    smoke_now,
    alcohol_day,
    regular_marij,
    bmi)