-
Notifications
You must be signed in to change notification settings - Fork 7
/
tagged_na_usage.Rmd
101 lines (83 loc) · 4.04 KB
/
tagged_na_usage.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
---
title: "missing data (tagged_na)"
output: html_document
vignette: >
%\VignetteIndexEntry{How to use tagged_na}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Missing data (`NA` --> `tagged_na`)
Base R supports only one type of NA ('not available') to represent missing
values. The CCHS, however, has several types of missing data or incomplete
category responses. `cchsflow` uses the `haven` package [`tagged_na()`](https://haven.tidyverse.org/reference/tagged_na.html)
to allow multiple missing data categories. `tagged_na()` adds an addition
character to the NA value, thereby allowing users to define additional missing
data types. `tagged_na()` applies only for numeric values, as character base
values can use any string to represent NA or missing data.
## CCHS approach to coding missing data
CCHS category values can change across variables and survey
cycles, but the most common values are:
| CCHS category value | Label |
|----------------|--------------|
| 6 | not applicable |
| 7 | don't know |
| 8 | refusal |
| 9 | not stated |
`cchsflow` recodes the four CCHS category missing data categories values into
two NA values that are commonly used for most studies. NA(a) = 'not applicable'
and NA(b) = 'missing'. Furthermore, variables may be entirely missing from a
specific cycle, which results in 'not asked' missing data that is not included
or coded in CCHS surveys, recoded to NA(a)
**Summary of `tagged_na` values and their corresponding CCHS category values.**
| cchsflow `tagged_na` | CCHS category value | Label |
|----------------|--------------|-----------------|
| NA(a) | 6 | not applicable |
| NA(b) | 7 | don't know |
| NA(b) | 8 | refusal |
| NA(b) | 9 | not stated |
| NA(c) | | question not asked in the survey cycle |
## Example `haven::tagged_na()`
```{r}
library(haven)
x <- c(1:5, tagged_na("a"), tagged_na("b"))
# Is used to read the tagged NA in most other functions they are still viewed as NA
na_tag(x)
```
```{r}
# Is used to print the na as well as their tag
print_tagged_na(x)
```
```{r}
# Tagged NA's work identically to regular NAs
x
```
```{r}
is.na(x)
```
## Using tagged_na for creating derived variables
See the [derived variables](derived_variables.html) and
[how to add variables](how_to_add_variables.html) for details on derived
variables and how to add them.
When creating derived variables from CCHS variables, it is important to
distinguish the different NA values. Certain derived variables include the use
of variables that may not be applicable to respondents. For example,
[smoking pack-years](https://big-life-lab.github.io/cchsflow/reference/pack_years_fun.html)
involves the use of smoking variables that may not be applicable to all CCHS
respondents (i.e. non-smokers who have never smoked). In this scenario,
respondents who had values of `NA(a)` for the various base smoking variables
would have a value of `NA(a)` for smoking pack-years. Respondents who had
identified as having smoked at one point in their lives, but had values of
`NA(b)` in the base smoking variables would have a value of `NA(b)` for smoking
pack-years as pack-years cannot be calculated due to missing values.
On the other hand, there are certain derived variables that use variables that
are applicable to all respondents. For example, [BMI](https://big-life-lab.github.io/cchsflow/reference/bmi_fun.html) uses CCHS
height and weight variables which are asked to all CCHS respondents. In this
scenario, all NA variables would be tagged as `NA(b)` as these variables are
applicable to everyone, and respondents with NA values would be classified as
missing.
When creating deriving variables, it is important to examine the base CCHS
variables to check for the presence of `NA(a)` and `NA(b)`. This can be done
by reviewing the [CCHS documentation](https://osf.io/g6d84/).