# output: github_document
# always_allow_html: true
This package of createdictionary was developed at the Neuroepidemiology Section (NES) of Laboratory of Epidemiology & Population Science (LEPS), NIA. The goal is to build a one-document dictionary from multiple datasets; the key values are extracted for each variable.
You can install the released version of createdictionary from CRAN with (This installation option is currently unavailable yet):
install.packages("createdictionary")
And the development version from GitHub with:
# install.packages("devtools")
devtools::install_github("LEPSNES/createdictionary")
This is a basic example which shows you how to solve a common problem:
## get folder path
df <- system.file(package = "haven")
## search all the files under the folder for sas dataset
dse <- flat2file(df, "*.sas7bdat")
## dic_value_extract_one_dataset_path uses file path as input
dse %>%
dic_value_extract_one_dataset_path()
#> # A tibble: 5 x 21
#> var_name label value_distinct mean sd largest_1 largest_2 largest_3
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Sepal_L… <NA> 35 5.84… 0.82… 7.9 7.7 7.6
#> 2 Sepal_W… <NA> 23 3.05… 0.43… 4.4 4.2 4.1
#> 3 Petal_L… <NA> 43 3.758 1.76… 6.9 6.7 6.6
#> 4 Petal_W… <NA> 22 1.19… 0.76… 2.5 2.4 2.3
#> 5 Species <NA> 3 <NA> <NA> virgin versic setosa
#> # … with 13 more variables: smallest_1 <chr>, smallest_2 <chr>,
#> # smallest_3 <chr>, top1_value <chr>, top1_freq <chr>, top2_value <chr>,
#> # top2_freq <chr>, top3_value <chr>, top3_freq <chr>, num_NA <chr>,
#> # total_row <int>, file_name <chr>, dir_name <chr>
## dic_value_extract_one_dataset uses a data frame as input
dse[1] %>%
read_sas() %>%
dic_value_extract_one_dataset()
#> # A tibble: 5 x 18
#> var_name label value_distinct mean sd largest_1 largest_2 largest_3
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Sepal_L… <NA> 35 5.84… 0.82… 7.9 7.7 7.6
#> 2 Sepal_W… <NA> 23 3.05… 0.43… 4.4 4.2 4.1
#> 3 Petal_L… <NA> 43 3.758 1.76… 6.9 6.7 6.6
#> 4 Petal_W… <NA> 22 1.19… 0.76… 2.5 2.4 2.3
#> 5 Species <NA> 3 <NA> <NA> virgin versic setosa
#> # … with 10 more variables: smallest_1 <chr>, smallest_2 <chr>,
#> # smallest_3 <chr>, top1_value <chr>, top1_freq <chr>, top2_value <chr>,
#> # top2_freq <chr>, top3_value <chr>, top3_freq <chr>, num_NA <chr>
## dic_value_extract_one_var process a variable each time
dse %>%
read_sas() %>%
imap_dfr( ~ dic_value_extract_one_var(.x, .y))
#> # A tibble: 5 x 18
#> var_name label value_distinct mean sd largest_1 largest_2 largest_3
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Sepal_L… <NA> 35 5.84… 0.82… 7.9 7.7 7.6
#> 2 Sepal_W… <NA> 23 3.05… 0.43… 4.4 4.2 4.1
#> 3 Petal_L… <NA> 43 3.758 1.76… 6.9 6.7 6.6
#> 4 Petal_W… <NA> 22 1.19… 0.76… 2.5 2.4 2.3
#> 5 Species <NA> 3 <NA> <NA> virgin versic setosa
#> # … with 10 more variables: smallest_1 <chr>, smallest_2 <chr>,
#> # smallest_3 <chr>, top1_value <chr>, top1_freq <chr>, top2_value <chr>,
#> # top2_freq <chr>, top3_value <chr>, top3_freq <chr>, num_NA <chr>
df <- "/LSC/NES/study/CARDIA/core datasets/Y25/DATA"
## on windows, use this
## df <- "T:/LEPS/NES/study/CARDIA/core datasets/Y25/DATA"
dse <- flat2file(df, "*.sas7bdat")
dse[3] %>%
dic_value_extract_one_dataset_path()
#> # A tibble: 9 x 21
#> var_name label value_distinct mean sd largest_1 largest_2 largest_3
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 ID SUBJ… 3480 2653… 1132… 41681722… 41679622… 41678331…
#> 2 CENTER YEAR… 4 2.57… 1.12… 4 3 2
#> 3 HL7CRPS… C-RE… 2 <NA> <NA> F C <NA>
#> 4 HL7CRPT… C-RE… 3 <NA> <NA> S N C
#> 5 HL6CRPBN C-RE… 828 3.26… 6.26… 199 103 66.4
#> 6 HL7CRPC… C-RE… 1 1960… 0 1960-01-… <NA> <NA>
#> 7 HL7CRPR… C-RE… 1 1960… 0 1960-01-… <NA> <NA>
#> 8 HL7CRPA… C-RE… 1 1960… 0 1960-01-… <NA> <NA>
#> 9 HL6CRPF FLAG… 2 2 0 2 <NA> <NA>
#> # … with 13 more variables: smallest_1 <chr>, smallest_2 <chr>,
#> # smallest_3 <chr>, top1_value <chr>, top1_freq <chr>, top2_value <chr>,
#> # top2_freq <chr>, top3_value <chr>, top3_freq <chr>, num_NA <chr>,
#> # total_row <int>, file_name <chr>, dir_name <chr>
dse[1:3] %>%
map_dfr(dic_value_extract_one_dataset_path) %>%
slice_head(n=7)
#> # A tibble: 7 x 21
#> var_name label value_distinct mean sd largest_1 largest_2 largest_3
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 short_ID <NA> 5114 2678… 1130… 41681 41679 41678
#> 2 Y15cact… <NA> 218 8.15… 97.2… 4426.505… 2292.617… 615.9132…
#> 3 Y15cact… <NA> 220 8.44… 86.7… 3530.067… 2114.105… 909.8505…
#> 4 Y15CABG <NA> 2 0 0 0 <NA> <NA>
#> 5 Y15pacer <NA> 2 0 0 0 <NA> <NA>
#> 6 Y15stent <NA> 3 0.00… 0.02… 1 0 <NA>
#> 7 Y15valve <NA> 3 0.00… 0.01… 1 0 <NA>
#> # … with 13 more variables: smallest_1 <chr>, smallest_2 <chr>,
#> # smallest_3 <chr>, top1_value <chr>, top1_freq <chr>, top2_value <chr>,
#> # top2_freq <chr>, top3_value <chr>, top3_freq <chr>, num_NA <chr>,
#> # total_row <int>, file_name <chr>, dir_name <chr>
Please check the help document for each function to see how to use them.
?flat2file
?dic_value_extract_one_dataset_path
?dic_value_extract_one_dataset
?dic_value_extract_one_var
future direction.
-
other dataset types except sas7bdat
- SPSS
- Stata
- CSV
- database table
-
handle possible file opening errors
-
handle possible value extraction errors
Hope this tool can be of somewhat help with your research.
any comments or suggestions are welcome!