Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vignette for harmonizing full dataset #126

Open
wants to merge 5 commits into
base: dev
Choose a base branch
from
Open

Conversation

kittychenn
Copy link
Collaborator

This vignette provides examples of how to harmonize variables using the full dataset. It's a good starting point to harmonize variables, but users can also implement different pipeline tools to harmonize data. Should we include pipeline methods in the example or leave it as is?

@kittychenn kittychenn added the enhancement New feature or request label Sep 11, 2023
@kittychenn kittychenn changed the title Vignette for harmonize data Vignette for harmonizing full dataset Sep 11, 2023
@kittychenn kittychenn added this to the V2.2 milestone Sep 11, 2023
Copy link

@JuanLiOHRI JuanLiOHRI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this vignette is clear. Only one small change from me: the current merged dataset is saved as .RData and I need an extra step to save it as .rds so I can work with it in targets. Not sure if we should just save it as .rds.

Copy link
Contributor

@yulric yulric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Can you tell me what command you use to build and view the vignettes?
  2. I would move one of these code samples to the examples on the front page? I think its an important use case for users. I would keep this vignette thought and link to it from the home page example.


## Introduction

This vignette explains how you can transform variables across multiple CCHS datasets using the full datasets to the _cchsflow_ package. The full PUMF datasets can be found [here](https://odesi.ca/). A full harmonized dataset of all cchsflow variables
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to link to the actual dataset on odesi?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple of CCHS cycles on odesi, but I wasn't sure whether to add links to each individual cycle in this vignette or have the general link.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like odesi update their website https://odesi.ca/en/browse? I think maybe just text that says go to this link and search for the cycle you want?

vignettes/how_to_harmonize.Rmd Outdated Show resolved Hide resolved
vignettes/how_to_harmonize.Rmd Show resolved Hide resolved
To show outputs in first chunk only, fix 2011 and 2012 outputs, use sample data
@yulric
Copy link
Contributor

yulric commented Oct 13, 2023

Sorry @kittychenn, I think you missed these comments,

  1. Can you tell me what command you use to build and view the vignettes?
  2. I would move one of these code samples to the examples on the front page? I think its an important use case for users. I would keep this vignette though and link to it from the home page example.

@yulric
Copy link
Contributor

yulric commented Oct 13, 2023

@reikookamoto Adding you to this PR on Doug's suggestion, good to get an "outside" perspective on this feature. For some reason I can't add you as a reviewer (you don't show up when I search your username), maybe its because you're not part of the team? In any case I sent an invitation to join the GitHub team.

@reikookamoto
Copy link
Collaborator

reikookamoto commented Oct 13, 2023 via email

@yulric yulric requested review from DougManuel and reikookamoto and removed request for DougManuel October 13, 2023 18:53
@yulric
Copy link
Contributor

yulric commented Oct 13, 2023

Adding you to this PR on Doug's suggestion, good to get an "outside" perspective on this feature. For some reason I can't add you as a reviewer (you don't show up when I search your username), maybe its because you're not part of the team? In any case I sent an invitation to join the GitHub team - Yulric

Can you try adding me as a reviewer now? Reiko

Should be there now. I think it was because you weren't a collaborator on the repo and not because you were not on the team....

@kittychenn
Copy link
Collaborator Author

@yulric I used knit to HTML to build and view the vignettes. The code sample is also available on the 'Get Started' page, so should I include it on the main page too?

Copy link
Collaborator

@reikookamoto reikookamoto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left my comments from an "outside" perspective @yulric


## Introduction

This vignette explains how you can transform variables across multiple CCHS datasets using the full datasets to the _cchsflow_ package. The full PUMF datasets can be found [here](https://odesi.ca/). A full harmonized dataset of all _cchsflow_ variables
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider writing out the first instances of acronyms like CCHS and PUMF in full.


## Introduction

This vignette explains how you can transform variables across multiple CCHS datasets using the full datasets to the _cchsflow_ package. The full PUMF datasets can be found [here](https://odesi.ca/). A full harmonized dataset of all _cchsflow_ variables
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This vignette explains how you can transform variables across multiple CCHS datasets using the full datasets to the _cchsflow_ package. The full PUMF datasets can be found [here](https://odesi.ca/). A full harmonized dataset of all _cchsflow_ variables
This vignette explains how you can transform variables across multiple Canadian Community Health Survey (CCHS) cycles using complete datasets with the _cchsflow_ package. The Public Use Microdata Files (PUMF) containing the complete data can be found [here](https://odesi.ca/). A full harmonized dataset of all _cchsflow_ variables

I'm not sure if I've correctly described the relationship between CCHS and PUMF, but something like this would provide more context to someone new to this area of study.

This vignette explains how you can transform variables across multiple CCHS datasets using the full datasets to the _cchsflow_ package. The full PUMF datasets can be found [here](https://odesi.ca/). A full harmonized dataset of all _cchsflow_ variables
can be found [here](https://osf.io/j5wgu). With the original PUMF datasets, data file should be renamed such that it specifies the survey and cycle year, which follows the format of the _p sample data (ex. cchs2001_p, cchs2013_2014_p).

To harmonize the data files, the `rec_with_table()` function is used to transform the indicated variables.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To harmonize the data files, the `rec_with_table()` function is used to transform the indicated variables.
To harmonize the data files, the `cchsflow::rec_with_table()` function is used to transform the indicated variables.

I know eventually we want users to use recodeflow::rec_with_table(), but, for the time being, we could specify the package name to avoid confusion.


## How to combine a single variable across multiple cycles

In this example, the sex variable from 2001 to 2018 CCHS datasets will be transformed and labeled using `rec_with_table()`, which is then combined into one dataset and labeled using `merge_rec_data()`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused as to why we're harmonizing this variable from 2001 to 2018 when, in the previous section, users were advised not to harmonized data from cycles before 2014 with those from 2015 and onwards.

2014 with cycles from 2015


### Option 1: Using _cchsflow_ variable_details sheet

When the variable argument in `rec_with_table()` is not specified, all variables listed in `variables.csv` and `variable_details.csv` will be transformed. In this example, all variables from the _cchsflow_ `variables.csv` and `variable_details.csv` sheets from 2001 to 2018 CCHS datasets will be transformed and labeled using `rec_with_table()`, which is then combined into one dataset and labeled using `merge_rec_data()`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where will variables.csv and variable_details.csv be on the user's computer when they install/load the package (i.e., expected file path)?

Copy link
Contributor

@yulric yulric Nov 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sheets will be in the inst/extdata folder. The rec_with_table uses the sheets from that folder if the user does not pass in those parameters.


### Option 2: Using your own variable_details sheet

In this example, all variables from personalized `variables.csv` and `variable_details.csv` sheets from 2001 to 2018 CCHS datasets will be transformed and labeled using `rec_with_table()`, which is then combined into one dataset and labeled using `merge_rec_data()`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider showing the relationship between variables.csv and sample_variables and variable_details.csv and sample_variable_details. Is the user expected to do something like sample_variables <- readr::read_csv('variables.csv') in their workspace before using the personalized spreadsheets?

…so that pkgdown would not complaing about including them in the references when building the documentation website
@yulric
Copy link
Contributor

yulric commented Nov 22, 2023

@kittychenn Sorry about getting back so late. All of Reiko's suggestions look good, can you address them?

In addition I was building the website using the following commands,

devtools::document()
pkgdown::build_site()

and I'm getting an error in the getting_started.Rmd vignette. Are you able to reproduce it?

Finally, I pushed some commits to fix some of the website build issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants