Merge branch 'main' of github.com:CoryMcCartan/causaltbl

CoryMcCartan · Mar 26, 2023 · 15f196c · 15f196c
2 parents 4621854 + 0fc842f
commit 15f196c
Show file tree

Hide file tree

Showing 6 changed files with 276 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -10,3 +10,4 @@ pkgdown
 *.tmp
 *.bak
 *.swp
+inst/doc
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -18,6 +18,8 @@ Imports:
     stats
 Suggests: 
     dplyr,
+    knitr,
+    rmarkdown,
     testthat (>= 3.0.0)
 License: MIT + file LICENSE
 Encoding: UTF-8
@@ -27,3 +29,4 @@ Config/testthat/edition: 3
 URL: https://github.com/CoryMcCartan/causaltbl,
     http://corymccartan.com/causaltbl/
 BugReports: https://github.com/CoryMcCartan/causaltbl/issues
+VignetteBuilder: knitr
diff --git a/README.Rmd b/README.Rmd
@@ -34,3 +34,71 @@ You can install the development version of causaltbl from [GitHub](https://githu
 # install.packages("remotes")
 remotes::install_github("CoryMcCartan/causaltbl")
 ```
+
+## Using `causaltbl`
+
+A causal tibble, `causal_tbl`, is a data frame with attributes identifying which columns correspond to common inputs in causal inference analyses. At the most basic level, you can indicate the outcome and treatment columns. For more involved analyses, `causal_tbl`s can keep track of additional columns including multiple outcomes and multiple treatments.
+
+The primary entryway to `causaltbl` is through <!--- [`tidycausal`](https://corymccartan.com/tidycausal/) -->. 
+You can create a `causal_tbl` directly via `causal_tbl()`.
+
+Suppose we have data from a really simple differences in differences design. Our data looks like this:
+
+```{r}
+df <- data.frame(
+  id = c("a", "a", "a", "a", "b", "b", "b", "b"),
+  year = rep(2015:2018, 2),
+  trt = c(0, 0, 0, 0, 0, 0, 1, 1),
+  y = c(1, 3, 2, 3, 2, 4, 4, 5)
+)
+```
+
+There are two units (`id`), `a` and `b`. We have 4 yearly observations from 2015 to 2018 (`year`) for each unit. `a` is never treated and `b` is treated in 2017 and 2018 (`trt`). Some outcome (`y`) is measured yearly.
+
+We first can make a `causal_tbl` by passing `df` to `causal_tbl()`. We don't need to specify any options.
+
+```{r}
+library(causaltbl)
+did <- causal_tbl(df)
+```
+
+Now `did` is a `causal_tbl` version of `df`.
+
+```{r}
+did
+```
+
+To set outcome , we can use the corresponding functions `set_outcome()`. `causal_tbl` uses tidy evaluation, so we can use the bare column name.
+
+```{r}
+did <- did |>
+    set_outcome(outcome = y)
+did
+```
+
+Similarly, we can indicate that `did` has a treatment column `trt` or panel structure for each `id`-`year` with the corresponding `set_treatment()` and `set_panel()` functions.
+
+```{r}
+did <- did |>
+    set_treatment(treatment = trt) |>
+    set_panel(unit = id, time = year)
+did
+```
+
+This sets attributes that are used down-the-line by other packages. We can retrieve them by calling their `get`ters. For the outcome, `get_outcome()`:
+
+```{r}
+get_outcome(did)
+```
+For the treatment, `get_treatment()`:
+```{r}
+get_treatment(did)
+```
+
+And for the panel structure, `get_panel()`:
+```{r}
+get_panel(did)
+```
+
+For more information on using `causal_tbl`s or designing functions that use `causal_tbl`s, see the Advanced `causal_tbl` vignette.
+
diff --git a/README.md b/README.md
@@ -24,3 +24,134 @@ You can install the development version of causaltbl from
 # install.packages("remotes")
 remotes::install_github("CoryMcCartan/causaltbl")
 ```
+
+## Using `causaltbl`
+
+A causal tibble, `causal_tbl`, is a data frame with attributes
+identifying which columns correspond to common inputs in causal
+inference analyses. At the most basic level, you can indicate the
+outcome and treatment columns. For more involved analyses, `causal_tbl`s
+can keep track of additional columns including multiple outcomes and
+multiple treatments.
+
+The primary entryway to `causaltbl` is through
+<!--- [`tidycausal`](https://corymccartan.com/tidycausal/) -->. You can
+create a `causal_tbl` directly via `causal_tbl()`.
+
+Suppose we have data from a really simple differences in differences
+design. Our data looks like this:
+
+``` r
+df <- data.frame(
+  id = c("a", "a", "a", "a", "b", "b", "b", "b"),
+  year = rep(2015:2018, 2),
+  trt = c(0, 0, 0, 0, 0, 0, 1, 1),
+  y = c(1, 3, 2, 3, 2, 4, 4, 5)
+)
+```
+
+There are two units (`id`), `a` and `b`. We have 4 yearly observations
+from 2015 to 2018 (`year`) for each unit. `a` is never treated and `b`
+is treated in 2017 and 2018 (`trt`). Some outcome (`y`) is measured
+yearly.
+
+We first can make a `causal_tbl` by passing `df` to `causal_tbl()`. We
+don’t need to specify any options.
+
+``` r
+library(causaltbl)
+did <- causal_tbl(df)
+```
+
+Now `did` is a `causal_tbl` version of `df`.
+
+``` r
+did
+#> # A <causal_tbl> [8 × 4]
+#>                          
+#>   id     year   trt     y
+#>   <chr> <int> <dbl> <dbl>
+#> 1 a      2015     0     1
+#> 2 a      2016     0     3
+#> 3 a      2017     0     2
+#> 4 a      2018     0     3
+#> 5 b      2015     0     2
+#> 6 b      2016     0     4
+#> 7 b      2017     1     4
+#> 8 b      2018     1     5
+```
+
+To set outcome , we can use the corresponding functions `set_outcome()`.
+`causal_tbl` uses tidy evaluation, so we can use the bare column name.
+
+``` r
+did <- did |>
+    set_outcome(outcome = y)
+did
+#> # A <causal_tbl> [8 × 4]
+#>                     [out]
+#>   id     year   trt     y
+#>   <chr> <int> <dbl> <dbl>
+#> 1 a      2015     0     1
+#> 2 a      2016     0     3
+#> 3 a      2017     0     2
+#> 4 a      2018     0     3
+#> 5 b      2015     0     2
+#> 6 b      2016     0     4
+#> 7 b      2017     1     4
+#> 8 b      2018     1     5
+```
+
+Similarly, we can indicate that `did` has a treatment column `trt` or
+panel structure for each `id`-`year` with the corresponding
+`set_treatment()` and `set_panel()` functions.
+
+``` r
+did <- did |>
+    set_treatment(treatment = trt) |>
+    set_panel(unit = id, time = year)
+did
+#> # A <causal_tbl> [8 × 4]
+#>   [unit] [time] [trt] [out]
+#>   id       year   trt     y
+#>   <chr>   <int> <dbl> <dbl>
+#> 1 a        2015     0     1
+#> 2 a        2016     0     3
+#> 3 a        2017     0     2
+#> 4 a        2018     0     3
+#> 5 b        2015     0     2
+#> 6 b        2016     0     4
+#> 7 b        2017     1     4
+#> 8 b        2018     1     5
+```
+
+This sets attributes that are used down-the-line by other packages. We
+can retrieve them by calling their `get`ters. For the outcome,
+`get_outcome()`:
+
+``` r
+get_outcome(did)
+#> [1] "y"
+```
+
+For the treatment, `get_treatment()`:
+
+``` r
+get_treatment(did)
+#>     y 
+#> "trt"
+```
+
+And for the panel structure, `get_panel()`:
+
+``` r
+get_panel(did)
+#> $unit
+#> [1] "id"
+#> 
+#> $time
+#> [1] "year"
+```
+
+For more information on using `causal_tbl`s or designing functions that
+use `causal_tbl`s, see the Advanced `causal_tbl` vignette.
diff --git a/vignettes/.gitignore b/vignettes/.gitignore
@@ -0,0 +1,2 @@
+*.html
+*.R
diff --git a/vignettes/advanced.Rmd b/vignettes/advanced.Rmd
@@ -0,0 +1,71 @@
+---
+title: "Advanced `causal_tbl`"
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{Advanced `causal_tbl`}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+```{r, include = FALSE}
+knitr::opts_chunk$set(
+  collapse = TRUE,
+  comment = "#>"
+)
+```
+
+This vignette provides more specific details of how `causal_tbl` objects work and how to extend them. Most users won't need to know much about `causal_tbl`s except that they're (1) extensions of `tibble`s and (2) they rely on a `causal_cols` attribute that makes things "just work". The `causal_cols` are the columns for different causal variables that play an important role. The package provides various getter and setter functions for these.
+
+This vignette covers:
+1. How `causal_cols` works internally.
+2. How to extend the type if your model needs shiny new causal variables.
+
+```{r setup}
+library(causaltbl)
+```
+
+## Internal Design of `causal_tbl`
+
+Like in the README, here we use a simple difference-in-differences example: 8 observations for 2 units, across 4 years.
+
+```{r}
+df <- data.frame(
+  id = c("a", "a", "a", "a", "b", "b", "b", "b"),
+  year = rep(2015:2018, 2),
+  trt = c(0, 0, 0, 0, 0, 0, 1, 1),
+  y = c(1, 3, 2, 3, 2, 4, 4, 5)
+)
+```
+
+Here, when we create the `causal_tbl`, we can specify the outcome and treatment directly via `.outcome` and `.treatment`.
+```{r}
+did <- causal_tbl(df, .outcome = y, .treatment = trt)
+```
+
+All causal attributes can be recovered with `causal_cols()`:
+
+```{r}
+causal_cols(did)
+```
+
+Each of these elements is a character vector, with each element being a name of a column in the data frame. For some variables, this vector should be of length 1, but for other variables, there may be multiple columns of that type. 
+
+In our case, the `causal_cols()` are the `outcome` and `treatment`. The outcome has no name, i.e., it's just `"y"`. The treatments entry indicates that `trt` automatically corresponds to `"y"` as the outcome related to this treatment. This is indicated by the name.
+
+The optional `names()` of the columns within a particular element of `causal_cols` convey information on any associated variable. For example, the treatment variable is by default associated with a particular outcome. And a propensity score or outcome model is associated with a particular treatment or outcome variable.
+
+However, you are not limited to one treatment or one outcome. For example, if a package author was developing methods for causal inference with multiple continuous treatments, the treatment element of `causal_cols` could have an entry for each `treatment` column.
+
+Once set, these column names within `causal_cols` are automatically updated if columns are renamed, or set to `NULL` if columns are dropped. This reassignment happens automatically and silently in all cases.
+
+## Extending `causal_tbl` with new `causal_cols`
+
+Now, if you need something fancy, odds are should implement a new attribute for `causal_cols`. As we saw before, `causal_cols` attributes can be gotten via `causal_cols()`. They can be set using `causal_cols() <- ...`.
+
+Each new entry to `causal_cols` should be a named list, where:
+
+- the name of the list denotes, in short form, what the thing is (i.e. if they're propensity scores, the name should be `pscores`)
+- each entry in the list denotes one of those things
+- each name of each entry indicates what that entry corresponds to
+
+It is the responsibility of implementers of particular methods to check that a causal_tbl has the necessary columns set via helpers like `has_treatment()`, `has_outcome()`, etc.