/
ggdotplotstats.Rmd
216 lines (169 loc) · 6.63 KB
/
ggdotplotstats.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
---
title: "ggdotplotstats"
author: "Indrajeet Patil"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette:
fig_width: 6
fig.align: 'center'
fig.asp: 0.618
dpi: 300
toc: true
warning: FALSE
message: FALSE
vignette: >
%\VignetteIndexEntry{ggdotplotstats}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
## show me all columns
options(tibble.width = Inf, pillar.bold = TRUE, pillar.subtle_num = TRUE)
knitr::opts_chunk$set(
collapse = TRUE,
dpi = 300,
warning = FALSE,
message = FALSE,
out.width = "100%",
comment = "#>"
)
library(ggstatsplot)
```
---
You can cite this package/vignette as:
```{r citation, echo=FALSE, comment = ""}
citation("ggstatsplot")
```
---
Lifecycle: [![lifecycle](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://lifecycle.r-lib.org/articles/stages.html)
The function `ggdotplotstats` can be used for **data exploration** and to
provide an easy way to make **publication-ready dot plots/charts** with
appropriate and selected statistical details embedded in the plot itself. In
this vignette, we will explore several examples of how to use it.
This function is a sister function of `gghistostats` with the difference being
it expects a labeled numeric variable.
## Distribution of a sample with `ggdotplotstats`
Let's begin with a very simple example from the `{ggplot2}` package
(`ggplot2::mpg`), a subset of the fuel economy data that the EPA makes available
on <http://fueleconomy.gov>.
```{r mpg}
## looking at the structure of the data using glimpse
dplyr::glimpse(ggplot2::mpg)
```
Let's say we want to visualize the distribution of mileage by car manufacturer.
```{r mpg2, fig.height = 7, fig.width = 9}
## for reproducibility
set.seed(123)
library(ggstatsplot)
## removing factor level with very few no. of observations
df <- dplyr::filter(ggplot2::mpg, cyl %in% c("4", "6"))
## creating a vector of colors using `paletteer` package
paletter_vector <-
paletteer::paletteer_d(
palette = "palettetown::venusaur",
n = nlevels(as.factor(df$manufacturer)),
type = "discrete"
)
## plot
ggdotplotstats(
data = df,
x = cty,
y = manufacturer,
xlab = "city miles per gallon",
ylab = "car manufacturer",
test.value = 15.5,
point.args = list(
shape = 16,
color = paletter_vector,
size = 5
),
title = "Distribution of mileage of cars",
ggtheme = ggplot2::theme_dark()
)
```
## Grouped analysis with `grouped_ggdotplotstats`
What if we want to do the same analysis separately for different engines with
different numbers of cylinders?
`{ggstatsplot}` provides a special helper function for such instances:
`grouped_ggdotplotstats`. This is merely a wrapper function around
`combine_plots`. It applies `ggdotplotstats` across all **levels** of
a specified **grouping variable** and then combines the individual plots into a
single plot.
Let's see how we can use this function to apply `ggdotplotstats` to accomplish our
task.
```{r grouped1, fig.height = 12, fig.width = 7}
## for reproducibility
set.seed(123)
## removing factor level with very few no. of observations
df <- dplyr::filter(ggplot2::mpg, cyl %in% c("4", "6"))
## plot
grouped_ggdotplotstats(
## arguments relevant for ggdotplotstats
data = df,
grouping.var = cyl, ## grouping variable
x = cty,
y = manufacturer,
xlab = "city miles per gallon",
ylab = "car manufacturer",
type = "bayes", ## Bayesian test
test.value = 15.5,
## arguments relevant for `combine_plots`
annotation.args = list(title = "Fuel economy data"),
plotgrid.args = list(nrow = 2)
)
```
## Grouped analysis with `{purrr}`
Although this is a quick and dirty way to explore a large amount of data with
minimal effort, it does come with an important limitation: reduced flexibility.
For example, if we wanted to add, let's say, a separate `test.value` argument
for each gender, this is not possible with `grouped_ggdotplotstats`. For cases
like these, or to run separate kinds of tests (robust for some, parametric for
other, while Bayesian for some other levels of the group) it would be better to
use `{purrr}`.
See the associated vignette here:
<https://indrajeetpatil.github.io/ggstatsplot/articles/web_only/purrr_examples.html>
## Summary of tests
**Central tendency measure**
Type | Measure | Function used
----------- | --------- | ------------------
Parametric | mean | `datawizard::describe_distribution`
Non-parametric | median | `datawizard::describe_distribution`
Robust | trimmed mean | `datawizard::describe_distribution`
Bayesian | MAP (maximum *a posteriori* probability) estimate | `datawizard::describe_distribution`
**Hypothesis testing**
Type | Test | Function used
------------------ | ------------------------- | -----
Parametric | One-sample Student's *t*-test | `stats::t.test`
Non-parametric | One-sample Wilcoxon test | `stats::wilcox.test`
Robust | Bootstrap-*t* method for one-sample test | `WRS2::trimcibt`
Bayesian | One-sample Student's *t*-test | `BayesFactor::ttestBF`
**Effect size estimation**
Type | Effect size | CI? | Function used
------------ | ----------------------- | --- | -----
Parametric | Cohen's *d*, Hedge's *g* | ✅ | `effectsize::cohens_d`, `effectsize::hedges_g`
Non-parametric | *r* (rank-biserial correlation) | ✅ | `effectsize::rank_biserial`
Robust | trimmed mean | ✅ | `WRS2::trimcibt`
Bayes Factor | $\delta_{posterior}$ | ✅ | `bayestestR::describe_posterior`
## Reporting
If you wish to include statistical analysis results in a publication/report, the
ideal reporting practice will be a hybrid of two approaches:
- the `{ggstatsplot}` approach, where the plot contains both the visual and
numerical summaries about a statistical model, and
- the *standard* narrative approach, which provides interpretive context for the
reported statistics.
For example, let's see the following example:
```{r reporting}
ggdotplotstats(morley, Speed, Expt, test.value = 800)
```
The narrative context (assuming `type = "parametric"`) can complement this plot
either as a figure caption or in the main text-
> Student's *t*-test revealed that, across 5 experiments, the speed of light was
significantly different than posited speed. The effect size $(g = 1.22)$ was
very large, as per Cohen’s (1988) conventions. The Bayes Factor for the same
analysis revealed that the data were `r round(exp(1.24), 2)` times more probable
under the alternative hypothesis as compared to the null hypothesis. This can be
considered moderate evidence (Jeffreys, 1961) in favor of the alternative
hypothesis.
## Suggestions
If you find any bugs or have any suggestions/remarks, please file an issue on GitHub:
<https://github.com/IndrajeetPatil/ggstatsplot/issues>