forked from gilliganondata/web-analytics-anomalies
-
Notifications
You must be signed in to change notification settings - Fork 0
/
anomaly_id_two_dimension_drilldown.Rmd
291 lines (223 loc) · 12.3 KB
/
anomaly_id_two_dimension_drilldown.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
---
title: "Anomaly Counts by Segment List #1 and Segment List #2"
output:
ioslides_presentation:
widescreen: true
smaller: true
fig_width: 10
fig_height: 3.5
css: styles.css
logo: images/logo.png
---
```{r setup, include=FALSE}
# PLATFORM: Adobe Analytics
#
# The purpose of this script is to answer the question, "Did any of my metrics move ENOUGH" for
# any of two sets of segments in the most recent time period to look like they are not likely
# just fluctuatingt due to noise. It does this by using the 35 days *prior* to the period being
# assessed to build a forecast for the assessment period. This uses exponential smoothing /
# Holt-Winters for the forecast, and it predicts a specific value as well as a 95% confidence
# interval (an "upper" and a "lower" limit around the forecast value). The script relies on Adobe
# to do this work (but it could also be done in R). The script then looks at the *actual* values
# and flags any of them that fall OUTSIDE the confidence interval.
#
# This differs from the way that Adobe Analytics presents anomaly detection in a couple of ways:
#
# 1) It focuses JUST on the most recent period, even though it plots a longer trendline. It ignores
# anomalies that occurred in the past, because it's focused on "did anything happen LATELY?"
# 2) It shows a trendline that includes the previous period data -- data that is used to create
# the forecast (and even earlier, if desired)
#
# This script takes as inputs:
# - a set of metrics
# - a single segment or list of segments that should be applied for the entire report.
# - two lists of segments that are then drilled down into
#
# To use this script, you will need an .Renviron file in your working directory when you start/
# re-start R that has your Adobe Analytics credentials and the RSID for the report suite being
# used. It should look like:
#
# ADOBE_KEY="[Your Adobe Key]"
# ADOBE_SECRET="[Your Adobe Secret]"
# RSID="[The RSID for the report suite being used]"
#
# Then, you will need to customize the various settings in the config.R file.
# What these settings are for and how to adjust them is documented in the comments of that file.
knitr::opts_chunk$set(echo = TRUE)
# Get a timestamp for when the script starts running. Ultimately, this will be written
# out to a file with end time so there is a record of how long it took the script to run.
script_start_time <- Sys.time()
# Load libraries
library(RSiteCatalyst)
library(tidyverse)
library(scales) # For getting commas in numbers on y-axes
library(stringr) # For wrapping strings in the axis labels
```
```{r settings, include=FALSE}
###############
# Settings
###############
# These are all sourced from config.R, so be sure to open that script
# and adjust settings there before running this one. These are called out
# as separate chunks just for code readability (hopefully).
knitr::read_chunk('config.R')
```
```{r metrics-list, include=FALSE}
```
```{r timeframes, include=FALSE}
```
```{r main-segment, include=FALSE}
```
```{r drilldown-segments, include=FALSE}
```
```{r default-theme, include=FALSE}
```
```{r functions, include=FALSE}
###############
# Functions
####################
# Heatmap Creation Function
####################
summary_heatmap <- function(metric){
# Get just the results for the metric of interest
summary_table <- filter(segment_results_anomalies, metric_name == metric) %>%
select(segment_1, segment_2, metric_good_anomalies, metric_bad_anomalies, metric_net_good_anomalies)
# Convert the segment names to factors (required in order to order them) and
# ensure they're ordered the same as set up in the config.
summary_table$segment_1 <- factor(summary_table$segment_1,
levels = rev(sapply(segment_drilldown_1, function(x) x$name)))
summary_table$segment_2 <- factor(summary_table$segment_2,
levels = sapply(segment_drilldown_2, function(x) x$name))
# Create the heatmap
# Get the details on how to format the metric in the box
metric_format <- filter(metrics_list, metric_name == metric) %>%
select(metric_format) %>% as.character()
metric_decimals <- filter(metrics_list, metric_name == metric) %>%
select(metric_decimals) %>% as.numeric()
heatmap_plot <- ggplot(summary_table, aes(segment_2, segment_1)) +
geom_tile(aes(fill = metric_net_good_anomalies)) +
scale_fill_gradient2(low = "red", mid = "white", high = "green", limits=c(-5,5)) +
geom_text(aes(label = paste0("+",metric_good_anomalies)), nudge_y = 0.2) +
geom_text(aes(label = paste0("-",metric_bad_anomalies)), nudge_y = -0.2) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) +
scale_y_discrete(labels = function(x) str_wrap(x, width = 8)) +
default_theme +
theme(axis.text = element_text(size = 12, colour = "grey10"),
panel.grid.major = element_blank(),
legend.position = "none")
}
# And, get one function that is shared with other .Rmd files.
knitr::read_chunk('anomaly_check_functions.R')
```
```{r assess-anomalies, include=FALSE}
```
```{r main, include=FALSE}
#######################
# Start of main functionality
#######################
# Get the values needed to authenticate from the .Renviron file
auth_key <- Sys.getenv("ADOBE_KEY")
auth_secret <- Sys.getenv("ADOBE_SECRET")
# Get the RSID we're going to use from the .Renviron file
rsid <- Sys.getenv("RSID")
# Authenticate
SCAuth(auth_key, auth_secret)
# Cycle through all possible combinations of the segments
# in segment_drilldown_1 and segment_drilldown_2. Think of this as a matrix for each
# metric that will show the total for each combination of segments from the two lists.
# Should this be doable without loops? Maybe. With lapply? I don't think it would
# change the number of API calls, and that's what the real performance drag is.
segment_results_anomalies <- data.frame(segment_1 = character(),
segment_2 = character(),
metric_name = character(),
metric_good_anomalies = numeric(),
metric_bad_anomalies = numeric(),
metric_net_good_anomalies = numeric(),
stringsAsFactors = FALSE)
# Initialize a counter for adding new rows to the data frame just created.
new_row <- 1
for(s1 in 1:length(segment_drilldown_1)){
# Get the current segment 1 to be processed
segment1_id <- segment_drilldown_1[[s1]]$seg_id
segment1_name <- segment_drilldown_1[[s1]]$name
for(s2 in 1:length(segment_drilldown_2)){
# Get the current segment 2 to be processed
segment2_id <- segment_drilldown_2[[s2]]$seg_id
segment2_name <- segment_drilldown_2[[s2]]$name
segments <- c(segments_all, segment1_id, segment2_id)
# Pull the totals for the two segments metrics to be assessed. This is
# a little bit of a hack, as it's really QueueSummary() data that we're
# looking for, but that doesn't support a segment. So, this is simply
# using "year" as the granularity to get summary-like data. This will
# potentially cause a hiccup here if the period spans two years.
# Get metrics to be assessed for the "anomaly period." This data will include
# the forecast values for the metrics, with that forecast based on the 35 days
# preceding the start date.
data_anomaly_trend <- QueueOvertime(rsid,
date_start_anomaly_period,
date_end,
metrics_list$metric_id,
date.granularity = "day",
segment.id = segments,
anomaly.detection = TRUE)
# Get the result for each metric
for(m in 1:nrow(metrics_list)){
metric_id <- metrics_list[m,1]
metric_name <- metrics_list[m,2]
# Call the function that actually gets the anomaly counts
anomaly_count <- assess_anomalies(metric_id, data_anomaly_trend)
# Add the results to the data frame
segment_results_anomalies[new_row,] <- NA
segment_results_anomalies$segment_1[new_row] <- segment1_name
segment_results_anomalies$segment_2[new_row] <- segment2_name
segment_results_anomalies$metric_name[new_row] <- metric_name
segment_results_anomalies$metric_good_anomalies[new_row] <- anomaly_count$good_anomalies
segment_results_anomalies$metric_bad_anomalies[new_row] <- anomaly_count$bad_anomalies
segment_results_anomalies$metric_net_good_anomalies[new_row] <- anomaly_count$net_good_anomalies
# Increment the counter so the next iteration will add another row
new_row <- new_row + 1
}
}
}
# Save this data. This is just so we can comment out the actual pulling of the
# data if we're just tinkering with the output
save(segment_results_anomalies, file = "data_anomaly_id_two_dimension_drilldown.Rda")
# load("data_anomaly_id_two_dimension_drilldown.Rda")
# RMarkdown doesn't do great with looping for output, so the sections below need to be
# constructed manually. This should be fairly quick to tweak. Note that summary_heatmap() takes
# as an input the 'metric_name' value, so this needs to be based on what was entered
# for 'metric_name' in the 'metrics_list' object in the Settings.
```
## Revenue
This summary highlights the anomalies in overall key metrics by day for the most recent week by comparing a forecast of the results with the actual results to identify which days during the last week deviated a "significant" amount from the expected result. The top number is the number of positive anomalies, and the bottom number is the number of negative anomalies. The color indicates the number of _net positive_ (good - bad) anomalies. This assessment **`r ifelse(include_weekends=="No","excludes","includes")`** weekend anomalies.
```{r revenue, echo=FALSE, warning=FALSE}
heatmap_plot <- summary_heatmap("Revenue")
heatmap_plot
```
## Orders
This summary highlights the anomalies in overall key metrics by day for the most recent week by comparing a forecast of the results with the actual results to identify which days during the last week deviated a "significant" amount from the expected result. The top number is the number of positive anomalies, and the bottom number is the number of negative anomalies. The color indicates the number of _net positive_ (good - bad) anomalies. This assessment **`r ifelse(include_weekends=="No","excludes","includes")`** weekend anomalies.
```{r orders, echo=FALSE, warning=FALSE}
heatmap_plot <- summary_heatmap("Orders")
heatmap_plot
```
## Visits
This summary highlights the anomalies in overall key metrics by day for the most recent week by comparing a forecast of the results with the actual results to identify which days during the last week deviated a "significant" amount from the expected result. The top number is the number of positive anomalies, and the bottom number is the number of negative anomalies. The color indicates the number of _net positive_ (good - bad) anomalies. This assessment **`r ifelse(include_weekends=="No","excludes","includes")`** weekend anomalies.
```{r visits, echo=FALSE, warning=FALSE}
heatmap_plot <- summary_heatmap("Visits")
heatmap_plot
```
## Conversion Rate
This summary highlights the anomalies in overall key metrics by day for the most recent week by comparing a forecast of the results with the actual results to identify which days during the last week deviated a "significant" amount from the expected result. The top number is the number of positive anomalies, and the bottom number is the number of negative anomalies. The color indicates the number of _net positive_ (good - bad) anomalies. This assessment **`r ifelse(include_weekends=="No","excludes","includes")`** weekend anomalies.
```{r cvr, echo=FALSE, warning=FALSE}
heatmap_plot <- summary_heatmap("Conversion Rate")
heatmap_plot
```
```{r script_time, include=FALSE}
# Get a timestamp for when the script is essentially done and write the start and end times out
# to a file that can be checked to see how long it took the script to run.
script_end_time <- Sys.time()
duration_message <- paste0("The script started running at ", script_start_time, " and finished running at ",
script_end_time, ". The total duration for the script to run was: ",
script_end_time - script_start_time," minutes.")
write_file(duration_message, path = "script_duration_anomalies_two_dimension_drilldown.txt")
```