Skip to content
Permalink
master
Go to file
 
 
Cannot retrieve contributors at this time
94 lines (63 sloc) 5.46 KB
---
title: "Session reconstruction in R"
author: "Oliver Keyes"
date: "`r Sys.Date()`"
vignette: >
%\VignetteIndexEntry{Session reconstruction in R}
%\VignetteEngine{knitr::rmarkdown}
%\usepackage[utf8]{inputenc}
---
## Session reconstruction in R
With trace data - from web logs, behavioural logs, really anything to do with user actions - reconstructing sessions (or 'sessionising') is essential. It lets an analyst divide up user actions into actual periods of sustained interaction, and from there compute a whole host of useful metrics,
from session length to bounce rate.
`reconstructr` is a library for just that - sessionisation and metric computation - in a way that keeps all the metadata about the events you're sessionising.
### Dividing events into sessions
The nature of a "session" has provided fodder for researchers for years. Most people take an approach based on inactivity thresholds; if someone does not take an action in N seconds, their session has ended and a new one begins on their next logged action.
Using reconstructr, we can conveniently divide events into sessions with the `sessionise` function. This takes a data.frame of events (along with specifiers of which column contains the user ID, and which column contains the timestamp) and a threshold value (in seconds). When the time between events by a user crosses that threshold, the session ends and a new one begins.
We can demonstrate this using reconstructr's inbuilt session dataset:
```{r, eval=FALSE}
library(reconstructr)
data("session_dataset")
str(session_dataset)
'data.frame': 63524 obs. of 3 variables:
$ uuid : chr "47dc43895814861e21a2edf93348c826" "a736822df1890011694e7049cb3abef3" "674d2d00e096a3319874a4347caa1f4a" "f62d315398e6d04a3f2fa02e8ae42d49" ...
$ timestamp: POSIXlt, format: "2014-01-07 00:00:15" "2014-01-07 00:01:11" "2014-01-07 00:01:54" ...
$ url : chr "https://www.nasa.gov/history/mercury/mercury.html" "https://www.nasa.gov/images/ksclogosmall.gif" "https://www.nasa.gov/elv/hot.gif" "https://www.nasa.gov/facts/faq04.html" ...
sessionised_data <- sessionise(x = session_dataset, timestamp = timestamp, user_id = uuid,
threshold = 1800)
str(sessionised_data)
'data.frame': 63524 obs. of 5 variables:
$ uuid : chr "0005839b3e8483d50870f61f50307fa7" "000b047bad36484451f12c114ab5eb28" "000b047bad36484451f12c114ab5eb28" "000b047bad36484451f12c114ab5eb28" ...
$ timestamp : POSIXlt, format: "2014-01-14 12:47:59" "2014-01-07 14:25:11" "2014-01-09 12:47:17" ...
$ url : chr "https://www.nasa.gov/history/apollo/images/footprint-logo.gif" "https://www.nasa.gov/ksc.html" "https://www.nasa.gov/biomed/threat/gif/beachmousefinsmall.gif" "https://www.nasa.gov/shuttle/resources/orbiters/atlantis.html" ...
$ session_id: chr "09cd65049020ed55472a2d8b1f47787e" "9dcb2f610297b3fe2c810907fa90fb8e" "70bcde51eff332d4ac820a90930f0f6e" "70bcde51eff332d4ac820a90930f0f6e" ...
$ time_delta: int NA NA NA 45 4 75 274 47 NA 28 ...
```
Sessionisation adds two new columns; 'session\_id', a unique per-session ID, and 'time\_delta' - the time between an event and the previous event in the session. If the event was the first (or only) one in a session, that value will be `NA`.
Crucially, existing metadata (like URLs, or activity type) is carried along with the session information and not dropped.
From the sessionised data we can compute a whole host of useful metrics, many of which have convenience functions in this package.
### Session metrics
An important metric in session data is the bounce rate: the proportion of sessions that included only a single event. This represents (absent data quality issues) the number of sessions where a user took only one action and then simply left.
It can be computed with `bounce_rate`, which takes a sessionised dataset and produces the percentage of sessions resulting in bounces. Optionally (if you provide an argument for the `user_id` parameter) it produces the bounce rate on a per-user basis, rather than for the dataset overall:
```{r, eval=FALSE}
str(bounce_rate(sessionised_data))
num 20.7
str(bounce_rate(sessionised_data, user_id = uuid))
'data.frame': 10000 obs. of 2 variables:
$ user_id : chr "0005839b3e8483d50870f61f50307fa7" "000b047bad36484451f12c114ab5eb28" "000b2bc1a5438d8d54d4fbec139a2fd5" "001b6e80a14ba8d809c4ff18cdbade40" ...
$ bounce_rate: num 100 14.3 0 100 100 ...
```
`time_on_page` is very similar, calculating either the mean (or median) time between events - either for the dataset as a whole or, if the `by_session` parameter is TRUE, for each session:
```{r, eval=FALSE}
str(time_on_page(sessionised_data))
num 146
str(time_on_page(sessionised_data, by_session = TRUE))
'data.frame': 22226 obs. of 2 variables:
$ session_id : chr "00011b1e098848edee7e50a2174fe6ef" "0001f6457a4d09a8c2092278fec89a89" "000451f0869b7eab3582c093ace0253d" "0004c56ace95f92ee12bf9552401f923" ...
$ time_on_page: num NaN NaN NaN NaN NaN ...
```
(It's not broken, it just so happened the first few sessions contained no non-NA time deltas).
Finally, `session_count` and `session_length` provide easy ways of identifying how many sessions are in a sessionised dataset (overall, or on a per-user basis) and how long those sessions are, respectively.
### Other session functionality
If you have ideas for other functionality that would make processing sessionseasier, the best approach
is to either [request it](https://github.com/Ironholds/reconstructr/issues) or [add it](https://github.com/Ironholds/reconstructr/pulls)!
You can’t perform that action at this time.