Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign up| --- | |
| title: "Introduction to webreadr" | |
| author: "Oliver Keyes" | |
| date: "`r Sys.Date()`" | |
| output: rmarkdown::html_vignette | |
| vignette: > | |
| %\VignetteIndexEntry{Introduction to webreadr} | |
| %\VignetteEngine{knitr::rmarkdown} | |
| \usepackage[utf8]{inputenc} | |
| --- | |
| # Reading web access logs | |
| R, as a language, is used for analysing pretty much everything from genomic data to financial information. It's also | |
| used to analyse website access logs, and R lacks a good framework for doing that; the URL decoder isn't vectorised, | |
| the file readers don't have convenient defaults, and good luck normalising IP addresses at scale. | |
| Enter <code>webreadr</code>, which contains convenient wrappers and functions for reading, munging and formatting | |
| data from access logs and other sources of web request data. | |
| ## File reading | |
| Base R has read.delim, which is convenient but much slower for file reading than Hadley's new [readr](https://github.com/hadley/readr) | |
| package. <code>webreadr</code> defines a set of wrapper functions around readr's <code>read_delim</code>, designed | |
| for common access log formats. | |
| The most common historical log format is the [Combined Log Format](http://httpd.apache.org/docs/1.3/logs.html#combined); this is used as one of the default formats for [nginx](http://nginx.org/) and the [Varnish caching system](https://www.varnish-cache.org/docs/trunk/reference/varnishncsa.html). <code>webreadr</code> | |
| lets you read it in trivially with <code>read\_combined</code>: | |
| ```{r} | |
| library(webreadr) | |
| #read in an example file that comes with the webreadr package | |
| data <- read_combined(system.file("extdata/combined_log.clf", package = "webreadr")) | |
| #And if we look at the format... | |
| str(data) | |
| ``` | |
| As you can see, the types have been appropriately set, the date/times have been parsed, and sensible header names have been set. | |
| The same thing can be done with the Common Log Format, used by Apache default configurations and as one of the defaults for | |
| Squid caching servers, using <code>read\_clf</code>. The other squid default format can be read with <code>read\_squid</code>. | |
| Amazon's AWS files are also supported, with <code>read\_aws</code>, which includes automatic field detection, and S3 bucket | |
| access logs can be read with <code>read\_s3</code>. | |
| ## Splitting combined fields | |
| One of the things you'll notice about the example above is the "request" field - it contains not only the actual asset | |
| requested, but also the HTTP method used and the protocol used. That's pretty inconvenient for people looking to do something | |
| productive with the data. | |
| Normally you'd split each field out into a list, and then curse and recombine them into a data.frame and hope that | |
| doing so didn't hit R's memory limit during the "unlist" stage, and it'd take an absolute age. Or, you could just split them | |
| up directly into a data frame using <code>split\_clf</code>: | |
| ```{r} | |
| requests <- split_clf(data$request) | |
| str(requests) | |
| ``` | |
| This is faster than manual splitting-and-data.frame-ing, easier on the end user, and less likely to end in unexpected segfaults with | |
| large datasets. It also works on the <code>uri</code> within S3 access logs. | |
| A similar function, <code>split\_squid</code>, exists for the `status\_code` field in files read in with <code>read_squid</code>, which suffer from a similar problem. | |
| ## Other ideas | |
| If you have ideas for other URL handlers that would make access log processing easier, the best approach | |
| is to either [request it](https://github.com/Ironholds/webreadr/issues) or [add it](https://github.com/Ironholds/webreadr/pulls)! | |