Filter web logs according to the Public Radio Podcast Measurement Guidelines to generate accurate download statistics
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.

#Public Radio Podcast Measurement Filter In January 2016, the Public Radio Podcast Measurement Guidelines Version 1.1 were issued. These Guidelines specify criteria for filtering web server log files in order to produce realistic counts of unique show downloads.

This repo contains a script to perform the specified filtering using standard Linux shell utilities, and another that acts as an example of how useful statistics can be simply produced from the output of the former.

##Why? Standard web analysis tools such as AWStats, Webalizer and GoAccess are inappropriate for producing accurate download counts for media like podcasts, which may be streamed in chunks and where multiple differing types of access request can be made for a single file. These problems are well explained in a video from Blubrry.

##What? acts as a filter, so is most useful as part of a pipeline. It accepts log lines from stdin and writes data from those that meet the PRPM Guidelines to stdout. It takes a POSIX-style regular expression as its sole argument, with only URLs matching that regex being processed and included in the output.

Data output consists of four tab-separated fields: the show URL, download date (YYYY-MM-DD), IP address of the downloader, and the User Agent string of the downloader.

##Usage Basic sample usage to produce a unique download count for each show matching the regex:

# cat access.log | ./ "Episode[[:digit]]{1,3}\.(mp3|ogg)$" | cut -f1 | sort | uniq -c

See for an example of how to process rotated logs, and to use the output from to produce unique counts for downloads over time, counts and geographic location of your unique downloaders and so on.

##Web Log Formats and Rotation will happily process web logs in the standard combined format, used both by Apache and Nginx. However, one of the PRPM Guidelines requires the filtering of lines on the basis of specific byte-range data, which is not recorded in the combined log format by default. If you want your generated statistics to be completely PRPM Guideline compliant, you will need to alter your existing log format (or write out an additional custom log) with the byte-range request data suffixed. That is, a format of:

%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\" \"%{Range}i\"

Of note is that in many Linux distros, web log rotation is configured to retain only a year's worth of data. If you'd like to generate meaningful historical statistics using this script, refer to your distro's documentation for details on how to change this behaviour.

##Footnote Public Radio released the guidelines implemented here at least in part due to frustration at the length of time the IAB were taking in producing their own 'industry standard' guidance. A document from the IAB Working Group was finally released in September 2016. It is left to the reader to determine whether this 'official' guidance adds anything of substantive merit to the process implemented here, or whether that Working Group was simply the usual tea drinking and cake eating boondoggle for committee participants.