Skip to content
/ odl Public
forked from open-downloads/odl

The Open Downloads specification

Notifications You must be signed in to change notification settings

growlfm/odl

 
 

Repository files navigation

oDL

Open Downloads (oDL) is built on the simple idea that podcast download numbers should be consistent and transparent across hosting companies, networks, and analytics providers.

oDL is an open source package that contains a simple spec, IP deny lists, and code to prepare log files and count podcast downloads in a scalable way.

oDL's goal is to move the podcast industry forward collectively by introducing a layer of trust between podcasters, providers, and advertisers by removing the black box approach and replacing it with an open system to count and verify download numbers.

Quickstart

Docker (preferred)

Note: Using docker-compose, a Docker volume is used to share files between host and container. The host path, ./shared/data is mapped to the container path, /var/lib/odl.

  1. Place source events CSV file in ./shared/data/ directory.

  2. Execute 'odl' to generate analytics

$ docker-compose run odl python run.py /var/lib/odl/<source_file.csv>

Results are written to ./shared/data/<uid> directory.

Example:

$ mkdir -p ./shared/data
$ cp events.csv ./shared/data
$ docker-compose run odl python run.py /var/lib/odl/events.csv

Creating odl_odl_run ... done
Running the oDL pipeline from /tmp/odl/ec2e4eef293f414a936ce8e7821f8d0c.avro to /var/lib/odl/ec2e4eef293f414a936ce8e7821f8d0c

oDL run complete.
Downloads: 270
Output files written to path: /var/lib/odl/ec2e4eef293f414a936ce8e7821f8d0c

iPython

oDL is in the comment phase of development. For now you'll need to clone from source.

$ git clone git@github.com:open-downloads/odl.git && cd odl
$ virtualenv venv
$ source venv/bin/activate
$ python -m pip install .
$ ipython

> from odl import prepare, pipeline
> prepare.run('path/to/events-input.csv', 'path/to/events.odl.avro')
> pipeline.run('path/to/events.odl.avro', 'path/to/events-output')

Running the oDL pipeline from path/to/events.odl.avro to path/to/events-output

oDL run complete.
Downloads: 13751
Output files written to path: path/to/events-output

Overview

oDL is a fundamentally new paradigm in podcasting, so we wanted to explain a little more about where oDL fits in the ecosystem.

Spec

An oDL Download meets the following criteria:

  • It's an HTTP GET request
  • The IP Address is not on the deny list
  • The User Agent is not on the deny list AND not a bot
  • It can't be a streaming request for the first 2 bytes (i.e. Range: bytes=0-1)
  • The User Agent/IP combination counts once per day (i.e. fixed 24 hour period starting at midnight UTC)

That's it. It's intentionally simple to reduce confusion for all involved.

Dealing with bots

Podcasts get many downloads from bots, servers, and things that just aren't human. We need to avoid counting these, but if everyone maintains their own user agent and IP allow/deny lists, they can't count the same way.

oDL uses common, publicly available lists to decide which User Agents and IP Addresses should be disallowed:

Out of Scope.

oDL focuses on accuracy in cross provider download numbers. Analytics Prefixes and hosting providers have access to different sets of data, so we take the intersection of this information. All counts are based on these seven data points:

  • IP Address
  • User Agent
  • HTTP Method
  • Timestamp
  • Episode Identifier (ID/Enclosure URL)
  • Byte Range Start
  • Byte Range End

oDL does not take into account bytes coming off the server or even percentage of episode streamed. These numbers tend to conflate listening behavior with downloads and only podcast players can reliably report listening behavior.

Our goal is to simply weed out requests that did not make a good faith effort to put an audio file on a podcast player.

Isn't this just IAB v2.0?

No, similar goals, but different tactics.

The IAB v2.0 spec is great, but it relies on wording around "Best Practices." We believe that a spec shouldn't have best practices. Two hosting providers, both who are IAB v2.0 certified could have up to a 10% difference in download counts. This creates confusion for publishers, podcasters and advertisers alike as who's number is "correct".

IAB v2.0 is also expensive. It's up to $45k for a hosting provider to become certified. Competition is important, and this hurdle creates an undue burden on smaller companies.

oDL takes a transparent, open source approach. By using common IP denylists and User Agents we make the industry as a whole more accurate.

oDL and IAB v2.0 are not mutually exclusive. A provider may decide to do both.

Advanced Analytics

Many hosting providers offer advanced analytics and we are all about it. oDL is not meant to reduce innovation in the podcast analytics space. Simplecast is fingerprinting downloads, backtracks has its thing going on and many others have interesting ideas of how to use download data.

Advanced analytics methodologies are not consistent across hosting providers. Some providers will use an allow list to allow more downloads from IPs with many users or use shorter attribution windows. These methods, while taken in good faith, inflate download numbers against providers that take a stricter approach.

Ad-supported podcasters can make more or less money depending on which hosting provider they choose.

Our hope is that an oDL number sits beside the advanced analytics number and may be higher or lower, but the podcaster knows that it is consistent with their peers.

I'm a Hosting Provider and I want to support oDL.

oDL is a self-certifying spec, meaning that there is no formal process to become certified. The only requirement is that you let podcasters download their raw logs in the odl.avro format.

Users can then verify the numbers reported in your dashboard are the same as oDL's. Our goal in the future is to add a hosted verification service to https://odl.dev.

Code

A spec isn't much without code to run it all. oDl ships with a full implementation for counting downloads against server logs. It's built using Apache Beam to scale from counting thousands of downloads on your laptop to millions using a data processing engine like Apache Spark or Google Cloud Dataflow.

Prepare

The first step in running oDL is to prepare the data for the download counting job. oDL uses the following avro schema for raw events:

Note: timestamps should be strings in RFC 3339 format

[{
    'name': 'encoded_ip',
    'type': 'string'
}, {
    'name': 'user_agent',
    'type': 'string'
}, {
    'name': 'http_method',
    'type': 'string'
}, {
    'name': 'timestamp',
    'type': 'string'
}, {
    'name': 'episode_id',
    'type': 'string'
}, {
    'name': 'byte_range_start',
    'type': ["int", "null"]
}, {
    'name': 'byte_range_end',
    'type': ["int", "null"]
}]

We ask that you encode the IP before creating the odl.avro file. We provide a helper to salt and hash the IP.

> from odl.prepare import get_ip_encoder
> encode = get_ip_encoder()
> encode('1.2.3.4')
'58ecdbd64d2fa9844d29557a35955a58'
> encode('1.2.3.5')
'1bcdcb404b16e046f3a13fc5563853d3'
> encode('1.2.3.5')
 '1bcdcb404b16e046f3a13fc5563853d3'

By default get_ip_encoder uses a random salt, so if you need to run oDL over multiple files, include your own salt.

> encode = get_ip_encoder(salt="this is a super secret key")
> encode('1.2.3.4')

To actually write a file, you can use the following.

> from odl import prepare
> prepare.run('./path/to/log-events.json', './path/to/events.odl.avro', format="json")

Write will throw errors at you if it can't create a file.

Pipeline

The odl package uses Apache Beam under the hood to work on logfiles at any scale. On your local machine it looks like this:

pipeline.run('path/to/events.odl.avro', 'path/to/events-output')

> from odl import pipeline
> pipeline.run('path/to/events.odl.avro', 'path/to/events-output')

Closing

It's still early in the ball game here, but we hope open, transparent counting will improve podcasting for all.

About

The Open Downloads specification

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 98.6%
  • Dockerfile 1.4%