Tools used to create the data on TravisTorrent (
Ruby R Shell
Latest commit 18ba889 Feb 9, 2017 @gousiosg gousiosg committed on GitHub Documentation fixes
Failed to load latest commit information.
bin Fix parallel commandline for build_data_extraction Dec 29, 2016
dev_logs Add missing tests Jan 21, 2017
.gitignore Ignore configuration Jan 12, 2017
.travis.yml Add Travis configuration Oct 29, 2016
gh-active-projects.csv Graceful writing of CSV file Aug 7, 2015
travis-analyzer.gemspec Resolve missing Unit Spec problem Jan 11, 2017
travis-doc Lots of smaller improvements. Aug 17, 2015

This repository contains the tools used to generate the data on TravisTorrent: These include the

  1. Travis Poker (bin/travis_poker.rb), which pokes en-mass whether a project has a Travis build history,
  2. Travis Harvester which downloads Travis build logs (bin/travis_harvester.rb),
  3. Travis BuildLog Analyzer (bin/buildlog_analysis.rb)
  4. Build Metadata extractor (bin/build_data_extraction.rb)

Installing required dependencies

The following works on Debian Jessie

$ apt-get install ruby ruby-dev bundler pkg-config libmysqlclient-dev
$ git clone
$ cd travistorrent-tools
$ bundle install

Running the data extraction process

The file projects.txt contains a list of non-toy, non-fork, active GitHub projects. It was retrieved from GHTorrent by running the query:

select u.login,, p.language, count(*)
from projects p, users u, watchers w
    p.forked_from is null and
    p.deleted is false and
    w.repo_id = and = p.owner_id
group by
having count(*) > 50
order by count(*) desc

You can then call the Travis Poker to see whether these projects use Travis CI or not. Projects will be annotated with a binary flag indicating this.

To further process the list generated by Travis Poker, do

grep "true" results.csv > travis_enabled
sed -i 's/\([^,]*\),\([^,]*\).*/\1 \2/' travis_enabled

This list can now be passed to the Travis Harvester, for which we use parallel.

Retrieve build logs of 20 GH project simultaneously (beware, depending on your network connection this puts a heavy load on Travis-CI!)

cat travis-enabled | parallel -j 20 --colsep ' ' ruby bin/travis_harvester.rb

Extracting GitHub features about each build

To extract features for one project, do

 ruby -Ibin bin/build_data_extraction.rb stripe brushfire github-token

where github-token is a valid GitHub OAuth token used to download information about commits. To configure access to the required GHTorrent MySQL and MongoDB databases, copy config.yaml.tmpl to config.yaml and edit accordingly. You can have direct access to the GHTorrent MySQL and MongoDB databases using this link.

To extract features for multiple projects in parallel, you need

  • A file (project-list) of projects, in the format specified above
  • A file (token-list) of one or more Github tokens, one token per line

Then, run

./bin/project_token.rb project-list token-list | sort -R > projects-tokens
./bin/ -p 4 -d data projects-tokens

this will create a file with tokens equi-distributed to projects a directory data, and start 4 instanced of the build_data_extraction.rb script

Analyzing Buildlogs

Our buildlog dispatcher handles everything that you typically want: It generates one convenient output file (a CSV) per project directory, and invokes an automatically dispatched correct buildlog analyzer. You can start the per-project analysis (typically on a directory structured checkedout through travis-harvester) via

ruby bin/buildlog_analysis.rb directory-of-project-to-analyze

To start to analyze all buildlogs, parallel helps us again:

ls build_logs | parallel -j 5 ruby bin/buildlog_analysis.rb "build_logs/{}"

Travis Breaking the Build

broken <- (errored|failed) errored <- infrastructure failed <- tests canceled <- user abort

Breaking the Build

If any of the commands in the first four stages returns a non-zero exit code, Travis CI considers the build to be broken.

When any of the steps in the before_install, install or before_script stages fails with a non-zero exit code, the build is marked as errored.

When any of the steps in the script stage fails with a non-zero exit code, the build is marked as failed.

Note that the script section has different semantics to the other steps. When a step defined in script fails, the build doesn’t end right away, it continues to run the remaining steps before it fails the build.

Currently, neither the after_success nor after_failure have any influence on the build result. Travis have plans to change this behaviour