Skip to content
This repository was archived by the owner on Mar 12, 2020. It is now read-only.

Conversation

@gravesm
Copy link
Collaborator

@gravesm gravesm commented Nov 18, 2016

This PR refactors the application to use Postgres as the backend for request data. The pipeline itself has also been rewritten for better log processing performance.

Documentation for the new pipeline is located in the docs_src/ folder.

Mike Graves added 18 commits September 29, 2016 11:27
This is a significant refactoring of the pipeline in preparation for
moving to Postgres. Much of the code has been simplified and removed
since we no longer need to deal with adding a summary collection. Some
further optimizations have been made to push expensive operations as far
down the pipeline as possible. This allows for more aggressive filtering
up front that drastically cuts down on the work being performed per
request.

Date filtering is now done within the pipeline. This was done to
simplify the execution of the pipeline, which previously relied on grep
to feed requests into the pipeline.

The dogpile.cache package has been added to further improve performance
for operations that would benefit from caching.
This commit provides the framework for inserting documents into
postgres. A memory cache is provided through dogpile to improve
performance while running over lots of requests. After some time most
documents, authors and dlcs will be cached and requests can be generated
without going to the database at all. This should vastly improve
performance over long running pipelines, especially where postgres lives
elsewhere on the network.
This adds a CSV formatted output suitable for passing to the PostGres
COPY command. Only the requests table is populated in this way; the
underlying documents and identity tables get updated during the run.
This is necessary in order to deal with foreign keys, but the bulk of
the inserts will be from requests.
This commit addresses the various errors that can occur due to bad data
entering the pipeline. Errors are logged as warnings, the offending
request is dropped and processing continues.
The command line will now pull the database URI from an environment
variable called `OASTATS_DATABASE` if available.
This adds full support for generating the summary collection from
postgres.
Some of the dates are malformed in the logs and getting through the lazy
match. arrow doesn't seem to care, so this change should filter out the
messed up dates.
It may be the case that author or dlc identities are encountered that
are missing one of the two fields. The documents associated with these
should still be added, but the incomplete identities should be thrown
out and corrected later.
This adds a load command that will take the requests data currently in
mongo and dump it into the new postgres data model. This is one-use
functionality that can be removed once the migration is complete.
@coveralls
Copy link

Coverage Status

Coverage decreased (-24.5%) to 69.412% when pulling c805170 on roast into f846671 on master.

@gravesm gravesm merged commit d2d6d3d into master Nov 18, 2016
@gravesm gravesm deleted the roast branch November 18, 2016 19:00
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants