Postgres refactor #60

gravesm · 2016-11-18T18:55:36Z

This PR refactors the application to use Postgres as the backend for request data. The pipeline itself has also been rewritten for better log processing performance.

Documentation for the new pipeline is located in the docs_src/ folder.

This is a significant refactoring of the pipeline in preparation for moving to Postgres. Much of the code has been simplified and removed since we no longer need to deal with adding a summary collection. Some further optimizations have been made to push expensive operations as far down the pipeline as possible. This allows for more aggressive filtering up front that drastically cuts down on the work being performed per request. Date filtering is now done within the pipeline. This was done to simplify the execution of the pipeline, which previously relied on grep to feed requests into the pipeline. The dogpile.cache package has been added to further improve performance for operations that would benefit from caching.

This commit provides the framework for inserting documents into postgres. A memory cache is provided through dogpile to improve performance while running over lots of requests. After some time most documents, authors and dlcs will be cached and requests can be generated without going to the database at all. This should vastly improve performance over long running pipelines, especially where postgres lives elsewhere on the network.

This adds a CSV formatted output suitable for passing to the PostGres COPY command. Only the requests table is populated in this way; the underlying documents and identity tables get updated during the run. This is necessary in order to deal with foreign keys, but the bulk of the inserts will be from requests.

This commit addresses the various errors that can occur due to bad data entering the pipeline. Errors are logged as warnings, the offending request is dropped and processing continues.

The command line will now pull the database URI from an environment variable called `OASTATS_DATABASE` if available.

This adds full support for generating the summary collection from postgres.

Some of the dates are malformed in the logs and getting through the lazy match. arrow doesn't seem to care, so this change should filter out the messed up dates.

It may be the case that author or dlc identities are encountered that are missing one of the two fields. The documents associated with these should still be added, but the incomplete identities should be thrown out and corrected later.

This adds a load command that will take the requests data currently in mongo and dump it into the new postgres data model. This is one-use functionality that can be removed once the migration is complete.

Closes #57.

Closes #58 and #59.

coveralls · 2016-11-18T18:57:54Z

Coverage decreased (-24.5%) to 69.412% when pulling c805170 on roast into f846671 on master.

Mike Graves added 18 commits September 29, 2016 11:27

Add error handling in pipeline

17bd99a

This commit addresses the various errors that can occur due to bad data entering the pipeline. Errors are logged as warnings, the offending request is dropped and processing continues.

Fix field quoting

c950da2

Add summary for author object

5e6a971

Add logging config

b218972

Add support for database config from env var

46c1dc2

The command line will now pull the database URI from an environment variable called `OASTATS_DATABASE` if available.

Add summary command

d4f324c

This adds full support for generating the summary collection from postgres.

Use stricter date validation

afbbeb2

Some of the dates are malformed in the logs and getting through the lazy match. arrow doesn't seem to care, so this change should filter out the messed up dates.

Fix mongo collection access

e0aeb9f

Fix handle summary query

da0a25b

Ensure only valid identities are added

f1f6707

It may be the case that author or dlc identities are encountered that are missing one of the two fields. The documents associated with these should still be added, but the incomplete identities should be thrown out and corrected later.

Add load command to dump mongo into postgres

b04b366

This adds a load command that will take the requests data currently in mongo and dump it into the new postgres data model. This is one-use functionality that can be removed once the migration is complete.

Change dlc summary object _id

651d4ab

Closes #57.

Fix handle summary generation

b6a125a

Add database constraints and indexes

2e1d7e8

Closes #58 and #59.

Add documentation

c805170

gravesm merged commit d2d6d3d into master Nov 18, 2016

gravesm deleted the roast branch November 18, 2016 19:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Postgres refactor #60

Postgres refactor #60

Uh oh!

gravesm commented Nov 18, 2016

Uh oh!

coveralls commented Nov 18, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Postgres refactor #60

Postgres refactor #60

Uh oh!

Conversation

gravesm commented Nov 18, 2016

Uh oh!

coveralls commented Nov 18, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants