This repository was archived by the owner on Mar 12, 2020. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 4
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This is a significant refactoring of the pipeline in preparation for moving to Postgres. Much of the code has been simplified and removed since we no longer need to deal with adding a summary collection. Some further optimizations have been made to push expensive operations as far down the pipeline as possible. This allows for more aggressive filtering up front that drastically cuts down on the work being performed per request. Date filtering is now done within the pipeline. This was done to simplify the execution of the pipeline, which previously relied on grep to feed requests into the pipeline. The dogpile.cache package has been added to further improve performance for operations that would benefit from caching.
This commit provides the framework for inserting documents into postgres. A memory cache is provided through dogpile to improve performance while running over lots of requests. After some time most documents, authors and dlcs will be cached and requests can be generated without going to the database at all. This should vastly improve performance over long running pipelines, especially where postgres lives elsewhere on the network.
This adds a CSV formatted output suitable for passing to the PostGres COPY command. Only the requests table is populated in this way; the underlying documents and identity tables get updated during the run. This is necessary in order to deal with foreign keys, but the bulk of the inserts will be from requests.
This commit addresses the various errors that can occur due to bad data entering the pipeline. Errors are logged as warnings, the offending request is dropped and processing continues.
The command line will now pull the database URI from an environment variable called `OASTATS_DATABASE` if available.
This adds full support for generating the summary collection from postgres.
Some of the dates are malformed in the logs and getting through the lazy match. arrow doesn't seem to care, so this change should filter out the messed up dates.
It may be the case that author or dlc identities are encountered that are missing one of the two fields. The documents associated with these should still be added, but the incomplete identities should be thrown out and corrected later.
This adds a load command that will take the requests data currently in mongo and dump it into the new postgres data model. This is one-use functionality that can be removed once the migration is complete.
Closes #57.
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR refactors the application to use Postgres as the backend for request data. The pipeline itself has also been rewritten for better log processing performance.
Documentation for the new pipeline is located in the
docs_src/folder.