Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENG-1987] indexer shenanigans #779

Merged

Conversation

aaxelb
Copy link
Contributor

@aaxelb aaxelb commented Oct 8, 2020

ok so the point of this is to make it so the elastic index can be populated directly from NormalizedData instead of AbstractCreativeWork, to make way for deleting ShareObject and all its kin (including AbstractCreativeWork) -- all without breaking the apps we've built that use the index directly (preprints and registries discover pages -- will be updated to use a new search api once that new api exists)

changes

  • docker-compose updates
    • add indexer service
    • move common environment variables to .docker-compose.env
    • tell celery its app
  • share.search tidy-up
    • cleaner indexer daemon threading, messaging
    • remove unhelpfully redundant code
    • move elastic mappings and fetching logic to IndexSetup classes
      • available IndexSetups added to entry_points in setup.py
      • allow configuring a different IndexSetup for each index
      • currently only one share_classic IndexSetup that just wraps
        existing behavior
    • move index creation/deletion to ElasticManager
  • ensure useful suids for all datums
    • require suids for pushed data
    • for compatibility, infer suids for data pushed from OSF
    • populate_osf_suids command to populate suids for existing OSF data
  • add FormattedMetadataRecord model
    • each suid has one FMR for each supported metadata format (see the list in setup.py under entry_points => share.metadata_formats)
    • the elasticsearch source document is now stored as an FMR during ingest, and the indexer daemon just pulls from that table (makes re-indexing much smoother)

todo

  • tests for ElasticManager
  • tests for IndexSetup(s)
  • (better) tests for SearchIndexerDaemon
  • postrend_backcompat IndexSetup that builds elastic documents from NormalizedData

Copy link
Member

@felliott felliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Just a few comments.

share/util/osf.py Outdated Show resolved Hide resolved
@@ -10,3 +14,33 @@ def osf_sources():
).exclude(
user__username=settings.APPLICATION_USERNAME,
)


OSF_GUID_RE = re.compile(r'^https?://(?:[^.]+.)?osf.io/(?P<guid>[^/]+)/?$')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick 2: This may be intentional, but (?P<guid>[^/]+) will match an arbitrary number of non-/ characters. If you want to limit this to exactly 5, no more, no less, then you can replace the + with {5}: (?P<guid>[^/]{5})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that was kinda intentional -- i don't know if "exactly 5" will always be set in stone (do we have a way to tell how much of valid-guid-space is used/left?) and it seemed safe enough to allow any length, as long as it's not a nested path

api/normalizeddata/views.py Show resolved Hide resolved
share/ingest/ingester.py Outdated Show resolved Hide resolved
@aaxelb aaxelb force-pushed the feature/eng-1987--indexer-redo branch 6 times, most recently from c1082e9 to 2c6763c Compare October 22, 2020 14:55
@aaxelb aaxelb force-pushed the feature/eng-1987--indexer-redo branch 3 times, most recently from 2bd6ae6 to b0dde85 Compare November 30, 2020 20:53
@aaxelb aaxelb marked this pull request as ready for review December 7, 2020 18:52
@aaxelb aaxelb changed the title [wip][ENG-1987] indexer shenanigans [ENG-1987] indexer shenanigans Dec 7, 2020
- add `indexer` service
- move common environment variables to `.docker-compose.env`
- tell `celery` its app
- update quay.io image names
- cleaner indexer daemon threading, messaging
- remove unhelpfully redundant code
- move elastic mappings and fetching logic to IndexSetup classes
    - available IndexSetups added to entry_points in setup.py
    - allow configuring a different IndexSetup for each index
    - currently only one `share_classic` IndexSetup that just wraps
      existing behavior
- move index creation/deletion to ElasticManager
- tidy up `settings.ELASTICSEARCH`
- require suids for pushed data
- a `MutableGraph` can have a "central" node that the graph is "about"
- for compatibility, infer suids for data pushed from OSF
- `populate_osf_suids` command to populate suids for existing OSF data
- preserve full type lineage in `ShareV2SchemaType.type_lineage`
- require specifying through-fields for many-to-many relations
    - allows traversing m2m relations in MutableGraph
- add `share.metadata_formats` entry points
    - start out with just `sharev2_elastic`, for back-compatible
      elasticsearch documents
- allow getting a list of all entry points of a given namespace in
  share.util.extensions -- easy to loop through all formats
- add `FormattedMetadataRecord` model
    - given a suid, easy to store its most recent normalized datum in
      all available metadata formats
an alternate path to share_classic -- will allow rending ShareObject
while retaining back-compatibility for our discover pages
- python 3.6+, not 3.5
- other miscellany
- only run `ingest` on IngestJobs, not HarvestJobs
- make status names in english match status names in python
@aaxelb aaxelb force-pushed the feature/eng-1987--indexer-redo branch from 9f9502a to 9c00ff0 Compare December 7, 2020 20:15
@aaxelb aaxelb merged commit caf2ad4 into CenterForOpenScience:develop Dec 8, 2020
@aaxelb aaxelb deleted the feature/eng-1987--indexer-redo branch December 8, 2020 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants