[ENG-1987] indexer shenanigans #779

aaxelb · 2020-10-08T18:11:18Z

ok so the point of this is to make it so the elastic index can be populated directly from NormalizedData instead of AbstractCreativeWork, to make way for deleting ShareObject and all its kin (including AbstractCreativeWork) -- all without breaking the apps we've built that use the index directly (preprints and registries discover pages -- will be updated to use a new search api once that new api exists)

changes

docker-compose updates
- add indexer service
- move common environment variables to .docker-compose.env
- tell celery its app
share.search tidy-up
- cleaner indexer daemon threading, messaging
- remove unhelpfully redundant code
- move elastic mappings and fetching logic to IndexSetup classes
  - available IndexSetups added to entry_points in setup.py
  - allow configuring a different IndexSetup for each index
  - currently only one share_classic IndexSetup that just wraps
    existing behavior
- move index creation/deletion to ElasticManager
ensure useful suids for all datums
- require suids for pushed data
- for compatibility, infer suids for data pushed from OSF
- populate_osf_suids command to populate suids for existing OSF data
add FormattedMetadataRecord model
- each suid has one FMR for each supported metadata format (see the list in setup.py under entry_points => share.metadata_formats)
- the elasticsearch source document is now stored as an FMR during ingest, and the indexer daemon just pulls from that table (makes re-indexing much smoother)

todo

tests for ElasticManager
tests for IndexSetup(s)
(better) tests for SearchIndexerDaemon
postrend_backcompat IndexSetup that builds elastic documents from NormalizedData

felliott

Looking good! Just a few comments.

share/util/osf.py

felliott · 2020-10-11T20:48:26Z

share/util/osf.py

@@ -10,3 +14,33 @@ def osf_sources():
    ).exclude(
        user__username=settings.APPLICATION_USERNAME,
    )
+
+
+OSF_GUID_RE = re.compile(r'^https?://(?:[^.]+.)?osf.io/(?P<guid>[^/]+)/?$')


Nitpick 2: This may be intentional, but (?P<guid>[^/]+) will match an arbitrary number of non-/ characters. If you want to limit this to exactly 5, no more, no less, then you can replace the + with {5}: (?P<guid>[^/]{5})

yeah that was kinda intentional -- i don't know if "exactly 5" will always be set in stone (do we have a way to tell how much of valid-guid-space is used/left?) and it seemed safe enough to allow any length, as long as it's not a nested path

api/normalizeddata/views.py

share/ingest/ingester.py

- add `indexer` service - move common environment variables to `.docker-compose.env` - tell `celery` its app - update quay.io image names

- cleaner indexer daemon threading, messaging - remove unhelpfully redundant code - move elastic mappings and fetching logic to IndexSetup classes - available IndexSetups added to entry_points in setup.py - allow configuring a different IndexSetup for each index - currently only one `share_classic` IndexSetup that just wraps existing behavior - move index creation/deletion to ElasticManager - tidy up `settings.ELASTICSEARCH`

- require suids for pushed data - a `MutableGraph` can have a "central" node that the graph is "about" - for compatibility, infer suids for data pushed from OSF - `populate_osf_suids` command to populate suids for existing OSF data

- preserve full type lineage in `ShareV2SchemaType.type_lineage` - require specifying through-fields for many-to-many relations - allows traversing m2m relations in MutableGraph

- add `share.metadata_formats` entry points - start out with just `sharev2_elastic`, for back-compatible elasticsearch documents - allow getting a list of all entry points of a given namespace in share.util.extensions -- easy to loop through all formats - add `FormattedMetadataRecord` model - given a suid, easy to store its most recent normalized datum in all available metadata formats

an alternate path to share_classic -- will allow rending ShareObject while retaining back-compatibility for our discover pages

- python 3.6+, not 3.5 - other miscellany

- only run `ingest` on IngestJobs, not HarvestJobs - make status names in english match status names in python

felliott suggested changes Oct 14, 2020

View reviewed changes

aaxelb force-pushed the feature/eng-1987--indexer-redo branch 6 times, most recently from c1082e9 to 2c6763c Compare October 22, 2020 14:55

aaxelb force-pushed the feature/eng-1987--indexer-redo branch 3 times, most recently from 2bd6ae6 to b0dde85 Compare November 30, 2020 20:53

aaxelb marked this pull request as ready for review December 7, 2020 18:52

aaxelb changed the title ~~[wip][ENG-1987] indexer shenanigans~~ [ENG-1987] indexer shenanigans Dec 7, 2020

aaxelb added 9 commits December 7, 2020 14:22

docker-compose updates

efa5d83

- add `indexer` service - move common environment variables to `.docker-compose.env` - tell `celery` its app - update quay.io image names

suids for all datums

65013c0

- require suids for pushed data - a `MutableGraph` can have a "central" node that the graph is "about" - for compatibility, infer suids for data pushed from OSF - `populate_osf_suids` command to populate suids for existing OSF data

share.schema updates

a376947

- preserve full type lineage in `ShareV2SchemaType.type_lineage` - require specifying through-fields for many-to-many relations - allows traversing m2m relations in MutableGraph

mollify pip

41fef56

add postrend_backcompat index setup

1f0ed3d

an alternate path to share_classic -- will allow rending ShareObject while retaining back-compatibility for our discover pages

comment cleanup

d8330f2

- python 3.6+, not 3.5 - other miscellany

job cleanup

9c00ff0

- only run `ingest` on IngestJobs, not HarvestJobs - make status names in english match status names in python

aaxelb force-pushed the feature/eng-1987--indexer-redo branch from 9f9502a to 9c00ff0 Compare December 7, 2020 20:15

aaxelb merged commit caf2ad4 into CenterForOpenScience:develop Dec 8, 2020

aaxelb deleted the feature/eng-1987--indexer-redo branch December 8, 2020 19:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENG-1987] indexer shenanigans #779

[ENG-1987] indexer shenanigans #779

aaxelb commented Oct 8, 2020 •

edited

Loading

felliott left a comment

felliott Oct 11, 2020

aaxelb Oct 14, 2020

[ENG-1987] indexer shenanigans #779

[ENG-1987] indexer shenanigans #779

Conversation

aaxelb commented Oct 8, 2020 • edited Loading

changes

todo

felliott left a comment

Choose a reason for hiding this comment

felliott Oct 11, 2020

Choose a reason for hiding this comment

aaxelb Oct 14, 2020

Choose a reason for hiding this comment

aaxelb commented Oct 8, 2020 •

edited

Loading