Skip to content

Commit

Permalink
[ENG-4380][ENG-4382] iri-based ingest and search (#806)
Browse files Browse the repository at this point in the history
* wip: index-card-search api

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* less wip

* less wip...

* api/v3

* wip

* wip

* use ELASTIC_PASSWORD consistently

* fix: create "search" alias on initial setup

* wip

* wip

* wip

* wip

* wip-migrations

* wipwip

* wip (with piri tests)

* wip (partial digestion)

* wip (vocab, browse, sharev2)

* wip (derive)

* wipwip

* wip

* wip

* wip (with Indexcard)

* wip (with trove_iris index)

* wip

* wip (no openended, no fuzzy highlight)

* wip (trove_iris cardsearch)

* wip

* wip

* wip

* drop models: IngestJob, RegulatorLog

* wip

* wip (broken)

* wip (less broken)

* wip (identifier-based index)

* wip (populate identifier-based index)

* wip (deletable indexcards)

* fix: no break on no query

* wip (fix path indexing)

* wipwip

* wip (multi-phase backfill)

* wipwip

* wip (working value-search)

* wip (stabler sharev2_elastic backcompat)

* dockerfile: pin to bullseye

* restore unsafe settings, for the moment

* wip (valuesearch results as indexcards)

* text not iri

* osfmappy property paths

* legacy sharev2 blank node hackery

* slightly better sharev2-blanknode hackery

* pay a toll for lack of tests

* enable bingest

* reduce deadlock likelihood

* avoid mismatched iri

* smaller valuesearch

* legacy-sharev2 subjects

* fix broken harvest

* better search-admin experience

* shorter paths; capped keywords

* cardsearch sort

* fix schedule_harvests error

* is ok to search a still-filling index

* add periodic tasks back to admin

* index both full-iri and suff-uniq paths

* fix schedule_harvests (again)

* fix: handle multi-word queries

* add 'api/v3/' synonym to 'trove/'

* more focused default text search

* cleaner legacy_sharev2 extract

* osfmap updates

* loosen query-param name restrictions

* tidy up

* fix local card_iri

* harvester.key -> harvester_key

* remove SHARE_API_URL setting

* add suid info to indexed indexcard

* osfmap metadata update (subjects, collection)

* accept "sort=-relevance" same as no sort

* count skos labels as labels

* avoid duplicate indexcards

* add backfill rate to admin/search-indexes

* static-ish relatedPropertysearch

* suggest property iris in consistent order

* fix valueSearchText query

* remove unhelpful expense

* allow prefixed shortnames for iri params

* tidy

* make relatedPropertySearch to-one

* pagination links

* fix relatedPropertySearch (add intermediate search-result)

* relatedProperties (not relatedPropertySearch)

* filterValueSet with value info

* better handling unknown filter value

* daemon debug logs

* tidy digestive messaging

* keep the ontology special-case special

* plac8 flake8

* nicer rdf in admin interface

* fix: deletion via legacy is_deleted

* fix some tests

* fix legacy_sharev2 extract

* fix legacy_sharev2 extract for real this time

* fix: digestive_tract.extract return value

* better handle "sort by relevance to nothing"

* stop causing index refresh during backfill

* fix failing tests (and pull some threads)

* fix some tests (wip)

* add counts for related properties

* daemon never lived

* remove debuggin change

* fix sharev2_elastic5 tests(?)

* wip/tmp

* support valuesearch on date property

* cardSearchFilter[date]=2023

* reverse order for value-search on dates

* stop mis-mapping dateWithdrawn

* value-search on date: exclude empty years

* fix: deleting records for a suid includes cards

* add display label to osfmap properties

* add filter operators: is-present, is-absent

e.g. `cardSearchFilter[funder][is-present]`

* stop putting osfmap types on legacy data

* always show queue size/rate in admin/search-indexes

* remove dead/unused code and concepts

* related-property-path

* set label label to "displayLabel"

* fix: make sure deleted indexcards are de-indexed

* better de-index of data for disabled sourceconfig

* de-index items without title/name/label

* fix: sharev2_elastic withdrawn should be bool

* fix: fill sharev2_elastic `type` for rdf-ingested cards

* skip storing all child tasks -- can be a long list

* tidier message logging

* sharev2_elastic: |-delimited subject lineage

* fix: include redundant references in jsonld

* lil nicities

* remove hasVersion from suggested properties

* prevent duplicates between v2_push and rdf tracts

* indexcard-based oaipmh

* fix: allow empty operator (`cardSearchFilter[foo][]`)

* fix: separate old (sharev2-push) and new (trove-rdf)

* admin: display raw datum in <pre>

* fix: accept fileName as sufficient for indexing

* fix: actually update datestamp when a RawDatum is seen again

* add "affiliation" to suggested preprint properties

* preprint affiliations are implicit via creator

* fix tests

* skip indexing invalid dates

* fix sentry `capture_message` usage

* fix more tests

* add search_params tests

* support "nonurgent" query param on ingest
  • Loading branch information
aaxelb committed Aug 22, 2023
1 parent 880fa02 commit 39f4e9b
Show file tree
Hide file tree
Showing 161 changed files with 13,792 additions and 2,528 deletions.
2 changes: 1 addition & 1 deletion .docker-compose.env
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ LOGIN_REDIRECT_URL=http://localhost:8003/
OSF_API_URL=http://localhost:8000
RABBITMQ_HOST=rabbitmq
RABBITMQ_PORT=5672
SHARE_API_URL=http://web:8000/
SHARE_WEB_URL=http://localhost:8003/
LOG_LEVEL=INFO

# allow localhost and the localhost-loopback ip used by local osf
Expand Down
2 changes: 2 additions & 0 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
# Architecture of SHARE/Trove
> NOTE: this document requires update (big ol' TODO)

This document is a starting point and reference to familiarize yourself with this codebase.

Expand Down
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM python:3.10-slim as app
FROM python:3.10-slim-bullseye as app

RUN apt-get update \
&& apt-get install -y \
Expand Down
31 changes: 0 additions & 31 deletions api/ingestjobs/serializers.py

This file was deleted.

7 changes: 0 additions & 7 deletions api/ingestjobs/urls.py

This file was deleted.

17 changes: 0 additions & 17 deletions api/ingestjobs/views.py

This file was deleted.

2 changes: 1 addition & 1 deletion api/migrations/0001_ensure_application.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ def ensure_application(apps, schema_editor):
class Migration(migrations.Migration):

dependencies = [
('oauth2_provider', '0002_08_updates'),
('oauth2_provider', '0001_initial'),
('share', '0061_ensure_auto_users'),
]

Expand Down
48 changes: 22 additions & 26 deletions api/normalizeddata/views.py
Original file line number Diff line number Diff line change
@@ -1,24 +1,22 @@
import logging
import json

from django.db import transaction
from django.urls import reverse
from rest_framework import status
from rest_framework import generics
from rest_framework.exceptions import ValidationError
from rest_framework.response import Response
import sentry_sdk

from share import models
from share.tasks import ingest
from share.util import IDObfuscator
from share import models as share_db
from share.util.graph import MutableGraph
from share.util.osf import guess_osf_guid
from share.ingest.ingester import Ingester
from api.base.views import ShareViewSet
from api.normalizeddata.serializers import BasicNormalizedDataSerializer
from api.normalizeddata.serializers import FullNormalizedDataSerializer
from api.pagination import CursorPagination
from api.permissions import ReadOnlyOrTokenHasScopeOrIsAuthenticated
from trove import digestive_tract


logger = logging.getLogger(__name__)
Expand Down Expand Up @@ -66,9 +64,18 @@ def get_serializer_class(self):
return BasicNormalizedDataSerializer

def get_queryset(self):
return models.NormalizedData.objects.all()
return share_db.NormalizedData.objects.all()

def create(self, request, *args, **kwargs):
if share_db.FeatureFlag.objects.flag_is_up(share_db.FeatureFlag.IGNORE_SHAREV2_INGEST):
return Response({
'errors': [
{'detail': (
'this route was deprecated and has been removed'
f' (use {reverse("trove:ingest-rdf")} instead)'
)},
],
}, status=status.HTTP_410_GONE)
try:
return self._do_create(request, *args, **kwargs)
except Exception:
Expand All @@ -86,27 +93,16 @@ def _do_create(self, request, *args, **kwargs):
suid = guess_osf_guid(MutableGraph.from_jsonld(data))
if not suid:
raise ValidationError("'suid' is a required attribute")

with transaction.atomic():
# Hack for back-compat: Ingest halfway synchronously, then apply changes asynchronously
ingester = Ingester(data, suid).as_user(request.user).ingest(
pls_format_metadata=False,
)
ingester.job.reschedule(claim=True)

nd_id = models.NormalizedData.objects.filter(
raw=ingester.raw,
ingest_jobs=ingester.job
).order_by('-created_at').values_list('id', flat=True).first()

async_result = ingest.delay(job_id=ingester.job.id, urgent=True)

# TODO Use an actual serializer
_task_id = digestive_tract.swallow__sharev2_legacy(
from_user=request.user,
record=json.dumps(data, sort_keys=True),
record_identifier=suid,
transformer_key='v2_push',
urgent=True,
)
return Response({
'id': IDObfuscator.encode_id(nd_id, models.NormalizedData),
'type': 'NormalizedData',
'attributes': {
'task': async_result.id,
'ingest_job': request.build_absolute_uri(reverse('api:ingestjob-detail', args=[IDObfuscator.encode(ingester.job)])),
}
'task': _task_id,
},
}, status=status.HTTP_202_ACCEPTED)
4 changes: 2 additions & 2 deletions api/search/views.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,10 +39,10 @@ def _handle_request(self, request):
except exceptions.IndexStrategyError as error:
raise http.Http404(str(error))
try:
response_json = specific_index.pls_handle_query__sharev2_backcompat(
response_json = specific_index.pls_handle_search__sharev2_backcompat(
request_body=request.data,
request_queryparams=queryparams,
)
return Response(data=response_json, headers={'Content-Type': 'application/json'})
except (exceptions.IndexStrategyError, NotImplementedError) as error:
return Response(status=418, data=str(error)) # TODO
return Response(status=400, data=str(error)) # TODO
1 change: 0 additions & 1 deletion api/urls.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@
url('^$', RootView.as_view()),
url('^', include('api.banners.urls')),
url('^', include('api.formattedmetadatarecords.urls')),
url('^', include('api.ingestjobs.urls')),
url('^', include('api.normalizeddata.urls')),
url('^', include('api.rawdata.urls')),
url('^', include('api.sourceregistrations.urls')),
Expand Down
2 changes: 1 addition & 1 deletion api/views/feeds.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ def get_object(self, request):

def items(self, obj):
try:
json_response = self._index_strategy.pls_handle_query__sharev2_backcompat(
json_response = self._index_strategy.pls_handle_search__sharev2_backcompat(
request_body=obj,
)
except IndexStrategyError:
Expand Down
16 changes: 9 additions & 7 deletions api/views/workflow.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,14 @@
import jsonschema

from django.db import transaction
from django.urls import reverse

from rest_framework import views, status
from rest_framework.exceptions import ParseError
from rest_framework.parsers import JSONParser
from rest_framework.renderers import JSONRenderer
from rest_framework.response import Response

from share.util import IDObfuscator
from share.ingest.ingester import Ingester
from trove import digestive_tract

from api import v1_schemas
from api.authentication import APIV1TokenBackPortAuthentication
Expand Down Expand Up @@ -113,9 +111,13 @@ def post(self, request, *args, **kwargs):
except KeyError:
return Response({'errors': 'Canonical URI not found in uris.', 'data': prelim_data}, status=status.HTTP_400_BAD_REQUEST)

ingester = Ingester(prelim_data, doc_id).as_user(request.user, 'v1_push').ingest_async(urgent=True)

_task_id = digestive_tract.swallow__sharev2_legacy(
from_user=request.user,
record=prelim_data,
record_identifier=doc_id,
transformer_key='v1_push',
urgent=True,
)
return Response({
'task_id': ingester.async_task.id,
'ingest_job': request.build_absolute_uri(reverse('api:ingestjob-detail', args=[IDObfuscator.encode(ingester.job)])),
'task_id': _task_id,
}, status=status.HTTP_202_ACCEPTED)
6 changes: 3 additions & 3 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -68,9 +68,9 @@ services:
find . -type d -exec chmod 750 \{\} \;;
find . -type f -exec chmod 640 \{\} \;;
echo "Waiting for Elasticsearch availability";
until curl -s --cacert config/certs/ca/ca.crt https://elastic8:9200 | grep -q "missing authentication credentials"; do sleep 30; done;
until curl -s --cacert config/certs/ca/ca.crt https://elastic8:9200 | grep -q "missing authentication credentials"; do sleep 1; done;
echo "Setting kibana_system password";
until curl -s -X POST --cacert config/certs/ca/ca.crt -u "elastic:$${ELASTIC_PASSWORD}" -H "Content-Type: application/json" https://elastic8:9200/_security/user/kibana_system/_password -d "{\"password\":\"$${KIBANA_PASSWORD}\"}" | grep -q "^{}"; do sleep 10; done;
until curl -s -X POST --cacert config/certs/ca/ca.crt -u "elastic:$${ELASTIC_PASSWORD}" -H "Content-Type: application/json" https://elastic8:9200/_security/user/kibana_system/_password -d "{\"password\":\"$${KIBANA_PASSWORD}\"}" | grep -q "^{}"; do sleep 1; done;
echo "All done!";
'
healthcheck:
Expand All @@ -90,7 +90,7 @@ services:
- elastic8_data_vol:/usr/share/elasticsearch/data
- elastic8_cert_vol:/usr/share/elasticsearch/config/certs/
environment:
- ELASTIC_PASSWORD=secretsecret
- ELASTIC_PASSWORD=${ELASTIC_PASSWORD:-secretsecret}
- node.name=singlenode
- cluster.initial_master_nodes=singlenode
- ES_JAVA_OPTS=-Xms512m -Xmx512m # raise memory limits a lil bit
Expand Down
60 changes: 30 additions & 30 deletions how-to/run-locally.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,14 +46,14 @@ a shell running within SHARE's environment in docker:
docker-compose run --rm --no-deps worker bash
```
this will open a bash prompt within a temporary `worker` container -- from here we can
use SHARE's python environment, including django's `manage.py` and SHARE's own `sharectl`
utility (defined in `share/bin/`)
run commands within SHARE's environment, including django's `manage.py` and SHARE's own
`sharectl` utility (defined in `share/bin/`)

from the docker shell, use django's `migrate` command to set up tables in postgres:
from within that worker shell, use django's `migrate` command to set up tables in postgres:
```
python manage.py migrate
```
and use `sharectl` to set up indexes in elasticsearch:
...and use `sharectl` to set up indexes in elasticsearch:
```
sharectl search setup --initial
```
Expand All @@ -64,10 +64,35 @@ all other services can now be started from the host machine (upping `worker` ups
docker-compose up -d worker
```

## using with local [osf.io](https://github.com/CenterForOpenScience/osf.io)
0. [set up your local osf with docker](https://github.com/CenterForOpenScience/osf.io/blob/HEAD/README-docker-compose.md), if you haven't already
1. in a SHARE container, run `python manage.py add_local_osf_user` and copy the access token from the output.
```
# python manage.py add_local_osf_user
added user "my-local-osf" for local osf
access-token: THISISMYACCESSTOKENITISLONGANDINSCRUTABLEANDSECRET
```
2. add settings to your local osf's `website/settings/local.py`, including the access token from step 1:
```
SHARE_ENABLED = True
SHARE_PROVIDER_PREPEND = 'local'
SHARE_URL = 'http://192.168.168.167:8003/'
SHARE_API_TOKEN = 'THISISMYACCESSTOKENITISLONGANDINSCRUTABLEANDSECRET'
```
(you may need to restart osf services that use these settings)
3. use the osf admin interface at `http://localhost:8001` to connect osf providers (can skip this step if you're only interested in osf:Project records)
1. at `/provider_asset_files/create`, add a small icon (PNG or JPEG) with name `square_color_no_transparent` for the provider(s) you want
2. on each provider detail page (e.g. `/preprint_provider/<id>/`), click the "Setup Share Source" button
> TODO: streamline this process -- is the icon really necessary?
4. make things "public" on your local osf to start populating indexes


> TODO: make it easy to init a local deployment by harvesting data from share.osf.io
## handy commands

### start a shell in a container
there are several ways to open a shell with SHARE's python environment (including
there are several ways to open a shell with SHARE's environment (which has
django's `manage.py` and SHARE's own `sharectl` utility, defined in `share/bin/`)

if `worker` is already up, can open a shell within that container:
Expand All @@ -91,31 +116,6 @@ python manage.py shell_plus
## admin interface
http://localhost:8003/admin (username: "admin", password: "password")

## using with local [osf.io](https://github.com/CenterForOpenScience/osf.io)
0. [set up your local osf with docker](https://github.com/CenterForOpenScience/osf.io/blob/HEAD/README-docker-compose.md), if you haven't already
1. in a SHARE container, run `python manage.py add_local_osf_user` and copy the access token from the output.
```
# python manage.py add_local_osf_user
added user "my-local-osf" for local osf
access-token: THISISMYACCESSTOKENITISLONGANDINSCRUTABLEANDSECRET
```
2. add settings to your local osf's `website/settings/local.py`, including the access token from step 1:
```
SHARE_ENABLED = True
SHARE_PROVIDER_PREPEND = 'local'
SHARE_URL = 'http://192.168.168.167:8003/'
SHARE_API_TOKEN = 'THISISMYACCESSTOKENITISLONGANDINSCRUTABLEANDSECRET'
```
(you may need to restart osf services that use these settings)
3. use the osf admin interface at `http://localhost:8001` to connect osf providers (can skip this step if you're only interested in osf:Project records)
1. at `/provider_asset_files/create`, add a small icon (PNG or JPEG) with name `square_color_no_transparent` for the provider(s) you want
2. on each provider detail page (e.g. `/preprint_provider/<id>/`), click the "Setup Share Source" button
> TODO: streamline this process -- is the icon really necessary?
4. make things "public" on your local osf to start populating indexes


> TODO: once share.osf.io/oaipmh is reliable, make it easy to init a local deployment by harvesting data from there
## troubleshooting
- my containers keep mysteriously dying!
- does docker have enough memory? try giving it more

0 comments on commit 39f4e9b

Please sign in to comment.