Skip to content

Commit

Permalink
Merge pull request #782 from aaxelb/docs/docker-quickstart
Browse files Browse the repository at this point in the history
dock docs
  • Loading branch information
aaxelb committed Feb 9, 2021
2 parents fec9537 + 8616361 commit f0d69a6
Show file tree
Hide file tree
Showing 8 changed files with 260 additions and 22 deletions.
118 changes: 118 additions & 0 deletions README-docker-quickstart.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# SHARE Quickstart or: How I Learned to Stop Worrying and Love the Dock

this guide sets up everything inside docker containers -- no dependencies
or python environments on your host machine.

## pre-requisites
- [git](https://git-scm.com/)
- [docker](https://www.docker.com/) (including `docker-compose`)

## getting a local SHARE running

### 0. git the code
```
git clone https://github.com/CenterForOpenScience/SHARE.git share
```
the rest of this guide assumes your working directory is the SHARE repository root
(where the `docker-compose.yml` is):
```
cd ./share
```

### 1. download several bits
download docker images (depending on your internet connection, this may take a beat):
```
docker-compose pull
```
install python dependencies (in a shared docker volume):
```
docker-compose up requirements
```

### 2. structured data
there are two services that store persistent data: `postgres` and `elasticsearch`

let's start them from the host machine:
```
docker-compose up -d postgres elasticsearch
```

since we're not installing anything more on the host machine, it'll be useful to open
a shell running within SHARE's environment in docker:
```
docker-compose run --rm --no-deps worker bash
```
this will open a bash prompt within a temporary `worker` container -- from here we can
use SHARE's python environment, including django's `manage.py` and SHARE's own `sharectl`
utility (defined in `share/bin/`)

from the docker shell, use django's `migrate` command to set up tables in postgres:
```
python manage.py migrate
```
and use `sharectl` to set up indexes in elasticsearch:
```
sharectl search setup --initial
```
you will also need to initialize the "central taxonomy" of subjects:

```
python manage.py addsubjects
```

### 3. start 'em up
all other services can now be started (from the host machine):
```
docker-compose up -d rabbitmq worker web indexer frontend
```

## handy commands

### start a docker shell

this is the same command you ran in step 2:

```
docker-compose run --rm --no-deps worker bash
```

### start a django shell

this should be run inside the docker shell (see previous):

```
python manage.py shell_plus
```

## admin interface
http://localhost:8003/admin -- (admin/password)

## harvesting data
> TODO: once share.osf.io/oaipmh is reliable, make it easy to init a local deployment by harvesting data from there
> also TODO: put some thought into unconvoluting the whole harvest-scheduling, ingest-disabling system
for now, maybe grab a day of data from arxiv.org? at the moment, the `Source` needs to be marked
`canonical` for the system to ingest its data -- either:
- update it in the admin interface: http://localhost:8003/admin/share/source/
- update it from the django shell:
```
Source.objects.filter(name='org.arxiv').update(canonical=True)
```

next, choose a recent date, and start a harvest task for it from the docker shell:

```
sharectl schedule -t org.arxiv YYYY-MM-DD
```

you could watch its progress several different ways:
- looking at task queues in the rabbitmq management interface at http://localhost:15673/ (guest/guest)
- following the `worker` container's logs: `docker-compose logs -f worker`
- checking the result count as you refresh the search interface at http://localhost:8003/share/discover
- watching `IngestJob` statuses update in the admin at http://localhost:8003/admin/share/ingestjob/ (admin/password)
- useful for debugging! if ingest fails, the `IngestJob` will contain the error type, message, and stack trace

## troubleshooting
- my containers keep mysteriously dying!
- does docker have enough memory? try giving it more
3 changes: 2 additions & 1 deletion docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,8 @@ services:
- pip install -r requirements.txt -r dev-requirements.txt &&
(python3 -m compileall /usr/local/lib/python3.6 || true) &&
rm -Rf /python3.6/* &&
cp -Rf -p /usr/local/lib/python3.6 /
cp -Rf -p /usr/local/lib/python3.6 / &&
python3 setup.py develop
restart: 'no'
volumes:
- ./:/code:cached
Expand Down
3 changes: 2 additions & 1 deletion project/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -308,7 +308,7 @@ def split(string, delim):
'no_ack': False, # WHY KOMBU THAT'S NOT HOW ENGLISH WORKS
},
# NOTE: "active" indexes will receive new records from the indexer daemon -- be sure they're set up first
'ACTIVE_INDEXES': split(os.environ.get('ELASTICSEARCH_ACTIVE_INDEXES', 'share_customtax_1'), ','),
'ACTIVE_INDEXES': split(os.environ.get('ELASTICSEARCH_ACTIVE_INDEXES', 'share_postrend_backcompat'), ','),
# NOTE: indexes here won't be created automatically -- run `sharectl search setup <index_name>` BEFORE the daemon starts
'INDEXES': {
'share_v3': {
Expand Down Expand Up @@ -541,6 +541,7 @@ def route_urgent_task(name, args, kwargs, options, task=None, **kw):
SUBJECTS_YAML = 'share/subjects.yaml'
SUBJECT_SYNONYMS_JSON = 'share/models/synonyms.json'

# if false, will skip disambiguation, building ChangeSets, and updating ShareObjects
SHARE_LEGACY_PIPELINE = os.environ.get('SHARE_LEGACY_PIPELINE', True)

HIDE_DEPRECATED_VIEWS = strtobool(os.environ.get('HIDE_DEPRECATED_VIEWS', 'False'))
Expand Down
25 changes: 19 additions & 6 deletions share/bin/search.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,20 +29,33 @@ def search(args, argv):
@search.subcommand('Drop the Elasticsearch index')
def purge(args, argv):
"""
Usage: {0} search purge [options]
Options:
-i, --index=INDEX The name of the Elasticsearch index to use.
Usage: {0} search purge <index_names>...
"""
ElasticManager().delete_index(args.get('--index'))
for index_name in args['<index_names>']:
ElasticManager().delete_index(index_name)


@search.subcommand('Create indicies and apply mappings')
def setup(args, argv):
"""
Usage: {0} search setup <index_name>
{0} search setup --initial
"""
ElasticManager().create_index(args['<index_name>'])
is_initial = args.get('--initial')

if is_initial:
index_names = settings.ELASTICSEARCH['ACTIVE_INDEXES']
else:
index_names = [args['<index_name>']]

elastic_manager = ElasticManager()
for index_name in index_names:
print(f'creating elasticsearch index "{index_name}"...')
elastic_manager.create_index(index_name)

if is_initial:
primary_index = index_names[0]
elastic_manager.update_primary_alias(primary_index)


@search.subcommand('Start the search indexing daemon')
Expand Down
2 changes: 1 addition & 1 deletion share/management/commands/addsubjects.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,4 +32,4 @@ def save_subjects(self, subjects):
} for s in subjects
]

Ingester(subjects).as_user(user).ingest()
Ingester(subjects, 'share/management/commands/addsubjects.py').as_user(user).ingest()
44 changes: 36 additions & 8 deletions share/search/elastic_manager.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import logging

from elasticsearch import Elasticsearch, helpers as elastic_helpers
from elasticsearch.exceptions import NotFoundError

from django.conf import settings

Expand Down Expand Up @@ -39,19 +40,18 @@ def delete_index(self, index_name):
def create_index(self, index_name):
index_setup = self.get_index_setup(index_name)

if self.es_client.indices.exists(index_name):
raise ValueError(f'index already exists: {index_name}')

logger.debug('Ensuring Elasticsearch index %s', index_name)
self.es_client.indices.create(index_name, ignore=400)
self.es_client.indices.create(
index_name,
body={'settings': index_setup.index_settings},
)

logger.debug('Waiting for yellow status')
self.es_client.cluster.health(wait_for_status='yellow')

logger.info('Putting Elasticsearch settings')
self.es_client.indices.close(index=index_name)
try:
self.es_client.indices.put_settings(body=index_setup.index_settings, index=index_name)
finally:
self.es_client.indices.open(index=index_name)

logger.info('Putting Elasticsearch mappings')
for doc_type, mapping in index_setup.index_mappings.items():
logger.debug('Putting mapping for %s', doc_type)
Expand Down Expand Up @@ -81,3 +81,31 @@ def send_actions_sync(self, actions):

def refresh_indexes(self, index_names):
self.es_client.indices.refresh(index=','.join(index_names))

def update_primary_alias(self, primary_index_name):
alias = settings.ELASTICSEARCH['PRIMARY_INDEX']

previous_indexes = []

try:
existing_aliases = self.es_client.indices.get_alias(name=alias)
previous_indexes = list(existing_aliases.keys())
except NotFoundError:
pass

if previous_indexes == [primary_index_name]:
logger.info(f'index {primary_index_name} is already the primary')
return

logger.warn(f'removing aliases to {previous_indexes} and adding alias to {primary_index_name}')
delete_actions = [
{'remove': {'index': index_name, 'alias': alias}}
for index_name in previous_indexes
]
add_action = {'add': {'index': primary_index_name, 'alias': alias}}
self.es_client.indices.update_aliases(body={
'actions': [
*delete_actions,
add_action
],
})
2 changes: 1 addition & 1 deletion share/tasks/jobs.py
Original file line number Diff line number Diff line change
Expand Up @@ -332,7 +332,7 @@ def _consume_job(self, job, superfluous, force, apply_changes=True, index=True,
self._queue_for_indexing(job.suid, urgent)

# soon-to-be-rended ShareObject-based process:
if apply_changes:
if settings.SHARE_LEGACY_PIPELINE and apply_changes:
if graph is None:
graph = MutableGraph.from_jsonld(datum.data)
updated_work_ids = self._apply_changes(job, graph, datum)
Expand Down
85 changes: 81 additions & 4 deletions tests/share/search/test_elastic_manager.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from unittest.mock import patch, call

import pytest
from elasticsearch.exceptions import NotFoundError

from share.search.elastic_manager import ElasticManager
from share.search.index_setup import ShareClassicIndexSetup, PostRendBackcompatIndexSetup
Expand Down Expand Up @@ -40,13 +41,15 @@ def test_get_index_setup(self, isolated_elastic_manager, index_name, expected_se
def test_create_index(self, isolated_elastic_manager, index_name):
index_setup = isolated_elastic_manager.get_index_setup(index_name)
mock_es_client = isolated_elastic_manager.es_client
mock_es_client.configure_mock(**{
'indices.exists.return_value': False,
})

isolated_elastic_manager.create_index(index_name)

mock_es_client.indices.create.assert_called_once_with(index_name, ignore=400)
mock_es_client.indices.put_settings.assert_called_once_with(
index=index_name,
body=index_setup.index_settings,
mock_es_client.indices.create.assert_called_once_with(
index_name,
body={'settings': index_setup.index_settings},
)
mock_es_client.indices.put_mapping.assert_has_calls([
call(
Expand All @@ -56,6 +59,19 @@ def test_create_index(self, isolated_elastic_manager, index_name):
) for doc_type, mapping in index_setup.index_mappings.items()
], any_order=True)

@pytest.mark.parametrize('index_name', [
'classic_index',
'postrend_index',
])
def test_create_index_already_exists(self, isolated_elastic_manager, index_name):
mock_es_client = isolated_elastic_manager.es_client
mock_es_client.configure_mock(**{
'indices.exists.return_value': True,
})

with pytest.raises(ValueError):
isolated_elastic_manager.create_index(index_name)

@pytest.mark.parametrize('index_name', [
'classic_index',
'postrend_index',
Expand Down Expand Up @@ -114,3 +130,64 @@ def test_refresh_indexes(self, isolated_elastic_manager, index_names, expected_a
isolated_elastic_manager.refresh_indexes(index_names)

mock_es_client.indices.refresh.assert_called_once_with(index=expected_arg)

@pytest.mark.parametrize('index_name', [
'classic_index',
'postrend_index',
])
def test_initial_update_primary_alias(self, isolated_elastic_manager, index_name, settings):
alias_name = settings.ELASTICSEARCH['PRIMARY_INDEX']
mock_es_client = isolated_elastic_manager.es_client
mock_es_client.configure_mock(**{
'indices.get_alias.side_effect': NotFoundError,
})

isolated_elastic_manager.update_primary_alias(index_name)

mock_es_client.indices.get_alias.assert_called_once_with(name=alias_name)
mock_es_client.indices.update_aliases.assert_called_once_with(
body={'actions': [
{'add': {'index': index_name, 'alias': alias_name}}
]}
)

@pytest.mark.parametrize('index_name', [
'classic_index',
'postrend_index',
])
def test_update_primary_alias(self, isolated_elastic_manager, index_name, settings):
alias_name = settings.ELASTICSEARCH['PRIMARY_INDEX']
mock_es_client = isolated_elastic_manager.es_client
mock_es_client.configure_mock(**{
'indices.get_alias.return_value': {
'old_primary': {'alias': alias_name},
},
})

isolated_elastic_manager.update_primary_alias(index_name)

mock_es_client.indices.get_alias.assert_called_once_with(name=alias_name)
mock_es_client.indices.update_aliases.assert_called_once_with(
body={'actions': [
{'remove': {'index': 'old_primary', 'alias': alias_name}},
{'add': {'index': index_name, 'alias': alias_name}},
]}
)

@pytest.mark.parametrize('index_name', [
'classic_index',
'postrend_index',
])
def test_unnecessary_update_primary_alias(self, isolated_elastic_manager, index_name, settings):
alias_name = settings.ELASTICSEARCH['PRIMARY_INDEX']
mock_es_client = isolated_elastic_manager.es_client
mock_es_client.configure_mock(**{
'indices.get_alias.return_value': {
index_name: {'alias': alias_name},
},
})

isolated_elastic_manager.update_primary_alias(index_name)

mock_es_client.indices.get_alias.assert_called_once_with(name=alias_name)
mock_es_client.indices.update_aliases.assert_not_called()

0 comments on commit f0d69a6

Please sign in to comment.