Prerequisites: ES cluster to talk to (docker or local)
To start the docker elasticsearch cluster, first increase virtual memory available:

temporarily with `sysctl -w vm.max_map_count=262144`
or permanently via setting:
`vm.max_map_count` to `262144` in `/etc/sysctl.conf`

then:

    docker-compose up

(in virtualenv) `pip install elasticsearch jupyter rstr`

In [19]:
import json
from datetime import datetime
from uuid import uuid4
from copy import deepcopy

import fixtures

import elasticsearch
es1 = {'host': 'localhost', 'port': 9201}
es2 = {'host': 'localhost', 'port': 9202}
es = elasticsearch.Elasticsearch([es1, es2])

### Check we have redundancy:

Notice we talk to both hosts

In [2]:
for i in range(0, 9):
    print(es.info(pretty=True))

{'name': 'es01', 'cluster_name': 'es-docker-cluster', 'cluster_uuid': 'y2M8TuB0RhGkyhL5UZ_X6A', 'version': {'number': '7.10.0', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '51e9d6f22758d0374a0f3f5c6e8f3a7997850f96', 'build_date': '2020-11-09T21:30:33.964949Z', 'build_snapshot': False, 'lucene_version': '8.7.0', 'minimum_wire_compatibility_version': '6.8.0', 'minimum_index_compatibility_version': '6.0.0-beta1'}, 'tagline': 'You Know, for Search'}
{'name': 'es02', 'cluster_name': 'es-docker-cluster', 'cluster_uuid': 'y2M8TuB0RhGkyhL5UZ_X6A', 'version': {'number': '7.10.0', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '51e9d6f22758d0374a0f3f5c6e8f3a7997850f96', 'build_date': '2020-11-09T21:30:33.964949Z', 'build_snapshot': False, 'lucene_version': '8.7.0', 'minimum_wire_compatibility_version': '6.8.0', 'minimum_index_compatibility_version': '6.0.0-beta1'}, 'tagline': 'You Know, for Search'}
{'name': 'es01', 'cluster_name': 'es-docker-cluster', 'clu

Now stop one of the hosts (in terminal)

    docker stop es01

Then run the cell above again. This time it seamlessly connects to es02 each time.
Restart the container.

    docker start es01

In [39]:
# Clear up previous run
es.indices.delete('doaj-*') if es.indices.get('doaj-*') else print('Nothing to do')

{'acknowledged': True}

### Creating an index with custom dynamic mapping

The dynamic mappings have changed somewhat since ES 1.7. For reference, here's the old
default dynamic mapping:
```
'dynamic_templates': [
            {
                'default': {
                    'match': '*', 'match_mapping_type': 'string', 'mapping': {
                        'type': 'multi_field', 'fields': {
                            '{name}': {'type': '{dynamic_type}', 'index': 'analyzed', 'store': 'no'},
                            'exact': {'type': '{dynamic_type}', 'index': 'not_analyzed', 'store': 'yes'}}
                    }
                }
            }
        ]
    }
```
The following gives us an equivalent `.exact` not_analyzed **keyword** field.

We also have the benefit of initialising indexes with the correct number of shards depending on data size -
more shards for larger types to improve search performance. Replicas would be different for dev and production.

In [40]:
CREATE_BODY = {
    'aliases': {
        'account': {}
    },
    'mappings': {
        'dynamic_templates': [
            {
                "strings": {
                    "match_mapping_type": "string",
                    "mapping": {
                        "type": "text",
                        "fields": {
                            "exact": {
                                "type": "keyword",
                                "normalizer": "lowercase"
                            }
                        }
                    }
                }
            }
        ]
    },
    'settings': {
        'number_of_shards': 4,
        'number_of_replicas': 1
    }
}

# todo: do we want to do a check on index init that it has the correct mappings?

# Use the create index api with the mapping
es.indices.create(index='doaj-account', body=CREATE_BODY)


{'acknowledged': True, 'shards_acknowledged': True, 'index': 'doaj-account'}

### Put some data in the index

In [41]:
steve = {"api_key": uuid4().hex, "last_updated": "2021-04-27T09:49:11Z", "marketing_consent": False, "id": "steve", "role": ["admin", "api"], "created_date": "2014-09-10T15:53:50Z", "password": "pbkdf2:sha256:150000$o6pVxBxY$f8c25903211437b168af63b465c283942a9192f086fa77872a72cdaef0579c91", "email": "steve@example.com", "es_type": "account"}
bob =  {"api_key": uuid4().hex, "last_updated": "2021-04-27T09:49:11Z", "marketing_consent": False, "id": "bob", "role": ["publisher", "api"], "created_date": "2014-09-10T15:53:50Z", "password": "pbkdf2:sha256:150000$o6pVxBxY$f8c25903211437b168af63b465c283942a9192f086fa77872a72cdaef0579c91", "email": "bob@example.com", "es_type": "account"}

es.create(index='doaj-account', id='steve', body=steve)
es.create(index='doaj-account', id='bob', body=bob)

{'_index': 'doaj-account',
 '_type': '_doc',
 '_id': 'bob',
 '_version': 1,
 'result': 'created',
 '_shards': {'total': 2, 'successful': 2, 'failed': 0},
 '_seq_no': 1,
 '_primary_term': 1}

In [42]:
# An additional create will cause a 409 conflict
try:
    es.create(index='doaj-account', id='steve', body=steve)
except elasticsearch.ConflictError as e:
    print(e)

ConflictError(409, 'version_conflict_engine_exception', '[steve]: version conflict, document already exists (current version [1])')


In [43]:
# With correct method es.index
steve['last_updated'] = datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ")
es.index(index='doaj-account', id='steve', body=steve)

{'_index': 'doaj-account',
 '_type': '_doc',
 '_id': 'steve',
 '_version': 2,
 'result': 'updated',
 '_shards': {'total': 2, 'successful': 2, 'failed': 0},
 '_seq_no': 2,
 '_primary_term': 1}

### Elasticsearch concurrency control (save validation)
https://www.elastic.co/guide/en/elasticsearch/reference/7.10/optimistic-concurrency-control.html

Ensure we reject changes when a document has been saved in interim.

In [44]:
bob_retrieved = es.get('doaj-account', id='bob')
bob_retrieved

{'_index': 'doaj-account',
 '_type': '_doc',
 '_id': 'bob',
 '_version': 1,
 '_seq_no': 1,
 '_primary_term': 1,
 'found': True,
 '_source': {'api_key': 'b7ca4c9531fd46438ddffe91bbf38163',
  'last_updated': '2021-04-27T09:49:11Z',
  'marketing_consent': False,
  'id': 'bob',
  'role': ['publisher', 'api'],
  'created_date': '2014-09-10T15:53:50Z',
  'password': 'pbkdf2:sha256:150000$o6pVxBxY$f8c25903211437b168af63b465c283942a9192f086fa77872a72cdaef0579c91',
  'email': 'bob@example.com',
  'es_type': 'account'}}

In [45]:
# Bob's API key is updated before another user is finished updating Bob
bob_interjected = deepcopy(bob_retrieved).get('_source')
bob_interjected['api_key'] = uuid4().hex
es.index('doaj-account', id='bob', body=bob_interjected, if_seq_no=bob_retrieved['_seq_no'], if_primary_term=bob_retrieved['_primary_term'])

{'_index': 'doaj-account',
 '_type': '_doc',
 '_id': 'bob',
 '_version': 2,
 'result': 'updated',
 '_shards': {'total': 2, 'successful': 2, 'failed': 0},
 '_seq_no': 3,
 '_primary_term': 1}

In [46]:
# Then we try to carry on with our update of Bob, specifying our sequences as before
bob_retrieved['_source']['last_updated'] = datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ")
try:
    es.index('doaj-account', id='bob', body=bob_retrieved['_source'], if_seq_no=bob_retrieved['_seq_no'], if_primary_term=bob_retrieved['_primary_term'])
except elasticsearch.ConflictError as e:
    print(e)

ConflictError(409, 'version_conflict_engine_exception', '[bob]: version conflict, required seqNo [1], primary term [1]. current document has seqNo [3] and primary term [1]')


In [47]:
# Try again with correct sequence:
bob_uptodate = es.get('doaj-account', id='bob')
es.index('doaj-account', id='bob', body=bob_retrieved['_source'], if_seq_no=bob_uptodate['_seq_no'], if_primary_term=bob_uptodate['_primary_term'])
es.get('doaj-account', id='bob')

{'_index': 'doaj-account',
 '_type': '_doc',
 '_id': 'bob',
 '_version': 3,
 '_seq_no': 4,
 '_primary_term': 1,
 'found': True,
 '_source': {'api_key': 'b7ca4c9531fd46438ddffe91bbf38163',
  'last_updated': '2021-05-04T10:05:44Z',
  'marketing_consent': False,
  'id': 'bob',
  'role': ['publisher', 'api'],
  'created_date': '2014-09-10T15:53:50Z',
  'password': 'pbkdf2:sha256:150000$o6pVxBxY$f8c25903211437b168af63b465c283942a9192f086fa77872a72cdaef0579c91',
  'email': 'bob@example.com',
  'es_type': 'account'}}

In [48]:
# Match all search
es.search({'query': {'match_all': {}}}, index='doaj-account')


{'took': 2,
 'timed_out': False,
 '_shards': {'total': 4, 'successful': 4, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 2, 'relation': 'eq'},
  'max_score': 1.0,
  'hits': [{'_index': 'doaj-account',
    '_type': '_doc',
    '_id': 'steve',
    '_score': 1.0,
    '_source': {'api_key': '0735e41b118d41fc98b2ac508f1b6148',
     'last_updated': '2021-05-04T10:05:25Z',
     'marketing_consent': False,
     'id': 'steve',
     'role': ['admin', 'api'],
     'created_date': '2014-09-10T15:53:50Z',
     'password': 'pbkdf2:sha256:150000$o6pVxBxY$f8c25903211437b168af63b465c283942a9192f086fa77872a72cdaef0579c91',
     'email': 'steve@example.com',
     'es_type': 'account'}},
   {'_index': 'doaj-account',
    '_type': '_doc',
    '_id': 'bob',
    '_score': 1.0,
    '_source': {'api_key': 'b7ca4c9531fd46438ddffe91bbf38163',
     'last_updated': '2021-05-04T10:05:44Z',
     'marketing_consent': False,
     'id': 'bob',
     'role': ['publisher', 'api'],
     'created_date': '2014-09-

### Pull account by API key

In [49]:
q = {
    'query': {
        'term': {'api_key.exact': steve['api_key']}
    }
}

es.search(q, index='doaj-account')

{'took': 3,
 'timed_out': False,
 '_shards': {'total': 4, 'successful': 4, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 1, 'relation': 'eq'},
  'max_score': 0.87546873,
  'hits': [{'_index': 'doaj-account',
    '_type': '_doc',
    '_id': 'steve',
    '_score': 0.87546873,
    '_source': {'api_key': '0735e41b118d41fc98b2ac508f1b6148',
     'last_updated': '2021-05-04T10:05:25Z',
     'marketing_consent': False,
     'id': 'steve',
     'role': ['admin', 'api'],
     'created_date': '2014-09-10T15:53:50Z',
     'password': 'pbkdf2:sha256:150000$o6pVxBxY$f8c25903211437b168af63b465c283942a9192f086fa77872a72cdaef0579c91',
     'email': 'steve@example.com',
     'es_type': 'account'}}]}}

### Aggregate on role - demonstrate on text field

In [50]:
q = {
    'query': {
        'match_all': {}
    },
    'aggs': {
        "by_role": {
            "terms": {
                "field": "role"
            }
        }
    }
}

try:
    es.search(q, index='doaj-account')
except elasticsearch.RequestError as e:
    print(e)

RequestError(400, 'search_phase_execution_exception', 'Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [role] in order to load field data by uninverting the inverted index. Note that this can use significant memory.')


In [51]:
# Try again with .exact field
q = {
    'query': {
        'match_all': {}
    },
    'aggs': {
        "by_role": {
            "terms": {
                "field": "role.exact"
            }
        }
    },
    'size': 0
}

es.search(q, index='doaj-account')

{'took': 9,
 'timed_out': False,
 '_shards': {'total': 4, 'successful': 4, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 2, 'relation': 'eq'},
  'max_score': None,
  'hits': []},
 'aggregations': {'by_role': {'doc_count_error_upper_bound': 0,
   'sum_other_doc_count': 0,
   'buckets': [{'key': 'api', 'doc_count': 2},
    {'key': 'admin', 'doc_count': 1},
    {'key': 'publisher', 'doc_count': 1}]}}}

### Index aliases for query simplification

You can define index settings to filter results based on the index alias.
For example, we could have a 'view' on the DOAJ journal index that's just the public
records.

In [52]:
# Create an index for journals using the default dynamic mapping (for now)
J_INDEX = {
    'aliases': {
        'journal': {},
        'journal-public': {
            'filter': {
                'term': { 'admin.in_doaj': True }
            }
        }
    }
}
if not es.indices.exists('doaj-journal'):
    es.indices.create(index='doaj-journal', body=J_INDEX)

# Put a couple of journals that are in doaj and a couple out. This time we'll use bulk operations

js = fixtures.make_many_journal_sources(count=4, in_doaj=False)
for i in range(0, len(js)):
    js[i]['admin']['in_doaj'] = bool(i%2)

# Esprit has a to_bulk convenience function for this that we may wish to keep hold of.
bulk_instructions = [({'index': {'_id': j['id']}}, j) for j in js]
bulk_body = ''
for inst, data in bulk_instructions:
    bulk_body += json.dumps(inst) + '\n'
    bulk_body += json.dumps(data) + '\n'

es.bulk(bulk_body, 'journal')

{'took': 140,
 'errors': False,
 'items': [{'index': {'_index': 'doaj-journal',
    '_type': '_doc',
    '_id': 'journalid0',
    '_version': 1,
    'result': 'created',
    '_shards': {'total': 2, 'successful': 2, 'failed': 0},
    '_seq_no': 0,
    '_primary_term': 1,
    'status': 201}},
  {'index': {'_index': 'doaj-journal',
    '_type': '_doc',
    '_id': 'journalid1',
    '_version': 1,
    'result': 'created',
    '_shards': {'total': 2, 'successful': 2, 'failed': 0},
    '_seq_no': 1,
    '_primary_term': 1,
    'status': 201}},
  {'index': {'_index': 'doaj-journal',
    '_type': '_doc',
    '_id': 'journalid2',
    '_version': 1,
    'result': 'created',
    '_shards': {'total': 2, 'successful': 2, 'failed': 0},
    '_seq_no': 2,
    '_primary_term': 1,
    'status': 201}},
  {'index': {'_index': 'doaj-journal',
    '_type': '_doc',
    '_id': 'journalid3',
    '_version': 1,
    'result': 'created',
    '_shards': {'total': 2, 'successful': 2, 'failed': 0},
    '_seq_no': 3,


In [53]:
# Demonstrate the alias lets us expose only the journals in doaj

es.search({'query': {'match_all': {}}}, index='journal')

{'took': 3,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 4, 'relation': 'eq'},
  'max_score': 1.0,
  'hits': [{'_index': 'doaj-journal',
    '_type': '_doc',
    '_id': 'journalid0',
    '_score': 1.0,
    '_source': {'id': 'journalid0',
     'created_date': '2000-01-01T00:00:00Z',
     'last_manual_update': '2001-01-01T00:00:00Z',
     'last_updated': '2002-01-01T00:00:00Z',
     'admin': {'bulk_upload': 'bulk_1234567890',
      'current_application': 'qwertyuiop',
      'editor_group': 'editorgroup',
      'editor': 'associate',
      'in_doaj': False,
      'notes': [{'note': 'Second Note',
        'date': '2014-05-22T00:00:00Z',
        'id': '1234'},
       {'note': 'First Note', 'date': '2014-05-21T14:02:45Z', 'id': 'abcd'}],
      'owner': 'publisher',
      'related_applications': [{'application_id': 'asdfghjkl',
        'date_accepted': '2018-01-01T00:00:00Z'},
       {'application_id': 'zxcvbnm'}],
   

In [54]:
es.search({'query': {'match_all': {}}}, index='journal-public')

{'took': 4,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 2, 'relation': 'eq'},
  'max_score': 1.0,
  'hits': [{'_index': 'doaj-journal',
    '_type': '_doc',
    '_id': 'journalid1',
    '_score': 1.0,
    '_source': {'id': 'journalid1',
     'created_date': '2000-01-01T00:00:00Z',
     'last_manual_update': '2001-01-01T00:00:00Z',
     'last_updated': '2002-01-01T00:00:00Z',
     'admin': {'bulk_upload': 'bulk_1234567890',
      'current_application': 'qwertyuiop',
      'editor_group': 'editorgroup',
      'editor': 'associate',
      'in_doaj': True,
      'notes': [{'note': 'Second Note',
        'date': '2014-05-22T00:00:00Z',
        'id': '1234'},
       {'note': 'First Note', 'date': '2014-05-21T14:02:45Z', 'id': 'abcd'}],
      'owner': 'publisher',
      'related_applications': [{'application_id': 'asdfghjkl',
        'date_accepted': '2018-01-01T00:00:00Z'},
       {'application_id': 'zxcvbnm'}],
    

### Analyzers, tokens, filters

In [None]:
### Journal mapping

In [None]:
# Old mapping

# New mapping

### Filtered query

``` json
# Old ISSN query
{
    "query": {
        "filtered": {
            "filter": {
                "bool": {
                    "must": [
                        {
                            "terms": {
                                "index.issn.exact": [
                                    "1000-0000",
                                    "2000-0000"
                                ]
                            }
                        }
                    ]
                }
            }
        }
    }
}

# New ISSN query - same but with filtered removed and filter inside bool

{
    "query": {
        "bool": {
            "filter": {
                "must": [
                    {
                        "terms": {
                            "index.issn.exact": [
                                "1000-0000",
                                "2000-0000"
                            ]
                        }
                    }
                ]
            }
        }
    }
}
```