# DOAJ Elasticsearch / Indexing Presentation
Steve, Cottage Labs, 2022

Here I will cover the basics of our ElasticSearch document store, indexes in general, and its use and consequences for our system.

Elasticsearch runs on a separate machine to the DOAJ application, in production my first step is use the firewall to ensure only the right machines can talk to it. It stores all of our dynamic data, including our users, which contains personally identifiable info such as email addresses. Therefore it's a private network connecting our machines - meaning not over the internet.

Here I'll use Docker to produce two elasticsearch hosts here on my laptop, and connect to them with Python so I can demonstrate what an index does for us.

In [2]:
import json
from datetime import datetime
from uuid import uuid4
from copy import deepcopy

import fixtures

import elasticsearch
es1 = {'scheme': 'http', 'host': 'localhost', 'port': 9201}
es2 = {'scheme': 'http', 'host': 'localhost', 'port': 9202}
es = elasticsearch.Elasticsearch([es1, es2])

In [9]:
if es.indices.get(index='doaj-es-demo'):
    es.indices.delete(index='doaj-es-demo')
es.indices.create(index='doaj-es-demo')

{'acknowledged': True, 'shards_acknowledged': True, 'index': 'doaj-es-demo'}

## An index is different from a database

A database is a table - much like a spreadsheet, it has headers and rows for all of the data.

| ID  | Title                              | ISSN      |
|-----|------------------------------------|-----------|
| 001 | Journal of Spoons                  | 1234-5678 |
| 002 | Analytical Knitting                | 9876-5432 |
| 003 | Journal of Trivial Household Items | 1010-000X |
| 004 | Journal of Slow Growing Succulates | 3333-7777 |
| 005 | Humerous Taxidermy                 | 2468-1012 |

In a very basic database, you'd retrieve the full record with ISSN 9876-5432 by going through all of the rows, comparing the query with the value in the corresponding column. It's like reading the whole book to find the chapter you're interested in. Then you'd pull out the record by its ID.

Like in a book, an index allows you to search more efficiently by directly pointing the interesting information to the ID of the record - it's less to search through, but also it can be organised for efficient searching, e.g. by doing a binary search (all the ISSNs starting with odd number, and so on) you'd be faster than looking through all of the records.

### Structured vs unstructured data

In our records table above, we have 3 fields - the ID, the Title, and the ISSN. Since we know how long ISSNs are we can make our database more efficient by explicitly specifying how much data that row can take. We need to specify the size and type of data when we create the table.

In ElasticSearch and other document indexes, we store the records in a different format, namely JSON - JavaScript Object Notation. This is also similar to how objects look in Python. For example:

In [10]:
table = [
    {'ID': 1, 'Title': 'Journal of Spoons', 'ISSN': '1234-5678'},
    {'ID': 2, 'Title': 'Analytical Knitting', 'ISSN': '9876-5432'},
    {'ID': 3, 'Title': 'Journal of Trivial Household Items', 'ISSN': '1010-000X'},
    {'ID': 4, 'Title': 'Journal of Slow Growing Succulates', 'ISSN': '3333-7777'},
    {'ID': 5, 'Title': 'Humerous Taxidermy', 'ISSN': '2468-1012'}
]

# Put the data in the index one record at a time
[es.create(index='doaj-es-demo', body=json.dumps(row), id=row['ID']) for row in table]


[{'_index': 'doaj-es-demo',
  '_type': '_doc',
  '_id': '1',
  '_version': 1,
  'result': 'created',
  '_shards': {'total': 2, 'successful': 1, 'failed': 0},
  '_seq_no': 0,
  '_primary_term': 1},
 {'_index': 'doaj-es-demo',
  '_type': '_doc',
  '_id': '2',
  '_version': 1,
  'result': 'created',
  '_shards': {'total': 2, 'successful': 1, 'failed': 0},
  '_seq_no': 1,
  '_primary_term': 1},
 {'_index': 'doaj-es-demo',
  '_type': '_doc',
  '_id': '3',
  '_version': 1,
  'result': 'created',
  '_shards': {'total': 2, 'successful': 1, 'failed': 0},
  '_seq_no': 2,
  '_primary_term': 1},
 {'_index': 'doaj-es-demo',
  '_type': '_doc',
  '_id': '4',
  '_version': 1,
  'result': 'created',
  '_shards': {'total': 2, 'successful': 1, 'failed': 0},
  '_seq_no': 3,
  '_primary_term': 1},
 {'_index': 'doaj-es-demo',
  '_type': '_doc',
  '_id': '5',
  '_version': 1,
  'result': 'created',
  '_shards': {'total': 2, 'successful': 1, 'failed': 0},
  '_seq_no': 4,
  '_primary_term': 1}]

In [11]:
es.search({'query': {'match_all': {}}}, index='doaj-es-demo')

{'took': 8,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 5, 'relation': 'eq'},
  'max_score': 1.0,
  'hits': [{'_index': 'doaj-es-demo',
    '_type': '_doc',
    '_id': '1',
    '_score': 1.0,
    '_source': {'ID': 1, 'Title': 'Journal of Spoons', 'ISSN': '1234-5678'}},
   {'_index': 'doaj-es-demo',
    '_type': '_doc',
    '_id': '2',
    '_score': 1.0,
    '_source': {'ID': 2, 'Title': 'Analytical Knitting', 'ISSN': '9876-5432'}},
   {'_index': 'doaj-es-demo',
    '_type': '_doc',
    '_id': '3',
    '_score': 1.0,
    '_source': {'ID': 3,
     'Title': 'Journal of Trivial Household Items',
     'ISSN': '1010-000X'}},
   {'_index': 'doaj-es-demo',
    '_type': '_doc',
    '_id': '4',
    '_score': 1.0,
    '_source': {'ID': 4,
     'Title': 'Journal of Slow Growing Succulates',
     'ISSN': '3333-7777'}},
   {'_index': 'doaj-es-demo',
    '_type': '_doc',
    '_id': '5',
    '_score': 1.0,
    '_source': {'ID'