# DOAJ Elasticsearch / Indexing Presentation
_Steve, Cottage Labs, 2022_

Here I will cover the basics of our ElasticSearch document store, indexes in general, and its use and consequences for our system.

Elasticsearch runs on a separate machine to the DOAJ application, in production my first step is use the firewall to ensure only the right machines can talk to it. It stores all of our dynamic data, including our users, which contains personally identifiable info such as email addresses. Therefore it's a private network connecting our machines - meaning not over the internet.

Here I'll use Docker to produce two elasticsearch hosts here on my laptop, and connect to them with Python so I can demonstrate what an index does for us.

In [3]:
import json

import elasticsearch
es1 = {'scheme': 'http', 'host': 'localhost', 'port': 9201}
es2 = {'scheme': 'http', 'host': 'localhost', 'port': 9202}
es = elasticsearch.Elasticsearch([es1, es2])

In [4]:
es.indices.delete(index='doaj-es-demo')

{'acknowledged': True}

In [5]:
es.indices.create(index='doaj-es-demo')

{'acknowledged': True, 'shards_acknowledged': True, 'index': 'doaj-es-demo'}

## An index is different from a database

A database is a table - much like a spreadsheet, it has headers and rows for all of the data.

| ID  | Title                                             | ISSN      |
|-----|---------------------------------------------------|-----------|
| 001 | Journal of Spoons                                 | 1234-5678 |
| 002 | Analytical Knitting                               | 9876-5432 |
| 003 | Journal of Trivial Household Items and Their Uses | 1010-000X |
| 004 | Journal of Most Exciting Journals                 | 3333-7777 |
| 005 | Humerous Taxidermy                                | 2468-1012 |

In a very basic database, you'd retrieve the full record with ISSN 9876-5432 by going through all of the rows, comparing the query with the value in the corresponding column. It's like reading the whole book to find the chapter you're interested in. Then you'd pull out the record by its ID.

Like in a book, an **index** allows you to search more efficiently by directly pointing the interesting information to the ID of the record - it's less to search through, but also it can be organised for efficient searching, e.g. by doing a binary search (all the ISSNs starting with odd number, and so on) you'd be faster than looking through all of the records.

### Structured vs unstructured data

In our records table above, we have 3 fields - the ID, the Title, and the ISSN. Since we know how long ISSNs are we can make our database more efficient by explicitly specifying how much data that row can take. We need to specify the size and type of data when we create the table.

In ElasticSearch and other document indexes, we store the records in a different format, namely JSON - JavaScript Object Notation. This is also similar to how objects look in Python. For example:

In [6]:
table = [
    {'ID': 1, 'Title': 'Journal of Spoons', 'ISSN': '1234-5678'},
    {'ID': 2, 'Title': 'Analytical Knitting', 'ISSN': '9876-5432'},
    {'ID': 3, 'Title': 'Journal of Trivial Household Items and Their Uses', 'ISSN': '1010-000X'},
    {'ID': 4, 'Title': 'Journal of Most Exciting Journals', 'ISSN': '3333-7777'},
    {'ID': 5, 'Title': 'Humerous Taxidermy', 'ISSN': '2468-1012'}
]

# Put the data in the index one record at a time
[es.create(index='doaj-es-demo', body=json.dumps(row), id=row['ID']) for row in table]


[{'_index': 'doaj-es-demo',
  '_type': '_doc',
  '_id': '1',
  '_version': 1,
  'result': 'created',
  '_shards': {'total': 2, 'successful': 1, 'failed': 0},
  '_seq_no': 0,
  '_primary_term': 1},
 {'_index': 'doaj-es-demo',
  '_type': '_doc',
  '_id': '2',
  '_version': 1,
  'result': 'created',
  '_shards': {'total': 2, 'successful': 1, 'failed': 0},
  '_seq_no': 1,
  '_primary_term': 1},
 {'_index': 'doaj-es-demo',
  '_type': '_doc',
  '_id': '3',
  '_version': 1,
  'result': 'created',
  '_shards': {'total': 2, 'successful': 1, 'failed': 0},
  '_seq_no': 2,
  '_primary_term': 1},
 {'_index': 'doaj-es-demo',
  '_type': '_doc',
  '_id': '4',
  '_version': 1,
  'result': 'created',
  '_shards': {'total': 2, 'successful': 1, 'failed': 0},
  '_seq_no': 3,
  '_primary_term': 1},
 {'_index': 'doaj-es-demo',
  '_type': '_doc',
  '_id': '5',
  '_version': 1,
  'result': 'created',
  '_shards': {'total': 2, 'successful': 1, 'failed': 0},
  '_seq_no': 4,
  '_primary_term': 1}]

In [11]:
# Check we can immediately retrieve these records by doing a search.
es.search({'query': {'match_all': {}}}, index='doaj-es-demo')

{'took': 5,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 6, 'relation': 'eq'},
  'max_score': 1.0,
  'hits': [{'_index': 'doaj-es-demo',
    '_type': '_doc',
    '_id': '1',
    '_score': 1.0,
    '_source': {'ID': 1, 'Title': 'Journal of Spoons', 'ISSN': '1234-5678'}},
   {'_index': 'doaj-es-demo',
    '_type': '_doc',
    '_id': '2',
    '_score': 1.0,
    '_source': {'ID': 2, 'Title': 'Analytical Knitting', 'ISSN': '9876-5432'}},
   {'_index': 'doaj-es-demo',
    '_type': '_doc',
    '_id': '3',
    '_score': 1.0,
    '_source': {'ID': 3,
     'Title': 'Journal of Trivial Household Items and Their Uses',
     'ISSN': '1010-000X'}},
   {'_index': 'doaj-es-demo',
    '_type': '_doc',
    '_id': '4',
    '_score': 1.0,
    '_source': {'ID': 4,
     'Title': 'Journal of Most Exciting Journals',
     'ISSN': '3333-7777'}},
   {'_index': 'doaj-es-demo',
    '_type': '_doc',
    '_id': '5',
    '_score': 1.0,
    '_

# Using an index for search - fulltext search

An document store lets us efficiently search for text content in our data, for example all of the article abstracts in DOAJ can be searched by words. Indexes do full-text search really well because they can analyse incoming text and build what's called an **inverted index** - just a summary of a document by word count.

In order to treat all documents the same, we _normalise_ the text. Let's start with the following journal:

    Journal of Trivial Household Items and Their Uses

To make matching words easier, we change the letter case to be lowercase:

    journal of trivial household items and their uses

Next step is to ignore all of the boring words (stop words like _the_, _a_, etc.)

    journal trivial household items uses

ElasticSearch can also do additional natural language processing steps such as ascii filtering (`ö` -> `o`), stemming.

    journal trivial household item use

Then store the frequency of each word in the inverted index (boringly, 1 of each in our case)

| Word      | Count |
|-----------|-------|
| household | 1     |
| item      | 1     |
| journal   | 1     |
| trivial   | 1     |

More interestingly, _Journal of Most Exciting Journals_ would be represented something like this:

| Word     | Count |
|----------|-------|
| exciting | 1     |
| journal  | 2     |
| most     | 1     |

Since we have a higher count of the word stem `journal` we'd expect to see a better match when we search our ElasticSearch index we'd expect that result to be on top, and for all of our _Journal of..._ documents to be returned as well. ElasticSearch will do the same normalisation steps to the search text as it does to the whole document.

In [8]:
# Check results ordering on fulltext search
es.search({'query': {'query_string': {'query': 'Journal'}}}, index='doaj-es-demo')

{'took': 73,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 3, 'relation': 'eq'},
  'max_score': 0.60040116,
  'hits': [{'_index': 'doaj-es-demo',
    '_type': '_doc',
    '_id': '1',
    '_score': 0.60040116,
    '_source': {'ID': 1, 'Title': 'Journal of Spoons', 'ISSN': '1234-5678'}},
   {'_index': 'doaj-es-demo',
    '_type': '_doc',
    '_id': '4',
    '_score': 0.4889865,
    '_source': {'ID': 4,
     'Title': 'Journal of Most Exciting Journals',
     'ISSN': '3333-7777'}},
   {'_index': 'doaj-es-demo',
    '_type': '_doc',
    '_id': '3',
    '_score': 0.38251364,
    '_source': {'ID': 3,
     'Title': 'Journal of Trivial Household Items and Their Uses',
     'ISSN': '1010-000X'}}]}}

# Dynamic Data

Imagine we'd like to add a 4th column to our table, for the Publisher.

| ID  | Title              | ISSN      | Publisher    |
|-----|--------------------|-----------|--------------|
| 006 | Reciprocal impulse | 8888-8888 | Barry & Paul |

In a normal database table, we can't just decide to add this field - we have to explicitly add the column and its data type. Here, Elasticsearch is a little different - we can just add a new document with the additional field, and it will accept it.


In [9]:
# 6th record with additional field for publisher
rec = {'ID': 6, 'Title': 'Reciprocal impulse', 'ISSN': '8888-8888', 'Publisher': "Barry & Paul"}
es.create(index='doaj-es-demo', body=json.dumps(rec), id=rec['ID'])

{'_index': 'doaj-es-demo',
 '_type': '_doc',
 '_id': '6',
 '_version': 1,
 'result': 'created',
 '_shards': {'total': 2, 'successful': 1, 'failed': 0},
 '_seq_no': 5,
 '_primary_term': 1}

In [10]:
# Retrieve the above record to show it's saved
es.get(index='doaj-es-demo', id=6)

{'_index': 'doaj-es-demo',
 '_type': '_doc',
 '_id': '6',
 '_version': 1,
 '_seq_no': 5,
 '_primary_term': 1,
 'found': True,
 '_source': {'ID': 6,
  'Title': 'Reciprocal impulse',
  'ISSN': '8888-8888',
  'Publisher': 'Barry & Paul'}}

This is all possible via **dynamic mapping** - ElasticSearch knows how to treat a new field based on the type of data it detects. In this case it worked out that the new field contains text, so the structure of our documents was updated to add this field.

In [12]:
es.indices.get_mapping(index='doaj-es-demo')

{'doaj-es-demo': {'mappings': {'properties': {'ID': {'type': 'long'},
    'ISSN': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'Publisher': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'Title': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}}}}}}

You can see that when we added our documents, it interpreted all of the fields as text apart from the ID which is a number - 'long' means a signed 64-bit integer. The data type affects the type of analysis you can do on a particular field - e.g. it knows that a number can be continuous, and text can be broken down into words.

## The DOAJ mapping

Instead of relying on the default dynamic mapping for the DOAJ, for most of the data we store we explicitly tell ElasticSearch what its mapping should be, this ensures the documents we upload are searchable with our interface. Here's a portion of the DOAJ's mapping:

```json
"apc": {
    "properties": {
        "has_apc": {
            "type": "boolean"
        },
        "max": {
            "properties": {
                "currency": {
                    "type": "text",
                    "fields": {
                        "exact": {
                            "type": "keyword",
                            "store": true
                        }
                    }
                },
                "price": {
                    "type": "long"
                }
            }
        },
        "url": {
            "type": "text",
            "fields": {
                "exact": {
                    "type": "keyword",
                    "store": true
                }
            }
        }
    }
},
```

We generate this mapping using our internal data structure, so it should always be in sync with whatever the code expects to work with.

### Changing the mapping

We've seen that ElasticSearch can, via the dynamic mapping, incorporate the addition of fields in the records. The same is true via the explicit mapping above, but what happens when you need to remove a field, or change its data type?

You can't just edit the mapping directly, because all of your existing data will be invalid. You can't just upload a new document with the new structure, because it won't match the existing mapping. So that's why these changes require a **re-index** - we need to move all of the records from one structure to another, and load them again for search.

