# Elasticsearch Setup

For this project, we will be building an open-domain question answering system. There are three major components to such a system:

* Database

* Retriever

* Reader

In this notebook we will setup the first part, the *database* - where we will be using Elasticsearch.

Before creating our Elasticsearch index, we need to load our data. We will be using *Meditations* by Marcus Aurelius - a clean version of this can be found at:

https://raw.githubusercontent.com/jamescalam/transformers/main/data/text/meditations/clean.txt

We will download this through requests.

In [1]:
import requests

In [2]:
data = requests.get('https://raw.githubusercontent.com/jamescalam/transformers/main/data/text/meditations/clean.txt')
text = data.text.split('\n')

In [3]:
text[:3]

['From my grandfather Verus I learned good morals and the government of my temper.',
 'From the reputation and remembrance of my father, modesty and a manly character.',
 'From my mother, piety and beneficence, and abstinence, not only from evil deeds, but even from evil thoughts; and further, simplicity in my way of living, far removed from the habits of the rich.']

Now we can move onto setting up an index in elasticsearch. Let's confirm Elasticsearch is up and running.

In [4]:
requests.get('http://localhost:9200/_cluster/health').json()

{'cluster_name': 'docker-cluster',
 'status': 'yellow',
 'timed_out': False,
 'number_of_nodes': 1,
 'number_of_data_nodes': 1,
 'active_primary_shards': 5,
 'active_shards': 5,
 'relocating_shards': 0,
 'initializing_shards': 0,
 'unassigned_shards': 2,
 'delayed_unassigned_shards': 0,
 'number_of_pending_tasks': 0,
 'number_of_in_flight_fetch': 0,
 'task_max_waiting_in_queue_millis': 0,
 'active_shards_percent_as_number': 71.42857142857143}

And check currently active indices.

In [5]:
print(requests.get('http://localhost:9200/_cat/indices').text)

green  open .geoip_databases Tl4kr989SoSAJc76pbVoXw 1 0   41    0 39.6mb 39.6mb
yellow open squad_docs       8WDF2_ElTmuOWuL0dO4a-g 1 1 1204 1204 14.2mb 14.2mb
yellow open label            2VsgyMTHTi-TxfIu_TskJg 1 1    0    0   226b   226b



Now let's initialize a new index *aurelius* which we will use to store our *Meditations* dataset.

In [6]:
from haystack.document_stores import ElasticsearchDocumentStore

doc_store = ElasticsearchDocumentStore(
    host='localhost',
    username='', password='',
    index='aurelius'
)



In [7]:
print(requests.get('http://localhost:9200/_cat/indices').text)

green  open .geoip_databases Tl4kr989SoSAJc76pbVoXw 1 0   41    0 39.6mb 39.6mb
yellow open squad_docs       8WDF2_ElTmuOWuL0dO4a-g 1 1 1204 1204 14.2mb 14.2mb
yellow open aurelius         44OwWJ1dTGCkRUAYL7vEIg 1 1    0    0   226b   226b
yellow open label            2VsgyMTHTi-TxfIu_TskJg 1 1    0    0   226b   226b



Now we need to format our data into a list of dictionaries before passing it along to Elasticsearch. We will create the format:

```json
{
    'text': '<paragraph>',
    'meta': {
        'source': 'meditations'
    }
}
```

In [8]:
data_json = [
    {
        'content': paragraph,
        'meta': {
            'source': 'meditations'
        }
    } for paragraph in text
]

In [9]:
data_json[:3]

[{'content': 'From my grandfather Verus I learned good morals and the government of my temper.',
  'meta': {'source': 'meditations'}},
 {'content': 'From the reputation and remembrance of my father, modesty and a manly character.',
  'meta': {'source': 'meditations'}},
 {'content': 'From my mother, piety and beneficence, and abstinence, not only from evil deeds, but even from evil thoughts; and further, simplicity in my way of living, far removed from the habits of the rich.',
  'meta': {'source': 'meditations'}}]

In [10]:
len(data_json)

507

Now we simply write our data to Elasticsearch.

In [11]:
doc_store.write_documents(data_json)

And confirm that we have uploaded *507* items.

In [12]:
requests.get('http://localhost:9200/aurelius/_count').json()

{'count': 507,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}}

Perfect!