# Elasticsearch Setup

For this project, we will be building an open-domain question answering system. There are three major components to such a system:

* Database

* Retriever

* Reader

In this notebook we will setup the first part, the *database* - where we will be using Elasticsearch.

Before creating our Elasticsearch index, we need to load our data. We will be using *Meditations* by Marcus Aurelius 

### Elactric search setup here 

https://www.elastic.co/guide/en/elasticsearch/reference/7.17/targz.html

In [1]:
import requests

In [2]:
with open("./data/book.txt") as f:
    contents = f.read()
    text = contents.split('\n')

In [3]:
text[:3]

['From my grandfather Verus I learned good morals and the government of my temper.',
 'From the reputation and remembrance of my father, modesty and a manly character.',
 'From my mother, piety and beneficence, and abstinence, not only from evil deeds, but even from evil thoughts; and further, simplicity in my way of living, far removed from the habits of the rich.']

Now we can move onto setting up an index in elasticsearch. Let's confirm Elasticsearch is up and running.

In [4]:
requests.get('http://localhost:9200/_cluster/health').json()

{'cluster_name': 'elasticsearch',
 'status': 'yellow',
 'timed_out': False,
 'number_of_nodes': 1,
 'number_of_data_nodes': 1,
 'active_primary_shards': 6,
 'active_shards': 6,
 'relocating_shards': 0,
 'initializing_shards': 0,
 'unassigned_shards': 3,
 'delayed_unassigned_shards': 0,
 'number_of_pending_tasks': 0,
 'number_of_in_flight_fetch': 0,
 'task_max_waiting_in_queue_millis': 0,
 'active_shards_percent_as_number': 66.66666666666666}

And check currently active indices.

In [5]:
print(requests.get('http://localhost:9200/_cat/indices').text)

green  open .geoip_databases 3Nr4CVMhTpa9Niyg67r_Eg 1 0    42 0  40.3mb  40.3mb
yellow open harrypotter      BaO2yTCKTlyqTca9aOVKSQ 1 1 13670 4  27.2mb  27.2mb
yellow open meditation       itzMqP_zReyzF7bZJqYghg 1 1   507 0 321.5kb 321.5kb
yellow open label            S_tPNEOzTimfNFTHLC6OVw 1 1     0 0    226b    226b



Now let's initialize a new index *meditation* which we will use to store our *Meditations* dataset.

In [14]:
from haystack.document_stores.elasticsearch import ElasticsearchDocumentStore

doc_store = ElasticsearchDocumentStore(
    host='localhost',
    username='', password='',
    index='meditation'
)



In [15]:
print(requests.get('http://localhost:9200/_cat/indices').text)

green  open .geoip_databases 3Nr4CVMhTpa9Niyg67r_Eg 1 0 42 0 40.3mb 40.3mb
yellow open meditation       itzMqP_zReyzF7bZJqYghg 1 1  0 0   226b   226b
yellow open label            S_tPNEOzTimfNFTHLC6OVw 1 1  0 0   226b   226b



Now we need to format our data into a list of dictionaries before passing it along to Elasticsearch. We will create the format:

```json
{
    'content': '<paragraph>',
    'meta': {
        'source': 'meditations'
    }
}
```

In [7]:
data_json = [
    {
        'content': paragraph,
        'meta': {
            'source': 'meditations'
        }
    } for paragraph in text
]

In [8]:
data_json[:3]

[{'content': 'From my grandfather Verus I learned good morals and the government of my temper.',
  'meta': {'source': 'meditations'}},
 {'content': 'From the reputation and remembrance of my father, modesty and a manly character.',
  'meta': {'source': 'meditations'}},
 {'content': 'From my mother, piety and beneficence, and abstinence, not only from evil deeds, but even from evil thoughts; and further, simplicity in my way of living, far removed from the habits of the rich.',
  'meta': {'source': 'meditations'}}]

In [9]:
len(data_json)

507

Now we simply write our data to Elasticsearch.

In [23]:
doc_store.write_documents(data_json)

And confirm that we have uploaded *507* items.

In [10]:
requests.get('http://localhost:9200/meditation/_count').json()

{'count': 507,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}}