# Elasticsearch Setup

For this project, we will be building an open-domain question answering system. There are three major components to such a system:

* Database

* Retriever

* Reader

In this notebook we will setup the first part, the *database* - where we will be using Elasticsearch.

Before creating our Elasticsearch index, we need to load our data. We will use clean data from https://harrypotter.fandom.com/wiki/Main_Page

### Elactric search setup here 

https://www.elastic.co/guide/en/elasticsearch/reference/7.17/targz.html

In [58]:
import requests

In [73]:
from haystack.utils import clean_wiki_text
import PyPDF2
pdfFileObj = open('./data/aiqrunbook.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
count = pdfReader.numPages
output = []
for i in range(count):
        page = pdfReader.getPage(i)
        contents = (page.extractText())
        para = contents.split('\n')
        output.extend(para)

In [74]:
output[:3]

['Comcast Confidential - Any unauthorized disclosure or use is strictly prohibited',
 'Comcast Confidential - Any unauthorized disclosure or use is strictly prohibited1.  AIQ Platform Failure Scenarios  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  2',
 '1.1  AIQ Production - CRDB Failure Scenarios  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  3']

Now we can move onto setting up an index in elasticsearch. Let's confirm Elasticsearch is up and running.

In [84]:
requests.get('http://localhost:9200/_cluster/health').json()

{'cluster_name': 'elasticsearch',
 'status': 'yellow',
 'timed_out': False,
 'number_of_nodes': 1,
 'number_of_data_nodes': 1,
 'active_primary_shards': 7,
 'active_shards': 7,
 'relocating_shards': 0,
 'initializing_shards': 0,
 'unassigned_shards': 4,
 'delayed_unassigned_shards': 0,
 'number_of_pending_tasks': 0,
 'number_of_in_flight_fetch': 0,
 'task_max_waiting_in_queue_millis': 0,
 'active_shards_percent_as_number': 63.63636363636363}

And check currently active indices.

In [76]:
print(requests.get('http://localhost:9200/_cat/indices').text)

yellow open myspace          8MD8-BrhRQiMcy5Ijh5KMQ 1 1  7 0 47.8kb 47.8kb
green  open .geoip_databases _ZyZk81hRuGagoxe3hFqrA 1 0 42 0 40.6mb 40.6mb
yellow open aiqrunbook       CK-BlBHuSY62eiDmqH98kA 1 1 29 0 55.2kb 55.2kb
yellow open label            kXxpF1SPQKqH5J3XpS20SA 1 1  0 0   226b   226b
yellow open opennlp          x3yf2sDBQRe0WCarTGthaQ 1 1 35 0 67.7kb 67.7kb



Now let's initialize a new index *meditation* which we will use to store our *Meditations* dataset.

In [77]:
from haystack.document_stores.elasticsearch import ElasticsearchDocumentStore

doc_store = ElasticsearchDocumentStore(
    host='localhost',
    username='', password='',
    index='aiqrunbook'
)



In [78]:
print(requests.get('http://localhost:9200/_cat/indices').text)

yellow open myspace          8MD8-BrhRQiMcy5Ijh5KMQ 1 1  7 0 47.8kb 47.8kb
green  open .geoip_databases _ZyZk81hRuGagoxe3hFqrA 1 0 42 0 40.6mb 40.6mb
yellow open aiqrunbook       CK-BlBHuSY62eiDmqH98kA 1 1 29 0 55.2kb 55.2kb
yellow open label            kXxpF1SPQKqH5J3XpS20SA 1 1  0 0   226b   226b
yellow open opennlp          x3yf2sDBQRe0WCarTGthaQ 1 1 35 0 67.7kb 67.7kb



Now we need to format our data into a list of dictionaries before passing it along to Elasticsearch. We will create the format:

```json
{
    'content': '<paragraph>',
    'meta': {
        'source': 'meditations'
    }
}
```

In [79]:
data_json = [
    {
        'content': clean_wiki_text(paragraph),
        'meta': {
            'source': 'aiqrunbook'
        }
    } for paragraph in output
]

In [80]:
data_json[:3]

[{'content': 'Comcast Confidential - Any unauthorized disclosure or use is strictly prohibited',
  'meta': {'source': 'aiqrunbook'}},
 {'content': 'Comcast Confidential - Any unauthorized disclosure or use is strictly prohibited1.  AIQ Platform Failure Scenarios  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  2',
  'meta': {'source': 'aiqrunbook'}},
 {'content': '1.1  AIQ Production - CRDB Failure Scenarios  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  3',
  'meta': {'source': 'aiqrunbook'}}]

In [81]:
len(data_json)

838

Now we simply write our data to Elasticsearch.

In [82]:
doc_store.write_documents(data_json)

And confirm that we have uploaded *13674* items.

In [83]:
requests.get('http://localhost:9200/aiqrunbook/_count').json()

{'count': 486,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}}

Perfect!