# Elasticsearch Setup

For this project, we will be building an open-domain question answering system. There are three major components to such a system:

* Database

* Retriever

* Reader

In this notebook we will setup the first part, the *database* - where we will be using Elasticsearch.

Before creating our Elasticsearch index, we need to load our data. We will use clean data from https://harrypotter.fandom.com/wiki/Main_Page

### Elactric search setup here 

https://www.elastic.co/guide/en/elasticsearch/reference/7.17/targz.html

In [1]:
import requests

In [8]:
from haystack.utils import clean_wiki_text
import pandas as pd
harry = pd.read_csv("https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/harry_potter_wiki.csv")



Now we can move onto setting up an index in elasticsearch. Let's confirm Elasticsearch is up and running.

In [4]:
requests.get('http://localhost:9200/_cluster/health').json()

{'cluster_name': 'elasticsearch',
 'status': 'yellow',
 'timed_out': False,
 'number_of_nodes': 1,
 'number_of_data_nodes': 1,
 'active_primary_shards': 5,
 'active_shards': 5,
 'relocating_shards': 0,
 'initializing_shards': 0,
 'unassigned_shards': 2,
 'delayed_unassigned_shards': 0,
 'number_of_pending_tasks': 0,
 'number_of_in_flight_fetch': 0,
 'task_max_waiting_in_queue_millis': 0,
 'active_shards_percent_as_number': 71.42857142857143}

And check currently active indices.

In [5]:
print(requests.get('http://localhost:9200/_cat/indices').text)

green  open .geoip_databases 3Nr4CVMhTpa9Niyg67r_Eg 1 0  42 0  40.3mb  40.3mb
yellow open meditation       itzMqP_zReyzF7bZJqYghg 1 1 507 0 321.5kb 321.5kb
yellow open label            S_tPNEOzTimfNFTHLC6OVw 1 1   0 0    226b    226b



Now let's initialize a new index *meditation* which we will use to store our *Meditations* dataset.

In [6]:
from haystack.document_stores.elasticsearch import ElasticsearchDocumentStore

doc_store = ElasticsearchDocumentStore(
    host='localhost',
    username='', password='',
    index='harrypotter'
)



In [7]:
print(requests.get('http://localhost:9200/_cat/indices').text)

green  open .geoip_databases 3Nr4CVMhTpa9Niyg67r_Eg 1 0  42 0  40.3mb  40.3mb
yellow open harrypotter      BaO2yTCKTlyqTca9aOVKSQ 1 1   0 0    226b    226b
yellow open meditation       itzMqP_zReyzF7bZJqYghg 1 1 507 0 321.5kb 321.5kb
yellow open label            S_tPNEOzTimfNFTHLC6OVw 1 1   0 0    226b    226b



Now we need to format our data into a list of dictionaries before passing it along to Elasticsearch. We will create the format:

```json
{
    'content': '<paragraph>',
    'meta': {
        'source': 'meditations'
    }
}
```

In [10]:
data_json = []

for ix, row in harry.iterrows():
    dic = {

        'content': clean_wiki_text(row.text),
        'meta': {
            'name': row['name'],
            'url': row.url
        }
    }
    data_json.append(dic)

In [11]:
data_json[:3]

[{'content': 'Gryffindor is one of the four Houses of Hogwarts School of Witchcraft and Wizardry and was founded by Godric Gryffindor. Gryffindor instructed the Sorting Hat to choose students possessing characteristics he most valued, such as courage, chivalry, and determination, to be sorted into his house. The emblematic animal is a lion, and its colours are scarlet and gold. Sir Nicholas de Mimsy-Porpington, also known as "\'\'Nearly Headless Nick\'\'" is the House ghost.\nGryffindor corresponds roughly to the element of fire, and it is for this reason that the colours scarlet and gold were chosen to represent the house. The colour of fire corresponds to that of a lion as well, with scarlet representing the mane and tail and gold representing the coat.\n\n\n==Traits==\nGryffindor\'s founder, Godric Gryffindor\nThe Gryffindor house emphasises the traits of courage as well as "\'\'daring, nerve, and chivalry,\'\'" and thus its members are generally regarded as brave, though sometimes 

In [12]:
len(data_json)

13674

Now we simply write our data to Elasticsearch.

In [13]:
doc_store.write_documents(data_json)



And confirm that we have uploaded *13674* items.

In [15]:
requests.get('http://localhost:9200/harrypotter/_count').json()

{'count': 13670,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}}

Perfect!