# Using Elasticsearch client

We'll load some arXiv article data and index it with Elasticsearch.

The simplest way to run Elasticsearch is to use [its docker image](https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html) to run it locally.

## Start client

Put your Elasticsearch's endpoint here if you don't want to use default (localhost:9200)

In [1]:
from elasticsearch import Elasticsearch
es = Elasticsearch()

## Check out indices

In [2]:
es.indices.delete('arxiv_document')
es.indices.create('arxiv_document')

In [3]:
es.indices.get_alias("*")

{'.kibana': {'aliases': {}},
 '.monitoring-es-6-2018.01.13': {'aliases': {}},
 '.monitoring-es-6-2018.01.14': {'aliases': {}},
 'arxiv_document': {'aliases': {}}}

## Load documents from arxiv

In [4]:
import arxiv
import json

In [5]:
%%time

search_query = 'machine learning'

def paged_results(search_query, no_pages, per_page=1000):
  for i in range(no_pages):
    yield arxiv.query(
      search_query,
      start=i * per_page,
      max_results=per_page)

results = arxiv.query(search_query=search_query,
                          max_results=1000)

print('Query returned {} results'.format(len(results)))

Query returned 1000 results
CPU times: user 3.72 s, sys: 36 ms, total: 3.76 s
Wall time: 18 s


In [6]:
from tqdm import tqdm

no_pages = 100
per_page = 1000 
pages = paged_results(search_query, no_pages, per_page=per_page)
results = []

for res_page in tqdm(pages, total=no_pages):
  l = len(res_page)
  if (l != per_page):
    print('There are {} papers for this page'.format(l))
  if l == 0:
    break
  results += res_page

  3%|▎         | 3/100 [00:44<23:46, 14.71s/it]

There are 0 papers for this page


In [7]:
def make_document(content, INDEX_NAME='arxiv_document'):
  return {
    '_op_type': 'create',
    '_type': 'document',
    '_id': content['arxiv_url'],
    '_index': INDEX_NAME,
    **content
  }

## Load data to Elasticsearch

In [None]:
%%time 

bulk(es, map(make_document, results))