Elasticsearch (ES) is built for searching collections of documents.  To improve retrieval performance it calculates query scores that are used for ordering the results returned.  The overhead of these calculations can unnecessarily slow operations involving many documents in a large index.  

For creating and managing large numbers of documents, ES has particular methods for "bulk" CRUD operations.  One of these is used to download or copy an entire index, or a large portion of it.  It does what is referred to as a "scroll" search.  

A scroll search consists of first obtaining all document ID's of documents that match a query, and then using them to retieve all the matched documents.  The results are not returned in any particular order.  Scroll searches are used in order to copy an index to another location, to reindex an index, or to download an index for local processing. 

The ES Python client can do scroll searches in a pretty straightforward manner.  To use it on the ES enron index on the SSCC, you need to have a version of it that is 5.x.  The following provides a simple demonstration of scroll searching to retrieve _all_ email messages from the ES enron index that you can run.

In [1]:
 # note the package's helper methods are are imported, too.
from elasticsearch import Elasticsearch, helpers 

In [2]:
# connecting to the enron index in ES
es=Elasticsearch('http://enron:spsdata@129.105.88.91:9200')  

In [3]:
# a query spec to match everything, i.e. to retrieve all messages
query={"query" : {"match_all" : {}}}    

In [4]:
scanner=helpers.scan(client= es, query=query, scroll= "10m", index="",
                       doc_type="email", timeout="10m")

`scanner`, above, is a _generator_ of the ID's of the enron documents that can be retrieved by using it.

The following is 'raw' example code:  You'll need to enable it in order to run it.  It's raw because executing it can take a few minutes, depending on your computer and the speed of your internect connection.  Do you know how many email documents there are in the enron index?

What's in the square brackets, above, is a _list comprehension_, sometimes called a _listcomp_.  It creates a list called `selectdocs` that has in it the `_source` dictionaries of the email documents.  Without the `[_source]` part of the above, `selectdocs` should be a list of the complete email documents.  

What's `scanner`?  Is it a Python _generator_?

Note that if you wanted to process each document as it is returned, you could write a function to do what you need to do to each email message as it is returned.  You could do this in a listcomp, or you could do it in an explicit loop that appends each processed message to a list of processed messages.

In [7]:
selectdocs = [msg['_source'] for msg in scanner]

In [8]:
len(selectdocs)

250762

In [9]:
selectdocs[0]

{'body': "Please click on the URL below for Enron's 2001 Holiday Schedule.\n\nhttp://home.enron.com:84/messaging/2001sched.jpg",
 'headers': {'Date': 'Mon, 13 Nov 2000 10:46:00 -0800 (PST)',
  'From': 'enron.announcements@enron.com',
  'Message-ID': '<32338077.1075857332643.JavaMail.evans@thyme>',
  'Subject': 'Holiday Schedule 2001',
  'To': 'enron.states@enron.com',
  'X-From': 'Enron Announcements',
  'X-To': 'Enron Employees United States',
  'X-bcc': '',
  'X-cc': ''},
 'mailbox': 'dean-c',
 'subFolder': 'all_documents'}

In [10]:
type(scanner)

generator