"*Elasticsearch is a highly scalable open-source full-text search and analytics engine. It allows you to store, search, and analyze big volumes of data quickly and in near real time. It is generally used as the underlying engine/technology that powers applications that have complex search features and requirements."*(https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started.html)

Remember to connect to the NU vp before running the followi code to (remotely) connect to and query the Elasticsearch server.

In [1]:
# import elasticsearch module to be used to connect to the Elasticsearch server
from elasticsearch import Elasticsearch, helpers 

In [2]:
# create low-level client
es=Elasticsearch('http://enron:spsdata@129.105.88.91:9200')   #  ip address and port to connect to...

We are are going to be querying documents of *type* `email` within the `Enron` *index*.(https://www.elastic.co/guide/en/elasticsearch/reference/current/_basic_concepts.html)

We demonstrate some sample queries of the Elasticsearch data store. They follow the following template:
 * Define the query as a dictionary `query = ...`
 * Use the `es` client to count the number of results returned by the query: `es.count(...)`
 * Use the `es` client to retrieve some or all of the results returned by the query: `es.search(...)`

In [3]:
# count the number of emails with silverpeak in the body
query={"query" : {"match" : {"body":"silverpeak"}}}
es.count(index='enron',doc_type='email',body=query)  

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 5, 'total': 5},
 'count': 14}

In [4]:
# get one such email....
es.search(size=1,index='enron',doc_type='email',body=query)

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': '<30684138.1075841590968.JavaMail.evans@thyme>',
    '_index': 'enron',
    '_score': 12.64623,
    '_source': {'body': 'The ISO cut the one side of my wheel on HE24 on 4/13. \n\nI had a firm export at 4C (EPMI_CISO_BUNNY)  of 20 mw for a sale to powerex and a wheel EPMI_CISO_ TOBY at PALO going out at Silverpeak. \nThey cut the sale to POWEREX and the following imports. They cut the following:\n\nEPMI_CISO_BUNNY 20 to a 8\nEPMI_CISO_TOBY IMPORT 12 to 4. \n\nAll this while keeping my EPMI_CISO_TOBY Export at a 12. \n\nBasically they used both the WHEEL IMPORT/ FIRM IMPORT to Fill the WHEEL EXPORT.\n\nGeir ',
     'headers': {'Date': 'Sat, 14 Apr 2001 12:42:00 -0700 (PDT)',
      'From': 'geir.solberg@enron.com',
      'Message-ID': '<30684138.1075841590968.JavaMail.evans@thyme>',
      'Subject': 'Strange cut:',
      'To': 'volume_management_portland@enron.com',
      'X-From': 'Geir Solbe

In [5]:
# get emails with silverpeak in the body but just display the From field...
query={"_source": "headers.From", "query" : {"match" : {"body":"silverpeak"}}}
es.search(size=1,index='enron',doc_type='email',body=query)

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': '<30684138.1075841590968.JavaMail.evans@thyme>',
    '_index': 'enron',
    '_score': 12.64623,
    '_source': {'headers': {'From': 'geir.solberg@enron.com'}},
    '_type': 'email'}],
  'max_score': 12.64623,
  'total': 14},
 'timed_out': False,
 'took': 2}

In [6]:
# get emails with silverpeak in the body but just display the From and To fields...
query={"_source": ["headers.From","headers.To"], "query" : {"match" : {"body":"silverpeak"}}}
es.search(size=1,index='enron',doc_type='email',body=query)

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': '<30684138.1075841590968.JavaMail.evans@thyme>',
    '_index': 'enron',
    '_score': 12.64623,
    '_source': {'headers': {'From': 'geir.solberg@enron.com',
      'To': 'volume_management_portland@enron.com'}},
    '_type': 'email'}],
  'max_score': 12.64623,
  'total': 14},
 'timed_out': False,
 'took': 2}

In [7]:
# get the number of emails sent from geir.solberg@enron.com
query={"query":{"nested":{"path":"headers","query":{"match":{"headers.From":"geir.solberg@enron.com"}}}}}
es.count(index='enron',doc_type='email',body=query)
es.search(size=1,index='enron',doc_type='email',body=query) #display one such

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': '<23384755.1075840632119.JavaMail.evans@thyme>',
    '_index': 'enron',
    '_score': 8.394843,
    '_source': {'body': 'I do not know if everybody knows about the change in the price for the SPS \nsale. The cost is now according to Roger\nPeak:  $31.20\nOff-Peak: $28.25\n\nGeir',
     'headers': {'Date': 'Mon, 23 Oct 2000 05:56:00 -0700 (PDT)',
      'From': 'geir.solberg@enron.com',
      'Message-ID': '<23384755.1075840632119.JavaMail.evans@thyme>',
      'Subject': 'SPS Firm buy.',
      'To': 'portland.shift@enron.com, john.forney@enron.com',
      'X-From': 'Geir Solberg',
      'X-To': 'Portland Shift, John M Forney',
      'X-bcc': '',
      'X-cc': ''},
     'mailbox': 'guzman-m',
     'subFolder': 'all_documents'},
    '_type': 'email'}],
  'max_score': 8.394843,
  'total': 79},
 'timed_out': False,
 'took': 2}

In [8]:
# get the number of emails with geir.solberg@enron.com in the body or Subject or both.
query={"query" : {"multi_match" : {"fields" : ["body", "Subject"],"query":"geir.solberg@enron.com"}}}
es.count(index='enron',doc_type='email',body=query)

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 5, 'total': 5},
 'count': 47688}

In [9]:
# display From/To fields of one email with geir.solberg@enron.com in the body or Subject or both.
query={"_source": ["headers.From","headers.To"], "query" : {"multi_match" : {"fields" : ["body", "Subject"],"query":"geir.solberg@enron.com"}}}
es.search(size=1,index='enron',doc_type='email',body=query) #display one such

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': '<19714461.1075839999527.JavaMail.evans@thyme>',
    '_index': 'enron',
    '_score': 16.025543,
    '_source': {'headers': {'From': 'bill.williams@enron.com',
      'To': 'jack_todd@pgn.com'}},
    '_type': 'email'}],
  'max_score': 16.025543,
  'total': 47688},
 'timed_out': False,
 'took': 29}

In [10]:
# display the number of emails From or To geir.solberg@enron.com
query={"query":{"nested":{"path":"headers","query" : {"multi_match" : {"fields" : ["headers.From", "headers.To"],"query":"geir.solberg@enron.com"}}}}}
es.count(index='enron',doc_type='email',body=query)

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 5, 'total': 5},
 'count': 671}

In [11]:
# display emails From or To geir.solberg@enron.com
query={"_source":["headers.From","headers.To"],"query":{"nested":{"path":"headers","query" : {"multi_match" : {"fields" : ["headers.From", "headers.To"],"query":"geir.solberg@enron.com"}}}}}
es.search(size=0,index='enron',doc_type='email',body=query) # how many?
es.search(size=5,index='enron',doc_type='email',body=query) # display 5 of them...

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': '<8103219.1075841579360.JavaMail.evans@thyme>',
    '_index': 'enron',
    '_score': 9.504709,
    '_source': {'headers': {'From': 'holden.salisbury@enron.com',
      'To': 'geir.solberg@enron.com'}},
    '_type': 'email'},
   {'_id': '<12227389.1075841585874.JavaMail.evans@thyme>',
    '_index': 'enron',
    '_score': 9.504709,
    '_source': {'headers': {'From': 'karen.buckley@enron.com',
      'To': 'geir.solberg@enron.com'}},
    '_type': 'email'},
   {'_id': '<19234234.1075840651338.JavaMail.evans@thyme>',
    '_index': 'enron',
    '_score': 9.121659,
    '_source': {'headers': {'From': 'virginia.thompson@enron.com',
      'To': 'geir.solberg@enron.com'}},
    '_type': 'email'},
   {'_id': '<31670730.1075841934305.JavaMail.evans@thyme>',
    '_index': 'enron',
    '_score': 9.121659,
    '_source': {'headers': {'From': 'kate.symes@enron.com',
      'To': 'geir.solberg@enron.com'}},
  

In [12]:
es.search(size=5,index='enron',doc_type='email',body=query) # display 5 of them...

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': '<8103219.1075841579360.JavaMail.evans@thyme>',
    '_index': 'enron',
    '_score': 9.504709,
    '_source': {'headers': {'From': 'holden.salisbury@enron.com',
      'To': 'geir.solberg@enron.com'}},
    '_type': 'email'},
   {'_id': '<12227389.1075841585874.JavaMail.evans@thyme>',
    '_index': 'enron',
    '_score': 9.504709,
    '_source': {'headers': {'From': 'karen.buckley@enron.com',
      'To': 'geir.solberg@enron.com'}},
    '_type': 'email'},
   {'_id': '<19234234.1075840651338.JavaMail.evans@thyme>',
    '_index': 'enron',
    '_score': 9.121659,
    '_source': {'headers': {'From': 'virginia.thompson@enron.com',
      'To': 'geir.solberg@enron.com'}},
    '_type': 'email'},
   {'_id': '<31670730.1075841934305.JavaMail.evans@thyme>',
    '_index': 'enron',
    '_score': 9.121659,
    '_source': {'headers': {'From': 'kate.symes@enron.com',
      'To': 'geir.solberg@enron.com'}},
  

In [13]:
# get all the emails for the purposes of analyzing them as required in GrEx4
query={"query" : {"match_all" : {}}}

# http://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch
# get the number of documents in the enron index of type email
es.count(index='enron',doc_type='email',body=query)

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 5, 'total': 5},
 'count': 250762}

In [14]:
# get just one..
es.search(size=1,index='enron',doc_type='email',body=query) 

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': '<32338077.1075857332643.JavaMail.evans@thyme>',
    '_index': 'enron',
    '_score': 1.0,
    '_source': {'body': "Please click on the URL below for Enron's 2001 Holiday Schedule.\n\nhttp://home.enron.com:84/messaging/2001sched.jpg",
     'headers': {'Date': 'Mon, 13 Nov 2000 10:46:00 -0800 (PST)',
      'From': 'enron.announcements@enron.com',
      'Message-ID': '<32338077.1075857332643.JavaMail.evans@thyme>',
      'Subject': 'Holiday Schedule 2001',
      'To': 'enron.states@enron.com',
      'X-From': 'Enron Announcements',
      'X-To': 'Enron Employees United States',
      'X-bcc': '',
      'X-cc': ''},
     'mailbox': 'dean-c',
     'subFolder': 'all_documents'},
    '_type': 'email'}],
  'max_score': 1.0,
  'total': 250762},
 'timed_out': False,
 'took': 4}

In [15]:
# can we get them all?
es.search(size=250000,index='enron',doc_type='email',body=query) 

GET http://129.105.88.91:9200/enron/email/_search?size=250000 [status:500 request:0.068s]


TransportError: TransportError(500, 'search_phase_execution_exception', 'Result window is too large, from + size must be less than or equal to: [10000] but was [250000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.')

In [16]:
# http://elasticsearch-py.readthedocs.io/en/master/helpers.html?highlight=scan
scanner=helpers.scan(client= es, query=query, scroll= "10m", index="", \
doc_type="email", timeout="10m")


selectdocs = [msg['_source'] for msg in scanner]

len(selectdocs) # 250762

4118903

In [None]:
import pickle
with open('/Users/EdwardArroyo/selectdocs.pickle', 'wb') as handle:
    pickle.dump(selectdocs, handle)

'/Users/EdwardArroyo/Dropbox/NU (FA 17)/PREDICT 420/GrEx4/Examples/GrEx4 Files'