## Intializing Elasitcsearch

We begin by intializing Elasticsearch with the names of our index and type. You should have Elasticsearch installed and already ran the "bin/elasticsearch" command. For more information you can visit: https://www.elastic.co/guide/en/elasticsearch/guide/current/running-elasticsearch.html.

Elasticsearch uses slightly difference terminology from tradational databases. 

This is the following mapping:
Database = Index
Table = Type
Row = Document

Here we intialize an object where corpus will be the name of the index with a single type articles.

In [1]:
from snorkel import SnorkelSession
from snorkel.models import Document, Sentence
from elastics import elasticSession ,printResults,deleteIndex

session = SnorkelSession()
eSearch=elasticSession("corpus","articles")

### Check Connection

Before we begin indexing, we can check our connection with Elasticsearch as well as our existing indices. 

In [2]:
eSearch.getIndices()

#To delete an index by name or _all to delete all indices
#deleteIndex(indexName)

Index Information: 
 
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size



## Indexing the Corpus

Now that we have an established connection we can begin indexing. Here we establish what fields each document in our type will contain, a line number, the sentence and the ner_tags converted to a string. Once everything has been indexed we can see its status with the amount of documents it contains as well as the size.

In [3]:
count = 0
print "Beginning Indexing"
for p in session.query(Document):
	for i in p.sentences:
		count+=1
		body ={
			'lineNum': count,
		    'sentence': i.text,
		    'candidates':' '.join((e).decode('utf-8') for e in i.ner_tags),
		}
		eSearch.elasticIndex(count,body)
print "Documents added %d" %(count)
eSearch.getIndices()

Beginning Indexing
Documents added 67820
Index Information: 
 
health status index  uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   corpus VlTi5iDNQXi2cLp9sYAldQ   5   1      67820            0     45.6mb         45.6mb



### Visualize Index

To better visualize our data we get a mapping of index. We can see here that the each document in articles has 3 different fields: lineNum,candidates and sentence.

In [4]:
eSearch.getIndexMap()

Index Mapping
{
  "corpus": {
    "mappings": {
      "articles": {
        "properties": {
          "lineNum": {
            "type": "long"
          }, 
          "candidates": {
            "fields": {
              "keyword": {
                "ignore_above": 256, 
                "type": "keyword"
              }
            }, 
            "type": "text"
          }, 
          "sentence": {
            "fields": {
              "keyword": {
                "ignore_above": 256, 
                "type": "keyword"
              }
            }, 
            "type": "text"
          }
        }
      }
    }
  }
}


### Visualizing a Document

We can get any document by it's line number. Here get the first sentence in our corpus and print the values stored in each field

In [5]:
result = eSearch.getDoc(1)
print "Candidates"
print result['_source']['candidates']
print "Sentence"
print result['_source']['sentence']
print "Sentence number"
print result['_source']['lineNum']


Candidates
O O O O O O O O O O O O O O O O O PERSON O PERSON PERSON O O O O O O O O O O O O ORG O O
Sentence
NEW YORK -- Theatergoers who check out "Beautiful" on tour won't get to see Tony winner Jessie Mueller but they may get the next best thing -- someone with her DNA.   
Sentence number
1


# # Querying the Corpus

Once we have all our documents indexed we can perform a simple query. In this example, we are querying our sentence field for the words married OR children. Matches that contain both and their entirety will be scored higher.  After performing the query we print the results which are sorted in a decsending order.

In [6]:
query="married children"
field = "sentence"
searchResult = eSearch.searchIndex(field,query)
printResults(searchResult,field)

Number of hits 
2477
Result 1
-------------------
sentence
He is married with        three children.     
Result 2
-------------------
sentence
Family: Married to Columba Bush (1974), with three adult children.
Result 3
-------------------
sentence
Family: Married to Janet Huckabee (1974), with three adult children.
Result 4
-------------------
sentence
Family: Married to Stephanie Chafee (1990) with three children.
Result 5
-------------------
sentence
Family: Married to Mary Pat Foster (1986) with four children.   
Result 6
-------------------
sentence
Family: Married to Janet Huckabee (1974), with three adult children.
Result 7
-------------------
sentence
Family: Married to Janet Huckabee (1974), with three adult children.
Result 8
-------------------
sentence
Family: Married to Libby Rowland (1973), with four adult children.   
Result 9
-------------------
sentence
Family: Married to Janet Huckabee (1974), with three adult children.
Result 10
-------------------
sentence
Family: M

### Search between Candidates

We can also search in between two candidates which were defined as PERSON in the spousal tutorial. Specifically, we are querying for PERSON married PERSON in that order. We specify which field the tags are stored in (candidates), what the tag is (PERSON), the field that contains the sentence (sentence), the value that we want to search for (married), and the distance. Distance is defined as the amount of positions the words are allowed to be away from each other and still consistute a hit. A distance of one would require all three words to be side by side. 

There is a slight discrepency here in the hits here because of the difference in the way Spacy and Elasticsearch tokenized. Ideally we would search the two pre-tokenized graphs an issue I posted here:
https://stackoverflow.com/questions/45537916/elasticsearch-proximity-search-using-2-pre-tokenized-arrays

In [7]:
distance = 100
value = "married"
candField = "candidates"
candTag = "person"
sentField="sentence"
result = eSearch.searchBetweenCandidates(candField,candTag,sentField,value,distance)
printResults(result,sentField)


Number of hits 
275
Result 1
-------------------
sentence
Lady Bird Johnson was born as Claudia Alta Taylor and married Lyndon B. Johnson in 1934.
Result 2
-------------------
sentence
Alex Gerrard - who's married to footballer Steven Gerrard, 35 - returned from her new US residence to her home town to indulge in a beauty treatment on Thursday.

Result 3
-------------------
sentence
Just over a year later, Kim Davis and the biological father of her twins, Thomas Dale McIntyre Jr., married in a ceremony on November 11, 2007, at Skye Bridge in Wolfe County.    
Result 4
-------------------
sentence
Williams married Van Veen on Saturday afternoon in a private ceremony at a ranch out West, her rep confirmed to the Los Angeles Times.   
Result 5
-------------------
sentence
Williams married Van Veen on Saturday afternoon in a private ceremony at a ranch out West, her rep confirmed to the Los Angeles Times.
Result 6
-------------------
sentence
Parenthood star Erika Christensen has married c

### Search in between 

Lastly, we can search any singular field for three values where value1 must come before value2 and value2 before value3. Again the distance is the amount of space that is allowable in between the values. 

In [8]:
value1="essential"
value2="legal"
value3="time"
field = "sentence"
result = eSearch.searchOrder(field,distance,value1,value2,value3)
printResults(result,field)

Number of hits 
1
Result 1
-------------------
sentence
Inheritors can only claim once, so it's essential they seek expert legal advice to ensure they get it right first time.   '
