#### These notebooks provide brief getting started notes on installing and using elasticsearch for searching free text documents in text, html, pdf and doc/docx format

First download elasticsearch from: https://www.elastic.co/downloads/elasticsearch.  Extract.  No further install is necessary.

You might need to install a java runtime, which you can do using [homebrew](http://stackoverflow.com/questions/24342886/how-to-install-java-8-on-mac) on mac.

Next you want the mapper attachments plugin, which will allow you to put pdfs and other documents into the database in a searchable format.

To do this, in the terminal `cd` to the directory where elasticsearch is installed (where you uncompressed the files), and run  `bin/plugin install mapper-attachments`

Follow the [hello world](https://www.elastic.co/guide/en/elasticsearch/plugins/master/mapper-attachments-helloworld.html) guidance to make sure everything is working

Next we need to design our 'database' - but to do this we need to learn some terminology.

'An index' = a bit like a database

'To index' = to put somethign in the database

'Type' = Similar to a table

'Document with properties' = row with columns

![image](https://www.elastic.co/assets/blt4852179309023a90/index-type-docs.png)

The first job is to set up a '[mapping](https://www.elastic.co/blog/found-elasticsearch-mapping-introduction)' ('schema') for our 'index'.  This tells elasticsearch what documents you're going to put into your index.

Note that we will be adding data into elasticsearch in JSON format, so our mappings are going to correspond to these JSON structures.  We're even going to use json for pdf and other binary documents by first encoding them into base64.

The schema in Elasticsearch is a mapping that describes the the fields in the JSON documents along with their data type, as well as how they should be indexed in the Lucene indexes that lie under the hood.

In the Elasticsearch documentation and related material, we often see the term “mapping type”, which is actually the name of the type inside the index, such as my_type and another_type in the figure above. When we talk about types in Elasticsearch, it is usually this definition of type. It is not to be confused with the type key inside each mapping definition that determines how the data inside the documents are handled by Elasticsearch (e.g date field, string etc).

Note that we do not actually need to specify a schema - we can get elasticsearch to 'autodetect' the schema.  But it's probably better to be explicit because then we understand what we're doing and in any case elasticsearch sometimes makes mistakes.

**Getting started with our mapping**

We'll call the index (database) 'casedb'

We'll call the type (table) 'allcases'

We'll call each document (row) 'onecase'.  

We then define columns, each which has a type (like 'string' or 'attachment')

The following assumes you have elasticserach running, which you can do by `cd`ing into the elasticsearch directoring and typing `bin/elasticsearch`

In [3]:
#The logging code will get elastic serach to print the curl commands being issued - you can run these directly on the command line if you'd like to
import logging
logger = logging.getLogger("elasticsearch.trace")
logger.setLevel("DEBUG")

#Set up an elasticsearch object
from elasticsearch import Elasticsearch
es = Elasticsearch()

The Python bindings are as close as possible to the REST API
So if we want to do:

    curl 'localhost:9200/_cat/health?v'
    
we do

In [5]:
es.cat.health()

INFO:elasticsearch.trace:curl -XGET 'http://localhost:9200/_cat/health?pretty' -d ''
DEBUG:elasticsearch.trace:#[200] (0.040s)
#1462300537 19:35:37 elasticsearch yellow 1 1 5 5 0 0 5 0 - 50.0% 
#


u'1462300537 19:35:37 elasticsearch yellow 1 1 5 5 0 0 5 0 - 50.0% \n'

In [6]:
es.cat.indices()

INFO:elasticsearch.trace:curl -XGET 'http://localhost:9200/_cat/indices?pretty' -d ''
DEBUG:elasticsearch.trace:#[200] (0.045s)
#yellow open casedb 5 1 769 0 272.7mb 272.7mb 
#


u'yellow open casedb 5 1 769 0 272.7mb 272.7mb \n'

In [6]:
# Let's call the index (database) 'casedb'
es.indices.create("casedb")

INFO:elasticsearch.trace:curl -XPUT 'http://localhost:9200/casedb?pretty' -d ''
DEBUG:elasticsearch.trace:#[200] (0.027s)
#{
#  "acknowledged": true
#}


{u'acknowledged': True}

In [34]:
# Let's call the type (table) 'allcases'
# Let's call each document (row) 'onecase'.  
# We then define columns, each which has a type (like 'string' or 'attachment')

# What's confusing is that the 'fields' in this are actually meta fields.  

# Metafields are specific to nested objects like attachments or json documents

# The actual columns/'fields' are the properties

mappings = '''
{
    "mappings": {
        "all_cases": {
            "properties": {
                "my_attachment": {
                    "type": "attachment",
                    "fields": {
                        "content": {
                            "type": "string",
                            "term_vector": "with_positions_offsets",
                            "store": true
                        }
                    }
                },
                "file_name" :{
                    "type": "string"
                }
            }
        }
    }
}
'''

if es.indices.exists("casedb"):
    es.indices.delete(index="casedb")
    
es.indices.create(index='casedb', ignore=400, body=mappings)

{u'acknowledged': True}

In [35]:
# Now we are ready to put documents in the index
# This is harder than it seems because the documents must be in .json format
# with any files encoded in base64

def createEncodedTempFile(fname):
    """
    Creates a json file that matches the mapping specified above
    
    """
    import json

    file64 = open(fname, "rb").read().encode("base64")

    f = open("tmp.json", 'w')
    data = { 'my_attachment': file64,
           'file_name' : fname }
    json.dump(data, f) # dump json to tmp file
    f.close()

In [36]:
# Test on a single pdf
createEncodedTempFile("/Users/robinlinacre/Documents/python_projects/elasticsearch/downloaded_files/a-v-b-and-c-judgment.pdf")
fpath = "/Users/robinlinacre/Documents/python_projects/elasticsearch/downloaded_files/abu-hamza-others-judgment-05102012.pdf"
createEncodedTempFile(fpath)

In [37]:
# Add to index
import json
with open("tmp.json") as f:
    b = json.load(f)
# es.index(index="casedb", doc_type="all_cases", body=b)  this doesn't work properly because 
# it renders the fileas a curl command

logger = logging.getLogger("elasticsearch.trace")
logger.setLevel("ERROR")
es.index(index="casedb", doc_type="all_cases", body=b, id=1) 

{u'_id': u'1',
 u'_index': u'casedb',
 u'_shards': {u'failed': 0, u'successful': 1, u'total': 2},
 u'_type': u'all_cases',
 u'_version': 1,
 u'created': True}

In [7]:
#Now we're going to run a query against the document we just added
logger = logging.getLogger("elasticsearch.trace")
logger.setLevel("ERROR")

In [13]:
#This just styles the output to highlight the matches.
from IPython.core.display import HTML
HTML("""
<style>
.output_subarea.output_html.rendered_html em {
    font-weight:bold;
    background-color: yellow;
    
    }
</style>
""")

In [20]:
from IPython.display import HTML, display

b = {
 "fields": ["my_attachment.content"], 
  "query": {
    "match": 
      {"my_attachment.content": "hjj September 14 ago District gkk"}
},
"highlight" : {
    "fields" : {
      "my_attachment.content" : {}
    }
  }
}

response = es.search(index="casedb", body=b)
for hit in response['hits']['hits']:
    for i in hit["highlight"]["my_attachment.content"]:
        display(HTML(i))