# Full-text search with Elasticsearch


### Inverted index

<img src="inverted_index.jpeg" alt="" style="width: 600px;"/>

* just like in a book's index
* commonly used search engine structure
* efficient lookup of terms
* usually stores positional information to enable phrase/proximity searches

### Indexing and Searching

<img src="indexing_searching.jpg" alt="" style="width: 400px;"/>


### Installing Elasticsearch

Elasticsearch is an open source search server. It provides a distributed, full-text search engine with a HTTP web interface and schema-free JSON documents. [documentation](https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html)

Make sure you have a recent Java runtime installed (else [install Java](https://java.com/en/download/help/download_options.xml))

    java -version

Download Elasticsearch server from https://www.elastic.co/downloads/elasticsearch, e.g.:

    wget https://download.elasticsearch.org/elasticsearch/release/org/elasticsearch/distribution/zip/elasticsearch/2.1.0/elasticsearch-2.1.0.zip
    unzip elasticsearch-2.1.0.zip 
    
Change the cluster name, give it a unique one

* edit the file `elasticsearch-2.1.0/config/elasticsearch.yml`
* uncomment and change line 17 from 
  * `# cluster.name: my-application` 
  * to 
  * `cluster.name: a-unique-name-here`
    
Start Elasticsearch. The last command starts the Elasticsearch server; leave that terminal open and running.

    cd elasticsearch-2.1.0/bin/
    ./elasticsearch (*nix)   elasticsearch.bat (windows)

To interact with ES, we will use ES's Python API ([documentation](https://elasticsearch-py.readthedocs.org/en/master/)), install it as usual

    pip install elasticsearch
    
Let's start by indexing and searching a single document into ElasticSearch:

In [21]:
from __future__ import print_function
from datetime import datetime
from elasticsearch import Elasticsearch

# get a connection to Elasticsearch server
es = Elasticsearch()

# index one document
doc = {
    'author': 'epfl',
    'text': 'The Library is the centre of expertise for scientific and technical information, serving teaching and research at EPFL',
    'timestamp': datetime.now(),
}
res = es.index(index="test-index", doc_type='test_doc', id=1, body=doc)
print('document created:', res['created'])

# now search for it
es.indices.refresh(index="test-index")
res = es.search(index="test-index", body={"query": {"match_all": {}}})
print("Got %d Hits:" % res['hits']['total'])
for hit in res['hits']['hits']:
    print("[%(timestamp)s]   %(author)s: %(text)s" % hit["_source"])

document created: True
Got 3 Hits:
[2015-11-26T12:23:12.368181]   epfl: The Library is the centre of expertise for scientific and technical information, serving teaching and research at EPFL
[2015-11-26T12:25:57.100934]   epfl: The Library is the centre of expertise for scientific and technical information, serving teaching and research at EPFL
[2015-11-26T12:26:03.614211]   epfl: The Library is the centre of expertise for scientific and technical information, serving teaching and research at EPFL


### Indexing more document

In [22]:
import csv
with open('../3b_pubmed_rest_api/pubmed_results.tsv') as f:
    for pmid, title, authors, abstract in csv.reader(f, delimiter='\t'):
        doc = {
            'authors'  : authors,
            'pmid'     : pmid,
            'title'    : title,
            'abstract' : abstract,
            'timestamp': datetime.now(),
        }
        res = es.index(index="test-index2", doc_type='test_doc2', body=doc)
        print('indexed document: ', pmid,  res['created'])

indexing document  26174762
indexed document:  26174762 True
indexing document  26577528
indexed document:  26577528 True
indexing document  25752701
indexed document:  25752701 True
indexing document  26291961
indexed document:  26291961 True
indexing document  26464464
indexed document:  26464464 True
indexing document  26430901
indexed document:  26430901 True
indexing document  26463923
indexed document:  26463923 True
indexing document  26059022
indexed document:  26059022 True
indexing document  26174106
indexed document:  26174106 True
indexing document  26497429
indexed document:  26497429 True
indexing document  26583177
indexed document:  26583177 True
indexing document  26583176
indexed document:  26583176 True
indexing document  25263447
indexed document:  25263447 True
indexing document  26343523
indexed document:  26343523 True
indexing document  26408402
indexed document:  26408402 True
indexing document  25172038
indexed document:  25172038 True
indexing document  25728

In [35]:
from IPython.html.widgets import interact
@interact(query="p53")
def search(query):
    res = es.search(index="test-index2", body={"query": {"match": {'title': query}}})

    print("searching for '%s', got %d hits:" % (query, res['hits']['total']))
    for hit in res['hits']['hits']:
        print("[%(pmid)s] %(title)s [%(authors)s]" % hit["_source"])

searching for 'expression', got 2 hits:
[25573363] Mdm2-dependent regulation of p53 expression during long-term potentiation. [Lisachev PD, Pustylnyak VO, Shtark MB.]
[25788166] Significance of p53 expression in background endometrium in endometrial carcinoma. [Nguyen TT, Hachisuga T, Urabe R, Kurita T, Kagami S, Kawagoe T, Shimajiri S, Nabeshima K.]
