# Information Retrieval Practice

Elasticsearch is an open-source distributed search server built on top of Apache Lucene. It’s a great tool that allows to quickly build applications with full-text search capabilities. The core implementation is in Java, but it provides a nice REST interface which allows to interact with Elasticsearch from any programming language.


## Install Elastic Search

To install elastic search download your the package for your platform from Get Elasticsearch
 in https://www.elastic.co/es/start


![](./download.png)

Once downloaded, unzip the tar.gz file and run `bin/elasticsearch` (or `bin\elasticsearch.bat` on Windows). This will launch the ElasticSearch Server. Once the server is running, by default it's accessible at [localhost:9200](http://localhost:9200).

## Querying Elastic Search via Python

To make queries to ElasticSearch you can directly query the server endpoint via REST. However, we can make it easier via the the `elasticsearch-py` Python library. This library provides a wrapper for the REST endpoint that will allow us to query the server form Python

In [1]:
from elasticsearch import Elasticsearch

# Exercise 0: Indexing and Searching Demo for ElasticSearch

Now it's time to run some demo program. In this practice, we will create inverted index of sample documents (indexing) and then use Elasticsearch query grammar to search documents (searching).

### Useful functions

Functions to facilitate the reading of the dataset

In [2]:
import os, io
from collections import namedtuple

# A document class with following attributes
# filename: document filename
# text: body of documment
# path: path of document
Doc = namedtuple('Doc', 'filename path text')

def read_doc(doc_path, encoding):
    '''
        reads a document from path
        input:
            - doc_path : path of document
            - encoding: encoding
        output: =>
            - doc: instance of Doc namedtuple
    '''
    filename = doc_path.split('/')[-1]
    fp = io.open(doc_path, 'r', encoding = encoding)
    text = fp.read().strip()
    fp.close()
    return Doc(filename = filename, text = text, path = doc_path)

def read_dataset(path, encoding = "ISO-8859-1"):
    '''
        reads multiple documents from path
        input:
            - doc_path : path of document
            - encoding: encoding
        output: =>
            - docs: instances of Doc namedtuple returned as generator
    '''
    for root, dirs, files in os.walk(path):
        for doc_path in files:
            yield read_doc(root + '/' + doc_path, encoding)

##  Indexing

We will try to index the sample documents in `./sample_documents`. To index the documents, we first need to make a connection to **Elasticsearch**. 

Before we index the documents, we first need to define the **configuration of elasticsearch**. During this process, you can define basic configuration of indexer such as tokenizer, stemmer, lemmatizer, and also define which search algorithm elasticsearch will use for search.

Below code shows a simple configuration settings for this demo.
The configuration tells elasticsearch that our document `doc` will have three fields `filename`, `path`, and `text`, and we will use `text` field for search. `my_analyzer` will be used to parse the `text` field, and `my_analyzer` will also be used as a search analyzer, which will parse search queries later on. `index:False` in `filename` and `path` fields tell elasticsearch that we will not index these two fields, therefore, we cannot search these two fields with queries. 

The detailed documentation of analyzer can be found [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html).

`"similarity": "boolean"` in `text` field will let elasticsearch know that we will use a boolean search algorithm to search `text` field. The detailed documentation of search algorithms can be found [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/search.html)  and [here](https://www.elastic.co/guide/en/elasticsearch/guide/master/search-in-depth.html). 


In [3]:
# configuration for indexing
settings = {
  "mappings": {
      "properties": {
        "filename": {
          "type": "keyword",
          "index": False,
        },
        "path": {
          "type": "keyword",
          "index": False,
        },
        "text": {
          "type": "text",
          "similarity": "boolean",
          "analyzer": "my_analyzer",
          "search_analyzer": "my_analyzer"
        }
      }
  },    
  "settings": {      
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "filter": [
            "lowercase","stop"
          ],
          "type": "custom",
          "tokenizer": "whitespace",
          "char_filter": ["my_char_filter"]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "html_strip",
          "escaped_tags": ["b"]
        }
      }
    }
  }
}

Now we will retrieve `sample documents` and indexing them into `INDEX_NAME` index. To that end, the following 2 functions will help you in the creation of the index and the indexing of the documents.


In [4]:
from elasticsearch import Elasticsearch

ES_HOSTS = ['http://localhost:9200']
INDEX_NAME = 'sample_index'
DOCS_PATH = 'practice_data/sample_documents'

def create_index(es_conn, index_name, settings):
    '''
        create index structure in elasticsearch server. 
        If index_name exists in the server, it will be removed, and new index will be created.
        input:
            - es_conn: elasticsearch connection object
            - index_name: name of index to create
            - settings: settings and mappings for index to create
        output: =>
            - None
    '''
    if es_conn.indices.exists(index_name):
        es_conn.indices.delete(index = index_name)
        print('index `{}` deleted'.format(index_name))
    es_conn.indices.create(index = index_name, body = settings)
    print('index `{}` created'.format(index_name))            
            
def build_index(es_conn, dataset, index_name, settings, DOC_TYPE='doc'):
    '''
        build index from a collection of documents
        input:
            - es_conn: elasticsearch connection object
            - dataset: iterable, collection of namedtuple Doc objects
            - index_name: name of the index where the documents will be stored
            - DOC_TYPE: type signature of documents
    '''
    # create the index if it doesn't exist
    create_index(es_conn = es_conn, index_name = index_name, settings=settings)
    counter_read, counter_idx_failed = 0, 0 # counters

    # retrive & index documents
    for doc in dataset:
        res = es_conn.index(
            index = index_name,
            id = doc.filename,
            body = doc._asdict())
        counter_read += 1

        if res['result'] != 'created':
            conter_idx_failed += 1
        else:
            print('indexed {} documents'.format(counter_read))

    print('indexed {} docs to index `{}`, failed to index {} docs'.format(
        counter_read,
        index_name,
        counter_idx_failed
    ))
    
    # refresh after indexing
    es_conn.indices.refresh(index=index_name)  

es_conn = Elasticsearch(ES_HOSTS)
dataset = read_dataset(DOCS_PATH)
build_index(es_conn, dataset, INDEX_NAME, settings)

index `sample_index` deleted
index `sample_index` created
indexed 1 documents
indexed 2 documents
indexed 3 documents
indexed 4 documents
indexed 5 documents
indexed 5 docs to index `sample_index`, failed to index 0 docs


We successfully created an inverted index for the sample documents in `./sample/documents`. It's time to search the documents with some queries.

## Searching

**Elasticsearch** supports a specific query grammar which intends to replicate the grammar of traditional search engines (Google Search supports a similar grammar).
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html

To understand score of the result, check: https://www.elastic.co/guide/en/elasticsearch/guide/current/relevance-intro.html#explain

### Useful Functions

These functions will help you with the ElasticSearch output format in order to visualize the search results

In [5]:
def extract_response(res):
    if res is not None:
        for hit in res['hits']['hits']:
            filename = hit["_source"]["filename"]
            score = hit["_score"]
            
            yield (filename, score)

def print_result(query, res):
    # formatter of searched result
    matches = extract_response(res)
    if matches is not None:
        for match in sorted(matches, key = lambda x: -x[1]):
            print('{}, {}, {},\n'.format(
                query,
                match[0], # filename
                match[1], # score
            ))

We will perform now different types of queries.

First, a query with a single term

In [6]:
res = es_conn.search(index = INDEX_NAME,
    body={
          "query": {
            "bool": {
              "must": [
                {
                  "match": {"text": "Obama"}
                }
              ]
            }
          }
        }
    )
print_result("Obama", res)

Obama, doc1.txt, 1.0,

Obama, doc3.txt, 1.0,

Obama, doc5.txt, 1.0,

Obama, doc2.txt, 1.0,



Now a query for the documents containing both terms

In [7]:
# Boolean Query "Obama AND Hillary"
res = es_conn.search(index = INDEX_NAME,
    body={
          "query": {
            "match" : {
              "text" : {
                "query" : "Obama Hillary",
                "operator" : "and"
              }
            }
          }
        }
    )
print_result("Obama AND Hillary", res)

Obama AND Hillary, doc1.txt, 2.0,



And now containing a term but NOT the other.

In [8]:
# Boolean Query "Obama BUT Hillary"
res = es_conn.search(index = INDEX_NAME,
    body={
          "query": {
            "bool": {
              "must": [
                {
                    "match": {"text": "Obama"}
                }
              ],
              "must_not":[
                {
                    "match": {"text": "Hillary"}
                }
              ]
            }
          }
        }
    )
print_result("Obama BUT Hillary", res)

Obama BUT Hillary, doc3.txt, 1.0,

Obama BUT Hillary, doc5.txt, 1.0,

Obama BUT Hillary, doc2.txt, 1.0,



Finally, the default behaviour for queries with more than one term: OR.

In [9]:
# Boolean Query "Obama OR Hillary"
# default is OR
res = es_conn.search(index = INDEX_NAME,
    body={
          "query": {
            "match" : {
              "text" : {
                "query" : "Obama Hillary",
              }
            }
          }
        }
    )
print_result("Obama OR Hillary", res)

Obama OR Hillary, doc1.txt, 2.0,

Obama OR Hillary, doc3.txt, 1.0,

Obama OR Hillary, doc5.txt, 1.0,

Obama OR Hillary, doc4.txt, 1.0,

Obama OR Hillary, doc2.txt, 1.0,



# Exercise 1: Evaluating Results

We will show how the retrieved result can be evaluated by **trec_eval** evaluation program.

**trec_eval** is the standard software for evaluating search engines with test collections.

## TREC_EVAL setup

First, we need to install `trec_eval`. To install

- unzip `trec_eval-master.zip`
- go to `trec_eval-master` folder
- run shell command `make` to create `trec_eval` binary file (If your are using Windows, you can install `make` from [here](http://gnuwin32.sourceforge.net/packages/make.htm))



Next, check the `government` folder which contains three things:

- A set of documents needed to be indexed, in the *documents* directory.
    
- A set of queries, also called 'topics', in *topics/gov.topics* file. The format of **.topic* file is "query_id query_terms". For example, the first line of 'air.topics' file is
    
    `1 mining gold silver coal`
    
    which means that the ID of query is *01* and the corresponding query is *mining gold silver coal*.

- A set of judgements, saying which documents are relevant for each query, in the *qrels/gov.qrels* file. The format of **.qrels* file is "query_id 0 document_name binary_relevance". For example, the first line of 'air.qrels' is
    
    `1 0 G00-00-0681214 0`
    
    which means that the document `G00-00-0681214` is not relevant to the given query id *01*. The binary relevance is *1* if the file is relevant to the query, otherwise *0*. Please ignore the second argument *0* as it is always *0*.

## Create new index

In the previous exercise, we have created the index (inverted-index) of five sample documents. In this one, you will create a new index with the documents in `government/documents` folder .

To build a new index, you first need to create a new index. Note that `EVAL_INDEX_NAME` should be changed in order to build separate index for the documents in `government/documents`.

After creating the new configuration file, now your job is to create the new index reusing the code in the previous exercise.

In [10]:
settings = {
  "mappings": {
      "properties": {
        "filename": {
          "type": "keyword",
          "index": False,
        },
        "path": {
          "type": "keyword",
          "index": False,
        },
        "text": {
          "type": "text",
          "similarity": "boolean",
          "analyzer": "my_analyzer",
          "search_analyzer": "my_analyzer"
        }
      }
  },    
  "settings": {      
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "filter": [
            "stop"
          ],
          "char_filter": [
            "html_strip"
          ],
          "type": "custom",
          "tokenizer": "whitespace"
        }
      }
    }
  }
}

### Exercise 1.1: Create the new index

You can reuse the previous code

In [11]:
ES_HOSTS = ['http://localhost:9200']
EVAL_INDEX_NAME = 'government'
EVAL_DOCS_PATH = 'practice_data/government/documents'

# Your code here

### Exercise 1.2. Read topics and produce result file

Read topics (queries) from a file (`government/topics/gov.topics`) and then search documents indexed by **Elasticsearch**. You may choose one of search algorithms.

Produce result file (e.g., *retrieved.txt*) according to **trec_eval** standard output format: 

`01 Q0 document1 0 1.23 my_IR_system1`

`01 Q0 document2 1 1.08 my_IR_system1`

where '01' is the query ID; ignore 'Q0'; 'documentX' is the name of the file; '0' (or '1' or some other integer number) is the rank of this result; '1.23' (or '1.08' or some other number) is the score of this result; and 'my_IR_system1' is the name for your retrieval system. In particular, note that the rank field will be ignored in **trec_eval**; internally ranks are assigned by sorting by the score field with ties broken deterministicly (using file name).

**Now here's your first job**

1. read `gov.topics` file line by line, 
2. send query to the elastic search
3. write output according the the output format described above

In [12]:
# Your code here

### Exercise 1.3.  Evaluation

It's time to run **trec_eval** which compares the qrels file provided in *gov.qrels* with your result file. (hint: adding a **!** and shell commands allow you to execute shell commands in jupyter-notebook, e.g. `!ls`)

TREC_EVAL will evaluate the performance of your search engine. To evaluate your search result, you first need two sets of files: the retrieved result file and the ground truth file.
Let's say your retrieval result is saved at `retrieved.txt`, and the ground truth file is saved at `gov.qrels`. The performance of your retrieval can be measured via:

```
./trec_eval.9.0/trec_eval  gov.qrels retrieved.txt
```

In [13]:
# Your code here

If **trec_eval** runs correctly and produces numbers which you think are sensible, you are done with this part. You might want to look at the output, though, and get some understanding of what it means; later you will be asked to interpret this and to choose evaluation measures you prefer.

Running `./trec_eval.9.0/trec_eval -h` will list all the options available.



# Improving the Index

The baseline retrieval that we have proposed before did offer a rather low performance. In order to improve it, we can tune the index setting to include some of the NLP processing that we have learned (e.g., stemming, stopwords, ...)-

To that end, review the documentation of analyzer [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html).


Although we could generate our own analyzers (as we did in the previous exercises with `my_analyzer`), Elasticsearch provides a set of predefined analyzers for the different languages. More information [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html).

In particular, we are going to use the `English Analyzer`

In addition, we can modify the index to use a more sophisticated similarity measure (e.g., `BM25`) than the binary similarity.

## Exercise 2.1 English Analyzer + BM25

Modify the settings to apply the `English Analyzer` and use the `BM25` similarity

In [14]:
new_settings = {  
# Your code here
}

With this new settings we will create a new index, generate a new result file and evaluate it by means of the `trec_eval`

In [15]:
ES_HOSTS = ['http://localhost:9200']
EVAL_INDEX_NAME = 'government'
EVAL_DOCS_PATH = 'practice_data/government/documents'

es_conn = Elasticsearch(ES_HOSTS)
dataset = read_dataset(EVAL_DOCS_PATH)
build_index(es_conn, dataset, EVAL_INDEX_NAME, new_settings)

index `government` deleted
index `government` created
indexed 1 documents
indexed 2 documents
indexed 3 documents
indexed 4 documents
indexed 5 documents
indexed 6 documents
indexed 7 documents
indexed 8 documents
indexed 9 documents
indexed 10 documents
indexed 11 documents
indexed 12 documents
indexed 13 documents
indexed 14 documents
indexed 15 documents
indexed 16 documents
indexed 17 documents
indexed 18 documents
indexed 19 documents
indexed 20 documents
indexed 21 documents
indexed 22 documents
indexed 23 documents
indexed 24 documents
indexed 25 documents
indexed 26 documents
indexed 27 documents
indexed 28 documents
indexed 29 documents
indexed 30 documents
indexed 31 documents
indexed 32 documents
indexed 33 documents
indexed 34 documents
indexed 35 documents
indexed 36 documents
indexed 37 documents
indexed 38 documents
indexed 39 documents
indexed 40 documents
indexed 41 documents
indexed 42 documents
indexed 43 documents
indexed 44 documents
indexed 45 documents
indexed 46

In [None]:
output_file = open("improved_retrieved.txt","w+")

es_conn = Elasticsearch(ES_HOSTS)
for query_id, query in queries:
    res = search(query, es_conn, EVAL_INDEX_NAME)
    write_trec_file(query_id, res, output_file)

output_file.close()

In [None]:
!./trec_eval-master/trec_eval ./practice_data/government/qrels/gov.qrels improved_retrieved.txt