# Session 1 - ElasticSearch - Zipf/Heaps laws

## 1 Running Elastic Search

 During the first part of this session we will configure and run an ElasticSearch instance. 

**Read the first section of the documentation and follow its instructions.**

After following the instructions you should test if ElasticSearch is up and running using the script `elastic-test.py`

The answer that you should get is the following:

In [1]:
%run elastic_test.py

b'{\n  "name" : "SJF0JlN",\n  "cluster_name" : "elasticsearch_qiaoruixiang",\n  "cluster_uuid" : "HvvvVQVCTOOmvhfRa3RKAw",\n  "version" : {\n    "number" : "6.2.4",\n    "build_hash" : "ccec39f",\n    "build_date" : "2018-04-12T20:37:28.497551Z",\n    "build_snapshot" : false,\n    "lucene_version" : "7.2.1",\n    "minimum_wire_compatibility_version" : "5.6.0",\n    "minimum_index_compatibility_version" : "5.0.0"\n  },\n  "tagline" : "You Know, for Search"\n}\n'


***

## 2 Indexing and querying

**Take a moment to read section 2.1 of the documentation **

ElasticSearch is a database that allows storing documents (tables do not need a predefined schema as in relational databases). Text in these documents can be processed so the queries extend beyond exact matches allowing complex queries, fuzzy matching and ranking documents respect to the actual match. 

These kind of databases are behind search engines like Google Search or Bing.

There are different ways of operating with ElasticSearch. It is deployed esentially as a web service with a REST API, so we can accessed basically from any language with a library for operating with HTTP servers. You have a link to the full documentation in the session document.

We are going to use two python libraries for programming `elasticsearch` and `elasticsearch-dsl`. Both provide access to ElasticSearch functionalities hidding and making more programming friendly the interactions, the second one is more convenient for configurating and searching.

We are only going to see the essential elements for developing the session but feel free to learn a little bit more. 


First we will need some text to index, for testing purposes we are going to use the python library `loremipsum`. We will need to install it first (if it is not installed already)

In [6]:
!pip3 install loremipsum --user  # Restart the kernel to be able to import the library in the next cells

Collecting loremipsum
[33m  Cache entry deserialization failed, entry ignored[0m
  Using cached https://files.pythonhosted.org/packages/55/8e/f75963c116c72bb81d2e22ec64ff3837e962cc89bae025ab60698dd83160/loremipsum-1.0.5.tar.gz
Building wheels for collected packages: loremipsum
  Running setup.py bdist_wheel for loremipsum ... [?25ldone
[?25h  Stored in directory: /Users/qiaoruixiang/Library/Caches/pip/wheels/cd/4c/6c/33f9c3db3fcd070c9021f5b4c25d1b3ddd1596ccffbbc0d6ff
Successfully built loremipsum
Installing collected packages: loremipsum
Successfully installed loremipsum-1.0.5
[33mCache entry deserialization failed, entry ignored[0m


To interact with ElasticSearch with need a client object of type `Elasticsearch`, if we have running the server in the localhost and with the default configuration we don't need to pass any parameters to the object.

In [7]:
from __future__ import print_function
from elasticsearch import Elasticsearch

client = Elasticsearch()

With this client you have a connection for operatinh with Elastic search. Now we will create an index. There are index operations in each libraty, but the one in `elasticseach-dsl` is simpler to use.

In [8]:
from elasticsearch_dsl import Index

index = Index('test', using=client)

Now we create some random paragraphs

In [9]:
from loremipsum import get_paragraphs
text = get_paragraphs(10)
print(text[0])

ModuleNotFoundError: No module named 'loremipsum'

Now we can index the paragraphs in ElasticSearch using the `create` method, we can indicate a type of document that will allow to group documents of the same king inside an index. The document is passed as the `body` parameter as a python dictionary. The keys of the dictionary will be the fields of the document, in this case we well have only one (`text`)

In [10]:
for t in text:
    client.index(index='test', doc_type='latin', body={'text': t})

NameError: name 'text' is not defined

Now we can search the documents

In [None]:
from elasticsearch_dsl import Search
s = Search(using=client, index='test')

s = s.query('match', text='Netus')

r = s.execute()

for v in r:
    print('ID= %s Text= %s' % (v.meta.id, v.text[:75]))

***

## 2.1 Anatomy of an indexing

Now we are ready for indexing some files, download the two sets of files linked in the documentation (*20_newsgroups* and *novels*) and follow the instructions.

 **Follow the instructions** and after that edit the script `IndexFiles.py` and understand how the indexing is performed, you will see that instead of inserting the documents one by one the `bulk` method is used for a more efficient indexing.

***

## 2.2 Looking for mr goodword

Now we are ready for query the documents. You have the script `SearchIndex.py` for this purpose, you can invoke the script with three flags:

* `--index` that correponds with the index of the files
* `--text` that searches for a word in the text field of the documents of the index
* `--query` that allows using LUCENE syntax for querying the index


These last two flags are mutually exclusive and the first one takes precedence

LUCENE syntax allows to use boolean operators in the query (AND, OR, NOT) always upper case and the fuzzy operator `~` with a number $n$ that matches the word allowing up to $n$ mismatches in the string.

**Follow the instructions** of the documentation and query the documents indexed. Browse the code and look into the documentation of `elasticsearch-dsl` to learn more about how a query is defined.

***

## 3 Zipf's and Heaps' Laws

Now we can work in the tasks for this session. You will have to test if the Zipf and Heaps Laws hold in the documents that you have.

You will need a count of the words in all the documents. ElasticSearch allows querying these counts from the ids of the documents.

For example:

In [None]:
from elasticsearch.helpers import scan

# Search for all the documents and query the list of (word, frequency) of each one
# Totals are accumulated in a dictionary
voc = {}
sc = scan(client, index='test', doc_type='latin', query={"query" : {"match_all": {}}})
for s in sc:
    tv = client.termvectors(index='test', doc_type='latin', id=s['_id'], fields=['text'])
    if 'text' in tv['term_vectors']:
        for t in tv['term_vectors']['text']['terms']:
            if t in voc:
                voc[t] += tv['term_vectors']['text']['terms'][t]['term_freq']
            else:
                voc[t] = tv['term_vectors']['text']['terms'][t]['term_freq']

                lpal = sorted(voc.items(), reverse=True, key=lambda x: x[1])

pal, freq = [p for p, _ in lpal], [f for _, f in lpal]

Now we can plot the words frequencies (have in mind that this text is artifically generated)

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

fig = plt.figure(figsize=(20,10))
plt.bar(range(len(pal)), freq)
a= plt.xticks(range(len(pal)), pal, rotation='vertical')

The `CountWords.py` script will generate the list of words and their frequency for an index. 

**Follow the instructions** in the documentation and **pay attention** to the documentation that you have to deliver for this session. 