Handling large text collections with Elastic database
=====================================================

Estnltk has database module that simplifies working with large corpora.
Check out wikipedia\_tutorial, tei\_tutorial for more information about
getting started with larger text document collections.

Estnltk database integrates with
[Elastic](https://www.elastic.co/downloads/elasticsearch), which is a
distributed RESTful schema-free JSON database, based on [Apache
Lucene](https://lucene.apache.org/). See this
[guide](https://www.elastic.co/guide/en/elasticsearch/reference/current/_installation.html)
for installation.

When the installation is complete you can run Elastic (from Elastic
folder) with the command:

    ./bin/elasticsearch

If you have trouble running Elastic, please refer to [Elastic guide](https://www.elastic.co/guide/index.html).

   * Do your research before asking us. Estnltk has only a very thin wrapper around the [Elastic Python API](https://elasticsearch-py.readthedocs.org/en/master/) .

Estnltk Elastic wrapper
-----------------------

To access estnltk elasticsearch wrappers:

In [2]:
# TODO: test if this whole tutorial works if you have elasticsearch

from estnltk.database.elastic import *

To create an index:

In [7]:
index = create_index('demo_index')

PUT http://localhost:9200/demo_index [status:N/A request:0.006s]
Traceback (most recent call last):
  File "/home/dage/anaconda3/envs/working_estnltk/lib/python3.5/site-packages/urllib3-1.12-py3.5.egg/urllib3/connection.py", line 135, in _new_conn
    (self.host, self.port), self.timeout, **extra_kw)
  File "/home/dage/anaconda3/envs/working_estnltk/lib/python3.5/site-packages/urllib3-1.12-py3.5.egg/urllib3/util/connection.py", line 90, in create_connection
    raise err
  File "/home/dage/anaconda3/envs/working_estnltk/lib/python3.5/site-packages/urllib3-1.12-py3.5.egg/urllib3/util/connection.py", line 80, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/dage/anaconda3/envs/working_estnltk/lib/python3.5/site-packages/elasticsearch-2.1.0-py3.5.egg/elasticsearch/connection/http_urllib3.py", line 78, in perform_request
    

ConnectionError: ConnectionError(<urllib3.connection.HTTPConnection object at 0x7f238c8aa080>: Failed to establish a new connection: [Errno 111] Connection refused) caused by: NewConnectionError(<urllib3.connection.HTTPConnection object at 0x7f238c8aa080>: Failed to establish a new connection: [Errno 111] Connection refused)

Or to connect to an existing index:

In [8]:
index = connect('demo_index')

HEAD http://localhost:9200/demo_index [status:N/A request:0.003s]
Traceback (most recent call last):
  File "/home/dage/anaconda3/envs/working_estnltk/lib/python3.5/site-packages/urllib3-1.12-py3.5.egg/urllib3/connection.py", line 135, in _new_conn
    (self.host, self.port), self.timeout, **extra_kw)
  File "/home/dage/anaconda3/envs/working_estnltk/lib/python3.5/site-packages/urllib3-1.12-py3.5.egg/urllib3/util/connection.py", line 90, in create_connection
    raise err
  File "/home/dage/anaconda3/envs/working_estnltk/lib/python3.5/site-packages/urllib3-1.12-py3.5.egg/urllib3/util/connection.py", line 80, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/dage/anaconda3/envs/working_estnltk/lib/python3.5/site-packages/elasticsearch-2.1.0-py3.5.egg/elasticsearch/connection/http_urllib3.py", line 78, in perform_request
   

ConnectionError: ConnectionError(<urllib3.connection.HTTPConnection object at 0x7f238c8aac18>: Failed to establish a new connection: [Errno 111] Connection refused) caused by: NewConnectionError(<urllib3.connection.HTTPConnection object at 0x7f238c8aac18>: Failed to establish a new connection: [Errno 111] Connection refused)

To specify non-default arguments to elasticsearch connection, you can
either pass the parameters to either method or create and Index instance
manually, passing the Elastic python API client object as the first
parameter.

These methods return an index object that has two important methods:
save and sentences:

In [10]:
from estnltk import Text

t = Text('See on demolause. Sellele järgneb veel üks.')
index.save(t)

for sentence in index.sentences():
    print(t.lemmas) #note that the sentences generator returns estnltk Text objects by default.

NameError: name 'index' is not defined

To see the mapping and data structure in the elasticsearch index, refer
to the mappings.py file.

Iterating over corpora
----------------------

To iterate over the entire corpus use the Index.sentences generator. In
the general case it is enough to do:

In [12]:
index = connect('demo_index')
for sentence in index.sentences():
    print(index)

HEAD http://localhost:9200/demo_index [status:N/A request:0.001s]
Traceback (most recent call last):
  File "/home/dage/anaconda3/envs/working_estnltk/lib/python3.5/site-packages/urllib3-1.12-py3.5.egg/urllib3/connection.py", line 135, in _new_conn
    (self.host, self.port), self.timeout, **extra_kw)
  File "/home/dage/anaconda3/envs/working_estnltk/lib/python3.5/site-packages/urllib3-1.12-py3.5.egg/urllib3/util/connection.py", line 90, in create_connection
    raise err
  File "/home/dage/anaconda3/envs/working_estnltk/lib/python3.5/site-packages/urllib3-1.12-py3.5.egg/urllib3/util/connection.py", line 80, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/dage/anaconda3/envs/working_estnltk/lib/python3.5/site-packages/elasticsearch-2.1.0-py3.5.egg/elasticsearch/connection/http_urllib3.py", line 78, in perform_request
   

ConnectionError: ConnectionError(<urllib3.connection.HTTPConnection object at 0x7f238c8dbcf8>: Failed to establish a new connection: [Errno 111] Connection refused) caused by: NewConnectionError(<urllib3.connection.HTTPConnection object at 0x7f238c8dbcf8>: Failed to establish a new connection: [Errno 111] Connection refused)

Iterating over query results
----------------------------

To iterate over query results, pass the elasticsearch query to the
sentences generator as the "query" parameter. The query should be a
dictionary as expected by elasticsearch python API. It will be
transformed into json before being transmitted.

To simplify writing some queries, see the query\_helper module. It
defines the Word class that maps well to estnltk morphological analysis
results. The general workflow is:

1.  Define words to match with the Word class.
2.  Combine them with boolean operators "&" and "|"
3.  Wrap them in a Grammar object
4.  Get the query via the Grammar.query() method.
5.  Annotate the results with the Grammar.annotate() method that creates
    a layer that marks the matching words.

For example:

In [13]:
grammar = Grammar(Word(lemma='karu') & Word(lemma='jahimees') & Word(partofspeech='V'))
for sentence in index.sentences(query=grammar.query()):
    grammar.annotate(sentence, 'result')

NameError: name 'Grammar' is not defined

The results can be visualised with the PrettyPrinter class or worked on
using any other standard tools that work on estnltk layers.