# Question 7
***
JG Hanekom <br>
20780893 <br>
December <br>
***

# Elasticsearch

In this notebook we will setup an Elasticsearch server, read in Shakespeares works, and analyze them to unerstand term vectors.

You may mix direct API calls, the Python API, or url calls from Python. Whatever gives you access to the data.



### Install the necessary elasticsearch Python packages

In [1]:
!pip install 'elasticsearch<7.14.0'

# docs are here https://elasticsearch-py.readthedocs.io/en/v7.13.4/#

Collecting elasticsearch<7.14.0
  Downloading elasticsearch-7.13.4-py2.py3-none-any.whl (356 kB)
[?25l[K     |█                               | 10 kB 24.3 MB/s eta 0:00:01[K     |█▉                              | 20 kB 29.9 MB/s eta 0:00:01[K     |██▊                             | 30 kB 15.5 MB/s eta 0:00:01[K     |███▊                            | 40 kB 10.9 MB/s eta 0:00:01[K     |████▋                           | 51 kB 5.7 MB/s eta 0:00:01[K     |█████▌                          | 61 kB 6.0 MB/s eta 0:00:01[K     |██████▍                         | 71 kB 5.2 MB/s eta 0:00:01[K     |███████▍                        | 81 kB 5.8 MB/s eta 0:00:01[K     |████████▎                       | 92 kB 5.9 MB/s eta 0:00:01[K     |█████████▏                      | 102 kB 5.3 MB/s eta 0:00:01[K     |██████████▏                     | 112 kB 5.3 MB/s eta 0:00:01[K     |███████████                     | 122 kB 5.3 MB/s eta 0:00:01[K     |████████████                    | 133 kB

### Import packages

In [2]:
import os
import time
from elasticsearch import Elasticsearch
import numpy as np
import pandas as pd

## Setup Elasticsearch Instance


In [3]:
%%bash

wget -q https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-7.9.2-linux-x86_64.tar.gz
wget -q https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-7.9.2-linux-x86_64.tar.gz.sha512
tar -xzf elasticsearch-oss-7.9.2-linux-x86_64.tar.gz
sudo chown -R daemon:daemon elasticsearch-7.9.2/
shasum -a 512 -c elasticsearch-oss-7.9.2-linux-x86_64.tar.gz.sha512 

elasticsearch-oss-7.9.2-linux-x86_64.tar.gz: OK


Run the instance as a daemon (background) process

In [4]:
%%bash --bg

sudo -H -u daemon elasticsearch-7.9.2/bin/elasticsearch

Starting job # 0 in a separate thread.


In [5]:
# Sleep for few seconds to let the instance start.  - here in case you are running end-to-end
time.sleep(20)

query the base endpoint to retrieve information about the cluster.

In [6]:
%%bash

curl -sX GET "localhost:9200/"

{
  "name" : "fdb2a45c639e",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "Sr60tNikQSas7AJaB7aaqw",
  "version" : {
    "number" : "7.9.2",
    "build_flavor" : "oss",
    "build_type" : "tar",
    "build_hash" : "d34da0ea4a966c4e49417f2da2f244e3e97b4e6e",
    "build_date" : "2020-09-23T00:45:33.626720Z",
    "build_snapshot" : false,
    "lucene_version" : "8.6.2",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}


### Data

Get the Shakespeare data 

In [7]:
%%bash 

wget 'https://download.elastic.co/demos/kibana/gettingstarted/shakespeare_6.0.json' -q

In [8]:
%%bash

head -5 shakespeare_6.0.json

{"index":{"_index":"shakespeare","_id":0}}
{"type":"act","line_id":1,"play_name":"Henry IV", "speech_number":"","line_number":"","speaker":"","text_entry":"ACT I"}
{"index":{"_index":"shakespeare","_id":1}}
{"type":"scene","line_id":2,"play_name":"Henry IV","speech_number":"","line_number":"","speaker":"","text_entry":"SCENE I. London. The palace."}
{"index":{"_index":"shakespeare","_id":2}}


In [9]:
from elasticsearch import helpers, Elasticsearch
import csv

ES_NODES = "http://localhost:9200"

es = Elasticsearch(hosts = [ES_NODES])
index_name = 'shakespeare'
doctype = 'shakespeare_works'
es.indices.delete(index=index_name, ignore=[400, 404])
es.indices.create(index=index_name, ignore=400, 
      body={
              "mappings": {
                  "properties" : {
                  "speaker": 
                    {"type": "keyword"},
                  "play_name": 
                    {"type": "keyword"},
                  "line_id": 
                    {"type": "integer"},
                  "speech_number": 
                    {"type": "integer"}, 
                  "text_entry":
                    {"term_vector": "with_positions_offsets",
                     "type": "text", 
                     "fielddata": True}
            }
      }}
  )
  

{'acknowledged': True, 'index': 'shakespeare', 'shards_acknowledged': True}

Bulk upload the data

In [10]:
! curl -s -q -H 'Content-Type: application/x-ndjson' -XPOST 'localhost:9200/shakespeare/_bulk?pretty' --data-binary @shakespeare_6.0.json 

Output hidden; open in https://colab.research.google.com to view.

In [11]:
! curl http://localhost:9200/_cat/indices

yellow open shakespeare xyVuQuZ7Sf2_w2RCWbOclA 1 1 111396 0 28.3mb 28.3mb


### Extract term vectors
> 3. Use the notebook to demonstrate that you can observe the term vectors for words (submit your result
and code block). Describe what these term-vectors are computed on (how it relates to the index,
documents, and fields)? [2]

In [12]:
ids_all= [str(i) for i in range(0, 2)]

In [13]:
es.mtermvectors(index=index_name, ids=ids_all, term_statistics=True)

{'docs': [{'_id': '0',
   '_index': 'shakespeare',
   '_type': '_doc',
   '_version': 1,
   'found': True,
   'term_vectors': {'text_entry': {'field_statistics': {'doc_count': 111395,
      'sum_doc_freq': 792995,
      'sum_ttf': 820130},
     'terms': {'act': {'doc_freq': 296,
       'term_freq': 1,
       'tokens': [{'end_offset': 3, 'position': 0, 'start_offset': 0}],
       'ttf': 297},
      'i': {'doc_freq': 18301,
       'term_freq': 1,
       'tokens': [{'end_offset': 5, 'position': 1, 'start_offset': 4}],
       'ttf': 20120}}}},
   'took': 5},
  {'_id': '1',
   '_index': 'shakespeare',
   '_type': '_doc',
   '_version': 1,
   'found': True,
   'term_vectors': {'text_entry': {'field_statistics': {'doc_count': 111395,
      'sum_doc_freq': 792995,
      'sum_ttf': 820130},
     'terms': {'i': {'doc_freq': 18301,
       'term_freq': 1,
       'tokens': [{'end_offset': 7, 'position': 1, 'start_offset': 6}],
       'ttf': 20120},
      'london': {'doc_freq': 101,
       'term_fre

### Find a rare term
> 4. Identify a term that occurs seldom in all the texts (by querying ElasticSearch). Submit the word and
notebook cell block code that allowed you to find it. [2]

In [14]:
%%bash
curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "aggs": {
    "genres": {
      "rare_terms": {
        "field": "text_entry",
        "max_doc_count": 1
      }
    }
  }
}
'

{
  "took" : 2280,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "shakespeare",
        "_type" : "_doc",
        "_id" : "0",
        "_score" : 1.0,
        "_source" : {
          "type" : "act",
          "line_id" : 1,
          "play_name" : "Henry IV",
          "speech_number" : "",
          "line_number" : "",
          "speaker" : "",
          "text_entry" : "ACT I"
        }
      },
      {
        "_index" : "shakespeare",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "type" : "scene",
          "line_id" : 2,
          "play_name" : "Henry IV",
          "speech_number" : "",
          "line_number" : "",
          "speaker" : "",
          "text_entry" : "SCENE I. London. The palace."
        }


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100   131    0     0  100   131      0    108  0:00:01  0:00:01 --:--:--   108100   131    0     0  100   131      0     59  0:00:02  0:00:02 --:--:--    59100  687k  100  687k  100   131   292k     55  0:00:02  0:00:02 --:--:--  292k


### Search for the term
> 5. The search function on ElasticSearch is already performing an inverted index, find a sentence in which
the term you identified as rare, is present. Submit the response and the code block that generated it. [2]


In [20]:
res = es.search(q="Honorificabilitudinitatibus")['hits']['hits'][0]['_source']['text_entry']
print(res)

honorificabilitudinitatibus: thou art easier
