# Elasticsearch

In this exercise, you'll first build an Elasticsearch index of a toy document collection, then request various term statistics from that index.

Remember to make sure that the Elasticsearch service is running (i.e., has been started in a terminal window).

See [this document](Elasticsearch.md) for help on Elasticsearch usage.

In [43]:
%pip install ipytest
%pip install elasticsearch==7.15

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [44]:
from elasticsearch import Elasticsearch
from typing import Dict, List, Optional

import ipytest
import pytest

ipytest.autoconfig()

This is to check that the Elasticsearch service is running on your machine.

In [45]:
es = Elasticsearch()

## Indexing

We use a toy data collection with 5 documents, each with title and content fields.

In [46]:
DOCS = [
    {"doc_id": "D1",
     "title": "First document",
     "content": "House on the hill"
    },
    {"doc_id": "D2",
     "title": "Second title",
     "content": "Downtown Stavanger is beautiful"
    },
    {"doc_id": "D3",
     "title": "First, second, and third",
     "content": "Never step on snakes"
    },
    {"doc_id": "D4",
     "title": "Document number four",
     "content": "House, house. It's a beautiful house you have"
    },
    {"doc_id": "D5",
     "title": "This document is the last document",
     "content": "There can be only one matching result"
    }
]

In [47]:
INDEX_SETTINGS = {  # single shard with a single replica
    "settings" : {
        "index" : {
            "number_of_shards" : 1,
            "number_of_replicas" : 1
        }
    }
}

In [48]:
INDEX_NAME = "test_e6-3"

In [49]:
if es.indices.exists(index=INDEX_NAME):
    es.indices.delete(index=INDEX_NAME)
es.indices.create(index=INDEX_NAME, settings=INDEX_SETTINGS["settings"])



{'acknowledged': True, 'shards_acknowledged': True, 'index': 'test_e6-3'}

Add documents in `DOC` to the index.

In [50]:
for doc in DOCS:
    es.index(index=INDEX_NAME, doc_type="_doc", id=doc["doc_id"],
             document={"title": doc["title"], "content": doc["content"]})

## Term statistics

Complete the methods below for getting various term statistics from the index.

Consult [this notebook](2-Elasticsearch.ipynb) for the interpretation of term vector statistics.

In [51]:
def get_doc_term_freqs(index_name: str, doc_id: str, field: str) -> Dict[str, int]:
    """Returns the terms along with their frequencies contained in a given document.

    Args:
        index_name: Name of index.
        doc_id: Document ID.
        field: Field name.

    Returns:
        Dict with terms as keys and corresponding frequencies (i.e.,
        number of occurrences within the given document field) as values.
    """
    tv = es.termvectors(index=index_name, doc_type="_doc", id=doc_id, fields=field, term_statistics=True)
    if tv["_id"] != doc_id:
        return None
    if field not in tv["term_vectors"]:
        return None
    term_freqs = {}
    for term, term_stat in tv["term_vectors"][field]["terms"].items():
        term_freqs[term] = term_stat["term_freq"]
    return term_freqs

In [52]:
def get_doc_field_len(index_name: str, doc_id: str, field: str) -> int:
    """Returns the length of a given document field.

    Length is defined as the total number of terms contained in that field.

    Args:
        index_name: Name of index.
        doc_id: Document ID.
        field: Field name.

    Returns:
        Field length.
    """
    term_freqs = get_doc_term_freqs(index_name, doc_id, field)
    if term_freqs is not None:
        return sum(term_freqs.values())
    return None

In [53]:
def get_doc_containing_term(index_name: str, field: str, term: str) -> Optional[str]:
    """Returns any document ID that contains term in a given field or None.

    Args:
        index_name: Name of index.
        field: Field name.
        term: Term.

    Returns:
        ID of a document that contains `term` or None.
    """
    # Use a boolean query to find a document that contains the term.
    hits = es.search(index=index_name, query={"match": {field: term}}).get("hits", {}).get("hits", {})
    return hits[0]["_id"] if len(hits) > 0 else None

In [54]:
def get_term_doc_count(index_name: str, field: str, term: str) -> int:
    """Returns the total number of documents that contain a given term within a specific field.

    Args:
        index_name: Name of index.
        field: Field name.
        term: Term.

    Returns:
        Number of documents that contain the given term within `field`.
    """
    # Find a document that contains the term.
    doc_id = get_doc_containing_term(index_name, field, term)
    if doc_id is None:
        return 0
    # Request term statistics for that document and extract the
    # requested information from there.
    tv = es.termvectors(index=index_name, doc_type="_doc", id=doc_id, fields=field, term_statistics=True)
    return tv["term_vectors"][field]["terms"][term]["doc_freq"]

In [55]:
def get_term_coll_freq(index_name: str, field: str, term: str) -> int:
    """Returns the total collection term frequency of a term in a given field.

    Args:
        index_name: Name of index.
        field: Field name.
        term: Term.

    Returns:
        Total number of occurrences of `term` in all documents within `field`.
    """
    # Find a document that contains the term.
    doc_id = get_doc_containing_term(index_name, field, term)
    if doc_id is None:
        return 0
    # Request term statistics for that document and extract the
    # requested information from there.
    tv = es.termvectors(index=index_name, doc_type="_doc", id=doc_id, fields=field, term_statistics=True)
    return tv["term_vectors"][field]["terms"][term]["ttf"]

Tests.

In [56]:
%%ipytest

def test_doc_term_freqs():
    assert get_doc_term_freqs(INDEX_NAME, "D2", "title") == {"second": 1, "title": 1}
    assert get_doc_term_freqs(INDEX_NAME, "D4", "content") == {"a": 1, "beautiful": 1, "have": 1,
                                                               "house": 3, "it's": 1, "you": 1}
def test_doc_field_len():
    assert get_doc_field_len(INDEX_NAME, "D2", "title") == 2
    assert get_doc_field_len(INDEX_NAME, "D4", "content") == 8

def test_doc_containing_term():
    assert get_doc_containing_term(INDEX_NAME, "title", "document") in ["D1", "D4", "D5"]
    assert get_doc_containing_term(INDEX_NAME, "content", "house") in ["D1", "D4"]

def test_term_doc_count():
    assert get_term_doc_count(INDEX_NAME, "title", "document") == 3
    assert get_term_doc_count(INDEX_NAME, "content", "house") == 2

def test_term_coll_freq():
    assert get_term_coll_freq(INDEX_NAME, "title", "this") == 1
    assert get_term_coll_freq(INDEX_NAME, "title", "document") == 4
    assert get_term_coll_freq(INDEX_NAME, "content", "house") == 4

[32m.[0m[32m.[0m

[31mF[0m[31mF[0m[31mF[0m[31m                                                                                        [100%][0m
[31m[1m_____________________________________ test_doc_containing_term _____________________________________[0m

    [94mdef[39;49;00m [92mtest_doc_containing_term[39;49;00m():[90m[39;49;00m
>       [94massert[39;49;00m get_doc_containing_term(INDEX_NAME, [33m"[39;49;00m[33mtitle[39;49;00m[33m"[39;49;00m, [33m"[39;49;00m[33mdocument[39;49;00m[33m"[39;49;00m) [95min[39;49;00m [[33m"[39;49;00m[33mD1[39;49;00m[33m"[39;49;00m, [33m"[39;49;00m[33mD4[39;49;00m[33m"[39;49;00m, [33m"[39;49;00m[33mD5[39;49;00m[33m"[39;49;00m][90m[39;49;00m
[1m[31mE       AssertionError: assert None in ['D1', 'D4', 'D5'][0m
[1m[31mE        +  where None = get_doc_containing_term('test_e6-3', 'title', 'document')[0m

[1m[31m/var/folders/bw/ll3q57bd36n7bgtnfj73wsr80000gn/T/ipykernel_8779/3276138731.py[0m:10: AssertionError
[31m[