This notebook tries to figure out what goes wrong with the queries and Elastic Search. Elastic Search results and annotated rankings by Jelmer de Ronde do not match.

In [33]:
import json
from collections import Counter
from pprint import pprint

In [4]:
# Loading all documents that are present in ES
with open("../resources/elasticsearch-documents.json", "r") as search_json:
    search_documents = {
        doc["id"]: doc
        for doc in json.load(search_json)
    }

# Loading the annotated rankings    
with open("../queries.json", "r") as queries_json:
    queries = json.load(queries_json)

In [4]:
list(search_documents.values())[0]

{'id': '5afabd996c9a78e915ab54655a6875c34787381e',
 'title': 'Powerpoint motivatietheorie',
 'language': 'en',
 'url': 'https://ndownloader.figshare.com/files/7026497',
 'text': 'Motivatietheorie – self detemination theory (SDT)\nRyan en Deci (2000)\n\n\n\n\n\n\n1\n\nDrie basisbehoeften\n\n\n\n\nRelatie\nCompetentie \nAutonomie \n\n\n\n\n2\n\n',
 'mime_type': 'application/vnd.openxmlformats-officedocument.presentationml.presentation',
 'collection_name': 'figshare',
 'humanized_mime_type': 'powerp.',
 'keywords': ['Utrecht University',
  'Universiteit Utrecht',
  'Education',
  'Higher Education',
  'Talentontwikkeling',
  'Motivatietheorie',
  'selfdetermination'],
 'item_id': 4308860,
 'item_url': 'https://api.figshare.com/v2/articles/4308860'}

In [5]:
# Check presence / absence of ranked documents in the Elastic Search index
missing = set()
present = set()
duplicates_count = 0
for query in queries:
    items = query["items"]
    for item in items:
        hsh = item["hash"]
        if hsh not in search_documents:
            missing.add(hsh)
        else:
            if hsh in present:
                duplicates_count += 1
            else:
                present.add(hsh)

print("Missing:", len(missing))
print("Present:", len(present))
print("Duplicates:", duplicates_count)


Missing: 25
Present: 118
Duplicates: 8


So we're missing about 20% of the ranked documents in the Elastic Search documents. We should look into this at some point. 

Now I'll continue with analysing which documents should be retrieved through Elastic Search. So what kind of documents are present in the annotated query rankings.

In [49]:
# This builds an overview of which document types and which collections 
# are both present in the annotated queries and in the Elastic Search index

query_aggregation = {}
query_documents = []
for query in queries:
    aggregation = {
        "collections": {},
        "mime_types": {},
        "ids": []
    }
    for item in query["items"]:
        hsh = item["hash"]
        if hsh not in search_documents:  # filters out anything that is not in ES
            continue
        doc = search_documents[hsh]
        aggregation["ids"].append(doc["id"])
        query_documents.append(doc)
        if not doc["collection_name"] in aggregation["collections"]:
            aggregation["collections"][doc["collection_name"]] = 1
        else:
            aggregation["collections"][doc["collection_name"]] += 1
        if not doc["mime_type"] in aggregation["mime_types"]:
            aggregation["mime_types"][doc["mime_type"]] = 1
        else:
            aggregation["mime_types"][doc["mime_type"]] += 1

    query_aggregation[" / ".join(query["queries"])] = aggregation

print(len(query_documents))

126


In [46]:
# The document types and collections present in the annotated queries
with open("ranked_query_ids.json", "w") as json_file:
    json.dump(query_aggregation, json_file, indent=4)

{
    "leren leren": {
        "collections": {
            "leraar24": 3,
            "wur": 3,
            "figshare": 1,
            "wwmhbo": 3
        },
        "mime_types": {
            "video": 6,
            "application/pdf": 4
        }
    },
    "reflectievaardigheden": {
        "collections": {
            "wur": 3,
            "hbovpk": 1
        },
        "mime_types": {
            "video": 3,
            "application/vnd.openxmlformats-officedocument.wordprocessingml.document": 1
        }
    },
    "feedback geven / giving feedback": {
        "collections": {
            "wur": 2,
            "stimuleringsmaatregel": 1
        },
        "mime_types": {
            "video": 2,
            "application/pdf": 1
        }
    },
    "genetische modificatie / biology transformation": {
        "collections": {
            "wur": 9
        },
        "mime_types": {
            "video": 9
        }
    },
    "harmonie / harmonie muziek": {
        "collections": {


In [37]:
# Looking at which collections and mime types are present across queries and should be found by Elastic Search.
# This gives a first indication of which collections and/or mime types should be worked on to improve accuracy.

collections = Counter()
mime_types = Counter()
for query, aggregates in query_aggregation.items():
    collections += Counter(aggregates["collections"])
    mime_types += Counter(aggregates["mime_types"])

print("Collections:", collections)
print()
print("Mime types:", mime_types)

Collections: Counter({'wwmhbo': 51, 'wur': 38, 'stimuleringsmaatregel': 27, 'leraar24': 6, 'figshare': 2, 'hbovpk': 2})

Mime types: Counter({'video': 47, 'application/pdf': 42, 'application/vnd.openxmlformats-officedocument.wordprocessingml.document': 36, 'application/vnd.openxmlformats-officedocument.presentationml.presentation': 1})


We can see that content packages and WUR transcriptions should be found the most. Together they make up more than 50% of the documents that should be found and which ES is able to find currently, because they are present in the index.

Let's dig a little deeper into mime types per collection.

In [56]:
wwmhbo_application_count = len([
    doc for doc in query_documents
    if doc["collection_name"] == "wwmhbo" and 
    doc["mime_type"].startswith("application")
])

wur_video_count = len([
    doc for doc in query_documents
    if doc["collection_name"] == "wur" and 
    doc["mime_type"].startswith("video")
])

stimuleringsmaatregel_video_count = len([
    doc for doc in query_documents
    if doc["collection_name"] == "stimuleringsmaatregel" and 
    doc["mime_type"].startswith("video")
])

hbovpk_video_count = len([
    doc for doc in query_documents
    if doc["collection_name"] == "hbovpk" and 
    doc["mime_type"].startswith("video")
])

print("Wikiwijsmaken application count:", wwmhbo_application_count)
print("WUR (non-Kaldi) video count:", wur_video_count)
print("Stimuleringsmaatregel (Kaldi) video count:", stimuleringsmaatregel_video_count)
print("HBO VPK (Kaldi) video count:", hbovpk_video_count)

Wikiwijsmaken application count: 51
WUR (non-Kaldi) video count: 38
Stimuleringsmaatregel (Kaldi) video count: 3
HBO VPK (Kaldi) video count: 0


The WUR consists 100% of video transcribes that we haven't created ourselves. So if incorrect transcripts are a problem during search, then we can't do much about it.

The Wikiwijsmaken HBO content consists 100% of PDF and/or Microsoft office documents.

As we can see the total contribution of Kaldi to the disappointing results can't be very large, because in total there are only 9 Kaldi videos that should/could be found. 6 from Leraar24 and 3 from Stimuleringsmaatregel.