## Querying an ES index with a batch of files

This notebook demonstrates how to perform querying of an ES index with a given query for a batch of files.

Imports

In [35]:
from source.data.input_data_reading import get_all_input_file_paths
from config import INPUT_FILES_DIR, ONTOLOGY_CORE_DIR

from source.env_setup.setup import connect_elasticsearch 
from config import IDX_NAME, PORT
from source.ontology_parsing.graph_utils import get_concepts_pref_labels
from source.ontology_parsing.data_loading import get_all_concept_file_paths, get_graphs_from_files

from source.result_saving.handle_input_files import classify_input_files

Retrieving all files intended to be processed. An example vocabulary for an input article was delivered by the Web scrapping team.
All inputs are stored in the `data/input_ttl_files`.

The article's title and abstract should be present in the file, their URIs are specified in the `config.py` file.

In [21]:
input_files_paths = get_all_input_file_paths(INPUT_FILES_DIR)
input_files_paths

['C:\\Users\\golik\\Desktop\\mgr\\semantic\\project\\opencs_paperclassification\\data\\input_ttl_files\\article0.ttl',
 'C:\\Users\\golik\\Desktop\\mgr\\semantic\\project\\opencs_paperclassification\\data\\input_ttl_files\\article1.ttl',
 'C:\\Users\\golik\\Desktop\\mgr\\semantic\\project\\opencs_paperclassification\\data\\input_ttl_files\\article2.ttl',
 'C:\\Users\\golik\\Desktop\\mgr\\semantic\\project\\opencs_paperclassification\\data\\input_ttl_files\\article3.ttl']

Parsing the OpenCS ontology to get all concept names and preferred labels

In [32]:
# reading OpenCS files
files = get_all_concept_file_paths(ONTOLOGY_CORE_DIR)

# loading the files data into graphs with rdflib
graphs = get_graphs_from_files(files)

labels_to_concepts_names = get_concepts_pref_labels(graphs)
labels_to_concepts_names

{'C1': 'Computer science',
 'C10': 'Algorithm',
 'C100': 'Sequential optimization',
 'C101': 'Knowledge based society',
 'C102': 'Object matching',
 'C103': 'Mixed variables',
 'C104': 'Functional synthesis',
 'C105': 'Simultaneous learning',
 'C106': 'Traffic characteristic',
 'C107': 'Rest break',
 'C108': 'Nature inspired algorithm',
 'C109': 'Naturalistic driving',
 'C11': 'Embedded system',
 'C110': 'Passive marker',
 'C111': 'Visual feedback',
 'C112': 'Reconfigurability',
 'C113': 'Color response',
 'C114': 'REEM',
 'C115': 'crt0',
 'C116': 'Team strategy',
 'C117': 'Memory copy',
 'C118': 'Test query',
 'C119': 'Volumetric image',
 'C12': 'Knowledge management',
 'C120': 'Prosthetic hand',
 'C121': 'Systems simulation',
 'C122': 'Cellular automaton',
 'C123': 'Digital filter',
 'C124': 'Interrogative',
 'C125': 'Next-generation network',
 'C126': 'Logical consequence',
 'C127': 'Bookmarking',
 'C128': 'Global variable',
 'C129': 'Functional capacity evaluation',
 'C13': 'Data m

Preparing a query. Note that this query will be used for all files, because of this we use placeholders for title (`#ARTICLE_TITLE`) and abstract (`#ARTICLE_ABSTRACT`) values. These placeholders will be automatically replaced with correct values for each of the processed file.

To learn more about constructing queries, see the `querying_es.ipynb` notebook. Below we provide a simple query to demonstrate the processing batches of files.

In [7]:
query = {
  "query": {
    "dis_max": {
      "queries": [
        {
          "multi_match" : {
          "query":      "#ARTICLE_TITLE",  # PLACEHOLDER for an article title value (the same query for all files)
          "type":       "most_fields",
          "fields":     ["prefLabel^3", "related", "broader"],
          "tie_breaker": 0.5
         }
        },
        {
          "multi_match" : {
          "query":      "#ARTICLE_ABSTRACT", # PLACEHOLDER for an article abstract value (the same query for all files)
          "type":       "most_fields",
          "fields":     ["prefLabel^3", "related", "broader"],
          "tie_breaker": 0.5
         }
        }
      ]
    }
  }
}

`classify_input_files` is a function to query an ES index using a batch of files and the same query for all of them.
It queries the index with a given `index_name` for each file from the `input_files_paths` list and retrieves a ranking of `n` best results with their scores.

The results are saved in the `results` directory in two forms:
- an original file with an object (that is **the best retrieved result** e.g., `ocs:C19`) of the `hasDiscipline` predicate updated in the `results/results` directory
- a query result, a  `.ttl` file with a vocabulary containing data and the ranking retrieved from the query, in the `results/query_results` directory

The input files are deleted (since their updated version is saved) by default. You can specify the `move_orignal` parameter as `False` to change this behavior.
You can also set the `ask_to_override` parameters to `True` to make the function ask you whether to save the result after displaying it ([y/n]).

It assumes that ES is running (if not start it, or see `env_setup.ipynb` for instructions) and an index is already constructed (see `loading_data_and_building_index.ipynb` for instructions).



In [34]:
with connect_elasticsearch({'host': 'localhost', 'port': PORT}) as es:
    classify_input_files(input_files_paths, labels_to_concepts_names, es, IDX_NAME, query, n=5, move_original=False, ask_to_override=True)

Yay Connected
###
Parsing a next file 1/4: 
Query result for a file article0:
[{'prefLabel': ['Sociology of the Internet'], 'score': 872.1316}, {'prefLabel': ['Suicide and the Internet'], 'score': 852.0229}, {'prefLabel': ['Outline of the Internet'], 'score': 836.8296}, {'prefLabel': ['Confusion of the inverse'], 'score': 833.3474}, {'prefLabel': ['Abundances of the elements'], 'score': 830.24585}]
###
Parsing a next file 2/4: 
Query result for a file article1:
[{'prefLabel': ['Brain tumor segmentation'], 'score': 379.25287}, {'prefLabel': ['Tumor segmentation'], 'score': 323.5705}, {'prefLabel': ['Liver tumor segmentation'], 'score': 321.0152}, {'prefLabel': ['Recurrent Gastrointestinal Stromal Tumor'], 'score': 302.49762}, {'prefLabel': ['Duodenal submucosal tumor'], 'score': 262.3456}]
###
Parsing a next file 3/4: 
Query result for a file article2:
[{'prefLabel': ['Suicide and the Internet'], 'score': 237.78304}, {'prefLabel': ['Programming in the large and programming in the small'

In [36]:
with connect_elasticsearch({'host': 'localhost', 'port': PORT}) as es:
    classify_input_files(input_files_paths, labels_to_concepts_names, es, IDX_NAME, query, n=5)

Yay Connected
###
Parsing a next file 1/4: 
Query result for a file article0:
[{'prefLabel': ['Sociology of the Internet'], 'score': 872.1316}, {'prefLabel': ['Suicide and the Internet'], 'score': 852.0229}, {'prefLabel': ['Outline of the Internet'], 'score': 836.8296}, {'prefLabel': ['Confusion of the inverse'], 'score': 833.3474}, {'prefLabel': ['Abundances of the elements'], 'score': 830.24585}]
Saving the result and updating the discipline with the best result.
###
Parsing a next file 2/4: 
Query result for a file article1:
[{'prefLabel': ['Brain tumor segmentation'], 'score': 379.25287}, {'prefLabel': ['Tumor segmentation'], 'score': 323.5705}, {'prefLabel': ['Liver tumor segmentation'], 'score': 321.0152}, {'prefLabel': ['Recurrent Gastrointestinal Stromal Tumor'], 'score': 302.49762}, {'prefLabel': ['Duodenal submucosal tumor'], 'score': 262.3456}]
Saving the result and updating the discipline with the best result.
###
Parsing a next file 3/4: 
Query result for a file article2:
