# Elastic Search
Learn to insert documents in a collection, and query those documents in order to retrieve the most relevant ones for your query

We first need to install the ElasticSearch library. We need specifically the version 7.14.0 because we will use a server with a working installation of ElasticSearch to do our tests

In [None]:
!pip install elasticsearch==7.14.0

In [1]:
from elasticsearch import Elasticsearch
import json
import os

Now we load the config.json file, because we need a host, port, and credential of the server. Make sure you add your surname in the field config['surname']

In [3]:
# open the file
with open("../config.json", 'r') as config_file:
    config = json.load(config_file) # load the content of the file in a json object
    INDEX_PORT = config['port']
    INDEX_HOST = config['host']
    INDEX_USER = config['username']
    INDEX_PASS = config['psw']
    INDEX_NAME = config['surname']
    # format HOST and PORT into a full URL for queries
    INDEX_URL = 'http://{}:{}/'.format(INDEX_HOST, INDEX_PORT)

Now it is time to create the function to build an elastic search index, deleting any previous one with the same SURNAME

In [4]:
def index_create():
    es = Elasticsearch(INDEX_URL, http_auth=(INDEX_USER, INDEX_PASS))
    if es.indices.exists(index=INDEX_NAME):
        # this means that an index under your surname is already present
        es.indices.delete(index=INDEX_NAME)
    es.indices.create(index=INDEX_NAME)
    return es

The next function puts three sample documents in the index

In [5]:
def insert_text_examples(es):
    docs = ["Trump u.s.a. NATO", "trump usa N.A.T.O.", "the cat sleeps"]
    for line in docs:
        document = {'line_content': line.strip()} # strip simply removes leading and trailing whitespaces
        es.index(index=INDEX_NAME, body=document) # es.index is the indexing function 

We now created the functions, now we call them to build and populate the index

In [6]:
es = index_create()
insert_text_examples(es)

We can also see the content from the browser. Just go to the URL [http://kddrtserver15.isti.cnr.it:7777/\<surname\>/_search?pretty](http://kddrtserver15.isti.cnr.it:7777/\<surname\>/_search?pretty) and put as username and password the same you find on the config.json file.

Now let's try some queries. 
Try some query on our index by providing the input yourself.
Queries are modelled with dictionaries.

In [7]:
input_query = input('Insert a query: ').strip()
query_body = {'query': {'match': {'line_content': input_query}}}

res = es.search(index=INDEX_NAME, body=query_body)
for hit in res['hits']['hits']:
    print('score: {} - line: {}'.format(hit['_score'], hit['_source']['line_content']))

score: 0.9808291 - line: trump usa N.A.T.O.


Now let's create a function that makes interesting queries on our index. We will see later how, changing metadata, the results change drastically.

In [8]:
def example_queries():
    queries = ["She is sleeping", "I am sleeping", "I live in the u.s.a.", "TRUMP"]
    for query in queries:
        query_body = {'query': {'match': {'line_content': query.strip()}}}

        res = es.search(index=INDEX_NAME, body=query_body)
        print("QUERY \"{}\":".format(query))
        for hit in res['hits']['hits']:
            print('score: {} - line: {}'.format(hit['_score'], hit['_source']['line_content']))
        print("================================================================================")
            
example_queries()

QUERY "She is sleeping":
QUERY "I am sleeping":
QUERY "I live in the u.s.a.":
score: 0.9808291 - line: Trump u.s.a. NATO
score: 0.9808291 - line: the cat sleeps
QUERY "TRUMP":
score: 0.4700036 - line: Trump u.s.a. NATO
score: 0.4700036 - line: trump usa N.A.T.O.


## Content Analyzer

In this section we will show how to add a text analyzer to the fields, and how it effects queries

In [9]:
es = index_create() # re-create the index
mapping =  {
    "properties": { 
        "line_content": { # field name: we decide this
            "type": "text", # type of the fields of the project
            "analyzer": "english" # this is where we specify the analyzer type
        }      
    }    
}
es.indices.put_mapping(index=INDEX_NAME, body=mapping)

{'acknowledged': True}

In [10]:
insert_text_examples(es) # we insert again the same previous three elements

Try some queries, see how they change with respect to the ones done earlier with no analyzer.

In [11]:
input_query = input('Insert a query: ').strip()
query_body = {'query': {'match': {'line_content': input_query}}}

res = es.search(index=INDEX_NAME, body=query_body)
for hit in res['hits']['hits']:
    print('score: {} - line: {}'.format(hit['_score'], hit['_source']['line_content']))

score: 0.9331132 - line: trump usa N.A.T.O.


In [12]:
example_queries() # let's run the same queries we did before

QUERY "She is sleeping":
score: 1.0925692 - line: the cat sleeps
QUERY "I am sleeping":
score: 1.0925692 - line: the cat sleeps
QUERY "I live in the u.s.a.":
score: 0.9331132 - line: Trump u.s.a. NATO
QUERY "TRUMP":
score: 0.4471386 - line: Trump u.s.a. NATO
score: 0.4471386 - line: trump usa N.A.T.O.


## Basic Fields

In this section we will show how to add fields to our documents (in this case, the news source of the article)
We want two fields, one modelled using the english analyzer, the other one using the white space one.

In [13]:
es = index_create() # let's re-create the index
mapping = {
    "properties":{
        "maintext": { # again, we choose the name of the properties
            "type": "text",
            "analyzer": "english" 
        },
        "source": { # this would be the news source that wrote the article
            "type": "text",
            "analyzer": "whitespace"
        }      
    }        
}
es.indices.put_mapping(index=INDEX_NAME, body=mapping)

{'acknowledged': True}

Next we index some articles. 5 articles are provided in the data/texts folder. In the folder is 5 files, each one is a json object with the fields needed for a correct indexing

In [14]:
dir = "../data/texts" # directory path
for filename in os.listdir(dir): # iterate over all the files in the directory
    f = os.path.join(dir, filename) # join the directory path with the current filename of the iteration
    with open(f, 'r') as article_file: # open the file in read-only mode
        text = json.load(article_file) # load the json object containing the article
        document = {"maintext": text["maintext"], "source": text["source"]} # put the fields in the respective keys of a dictionary
        es.index(index=INDEX_NAME, body=document) 

unique sources: "The New York Times", "The Herald-ir"

some words to query for: "Leclerc", "leclerc", "the", "aircraft"

Try doing some queries as shown here. Pay special care to the "should" clause. This means that the article should at least match in some way with one of the clause. But if one of the clauses does not match, this does not create a problem.

In [15]:
source = input("Insert a news source: ").strip()
terms = input("Insert text terms: ").strip()
query_body = {
    "query": {
        "bool": {
            "should": [ # at least one of those has to match for the article to be considered
                {"match": {"maintext": terms}}, 
                {"match": {"source" : source}}
            ]
        }      
    }        
}
res = es.search(index=INDEX_NAME, body=query_body)
print ("Found {} results.".format(res['hits']['total']['value']))
for hit in res['hits']['hits']:
    print("=====================================================================")
    print ("score: {} source: {}".format(hit["_score"], hit["_source"]["source"]))
    print ("body: {}".format(hit["_source"]["maintext"])[:100])

Found 1 results.
score: 2.5437517 source: The Herald-ir
body: Charles Leclerc
Charles Leclerc registered the maiden win of his Formula One career after romp


## Date Handling

In this section we will show how to deal with dates 

In [16]:
es = index_create()
mapping = {
    "properties":{
        "maintext": {
            "type": "text",
            "analyzer": "english"
        },
        "source": {
            "type": "text",
            "analyzer": "whitespace"
        },
        "pub-date": {
            "type": "date",
             "format": "yyyy-MM-dd"
        }
    }        
}
es.indices.put_mapping(index=INDEX_NAME, body=mapping)

{'acknowledged': True}

In [17]:
dir = "../data/texts"
for filename in os.listdir(dir):
    f = os.path.join(dir, filename)
    with open(f, 'r') as article_file:
        text = json.load(article_file)
        document = {"maintext": text["maintext"], "source": text["source"], "pub-date": text["date"]}
        es.index(index=INDEX_NAME, body=document)

In [18]:
source = input("Insert a news source: ").strip()
terms = input("Insert text terms: ").strip()
query_body = {
    "query": {
        "bool": {
            "should": [{"match": {"maintext": terms}}, {"match": {"source": source}}],
            "minimum_should_match": 1,
            "must": [{"range": {"pub-date": {"lt":"2022-01-01"}}}]
        }      
    }        
}

res = es.search(index=INDEX_NAME, body=query_body)
print ("Found {} results.".format(res['hits']['total']['value']))
for hit in res['hits']['hits']:
    print ("score: {} source: {}".format(hit["_score"], hit["_source"]["source"]))
    print ("body: {}".format(hit["_source"]["maintext"])[:100])

Found 1 results.
score: 3.5437517 source: The Herald-ir
body: Charles Leclerc
Charles Leclerc registered the maiden win of his Formula One career after romp


# Boosting Fields

In [19]:
source = input("Insert a news source: ").strip()
terms = input("Insert text terms: ").strip()
query_body = {
    "query": {
        "bool": {
            "should": [
                {"match": {"maintext": terms}}, 
                {"match": {"source" : source}}
            ]
        }      
    }        
}
query_boosted = {
    "query": {
        "bool": {
            "should": [
                {
                    "match": {
                        "source": {
                            "query": source,
                            "boost": 3
                        }
                    }
                },
                {
                    "match": {
                        "maintext": {
                            "query": terms,
                        }
                    }
                },
            ]
        }
    }
}


for body in [query_body, query_boosted]:
    res = es.search(index=INDEX_NAME, body=body)
    print ("Found {} results.".format(res['hits']['total']['value']))
    for hit in res['hits']['hits']:
        print("=====================================================================")
        print ("score: {} source: {}".format(hit["_score"], hit["_source"]["source"]))
        print ("body: {}".format(hit["_source"]["maintext"])[:100])
    print("\n")

Found 1 results.
score: 2.5437517 source: The Herald-ir
body: Charles Leclerc
Charles Leclerc registered the maiden win of his Formula One career after romp


Found 1 results.
score: 2.5437517 source: The Herald-ir
body: Charles Leclerc
Charles Leclerc registered the maiden win of his Formula One career after romp




# Score description: Explain

In [24]:
res = es.explain(id="DdTCcYUBPTSChHKmyjde", index=INDEX_NAME, body=query_boosted)
res

{'_index': 'bellomo',
 '_type': '_doc',
 '_id': 'DdTCcYUBPTSChHKmyjde',
 'matched': True,
 'explanation': {'value': 2.5437517,
  'description': 'sum of:',
  'details': [{'value': 2.5437517,
    'description': 'weight(maintext:leclerc in 0) [PerFieldSimilarity], result of:',
    'details': [{'value': 2.5437517,
      'description': 'score(freq=5.0), computed as boost * idf * tf from:',
      'details': [{'value': 2.2, 'description': 'boost', 'details': []},
       {'value': 1.3862944,
        'description': 'idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:',
        'details': [{'value': 1,
          'description': 'n, number of documents containing term',
          'details': []},
         {'value': 5,
          'description': 'N, total number of documents with field',
          'details': []}]},
       {'value': 0.83405864,
        'description': 'tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:',
        'details': [{'value': 5.0,
          'description': 

# Exercise
Execute this code to load this set of articles and create an index. It contains 500 articles. 
The fields are:
- *maintext*: Textual content of the article
- *source*: News Source that wrote the article
- *author*: Person that wrote the article
- *date*: the date, in format yyyy-MM-dd

UNIQUE SOURCES = ["La Repubblica", "La Stampa", "Il Corriere della Sera", "Rai News"]

UNIQUE AUTHORS = ["Paola Candreva", "Melchiorre Paccioretti", "Caterina De Luca", "Marcello Fiorentini", "Celestino Necci"]

The date range is 2019-12-05 -> 2020-02-10

In [25]:
with open("../data/articles.json", "r") as json_file:
    articles = json.load(json_file)
len(articles)

500

In [26]:
es = index_create()
mapping = {
    "properties":{
        "maintext": {
            "type": "text",
            "analyzer": "english"
        },
        "source": {
            "type": "text",
            "analyzer": "whitespace"
        },
        "pub-date": {
            "type": "date",
             "format": "yyyy-MM-dd"
        },
        "author": {
            "type": "text",
            "analyzer": "whitespace"
        },

    }        
}
es.indices.put_mapping(index=INDEX_NAME, body=mapping)
for a in articles:
    document = {"maintext": a["maintext"], "source": a["source"], "pub-date": a["date"], "author": a["author"]}
    es.index(index=INDEX_NAME, body=document)

# EXERCISE: TRY TO CREATE THIS QUERY
Tell me the number of articles that were written by "Caterina de Luca" OR published by "Rai News", written in the time ranging from the 5th of December 2019 to the 25th of January 2020 (both included), and it contains the word "world"

In [27]:
query_body = {} #TODO insert your query here
res = es.search(index=INDEX_NAME, body=query_body)
print ("Found {} results.".format(res['hits']['total']['value']))
for hit in res['hits']['hits']:
    print("=====================================================================")
    print ("score: {} source: {}, author: {}".format(hit["_score"], hit["_source"]["source"], hit["_source"]["author"]))
    print ("body: {}".format(hit["_source"]["maintext"])[:100])

Found 500 results.
score: 1.0 source: La Repubblica, author: Celestino Necci
body: In fact, some EUR 500 million would be missing from the appeal. Italia Viva continues to deman
score: 1.0 source: La Repubblica, author: Celestino Necci
body: About a possible alliance with the PD, the Pentastered leader reiterated: “We have a program t
score: 1.0 source: Rai News, author: Paola Candreva
body: “They were killed in a shooting: they tried to take possession of their guards's weapons but w
score: 1.0 source: La Stampa, author: Marcello Fiorentini
body: 
score: 1.0 source: La Repubblica, author: Melchiorre Paccioretti
body: The anniversary of the Immaculate Conception
score: 1.0 source: La Stampa, author: Paola Candreva
body: More than fifty countries that participated in the fifth edition of the International Conferen
score: 1.0 source: Il Corriere della Sera, author: Celestino Necci
body: Morgan's alarm comes as a result of reports that official government documents leaked online —
score: 

To the brave ones that finished early:
**How do I do the exact same query, without the "world" clause, but the OR becomes a XOR?** (basically the article is either written by "Caterina de Luca" OR published by "Rai News", but those two events do not co-exist)
So the new query to build is:
Tell me the number of articles that were either written by "Caterina de Luca" OR published by "Rai News", and written in the time ranging from the 5th of December 2019 to the 25th of January 2020 (both included).

_HINT_: similarly to "must", there is also a "must_not" construct.
