# Elastic Search
Learn to insert documents in a collection, and query those documents in order to retrieve the most relevant ones for your query

We first need to install the ElasticSearch library. We need specifically the version 7.14.0 because we will use a server with a working installation of ElasticSearch to do our tests

In [None]:
!pip install elasticsearch==7.14.0

In [1]:
from elasticsearch import Elasticsearch
import json
import os

Now we load the config.json file, because we need a host, port, and credential of the server. Make sure you add your surname in the field config['surname']

In [3]:
# open the file
with open("config.json", 'r') as config_file:
    config = json.load(config_file) # load the content of the file in a json object
    INDEX_PORT = config['port']
    INDEX_HOST = config['host']
    INDEX_USER = config['username']
    INDEX_PASS = config['psw']
    INDEX_NAME = config['surname']
    # format HOST and PORT into a full URL for queries
    INDEX_URL = 'http://{}:{}/'.format(INDEX_HOST, INDEX_PORT)

Now it is time to create the function to build an elastic search index, deleting any previous one with the same SURNAME

In [4]:
def index_create():
    es = Elasticsearch(INDEX_URL, http_auth=(INDEX_USER, INDEX_PASS))
    if es.indices.exists(index=INDEX_NAME):
        # this means that an index under your surname is already present
        es.indices.delete(index=INDEX_NAME)
    es.indices.create(index=INDEX_NAME)
    return es

The next function puts three sample documents in the index

In [5]:
def insert_text_examples(es):
    docs = ["Trump u.s.a. NATO", "trump usa N.A.T.O.", "the cat sleeps"]
    for line in docs:
        document = {'line_content': line.strip()} # strip simply removes leading and trailing whitespaces
        es.index(index=INDEX_NAME, body=document) # es.index is the indexing function 

We now created the functions, now we call them to build and populate the index

In [6]:
es = index_create()
insert_text_examples(es)

We can also see the content from the browser. Just go to the URL http://kddrtserver15.isti.cnr.it:7777/\<surname\>/_search?pretty and put as username and password the same you find on the config.json file.

Now let's try some queries. 
Try some query on our index by providing the input yourself.
Queries are modelled with dictionaries.

In [6]:
input_query = input('Insert a query: ').strip()
query_body = {'query': {'match': {'line_content': input_query}}}

res = es.search(index=INDEX_NAME, body=query_body)
for hit in res['hits']['hits']:
    print('score: {} - line: {}'.format(hit['_score'], hit['_source']['line_content']))

score: 0.4700036 - line: Trump u.s.a. NATO
score: 0.4700036 - line: trump usa N.A.T.O.


Now let's create a function that makes interesting queries on our index. We will see later how, changing metadata, the results change drastically.

In [7]:
def example_queries():
    queries = ["She is sleeping", "I am sleeping", "I live in the u.s.a.", "TRUMP"]
    for query in queries:
        query_body = {'query': {'match': {'line_content': query.strip()}}}

        res = es.search(index=INDEX_NAME, body=query_body)
        print("QUERY \"{}\":".format(query))
        for hit in res['hits']['hits']:
            print('score: {} - line: {}'.format(hit['_score'], hit['_source']['line_content']))
        print("================================================================================")
            
example_queries()

QUERY "She is sleeping":
QUERY "I am sleeping":
QUERY "I live in the u.s.a.":
score: 0.9808291 - line: Trump u.s.a. NATO
score: 0.9808291 - line: the cat sleeps
QUERY "TRUMP":
score: 0.4700036 - line: Trump u.s.a. NATO
score: 0.4700036 - line: trump usa N.A.T.O.


## Content Analyzer

In this section we will show how to add a text analyzer to the fields, and how it effects queries

In [8]:
es = index_create() # re-create the index
mapping =  {
    "properties": { 
        "line_content": { # field name: we decide this
            "type": "text", # type of the fields of the project
            "analyzer": "english" # this is where we specify the analyzer type
        }      
    }    
}
es.indices.put_mapping(index=INDEX_NAME, body=mapping)

{'acknowledged': True}

In [9]:
insert_text_examples(es) # we insert again the same previous three elements

Try some queries, see how they change with respect to the ones done earlier with no analyzer.

In [10]:
input_query = input('Insert a query: ').strip()
query_body = {'query': {'match': {'line_content': input_query}}}

res = es.search(index=INDEX_NAME, body=query_body)
for hit in res['hits']['hits']:
    print('score: {} - line: {}'.format(hit['_score'], hit['_source']['line_content']))

score: 1.0925692 - line: the cat sleeps


In [11]:
example_queries() # let's run the same queries we did before

QUERY "She is sleeping":
score: 1.0925692 - line: the cat sleeps
QUERY "I am sleeping":
score: 1.0925692 - line: the cat sleeps
QUERY "I live in the u.s.a.":
score: 0.9331132 - line: Trump u.s.a. NATO
QUERY "TRUMP":
score: 0.4471386 - line: Trump u.s.a. NATO
score: 0.4471386 - line: trump usa N.A.T.O.


## Basic Fields

In this section we will show how to add fields to our documents (in this case, the news source of the article)
We want two fields, one modelled using the english analyzer, the other one using the white space one.

In [12]:
es = index_create() # let's re-create the index
mapping = {
    "properties":{
        "maintext": { # again, we choose the name of the properties
            "type": "text",
            "analyzer": "english" 
        },
        "source": { # this would be the news source that wrote the article
            "type": "text",
            "analyzer": "whitespace"
        }      
    }        
}
es.indices.put_mapping(index=INDEX_NAME, body=mapping)

{'acknowledged': True}

Next we index some articles. 5 articles are provided in the data/texts folder. In the folder is 5 files, each one is a json object with the fields needed for a correct indexing

In [13]:
dir = "data/texts" # directory path
for filename in os.listdir(dir): # iterate over all the files in the directory
    f = os.path.join(dir, filename) # join the directory path with the current filename of the iteration
    with open(f, 'r') as article_file: # open the file in read-only mode
        text = json.load(article_file) # load the json object containing the article
        document = {"maintext": text["maintext"], "source": text["source"]} # put the fields in the respective keys of a dictionary
        es.index(index=INDEX_NAME, body=document) 

unique sources: "The New York Times", "The Herald-ir"

some words to query for: "Leclerc", "leclerc", "the", "aircraft"

Try doing some queries as shown here. Pay special care to the "should" clause. This means that the article should at least match in some way with one of the clause. But if one of the clauses does not match, this does not create a problem.

In [14]:
source = input("Insert a news source: ").strip()
terms = input("Insert text terms: ").strip()
query_body = {
    "query": {
        "bool": {
            "should": [ # at least one of those has to match for the article to be considered
                {"match": {"maintext": terms}}, 
                {"match": {"source" : source}}
            ]
        }      
    }        
}
res = es.search(index=INDEX_NAME, body=query_body)
print ("Found {} results.".format(res['hits']['total']['value']))
for hit in res['hits']['hits']:
    print("=====================================================================")
    print ("score: {} source: {}".format(hit["_score"], hit["_source"]["source"]))
    print ("body: {}".format(hit["_source"]["maintext"])[:100])

Found 3 results.
score: 1.331981 source: The Herald-ir
body: Luke O'Reilly with his mother Janet O'Brien Luke O'Reilly Jack Hall Ellis The Metro One Bar in
score: 1.0892314 source: The New York Times
body: The revival of supersonic passenger travel, thought to be long dead with the demise of Concord
score: 0.9655346 source: The Herald-ir
body: Charles Leclerc
Charles Leclerc registered the maiden win of his Formula One career after romp


## Date Handling

In this section we will show how to deal with dates 

In [15]:
es = index_create()
mapping = {
    "properties":{
        "maintext": {
            "type": "text",
            "analyzer": "english"
        },
        "source": {
            "type": "text",
            "analyzer": "whitespace"
        },
        "pub-date": {
            "type": "date",
             "format": "yyyy-MM-dd"
        }
    }        
}
es.indices.put_mapping(index=INDEX_NAME, body=mapping)

{'acknowledged': True}

In [16]:
dir = "texts"
for filename in os.listdir(dir):
    f = os.path.join(dir, filename)
    with open(f, 'r') as article_file:
        text = json.load(article_file)
        document = {"maintext": text["maintext"], "source": text["source"], "pub-date": text["date"]}
        es.index(index=INDEX_NAME, body=document)

In [17]:
source = input("Insert a news source: ").strip()
terms = input("Insert text terms: ").strip()
query_body = {
    "query": {
        "bool": {
            "should": [{"match": {"maintext": terms}}, {"match": {"source": source}}],
            "minimum_should_match": 1,
            "must": [{"range": {"pub-date": {"lt":"2022-01-01"}}}]
        }      
    }        
}

res = es.search(index=INDEX_NAME, body=query_body)
print ("Found {} results.".format(res['hits']['total']['value']))
for hit in res['hits']['hits']:
    print ("score: {} source: {}".format(hit["_score"], hit["_source"]["source"]))
    print ("body: {}".format(hit["_source"]["maintext"])[:100])

Found 4 results.
score: 2.425359 source: The Herald-ir
body: Luke O'Reilly with his mother Janet O'Brien Luke O'Reilly Jack Hall Ellis The Metro One Bar in
score: 2.0589128 source: The Herald-ir
body: Charles Leclerc
Charles Leclerc registered the maiden win of his Formula One career after romp
score: 1.0933781 source: The Herald-ir
body: Antonio Conte. Pic: PA
Head coach Antonio Conte does not think Chelsea are in the race to sign
score: 1.0933781 source: The Herald-ir
body: Hamid Sanambar
Gardai are hunting for a gunman who opened fire on a car in north Dublin - just


# Boosting Fields

In [18]:
source = input("Insert a news source: ").strip()
terms = input("Insert text terms: ").strip()
query_body = {
    "query": {
        "bool": {
            "should": [
                {"match": {"maintext": terms}}, 
                {"match": {"source" : source}}
            ]
        }      
    }        
}
query_boosted = {
    "query": {
        "bool": {
            "should": [
                {
                    "match": {
                        "source": {
                            "query": source,
                            "boost": 3
                        }
                    }
                },
                {
                    "match": {
                        "maintext": {
                            "query": terms,
                        }
                    }
                },
            ]
        }
    }
}


for body in [query_body, query_boosted]:
    res = es.search(index=INDEX_NAME, body=body)
    print ("Found {} results.".format(res['hits']['total']['value']))
    for hit in res['hits']['hits']:
        print("=====================================================================")
        print ("score: {} source: {}".format(hit["_score"], hit["_source"]["source"]))
        print ("body: {}".format(hit["_source"]["maintext"])[:100])
    print("\n")

Found 5 results.
score: 3.3360603 source: The New York Times
body: The revival of supersonic passenger travel, thought to be long dead with the demise of Concord
score: 1.425359 source: The Herald-ir
body: Luke O'Reilly with his mother Janet O'Brien Luke O'Reilly Jack Hall Ellis The Metro One Bar in
score: 1.0589126 source: The Herald-ir
body: Charles Leclerc
Charles Leclerc registered the maiden win of his Formula One career after romp
score: 0.09337806 source: The Herald-ir
body: Antonio Conte. Pic: PA
Head coach Antonio Conte does not think Chelsea are in the race to sign
score: 0.09337806 source: The Herald-ir
body: Hamid Sanambar
Gardai are hunting for a gunman who opened fire on a car in north Dublin - just


Found 5 results.
score: 10.008182 source: The New York Times
body: The revival of supersonic passenger travel, thought to be long dead with the demise of Concord
score: 1.6121151 source: The Herald-ir
body: Luke O'Reilly with his mother Janet O'Brien Luke O'Reilly Jack Hall 

# Score description: Explain

In [19]:
res = es.explain(id="kNRlr4QBPTSChHKmOhMc", index=INDEX_NAME, body=query_boosted)
res

{'_index': 'bellomo',
 '_type': '_doc',
 '_id': 'kNRlr4QBPTSChHKmOhMc',
 'matched': True,
 'explanation': {'value': 10.008182,
  'description': 'sum of:',
  'details': [{'value': 10.008182,
    'description': 'sum of:',
    'details': [{'value': 0.20509824,
      'description': 'weight(source:The in 0) [PerFieldSimilarity], result of:',
      'details': [{'value': 0.20509824,
        'description': 'score(freq=1.0), computed as boost * idf * tf from:',
        'details': [{'value': 6.6000004,
          'description': 'boost',
          'details': []},
         {'value': 0.087011375,
          'description': 'idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:',
          'details': [{'value': 5,
            'description': 'n, number of documents containing term',
            'details': []},
           {'value': 5,
            'description': 'N, total number of documents with field',
            'details': []}]},
         {'value': 0.35714287,
          'description': 'tf, compute

# Exercise
Execute this code to load this set of articles and create an index. It contains 1429 articles. 
The fields are:
- *maintext*: Textual content of the article
- *source*: News Source that wrote the article
- *author*: Person that wrote the article
- *date*: the date, in format yyyy-MM-dd

UNIQUE SOURCES = ["La Repubblica", "La Stampa", "Il Corriere della Sera", "Rai News"]

UNIQUE AUTHORS = ["Paola Candreva", "Melchiorre Paccioretti", "Caterina De Luca", "Marcello Fiorentini", "Celestino Necci"]

The date range is 2019-12-05 -> 2020-02-10

In [20]:
with open("articles.json", "r") as json_file:
    articles = json.load(json_file)
len(articles)

500

In [21]:
es = index_create()
mapping = {
    "properties":{
        "maintext": {
            "type": "text",
            "analyzer": "english"
        },
        "source": {
            "type": "text",
            "analyzer": "whitespace"
        },
        "pub-date": {
            "type": "date",
             "format": "yyyy-MM-dd"
        },
        "author": {
            "type": "text",
            "analyzer": "whitespace"
        },

    }        
}
es.indices.put_mapping(index=INDEX_NAME, body=mapping)
for a in articles:
    document = {"maintext": a["maintext"], "source": a["source"], "pub-date": a["date"], "author": a["author"]}
    es.index(index=INDEX_NAME, body=document)

# EXERCISE: TRY TO CREATE THIS QUERY
Tell me the number of articles that were written by "Caterina de Luca" OR published by "Rai News", written in the time ranging from the 5th of December 2019 to the 25th of January 2020 (both included), and it contains the word "world"

In [22]:
query_body = {} #TODO insert your query here
res = es.search(index=INDEX_NAME, body=query_body)
print ("Found {} results.".format(res['hits']['total']['value']))
for hit in res['hits']['hits']:
    print("=====================================================================")
    print ("score: {} source: {}, author: {}".format(hit["_score"], hit["_source"]["source"], hit["_source"]["author"]))
    print ("body: {}".format(hit["_source"]["maintext"])[:100])

Found 500 results.
score: 1.0 source: Il Corriere della Sera, author: Melchiorre Paccioretti
body: After the controversy the executive corrects the norm on plastic. Medical devices and single-u
score: 1.0 source: Rai News, author: Celestino Necci
body: Trump attacks Trudeau: “Hypocrite, he has a double face.” Then cancel the press conference. “I
score: 1.0 source: Rai News, author: Caterina De Luca
body: The images arrive three months after the US President's visit to Otay Mesa Immigrant Detention
score: 1.0 source: La Repubblica, author: Paola Candreva
body: Zelimkhan Khangoshvili, former Chechen rebel commander, was murdered on 23 August in a park in
score: 1.0 source: Rai News, author: Marcello Fiorentini
body: Victims are civilians: the attacker took his own life
score: 1.0 source: La Repubblica, author: Melchiorre Paccioretti
body: The traditional party for lighting up the lights
score: 1.0 source: La Stampa, author: Celestino Necci
body: 250 events are planned throughout the coun

To the brave ones that finished early:
**How do I do the exact same query, without the "world" clause, but the OR becomes a XOR?** (basically the article is either written by "Caterina de Luca" OR published by "Rai News", but those two events do not co-exist)
So the new query to build is:
Tell me the number of articles that were either written by "Caterina de Luca" OR published by "Rai News", and written in the time ranging from the 5th of December 2019 to the 25th of January 2020 (both included).

_HINT_: similarly to "must", there is also a "must_not" construct.
