# CAI Lab Session 4: Implementing search in the vector space model

In this session you will:

- Continue to work with the `arxiv` repository from last session
- Learn how to do atomic, conjunctive and disjunctive search with ElasticSearch
- Build an inverted index for the `arxiv` repository from last session (should fit in main memory)
- Implement search in the vector space model and compare it with ElasticSearch built-in search mechanism
- Compare different implementations of search

## 1. Built-in search in ElasticSearch

ElasticSearch provides a search mechanism to make queries against a database. 
In the next code snippet you can find examples on how to do this with an atomic query (single term)
and with conjunctive and disjunctive queries.

In [24]:
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
from elasticsearch_dsl.query import Q


client = Elasticsearch("http://localhost:9200", request_timeout=1000)
s = Search(using=client, index='arxiv')

## atomic query
q = Q('query_string',query='computer')  # Feel free to change the word

s = s.query(q)  # add the query to the search object
response = s[:5].execute()  # execute the search and return the first 5 results
for r in response:  # only returns a specific number of results
    print('ID= %s SCORE=%s' % (r.meta.id,  r.meta.score))
    print('PATH= %s' % r.path)
    print('TEXT: %s' % r.text[:90])
    print()

ID= lpEAV4sBcL-B_pFl6BAW SCORE=3.2082987
PATH= ../../arxiv\cs.updates.on.arXiv.org/002772
TEXT: Limit computable functions can be characterized by Turing jumps on the input side or limit

ID= bpEAV4sBcL-B_pFl-19e SCORE=3.2082987
PATH= ../../arxiv\math.updates.on.arXiv.org/001904
TEXT: Limit computable functions can be characterized by Turing jumps on the input side or limit

ID= FpEAV4sBcL-B_pFl7CJ2 SCORE=3.1941829
PATH= ../../arxiv\cs.updates.on.arXiv.org/007252
TEXT: We study scheduling of computation tasks across $n$ workers in a large scale distributed l

ID= CpEAV4sBcL-B_pFl_Wcq SCORE=3.1941829
PATH= ../../arxiv\math.updates.on.arXiv.org/003852
TEXT: We study scheduling of computation tasks across $n$ workers in a large scale distributed l

ID= JJEAV4sBcL-B_pFl7SMA SCORE=3.1520977
PATH= ../../arxiv\cs.updates.on.arXiv.org/007522
TEXT: Grid Computing is an idea of a new kind of network technology in which research work in pr



In [25]:
## conjunctive query

client = Elasticsearch("http://localhost:9200", request_timeout=1000)
s = Search(using=client, index='arxiv')

q = Q('query_string',query='computer') & Q('query_string',query='magic')

s = s.query(q)
response = s[0:5].execute()
for r in response:  # only returns a specific number of results
    print(f'ID= {r.meta.id} SCORE={r.meta.score}')
    print(f'PATH= {r.path}')
    print(f'TEXT: {r.text[:90]}')
    print()

ID= hJEBV4sBcL-B_pFlB5Oa SCORE=14.512409
PATH= ../../arxiv\quant-ph.updates.on.arXiv.org/000650
TEXT: We give a new algorithm for computing the robustness of magic - a measure of the utility o

ID= n5EBV4sBcL-B_pFlB5Oa SCORE=14.512409
PATH= ../../arxiv\quant-ph.updates.on.arXiv.org/000677
TEXT: We give a new algorithm for computing the robustness of magic - a measure of the utility o

ID= bpEBV4sBcL-B_pFlCJeR SCORE=11.029575
PATH= ../../arxiv\quant-ph.updates.on.arXiv.org/001652
TEXT: A defining feature in the field of quantum computing is the potential of a quantum device 

ID= WpAAV4sBcL-B_pFl2M1W SCORE=10.839215
PATH= ../../arxiv\astro-ph.updates.on.arXiv.org/006224
TEXT: Context. PKS 1510-089 is a flat spectrum radio quasar strongly variable in the optical and

ID= GJAAV4sBcL-B_pFl2NDh SCORE=9.807085
PATH= ../../arxiv\astro-ph.updates.on.arXiv.org/006926
TEXT: PKS 1510-089 is a flat spectrum radio quasar strongly variable in the optical and GeV rang



In [26]:
## disjunctive query

client = Elasticsearch("http://localhost:9200", request_timeout=1000)
s = Search(using=client, index='arxiv')

q = Q('query_string',query='computer') | Q('query_string',query='magic')

s = s.query(q)
response = s[0:5].execute()
for r in response:  # only returns a specific number of results
    print(f'ID= {r.meta.id} SCORE={r.meta.score}')
    print(f'PATH= {r.path}')
    print(f'TEXT: {r.text[:90]}')
    print()

ID= hJEBV4sBcL-B_pFlB5Oa SCORE=14.512409
PATH= ../../arxiv\quant-ph.updates.on.arXiv.org/000650
TEXT: We give a new algorithm for computing the robustness of magic - a measure of the utility o

ID= n5EBV4sBcL-B_pFlB5Oa SCORE=14.512409
PATH= ../../arxiv\quant-ph.updates.on.arXiv.org/000677
TEXT: We give a new algorithm for computing the robustness of magic - a measure of the utility o

ID= JZAAV4sBcL-B_pFl4_72 SCORE=12.0883665
PATH= ../../arxiv\cond-mat.updates.on.arXiv.org/003482
TEXT: When two monolayers of graphene are stacked with a small relative twist angle, the resulti

ID= BpEAV4sBcL-B_pFl-FNF SCORE=11.981175
PATH= ../../arxiv\hep-th.updates.on.arXiv.org/000265
TEXT: We introduce the extended Freudenthal-Rosenfeld-Tits magic square based on six algebras: t

ID= uZEAV4sBcL-B_pFl-luA SCORE=11.981175
PATH= ../../arxiv\math.updates.on.arXiv.org/000955
TEXT: We introduce the extended Freudenthal-Rosenfeld-Tits magic square based on six algebras: t



## 2. Excruciatingly slow search

In class we have presented a _slow_ version of search that, given a search query $q$, loops over every document in the database
computing the cosine similarity between document and query. Once this is done, it sorts documents by their similarity w.r.t. $q$ and returns the top $r$
scoring ones. 

```
1. for each d in D:
    sim(d,q) = 0
    get vector representing d
    for each w in q:
        sim(d,q) += tf(d,w) * idf(w)
    normalize sim(d,q) by |d|*|q|
2. sort results by similarity
3. return top r docs
```

A possible implementation can be found below. 

__Remark:__ _It should be important to note that there are certain elements in the implementation below that refer to my own
implementation, and that you should adapt to your own; in particular, the line_

```    weights = dict(normalize(tf_idf(s['_id'])))   # gets weights as a python dict of term -> weight ```

_obtains tf-idf weights through calling a function `tf_idf` that I have implemented that, given a docid, returns a list of pairs (term, weight); and `normalize` takes such a list a normalizes weights so that the corresponding vector has length 1. 
Obviously, you should adapt the code to your own implementations from previous sessions._


In [27]:
from elasticsearch.helpers import scan
from pprint import pprint
from elasticsearch import Elasticsearch
import tqdm
from colorama import Fore
import numpy as np  

In [28]:
def tf_idf(idx: str, client: Elasticsearch, doc_id: str, D: int) -> list:
    """
    Compute tf-idf for each term in a document with internal id doc_id
    """

    import math

    tv = client.termvectors(index=idx, id=doc_id, fields=['text'], term_statistics=True)
    tfidf = []
    
    if 'text' in tv['term_vectors']:
        max_word = max(tv['term_vectors']['text']['terms'], key=lambda x: tv['term_vectors']['text']['terms'][x]['term_freq'])
        max_fdj = tv['term_vectors']['text']['terms'][max_word]['term_freq']
        
        for word in tv['term_vectors']['text']['terms']:
        
            fdi = tv['term_vectors']['text']['terms'][word]['term_freq']    # term frequency in document
            dfi = tv['term_vectors']['text']['terms'][word]['doc_freq']     # number of documents containing term in entire corpus
            
            tf = fdi/max_fdj
            idf = math.log(D/dfi, 2)
            
            tfidf.append((word, tf*idf))
            
    return tfidf

In [29]:
def normalize(tdfidf: list) -> list:
    """
    Normalize tf-idf weights so that the resulting vector has length 1
    """

    import math

    norm_d = math.sqrt(sum([w**2 for _, w in tdfidf]))
    return [(t, w/norm_d) for t, w in tdfidf]

In [30]:
client = Elasticsearch("http://localhost:9200", request_timeout=1000)

r = 10  # only return r top docs
query = 'computer magic'

In [45]:
def slow_search(query: str, r: int) -> dict:
    """
    Slow search using tf-idf
    """

    sims = dict()

    l2query  = np.sqrt(len(query.split()))  # l2 of query assuming 0-1 vector representation

    # get nr. of docs; just for the progress bar
    ndocs = int(client.cat.count(index='arxiv', format = "json")[0]['count'])  # D

    # scan through docs, compute cosine sim between query and each doc
    for s in tqdm.tqdm(scan(client, index='arxiv', query={"query" : {"match_all": {}}}), total=ndocs):
        docid = s['_id']   # use path as id

        sims[docid] = 0.0
        weights = dict(normalize(tf_idf('arxiv', client, s['_id'], ndocs)))  # normalize weights for doc

        for w in query.split():  # gets terms as a list
            if w in weights:    # probably need to do something fancier to make sure that word is in vocabulary etc.
                sims[docid] += weights[w]   # accumulates if w in current doc

        # normalize sim
        sims[docid] /= l2query  # ||q||_2 = 1

    # now sort by cosine similarity
    sorted_answer = sorted(sims.items(), key=lambda kv: kv[1], reverse=True)
    
    return sorted_answer, sorted_answer[:r], len(sorted_answer)

In [46]:
complete_answer, answer, total = slow_search(query, r)
pprint(answer)

100%|██████████| 58102/58102 [04:37<00:00, 209.54it/s]

[('hJEBV4sBcL-B_pFlB5Oa', 0.47752597051076895),
 ('n5EBV4sBcL-B_pFlB5Oa', 0.477263052236843),
 ('JZAAV4sBcL-B_pFl4_72', 0.38792349570755896),
 ('vZEBV4sBcL-B_pFlCJaR', 0.3023262889426185),
 ('BpEAV4sBcL-B_pFl-FNF', 0.26483163493536094),
 ('uZEAV4sBcL-B_pFl-luA', 0.26483163493536094),
 ('_JEAV4sBcL-B_pFl-FJF', 0.24832299488605178),
 ('fpEAV4sBcL-B_pFl-luA', 0.24832299488605178),
 ('LZAAV4sBcL-B_pFl1L09', 0.24062631526975578),
 ('eZAAV4sBcL-B_pFl07k6', 0.23778079125705068)]





In [47]:
nz = len([x for x, s in complete_answer if s>0])
print(f'There are {nz} docs with non-zero similarity out of {total}, i.e. {100.0*nz/total:.1f}%')

There are 140 docs with non-zero similarity out of 58102, i.e. 0.2%


## 3. Your tasks

---

**Exercise 1:**  

Make sure you understand the algorithm for implementing search described in the lecture notes. Both slow and efficient versions. Describe
the number of sums you need to do in both slow and quick versions for the following toy example with a vocabulary of size 4 and four documents:

- $q = 0,1,1,0$

- document-term matrix:
<center>


|        | t1  | t2  | t3  | t4  |
|--------|-----|-----|-----|-----|
| **d1** | 1.2 | 0.0 | 0.0 | 0.0 |
| **d2** | 0.7 | 0.3 | 1.5 | 0.1 |
| **d3** | 0.0 | 0.0 | 0.0 | 0.7 |
| **d4** | 2.0 | 0.0 | 0.0 | 0.0 |

</center>

---

If we use the slow (inefficient) algorithm to implement the vectorial model, we will need to perform a total of 8 additions. In this version, we traverse the corpus document by document, and for each document, we need to calculate the similarity with the corresponding word from the query q. In this case, with 4 documents and the query containing 2 words (t2, t3), for each document, we have to perform 2 additions,  which totals 4x2 = 8 additions.

On the other hand, in the fast algorithm, where we use the inverted file, the total number of additions we need to make is 2. This is because we now traverse the weight matrix by columns/terms instead of rows/documents. For each word in the query, we extract its posting list (a list of documents that contain it), and for these documents, we calculate the partial similarity of the query word with the word in each respective document. Thus, instead of having the entire similarity for some documents at a single point $i$ in the execution, we have a partial similarity for all documents. In this case, we would first take the word t2, and since it only appears in document d2, we would perform 1 addition. Next, we would do the same for t3, which again only appears in d2, resulting in one more addition, making a total of 2.


---


**Exercise 2:**

Implement the quick version; run both slow and quick versions and report times (as a reference, in my old laptop it takes around 5m30s to run the slow version in the code above). Make sure both versions return the same answer. Note that you will need to build an inverted index in order to implement the efficient version as explained in class; it may take time but this is done once for all queries, and can be done "off-line".

In [41]:
def inverted_index(client: Elasticsearch, idx: str) -> dict:
  """
  Returns the inverted index as a dictionary. The inverted index is a dictionary mapping terms to the set of documents that contain the term

  ----------
  Parameters
  ----------
  client: Elasticsearch
      Elasticsearch client object
  idx: str
      Name of the index
  -------
  Returns
  -------
  dict
      Inverted index
  """

  D = int(client.cat.count(index=idx, format = "json")[0]['count'])
  posting_list = dict()
  
  print(f'There are {D} documents in the index. Start of the posting list construction...')
  for s in tqdm.tqdm(scan(client, index=idx, query={"query" : {"match_all": {}}}), total=D):
    docid = s['_id']
    tv = client.termvectors(index=idx, id=docid, fields=['text'], term_statistics=True)
    
    if 'text' in tv['term_vectors']:
      for t in tv['term_vectors']['text']['terms']:
        if t not in posting_list:
          posting_list[t] = set()
        posting_list[t].add(docid)
  print(Fore.GREEN + 'Posting list construction completed.')
  return posting_list

In [42]:
def inverted_index_search(query: str,  
                                 client: Elasticsearch,
                                 posting_list: dict, 
                                 D:int, 
                                 r: int) -> list:
  """
  Implement inverted file retrieval for a query and return top r results

  ----------
  Parameters
  ----------
  query: str
      Query string
  client: Elasticsearch
      Elasticsearch client object
  posting_list: dict
      Inverted index
  D: int
      Number of documents in the collection
  r: int
      Number of results to return
  -------
  Returns
  -------
  list
      Top r results
  """
  sims = dict()
  
  for w in tqdm.tqdm(query.split()):
    L = posting_list[w]
    for d in L:
      weights = dict(normalize(tf_idf('arxiv', client, d, D)))
      if d not in sims:
        sims[d] = 0.0 
      sims[d] += weights[w]
  
  l2query = np.sqrt(len(query.split()))  
  for d in sims:
    sims[d] /= l2query
    
  sorted_by_similarity = sorted(sims.items(), key=lambda kv: kv[1], reverse=True)
  return sorted_by_similarity[:r]

In [43]:
pst_list = inverted_index(client, 'arxiv')

There are 58102 documents in the index. Start of the posting list construction...


100%|██████████| 58102/58102 [04:34<00:00, 211.62it/s]


[32mPosting list construction completed.


In [44]:
r = 10
query = 'computer magic'
D = int(client.cat.count(index='arxiv', format = "json")[0]['count'])
inverted_index_search(query, client, pst_list, D, r)

100%|██████████| 2/2 [00:00<00:00,  3.30it/s]


[('hJEBV4sBcL-B_pFlB5Oa', 0.47752597051076895),
 ('n5EBV4sBcL-B_pFlB5Oa', 0.477263052236843),
 ('JZAAV4sBcL-B_pFl4_72', 0.38792349570755896),
 ('vZEBV4sBcL-B_pFlCJaR', 0.3023262889426185),
 ('uZEAV4sBcL-B_pFl-luA', 0.26483163493536094),
 ('BpEAV4sBcL-B_pFl-FNF', 0.26483163493536094),
 ('_JEAV4sBcL-B_pFl-FJF', 0.24832299488605178),
 ('fpEAV4sBcL-B_pFl-luA', 0.24832299488605178),
 ('LZAAV4sBcL-B_pFl1L09', 0.24062631526975578),
 ('eZAAV4sBcL-B_pFl07k6', 0.23778079125705068)]

---


**Exercise 3:**

Compare the results for a few sample queries that you get from your quick version and ElasticSearch search. Do you get similar results? Which is faster?

In [26]:
# import nltk
# nltk.download('words')

In [22]:
import nltk
from nltk.corpus import words
import random

In [None]:

def compare_query_execution_times(query: str, 
                                  client: Elasticsearch, 
                                  posting_list: dict, 
                                  ndocs:int,
                                  r: int) -> list:
  """
  Compare query execution times for the two implementations
  """
  import time
  
  start = time.time()
  Own_implementation = inverted_index_search(query, client, posting_list, ndocs, r)
  end = time.time()
  print(f'Inverted file implementation took {end-start:.2f} seconds')
  
  start = time.time()
  s = Search(using=client, index='arxiv')
  q = Q('query_string', query=query)
  s = s.query(q)
  Elasticsearch_implementation = s[:r].execute()
  end = time.time()
  print(f'Elasticsearch implementation took {end-start:.2f} seconds \n')

  print(f'Let''s compare the results of the two implementations: \n')
  for i in range(r): 
    print(f'Inverted file implementation: {Own_implementation[i][0]} with score {Own_implementation[i][1]}')
    print(f'Elasticsearch implementation: {Elasticsearch_implementation[i].meta.id} with score {Elasticsearch_implementation[i].meta.score} \n')


query = 'computer magic'
compare_query_execution_times(query, client, pst_list, D, r)

# english_words = words.words()

 

## 4. Rules of delivery

- To be solved in _pairs_.

- No plagiarism; don't discuss your work with other teams. You can ask for help to others for simple things, such as recalling a python instruction or module, but nothing too specific to the session.

- If you feel you are spending much more time than the rest of the classmates, ask us for help. Questions can be asked either in person or by email, and you'll never be penalized by asking questions, no matter how stupid they look in retrospect.

- Write a short report listing the solutions to the exercises proposed. Include things like the important parts of your implementation (data structures used for representing objects, algorithms used, etc). You are welcome to add conclusions and findings that depart from what we asked you to do. We encourage you to discuss the difficulties you find; this lets us give you help and also improve the lab session for future editions.

- Turn the report to PDF. Make sure it has your names, date, and title. Include your code in your submission.

- Submit your work through the [raco](http://www.fib.upc.edu/en/serveis/raco.html) _before November 6th, 2023_.