# Lab 2. Building inverted index and answering queries

In this lab you are going to implement a standard document processing pipeline and then build a simple search engine based on it: starting from crawling documents, then building an inverted index, answering queries using this index, and organizing it as a simple web server.

# 1. Preprocessing

First, we need a unified approach to documents preprocessing, and this class is responsible for it. Complete the code for given functions (most of them are just one-liners) and make sure you pass the tests. Make use of `nltk` library.

In [0]:
import nltk 
nltk.download('punkt')

class Preprocessor:
    
    def __init__(self):
        self.stop_words = {'a', 'an', 'and', 'are', 'as', 'at', 'be', 'by', 'for', 'from', 'has', 'he', 'in', 'is', 'it', 'its',
                      'of', 'on', 'that', 'the', 'to', 'was', 'were', 'will', 'with'}
        self.ps = nltk.stem.PorterStemmer()

    
    def tokenize(self, text):
        #TODO word tokenize text using nltk lib
        return nltk.word_tokenize(text)

    
    def stem(self, word, stemmer):
        #TODO stem word using provided stemmer
        return stemmer.stem(word)

    
    def is_apt_word(self, word):
        #TODO check if word is appropriate - not a stop word and isalpha, 
        # i.e consists of letters, not punctuation, numbers, dates
        return word.isalpha()

    
    def preprocess(self, text):
        #TODO combine all previous methods together: tokenize lowercased text 
        # and stem it, ignoring not appropriate words
        result = []
        for word in self.tokenize(text.lower()): 
          stammed_word = self.stem(word,self.ps)
          if self.is_apt_word(stammed_word) and word not in self.stop_words :
            result.append(stammed_word)
            
        return result

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 1.1. Tests ##

In [0]:
prep = Preprocessor()
text = 'To be, or not to be, that is the question'

assert prep.tokenize(text) == ['To', 'be', ',', 'or', 'not', 'to', 'be', ',', 'that', 'is', 'the', 'question']
assert prep.stem('retrieval', prep.ps) == 'retriev'
assert prep.is_apt_word('qwerty123') is False
assert prep.preprocess(text) == ['or', 'not', 'question']

# 2. Crawling and Indexing

## 2.1 Base classes

Here are some base classes we will need for writing our indexer. The code from the last lab's solution is given, but note that you will need to change some of it, namely, the `parse` function. The reason is it always makes complete parsing, which we want to avoid when we only need links, for example, or a specific portion of text.

In [0]:
import requests
from urllib.parse import quote
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.parse
import os


class Document:

  def __init__(self, url):
     self.url = url

  def download(self):
    try:
      response = requests.get(self.url)
      if response.status_code == 200:
        self.content = response.content
        return True
      else:
        return False
    except:
      return False

  def persist(self, path):
    with open(os.path.join(path, quote(self.url).replace('/', '_')), 'wb') as f:
      f.write(self.content)

class HtmlDocument(Document):

    def normalize(self, href):
      if href is not None and href[:4] != 'http':
          href = urllib.parse.urljoin(self.url, href)
      return href

    def parse(self):
      #TODO change this method
      self.download()
      # print(self.url)

      def tag_visible(element):
          if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
            return False
          if isinstance(element, Comment):
            return False
          return True
      model = BeautifulSoup(self.content)
        
      self.anchors = []
      a = model.find_all('a')
      for anchor in a:
        href = self.normalize(anchor.get('href'))
        # if href[-4:] in ('.pdf', '.mp3', '.avi', '.mp4', '.txt') : 
        #   continue
        text = anchor.text
        self.anchors.append((text, href))

      self.images = []
      i = model.find_all('img')
      for img in i:
        href = self.normalize(img.get('src'))
        self.images.append(href)
        
      texts = model.findAll(text=True)
      visible_texts = filter(tag_visible, texts)  
      self.text = u" ".join(t.strip() for t in visible_texts)

## 2.2 Main class

The main indexer logic is here. We organize it as a crawler generator that adds certain visited pages to inverted index and saves them on disk. 

- `crawl_generator_for_index` method crawles the given website doing BFS, starting from `source` within given `depth`. Considers only inner pages (of a form https://www.reuters.com/...) for visiting. To speed up, doesn't consider for visiting pages with content type other than html: '.pdf', '.mp3', '.avi', '.mp4', '.txt'. If encounters an article page (of a form https://www.reuters.com/article/...), saves its content in a file in `collection_path` folder and populates the inverted index calling `index_doc` method. When done, saves on disk three resulting dictionaries:
    - `doc_urls`, `doc_id:url`
    - `index`, `term:[collection_frequency, (doc_id_1, doc_freq_1), (doc_id_2, doc_freq_2), ...]`
    - `doc_lengths`, `doc_id:doc_length` 

    `limit` parameter is given for testing - if not `None`, break the loop when number of saved articles exceeds the `limit` and return without writing dictionaries to disk.
    
    
- `index_doc` method parses and preprocesses the content of a `doc` and adds it to the inverted index. Also keeps track of document lengths in a `doc_lengths` dictionary.


**Bonus task \*** In real industrial systems a crawler would pass the links to the dedicated service that would load their contents in a bunch of parallel threads. Implement such a service - get urls as inputs, load page contents in parallel and return filenames on disk, which are then processed by indexer.


In [0]:
from collections import Counter
from queue import Queue
import pickle
import os
import numpy as np

class Indexer:
  def __init__(self):      
    # dictionaries to populate
    self.doc_urls = {}        
    self.index = {}
    self.doc_lengths = {}
    # preprocessor
    self.prep = Preprocessor()
      
  def crawl_generator_for_index(self, source, depth, collection_path="collection", limit=None):
    q = Queue()
    q.put((source, 0))
    visited = set()
    i = 0
    while not q.empty():
      url, url_depth = q.get()
      if url == None : continue
      if url not in visited and "https://www.reuters.com" in url: #visit only pages from reuters.com
        visited.add(url)
        try:
          doc = HtmlDocument(url)
          if doc.download() == False:
            continue
          doc.parse()
          for a in doc.anchors:
            if url_depth + 1 < depth:
              q.put((a[1], url_depth + 1))
          
          if limit == None and "https://www.reuters.com/article/" in doc.url : #Save document if is an article
            if not os.path.exists(collection_path):
              os.makedirs(collection_path)
            doc.persist(collection_path)
            
          i += 1
          if i == limit : return doc
          yield doc
        except FileNotFoundError as e:
          print("Analyzing", url, "led to FileNotFoundError")
  
  def save_index_config(self):
    """Saves the Indexer configuration componets as pickle files : inverted_index, documents lengths, documents urls

    >>> import pickle
    >>> indxr = Indexer()
    >>> indxr.save_index_config()
    >>> with open('inverted_index.pickle', 'rb') as fp:
    ...   index = pickle.load(fp)
    >>> with open('doc_lengths.pickle', 'rb') as fp:
    ...   doc_len = pickle.load(fp)
    >>> with open('doc_urls.pickle', 'rb') as fp:
    ...   url = pickle.load(fp)
    >>> [doc_len,url,index]
    [{},{},{}]

    """
    print("Saving Indexer Config.....")
    with open('inverted_index.pickle', 'wb') as f:
      pickle.dump(self.index, f, protocol=pickle.HIGHEST_PROTOCOL)

    with open('doc_lengths.pickle', 'wb') as f:
      pickle.dump(self.doc_lengths, f, protocol=pickle.HIGHEST_PROTOCOL)

    with open('doc_urls.pickle', 'wb') as f:
      pickle.dump(self.doc_urls, f, protocol=pickle.HIGHEST_PROTOCOL)

  def index_doc(self, doc, doc_id):
    """Add document to index"""
    self.doc_urls[doc_id] = doc.url
    words = self.prep.preprocess(doc.text)
    self.doc_lengths[doc_id] = len(words)

    count = Counter(words)
    for pair in count.items():
      if pair[0] not in self.index.keys():
        self.index[pair[0]] = [pair[1],(doc_id,pair[1])]
      else:
        self.index[pair[0]][0] += pair[1]
        self.index[pair[0]].append((doc_id,pair[1]))

## 2.3. Tests ##

**In Test where there is a limit & shallow depth, it is possible that the indexer returns empty since it only saves article pages `https://www.reuters.com/article/`** <br>
To test I save all the pages visited

In [0]:
indexer = Indexer()
k = 1
for c in indexer.crawl_generator_for_index("https://www.reuters.com/news/us", 3, "test_collection", 5):
  # if "https://www.reuters.com/article/" in c.url:
  print(k, c.url)
  indexer.index_doc(c,k)
  k+=1
indexer.save_index_config()

assert type(indexer.index) is dict
assert type(indexer.index['reuter']) is list
assert type(indexer.index['reuter'][0]) is int
assert type(indexer.index['reuter'][1]) is tuple

1 https://www.reuters.com/news/us
2 https://www.reuters.com/
3 https://www.reuters.com/finance
4 https://www.reuters.com/finance/markets
Saving Indexer Config.....


## 2.4 Building index

In [0]:
indexer = Indexer()
k = 1
for c in indexer.crawl_generator_for_index("https://www.reuters.com/", 3, "docs_collection"):
    if "https://www.reuters.com/article/" in c.url:
      print(k, c.url)
      indexer.index_doc(c,k)
      k+=1
indexer.save_index_config()

1 https://www.reuters.com/article/us-china-health/u-s-and-others-tighten-curbs-on-travel-to-china-as-virus-toll-hits-213-idUSKBN1ZT374
2 https://www.reuters.com/article/us-china-health-business-impact/factbox-companies-feel-impact-of-coronavirus-outbreak-in-china-idUSKBN1ZU1LG
3 https://www.reuters.com/article/us-china-health-airlines/pilots-flight-attendants-demand-flights-to-china-stop-as-virus-fear-mounts-worldwide-idUSKBN1ZT33W
4 https://www.reuters.com/article/us-china-health-masks-safety/to-mask-or-not-to-mask-confusion-spreads-over-coronavirus-protection-idUSKBN1ZU0PH
5 https://www.reuters.com/article/us-usa-trump-impeachment/end-draws-near-in-trump-impeachment-trial-as-democrats-likely-to-fall-short-in-vote-idUSKBN1ZU1D2
6 https://www.reuters.com/article/us-britain-eu-union/brexit-day-britain-quits-eu-steps-into-transition-twilight-zone-idUSKBN1ZU00R
7 https://www.reuters.com/article/us-climatechange-investors-proxy/top-u-s-fund-firms-split-over-new-limits-on-shareholder-votes-

## 2.5 Index statistics

In [0]:
# load index, doc_lengths and doc_urls
with open('inverted_index.pickle', 'rb') as fp:
  index = pickle.load(fp)
with open('doc_lengths.pickle', 'rb') as fp:
  doc_lengths = pickle.load(fp)
with open('doc_urls.pickle', 'rb') as fp:
  doc_urls = pickle.load(fp)

In [0]:
print('Total index length', len(index))
print('\nTop terms by number of documents they apperared in:')
sorted_by_n_docs = sorted(index.items(), key=lambda kv: (len(kv[1]), kv[0]), reverse=True)
print([(sorted_by_n_docs[i][0], len(sorted_by_n_docs[i][1])) for i in range(20)])
print('\nTop terms by overall frequency:')
sorted_by_freq = sorted(index.items(), key=lambda kv: (kv[1][0], kv[0]), reverse=True)
print([(sorted_by_freq[i][0], sorted_by_freq[i][1][0]) for i in range(20)])

Total index length 13971

Top terms by number of documents they apperared in:
[('world', 721), ('use', 721), ('us', 721), ('unit', 721), ('tv', 721), ('thomson', 721), ('term', 721), ('tax', 721), ('state', 721), ('solut', 721), ('site', 721), ('see', 721), ('risk', 721), ('reuter', 721), ('quot', 721), ('privaci', 721), ('polit', 721), ('our', 721), ('newslett', 721), ('news', 721)]

Top terms by overall frequency:
[('reuter', 4970), ('s', 3521), ('said', 3026), ('thomson', 2172), ('all', 1910), ('more', 1617), ('delay', 1506), ('januari', 1486), ('have', 1473), ('advertis', 1456), ('solut', 1452), ('market', 1434), ('state', 1363), ('news', 1343), ('world', 1333), ('unit', 1233), ('govern', 1226), ('not', 1225), ('year', 1191), ('busi', 1146)]


# 3. Answering query

Now, given that we already have built the inverted index, it's time to utilize it for answering user queries. In this class there are two methods you need to implement:
- `boolean_retrieval`, the simplest form of document retrieval which returns a set of documents such that each one contains all query terms. Returns a set of document ids. Refer to *ch.1* of the book for details;
- `okapi_scoring`, Okapi BM25 ranking function - assigns scores to documents in the collection that are relevant to the user query. Returns a dictionary of scores, `doc_id:score`. Read about it in [Wikipedia](https://en.wikipedia.org/wiki/Okapi_BM25#The_ranking_function) and implement accordingly.

Both methods accept `query` parameter in a form of a dictionary, `term:frequency`

In [0]:
from collections import Counter, OrderedDict
import math

class QueryProcessing:
    
    @staticmethod
    def prepare_query(raw_query):
        prep = Preprocessor()
        # pre-process query the same way as documents
        query = prep.preprocess(raw_query)
        # count frequency
        return Counter(query)
    
    @staticmethod
    def boolean_retrieval(query, index):
      """retrieve a set of documents containing all query terms"""
      
      #approach 1
      query_words = list(query.keys())
      result = [i[0] for i in index[query_words[0]][1:]]

      for word in query_words[1:]:
        if word not in index.keys():
          print(f"{word} not in any of indexed documents!!")
          continue
        word_documents_app = [i[0] for i in index[word][1:]]
        result = np.intersect1d(result, word_documents_app)
        if len(result) == 0 : 
          print("No article(s) containing all query terms")
          return None
      result = set(result)

      # #approach 2 : Use Set
      # result = set([i[1] for i in index[query_words[0]][1:]])
      # for word in list(query.keys())[1:]:
      #   word_documents_app = [i[1] for i in index.get(word,None)[1:]]
      #   if word_documents_app == None:
      #     print(f"{word} not found in any of indexed documents!!")
      #   else:
      #     result = result & set(word_documents_app)
        
      #   if len(result) == 0 : 
      #     print("No set of documnets containing all query terms")
      #     return None

      return result

    
    @staticmethod
    def okapi_scoring(query, doc_lengths, index, k1=1.2, b=0.75):
      """retrieve relevant documents with scores"""
      result_rank = {} #OrderedDict()
      N = len(doc_lengths) #total number of documents indexed
      avgdl = sum(doc_lengths.values())/N

      for doc_id in doc_lengths.keys():
        doc_len = doc_lengths.get(doc_id) #current document length

        for word in query.keys():
          word_in_doc = index.get(word,None)
          temp_word_idx = dict(word_in_doc[1:])
          tf = temp_word_idx.get(doc_id,None) # term frequency in current document
          if word_in_doc == None or tf == None: continue

          in_n_documents = len(word_in_doc[1:]) # the number of documents containing current query word
          idf = np.log((N-in_n_documents + 0.5)/(in_n_documents + 0.5)) # inverse document frequency

          main_fraction = ((k1 + 1) * tf) / ((k1 * ((1 - b) + b * (float(doc_len) / float(avgdl)))) + tf)

          if doc_id not in result_rank:
            result_rank[doc_id] = (idf * main_fraction)
          else:
            result_rank[doc_id] += (idf * main_fraction)

      #result_rank = sorted(result_rank.items(), key=lambda x: x[1],reverse=True)

      return result_rank

## 3.1 Tests 

In [0]:
test_doc_lengths = {1: 20, 2: 15, 3: 10, 4:20, 5:30}
test_index = {'x': [2, (1, 1), (2, 1)], 'y': [2, (1, 1), (3, 1)], 'z': [3, (2, 1), (4,2)]}


test_query1 = QueryProcessing.prepare_query('x z')
test_query2 = QueryProcessing.prepare_query('x y')


assert QueryProcessing.boolean_retrieval(test_query1, test_index) == {2}
assert QueryProcessing.boolean_retrieval(test_query2, test_index) == {1}
okapi_res = QueryProcessing.okapi_scoring(test_query2, test_doc_lengths, test_index)
assert all(k in okapi_res for k in (1,2,3))
assert not any(k in okapi_res for k in (4,5))
assert okapi_res[1] > okapi_res[3] > okapi_res[2]

In [0]:
okapi_res

{1: 0.6587606318860281, 2: 0.3681816620619555, 3: 0.41734538548269146}

In [0]:
test_query1 = QueryProcessing.prepare_query('Organize engine')
r2 = QueryProcessing.boolean_retrieval(test_query1, index)

# 4. Setting up a server

**Bonus task \*** Organize the resulting search engine as a web-service that gets a query from get-parameters and returns urls with scores as a `json` dictionary. Check its work in a browser of with curl, should look smth like this:
 
`> curl localhost:8080/?q=some_query_text
{ "url1" : 0.9, "url2": 0.8 }`

You can use one of the following tools for this task: https://www.acmesystems.it/python_http, http.server.ThreadingHTTPServer (3.7+) https://docs.python.org/3/library/http.server.html#http.server.SimpleHTTPRequestHandler

In [0]:
#TODO write a web-service that answers queries using inverted index


