# Lab 2. Building inverted index and answering queries

In this lab you are going to implement a standard document processing pipeline and then build a simple search engine based on it: starting from crawling documents, then building an inverted index, answering queries using this index, and organizing it as a simple web server.

# 1. Preprocessing

First, we need a unified approach to documents preprocessing, and this class is responsible for it. Complete the code for given functions (most of them are just one-liners) and make sure you pass the tests. Make use of `nltk` library.

In [1]:
import nltk
from nltk.corpus import stopwords

class Preprocessor:
    
    def __init__(self):
        self.stop_words = {'a', 'an', 'and', 'are', 'as', 'at', 'be', 'by', 'for', 'from', 'has', 'he', 'in', 'is', 'it', 'its',
                      'of', 'on', 'that', 'the', 'to', 'was', 'were', 'will', 'with'}
        self.ps = nltk.stem.PorterStemmer()

    
    def tokenize(self, text):
        #TODO word tokenize text using nltk lib
        return nltk.word_tokenize(text)

    
    def stem(self, word, stemmer):
        #TODO stem word using provided stemmer
        return stemmer.stem(word)

    
    def is_apt_word(self, word):
        #TODO check if word is appropriate - not a stop word and isalpha, 
        # i.e consists of letters, not punctuation, numbers, dates
        if word not in self.stop_words and word.isalpha():
            return True
        return False

    
    def preprocess(self, text):
        #TODO combine all previous methods together: tokenize lowercased text 
        # and stem it, ignoring not appropriate words
        text = text.lower()
        token = self.tokenize(text)
        stem = [self.stem(word, self.ps) for word in token]
        res = []
        for word in stem:
            if self.is_apt_word(word):
                res.append(word)
        return res

## 1.1. Tests ##

In [2]:
prep = Preprocessor()
text = 'To be, or not to be, that is the question'
assert prep.tokenize(text) == ['To', 'be', ',', 'or', 'not', 'to', 'be', ',', 'that', 'is', 'the', 'question']
assert prep.stem('retrieval', prep.ps) == 'retriev'
assert prep.is_apt_word('qwerty123') is False
assert prep.preprocess(text) == ['or', 'not', 'question']

# 2. Crawling and Indexing

## 2.1 Base classes

Here are some base classes we will need for writing our indexer. The code from the last lab's solution is given, but note that you will need to change some of it, namely, the `parse` function. The reason is it always makes complete parsing, which we want to avoid when we only need links, for example, or a specific portion of text.

In [3]:
import requests
from urllib.parse import quote
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.parse
import os


class Document:

    def __init__(self, url):
        self.url = url

    def download(self):
        try:
            response = requests.get(self.url)
            if response.status_code == 200:
                self.content = response.content
                return True
            else:
                return False
        except:
            return False

    def persist(self, path):
        u = quote(self.url).replace('/', '_')
        p = os.path.join(f'{path}',u)
        with open(p, 'wb') as f:
            f.write(self.content)


class HtmlDocument(Document):

    def normalize(self, href):
        if href is not None and href[:4] != 'http':
            href = urllib.parse.urljoin(self.url, href)
        return href

    def parse(self):
        #TODO change this method
        def tag_visible(element):
            if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
                return False
            if isinstance(element, Comment):
                return False
            return True
            
        
        model = BeautifulSoup(self.content)
        
        self.anchors = []
        a = model.find_all('a')
        for anchor in a:
            href = self.normalize(anchor.get('href'))
            text = anchor.text
            self.anchors.append((text, href))
                        
        self.images = []
        i = model.find_all('img')
        for img in i:
            href = self.normalize(img.get('src'))
            self.images.append(href)
        
        texts = model.findAll(text=True)
        visible_texts = filter(tag_visible, texts)  
        self.text = u" ".join(t.strip() for t in visible_texts)

## 2.2 Main class

The main indexer logic is here. We organize it as a crawler generator that adds certain visited pages to inverted index and saves them on disk. 

- `crawl_generator_for_index` method crawles the given website doing BFS, starting from `source` within given `depth`. Considers only inner pages (of a form https://www.reuters.com/...) for visiting. To speed up, doesn't consider for visiting pages with content type other than html: '.pdf', '.mp3', '.avi', '.mp4', '.txt'. If encounters an article page (of a form https://www.reuters.com/article/...), saves its content in a file in `collection_path` folder and populates the inverted index calling `index_doc` method. When done, saves on disk three resulting dictionaries:
    - `doc_urls`, `doc_id:url`
    - `index`, `term:[collection_frequency, (doc_id_1, doc_freq_1), (doc_id_2, doc_freq_2), ...]`
    - `doc_lengths`, `doc_id:doc_length` 

    `limit` parameter is given for testing - if not `None`, break the loop when number of saved articles exceeds the `limit` and return without writing dictionaries to disk.
    
    
- `index_doc` method parses and preprocesses the content of a `doc` and adds it to the inverted index. Also keeps track of document lengths in a `doc_lengths` dictionary.


**Bonus task \*** In real industrial systems a crawler would pass the links to the dedicated service that would load their contents in a bunch of parallel threads. Implement such a service - get urls as inputs, load page contents in parallel and return filenames on disk, which are then processed by indexer.


In [4]:
from collections import Counter
from queue import Queue
import pickle
import os

class Indexer:

    def __init__(self):      
        # dictionaries to populate
        self.doc_urls = {}        
        self.index = {}
        self.doc_lengths = {}
        # preprocessor
        self.prep = Preprocessor()
    
    @staticmethod
    def is_valid(url):
        return 'www.reuters.com' in url and url[-4:] not in ['.pdf', '.mp3', '.avi', '.mp4', '.txt']
    
    @staticmethod
    def is_article(url):
        return  'www.reuters.com/article' in url
    
    def crawl_generator_for_index(self, source, depth, collection_path="collection", limit=None):        
        #TODO generate url-s for visiting
        if not os.path.exists(collection_path):
            os.makedirs(collection_path)
        
        q = Queue()
        q.put((source, 0))
        visited = set()
        while not q.empty():
            url, url_depth = q.get()
            if url and url not in visited and self.is_valid(url):
                visited.add(url)
                self.doc_urls[len(self.doc_urls)] = url
                try:
                    doc = HtmlDocument(url)
                    doc.download()
                    doc.parse()
                    for a in doc.anchors:
                        if url_depth + 1 < depth:
                            q.put((a[1], url_depth + 1))
                    # если там артикл, то надо это сохранить на диск, то есть сделать персист и распарсить в индекс
                    if self.is_article(url):
                        doc.persist(f'{collection_path}')
                        self.index_doc(doc, len(self.doc_urls)-1)
                    yield doc
                except FileNotFoundError as e:
                    print("Analyzing", url, "led to FileNotFoundError")
                except AttributeError:
                    continue
                    
        # when finished save dictionaries to disk
        with open('doc_urls.p', 'wb') as f:
            pickle.dump(self.doc_urls, f)
        with open('inverted_index.p', 'wb') as f:
            pickle.dump(self.index, f)
        with open('doc_lengths.p', 'wb') as f:
            pickle.dump(self.doc_lengths, f)
        
    def index_doc(self, doc, doc_id):
        #TODO add documents to index
        doc_processed = self.prep.preprocess(doc.text)
        self.doc_lengths[doc_id] = len(doc_processed)
        doc_processed = Counter(doc_processed)
        # потом проходимся по словам и заполняем индекс
        for term in doc_processed:
            if term not in self.index:
                # просто добавить новое слово
                self.index[term]=[1, (doc_id,doc_processed[term])]
            else:
                # иначе делаем апдейт
                old_values = self.index[term]
                old_values[0] = old_values[0]+1
                old_values.append((doc_id,doc_processed[term]))
                self.index[term] = old_values
    

## 2.3. Tests ##

In [5]:
indexer = Indexer()
k = 1
for c in indexer.crawl_generator_for_index("https://www.reuters.com/news/us", 2, "test_collection", 5):
    print(k, c.url)
    k+=1

assert type(indexer.index) is dict
assert type(indexer.index['reuter']) is list
assert type(indexer.index['reuter'][0]) is int
assert type(indexer.index['reuter'][1]) is tuple

1 https://www.reuters.com/news/us
2 https://www.reuters.com/
3 https://www.reuters.com/finance
4 https://www.reuters.com/finance/markets
5 https://www.reuters.com/news/world
6 https://www.reuters.com/politics
7 https://www.reuters.com/video
8 https://www.reuters.com/news/archive/worldNews
9 https://www.reuters.com/article/us-china-health-usa/trump-says-u-s-has-shut-down-coronavirus-threat-china-shuns-u-s-help-idUSKBN1ZW0OJ
10 https://www.reuters.com/news/archive/politicsNews
11 https://www.reuters.com/article/us-usa-trump-impeachment/after-controversial-trial-u-s-senate-poised-to-acquit-trump-idUSKBN1ZX1ER
12 https://www.reuters.com/article/us-usa-election/democratic-white-house-contenders-face-first-test-in-iowa-idUSKBN1ZX1G5
13 https://www.reuters.com/article/us-china-health-usa-california/u-s-confirms-11th-case-of-new-coronavirus-idUSKBN1ZW0WG
14 https://www.reuters.com/article/us-china-health-usa/u-s-will-send-more-flights-to-bring-back-citizens-from-hubei-province-pompeo-idUSKBN1Z

## 2.4 Building index

In [6]:
indexer = Indexer()
k = 1
for c in indexer.crawl_generator_for_index("https://www.reuters.com/", 3, "docs_collection"):
    print(k, c.url)
    k+=1

1 https://www.reuters.com/
2 https://www.reuters.com/home
3 https://www.reuters.com/finance
4 https://www.reuters.com/legal
5 https://www.reuters.com/finance/deals
6 https://www.reuters.com/subjects/aerospace-and-defense
7 https://www.reuters.com/subjects/banks
8 https://www.reuters.com/subjects/autos
9 https://www.reuters.com/finance/summits
10 https://www.reuters.com/z-factor
11 https://www.reuters.com/subjects/sustainable-business
12 https://www.reuters.com/the-world-at-work
13 https://www.reuters.com/finance/markets
14 https://www.reuters.com/finance/markets/us
15 https://www.reuters.com/finance/markets/europe
16 https://www.reuters.com/finance/markets/asia
17 https://www.reuters.com/finance/global-market-data
18 https://www.reuters.com/markets/stocks
19 https://www.reuters.com/markets/bonds
20 https://www.reuters.com/markets/currencies
21 https://www.reuters.com/markets/commodities
22 https://www.reuters.com/finance/funds
23 https://www.reuters.com/finance/EarningsUS
24 https://ww

103 https://www.reuters.com/article/us-europe-migrants-greece-lesbos/greek-police-fire-teargas-at-protesting-migrants-refugees-on-lesbos-idUSKBN1ZX1QW
104 https://www.reuters.com/article/us-britain-eu/brexit-trade-deal-clash-uk-and-eu-spar-over-rules-idUSKBN1ZW0UJ
105 https://www.reuters.com/article/us-ryanair-results/ryanair-talks-tough-on-compensation-as-737-max-woes-cloud-growth-target-idUSKBN1ZX0HH
106 https://www.reuters.com/article/us-israel-palestinians-economy-marriages/gazan-bridegrooms-end-up-in-jail-over-unpaid-debts-idUSKBN1ZX1Q4
107 https://www.reuters.com/news/picture/top-photos-of-the-day-idUSRTS30M6R
108 https://www.reuters.com/article/us-britain-eu-terms-factbox/what-britain-wants-johnson-outlines-post-brexit-trade-deal-idUSKBN1ZX1QE
109 https://www.reuters.com/article/us-britain-eu-germany/merkel-prepared-for-eu-treaty-changes-as-brexit-requires-bloc-to-be-more-competitive-idUSKBN1ZX1RP
110 https://www.reuters.com/article/us-britain-eu-finance/eu-says-financial-relati

172 http://www.reuters.com/on-the-case
173 https://www.reuters.com/article/us-otc-facebook/does-facebooks-550-million-settlement-change-the-privacy-class-action-game-idUSKBN1ZT33O
174 https://www.reuters.com/article/us-otc-adr/foreign-issuers-beware-toshiba-didnt-sponsor-adrs-but-investor-class-action-gets-green-light-idUSKBN1ZS318
175 https://www.reuters.com/article/us-otc-fca/doj-eyes-requirement-that-false-claims-act-whistleblowers-disclose-litigation-funding-idUSKBN1ZR2VU
176 https://www.reuters.com/news/archive/businessNews
177 https://www.reuters.com/article/us-astonmartin-stroll-funding/aston-martins-lifeline-buys-carmaker-time-as-suv-hits-road-idUSKBN1ZU22W
178 https://www.reuters.com/news/archive/innovationNews
179 https://www.reuters.com/article/us-taqa-m-a-adpower/abu-dhabi-power-to-take-control-of-taqa-in-asset-swap-idUSKBN1ZX0JB
180 https://www.reuters.com/article/us-usa-india-lng/petronet-lng-to-sign-2-5-billion-u-s-gas-deal-during-trumps-india-visit-idUSKBN1ZX0AK
181 htt

234 https://www.reuters.com/news/archive/autos-upclose?view=page&page=2&pageSize=10
235 https://www.reuters.com/article/china-health-philippines/philippines-duterte-says-xenophobia-against-chinese-must-stop-idUSL4N2A33N6
236 https://www.reuters.com/article/ryanair-results/update-4-ryanair-talks-tough-on-compensation-as-737-max-woes-cloud-growth-target-idUSL8N2A30NN
237 https://www.reuters.com/article/idUSL8N2A355X
238 https://www.reuters.com/summit/Investment20
239 https://www.reuters.com/article/us-investment-summit-protests/investors-wary-as-social-unrest-spreads-from-hong-kong-to-santiago-idUSKBN1XL0J6
240 https://www.reuters.com/finance/summits/past
241 https://www.reuters.com/summit/Investment19
242 https://www.reuters.com/summit/Commodities18
243 https://www.reuters.com/summit/Investment18
244 https://www.reuters.com/summit/Cybersecurity17
245 https://www.reuters.com/summit/Commodities17
246 https://www.reuters.com/summit/FinancialRegulation17
247 https://www.reuters.com/summit/R

306 https://www.reuters.com/article/europe-stocks/european-shares-inch-higher-on-brexit-relief-coronavirus-fears-cap-gains-idUSL4N2A32DG
307 https://www.reuters.com/article/us-europe-stocks/european-shares-drop-on-coronavirus-cases-weak-euro-zone-data-idUSKBN1ZU0WR
308 https://www.reuters.com/article/europe-stocks/update-3-european-shares-drop-on-coronavirus-cases-weak-euro-zone-data-idUSL4N2A03DI
309 https://www.reuters.com/article/us-europe-stocks-coronavirus-graphic/why-the-devil-coronavirus-has-hit-european-stocks-hard-idUSKBN1ZR23H
310 https://www.reuters.com/article/europe-stocks/european-shares-climb-in-early-trading-on-brexit-day-idUSL4N2A0365
311 https://www.reuters.com/article/europe-stocks/update-2-weak-earnings-hit-europe-amid-virus-fears-ftse-slides-as-boe-stands-pat-idUSL4N29Z2P6
312 https://www.reuters.com/article/us-europe-stocks/european-shares-slide-on-weak-earnings-china-virus-epidemic-idUSKBN1ZT0TN
313 https://www.reuters.com/article/europe-stocks/european-shares-sk

389 https://www.reuters.com/finance/markets/index?symbol=.AORD
390 https://www.reuters.com/finance/markets/index?symbol=.KS11
391 https://www.reuters.com/finance/markets/index?symbol=.SETI
392 https://www.reuters.com/finance/markets/index?symbol=.JKSE
393 https://www.reuters.com/finance/markets/index?symbol=.PSI
394 https://www.reuters.com/finance/markets/index?symbol=.SSEC
395 https://www.reuters.com/finance/markets/index?symbol=.BSESN
396 https://www.reuters.com/finance/markets/index?symbol=.FTFBMKLCI
397 https://www.reuters.com/finance/markets/index?symbol=.HNX30
398 https://www.reuters.com/finance/markets/index?symbol=.TRXFLDEUPU
399 https://www.reuters.com/sectors/energy
400 https://www.reuters.com/sectors/industries/overview?industryCode=193
401 https://www.reuters.com/sectors/industries/overview?industryCode=6
402 https://www.reuters.com/sectors/industries/overview?industryCode=190
403 https://www.reuters.com/sectors/basic-materials
404 https://www.reuters.com/sectors/industries

500 https://www.reuters.com/article/uk-usa-markets-dollar-analysis/death-cross-growth-abroad-threaten-u-s-dollar-idUSKBN1ZL0IN
501 https://www.reuters.com/article/uk-global-forex/yuan-weakens-safe-havens-gain-on-chinese-virus-concerns-idUSKBN1ZK022
502 https://www.reuters.com/article/uk-britain-sterling/sterling-rallies-after-uk-jobs-growth-weakens-case-for-rate-cut-idUSKBN1ZK0T1
503 https://www.reuters.com/article/uk-britain-sterling/sterling-falls-after-javid-comments-stoke-hard-brexit-fears-idUSKBN1ZJ0WZ
504 https://www.reuters.com/article/us-global-forex/dollar-gains-as-u-s-economic-strength-supports-sentiment-idUSKBN1ZJ037
505 https://www.reuters.com/article/uk-cftc-forex/speculators-cut-long-dollar-bets-to-19-month-low-cftc-reuters-idUSKBN1ZG2DX
506 https://www.reuters.com/article/us-global-forex/dollar-gains-on-u-s-economic-optimism-idUSKBN1ZG045
507 https://www.reuters.com/article/uk-britain-sterling/pound-reverses-gains-after-bleak-british-retail-sales-idUSKBN1ZG0X2
508 https:

583 https://www.reuters.com/finance/stocks/overview?symbol=SXI.N
584 https://www.reuters.com/finance/stocks/overview?symbol=HLIT.OQ
585 https://www.reuters.com/finance/stocks/overview?symbol=PCH.OQ
586 https://www.reuters.com/finance/stocks/overview?symbol=VVV.N
587 https://www.reuters.com/finance/stocks/overview?symbol=HIG.N
588 https://www.reuters.com/finance/stocks/overview?symbol=RBC.N
589 https://www.reuters.com/finance/stocks/overview?symbol=PAHC.OQ
590 https://www.reuters.com/finance/stocks/overview?symbol=MJCO.A
591 https://www.reuters.com/finance/stocks/overview?symbol=BECN.OQ
592 https://www.reuters.com/finance/stocks/overview?symbol=KRC.N
593 https://www.reuters.com/finance/stocks/overview?symbol=LUB.N
594 https://www.reuters.com/finance/stocks/overview?symbol=DLA.A
595 https://www.reuters.com/finance/stocks/overview?symbol=AFG.N
596 https://www.reuters.com/finance/stocks/overview?symbol=GOOGL.OQ
597 https://www.reuters.com/finance/stocks/overview?symbol=CBT.N
598 https://ww

658 https://www.reuters.com/article/us-usa-fundraisers-scampacs-side/how-scam-pacs-fall-through-the-cracks-of-u-s-regulators-idUSKBN1ZS29B
659 https://www.reuters.com/article/us-usa-courts-secrecy-regulators-special/special-report-how-secrecy-in-u-s-courts-hobbles-regulators-idUSKBN1ZF1G9
660 https://www.reuters.com/article/us-india-citizenship-protests-kanhaiyaku/in-india-a-firebrands-anti-modi-mantra-resonates-at-nationwide-protests-idUSKBN1ZE0LI
661 https://www.reuters.com/article/us-gold-mining-artisanal-explainer/what-is-artisanal-gold-and-why-is-it-booming-idUSKBN1ZE0YU
662 https://www.reuters.com/article/us-gold-africa-refineries-insight/race-to-refine-the-bid-to-clean-up-africas-gold-rush-idUSKBN1ZE0YG
663 https://www.reuters.com/article/us-usa-trade-supplychains-insight/u-s-bike-firms-face-uphill-slog-to-replace-chinese-supply-chains-idUSKBN1ZD1FV
664 https://www.reuters.com/article/us-china-aviation-comac-insight/chinas-bid-to-challenge-boeing-and-airbus-falters-idUSKBN1Z905N

744 https://www.reuters.com/article/us-china-health-pakistan/pakistan-resumes-flights-to-and-from-china-screens-passengers-for-virus-idUSKBN1ZX1RS
745 https://www.reuters.com/article/us-health-china-sport-factbox/factbox-events-affected-due-to-coronavirus-epidemic-idUSKBN1ZX0QB
746 https://www.reuters.com/article/us-china-health-latest/latest-on-the-coronavirus-spreading-in-china-and-beyond-idUSKBN1ZX054
747 https://www.reuters.com/article/us-brazil-coronavirus/brazil-draws-up-plan-to-evacuate-nationals-from-chinas-coronavirus-epicenter-idUSKBN1ZX1L0
748 https://www.reuters.com/article/us-china-health-xi/chinas-xi-says-coronavirus-control-the-most-important-task-idUSKBN1ZX1JH
749 https://www.reuters.com/article/us-china-health-pets/in-virus-stricken-wuhan-animal-lovers-break-into-homes-to-save-pets-idUSKBN1ZX1I2
750 https://www.reuters.com/article/us-china-health-japan-ship/japan-to-quarantine-cruise-ship-on-which-virus-patient-sailed-idUSKBN1ZX1GI
751 https://www.reuters.com/article/u

808 https://www.reuters.com/article/us-usa-election-iowa-caucuses-explainer/explainer-why-iowa-how-a-little-rural-state-picks-presidential-nominees-idUSKBN1ZX14X
809 https://www.reuters.com/article/us-usa-election-voters/ahead-of-crucial-vote-anxious-iowa-democrats-grapple-with-tough-choices-idUSKBN1ZW0MI
810 https://www.reuters.com/article/us-usa-election-iowa/volunteers-flock-to-iowa-for-high-stakes-democratic-nominating-contest-idUSKBN1ZW05O
811 https://www.reuters.com/article/us-usa-election-timeline/off-to-the-races-key-dates-on-the-u-s-presidential-election-calendar-idUSKBN1ZX157
812 https://www.reuters.com/article/us-usa-trump-impeachment-whatnext-factbo/factbox-trump-impeachment-what-happens-next-idUSKBN1ZW09W
813 https://www.reuters.com/article/us-usa-election-steyer/black-democratic-group-co-chair-endorses-steyer-in-south-carolina-idUSKBN1ZW0D6
814 https://www.reuters.com/article/us-usa-election-trump/trump-bloomberg-trade-schoolyard-taunts-as-spending-war-heats-up-idUSKBN1ZW

872 https://www.reuters.com/article/us-china-health-australia/australia-scientists-to-share-lab-grown-coronavirus-to-hasten-vaccine-efforts-idUSKBN1ZR2YD
873 https://www.reuters.com/article/us-science-footprints/karoo-firewalkers-dinosaurs-braved-south-africas-land-of-lava-idUSKBN1ZS2XS
874 https://www.reuters.com/article/us-food-tech-labmeat-shrimp/singapores-shiok-meats-hopes-to-hook-diners-with-lab-grown-shrimp-idUSKBN1ZR12N
875 https://www.reuters.com/article/us-chile-environment/chilean-scientists-scramble-to-save-last-of-desert-frogs-from-extinction-idUSKBN1ZR2KZ
876 https://www.reuters.com/article/us-china-health-science/china-science-database-scraps-paywall-to-aid-virus-battle-idUSKBN1ZS0T1
877 https://www.reuters.com/article/us-science-crater/did-asteroid-that-hit-australia-help-thaw-ancient-snowball-earth-idUSKBN1ZM39N
878 https://www.reuters.com/article/us-science-hair/hairy-situation-scientists-explain-how-stress-related-graying-occurs-idUSKBN1ZL2UB
879 https://www.reuters.

935 https://www.reuters.com/article/us-tech-ces-intel/intels-mobileye-demos-autonomous-car-equipped-only-with-cameras-no-other-sensors-idUSKBN1Z6091
936 https://www.reuters.com/article/us-tech-ces-qualcomm/qualcomm-launches-autonomous-driving-computer-aiming-to-hit-roads-by-2023-idUSKBN1Z51YH
937 https://www.reuters.com/article/us-tech-ces-amazon-com/amazon-to-showcase-its-transportation-drive-at-worlds-largest-tech-show-idUSKBN1Z51DF
938 https://www.reuters.com/article/us-tech-ces-washington/trump-administration-officials-to-talk-tech-policy-at-las-vegas-confab-idUSKBN1Z2235
939 https://www.reuters.com/news/picture/best-of-ces-idUSRTS2XEPS
940 https://www.reuters.com/news/archive/consumer-electronics-show?view=page&page=2&pageSize=10
941 https://www.reuters.com/article/us-ingenico-m-a-worldline-breakingviews/breakingviews-worldlines-latest-deal-stretches-payments-fervour-idUSKBN1ZX1MJ
942 https://www.reuters.com/article/us-china-virus-stocks-breakingviews/breakingviews-xis-market-watc

1021 https://www.reuters.com/video/watch/ryanair-says-max-woes-may-delay-growth-b-idOVBZ2DAIZ?chan=9qsux198
1022 https://www.reuters.com/video/watch/miners-face-funding-squeeze-as-investors-idOVBZ2ENHV?chan=9qsux198
1023 https://www.reuters.com/video/watch/breakingviews-tv-predictions-2020-rebuil-idRCV007S32?chan=9qsux198
1024 https://www.reuters.com/video/watch/uk-escapes-longest-factory-slowdown-sinc-idOVBZ2DEH7?chan=9qsux198
1025 https://www.reuters.com/video/watch/bloomberg-floats-wealth-tax-for-people-l-idRCV007S6P?chan=9qsux198
1026 https://www.reuters.com/video/watch/international-precautions-as-coronavirus-idOVBYSDMQZ?chan=9qsux198
1027 https://www.reuters.com/video/watch/virus-fears-weak-data-spark-600-pt-dow-p-idRCV007S3S?chan=9qsux198
1028 https://www.reuters.com/video/watch/week-ahead-jobs-google-earnings-and-iowa-idOVBYIEFM3?chan=9qsux198
1029 https://www.reuters.com/video/watch/this-market-sell-off-is-a-buying-opportu-idOVBYIEH6Z?chan=9qsux198
1030 https://www.reuters.com

1098 https://www.reuters.com/article/us-hedgefunds-valueact/valueacts-ubben-cheers-blackrocks-new-stance-on-climate-change-idUSKBN1ZE096
1099 https://www.reuters.com/article/us-money-divorce-finances/your-money-avoid-divorce-money-regrets-by-taking-control-now-idUSKBN1ZD0EA
1100 https://www.reuters.com/news/archive/personalfinance?view=page&page=2&pageSize=10
1101 https://www.reuters.com/subjects/life-lessons
1102 https://www.reuters.com/article/us-money-lifelessons-bellamyyoung/presidential-material-life-lessons-with-bellamy-young-idUSKBN1YL253
1103 https://www.reuters.com/article/us-money-lifelessons-harrisrosen/local-hero-florida-hotelier-harris-rosen-keeps-his-giving-close-to-home-idUSKBN1Y01VY
1104 https://www.reuters.com/article/us-money-lifelessons-brookeshields/a-model-life-life-lessons-with-brooke-shields-idUSKBN1W91C0
1105 https://www.reuters.com/article/us-usa-election-economy-factbox/as-trump-touts-gains-in-jobs-some-democrats-push-for-economic-overhaul-idUSKBN1ZR17A
1106 h

1161 https://www.reuters.com/news/archive/soccer-usa
1162 https://www.reuters.com/article/us-soccer-usa-tfc-bradley/torontos-bradley-faces-four-months-out-after-ankle-surgery-idUSKBN1ZJ2A7
1163 https://www.reuters.com/article/us-soccer-usa/lloyd-rapinoe-anchor-u-s-olympic-qualifying-roster-idUSKBN1ZG1ZO
1164 https://www.reuters.com/article/us-soccer-usa/general-manager-mcbride-sees-positive-future-for-u-s-mens-team-idUSKBN1ZC29P
1165 https://www.reuters.com/news/archive/soccer-brazil
1166 https://www.reuters.com/article/us-soccer-brazil/flamengo-big-winners-at-brazils-player-of-the-year-awards-idUSKBN1YE06O
1167 https://www.reuters.com/article/us-soccer-brazil-gre-inl-report/gremio-easily-overcome-city-rivals-internacional-2-0-idUSKBN1XE02F
1168 https://www.reuters.com/article/us-soccer-brazil-fla-cth-report/flamengo-win-4-1-and-continue-march-toward-serie-a-title-idUSKBN1XD0JK
1169 https://www.reuters.com/news/archive/soccer-england
1170 https://www.reuters.com/article/us-soccer-engla

1233 https://www.reuters.com/article/us-motor-f1-miami/f1-changes-planned-miami-gp-layout-after-local-opposition-idUSKBN1ZL02Z
1234 https://www.reuters.com/news/archive/football-nfl
1235 https://www.reuters.com/article/us-football-nfl-superbowl-boone/yankees-manager-boone-on-the-money-with-super-bowl-prediction-idUSKBN1ZX0CG
1236 https://www.reuters.com/news/archive/sport-cricket
1237 https://www.reuters.com/article/us-cricket-nzl-eng-ban/spectator-who-racially-abused-archer-banned-for-two-years-idUSKBN1ZC2D7
1238 https://www.reuters.com/article/us-australia-bushfires-cricket/australian-cricketers-paine-lyon-see-mind-blowing-fire-devastation-idUSKBN1Z8114
1239 https://www.reuters.com/article/us-cricket-test-aus-nzl/head-trusts-officials-on-bushfire-smoke-in-sydney-test-idUSKBN1YZ0EU
1240 https://www.reuters.com/news/archive/sport-sailing
1241 https://www.reuters.com/article/us-sailing-australia/sailing-comanche-grabs-early-lead-after-slow-start-in-sydney-hobart-race-idUSKBN1YU0GS
1242 

1294 https://www.reuters.com/news/archive/oddlyEnoughNews
1295 https://www.reuters.com/article/us-britain-politics-psychic/after-britain-appeals-for-weirdos-spoon-bender-uri-geller-applies-idUSKBN1Z71JY
1296 https://www.reuters.com/article/us-thailand-monk-cat/cat-vs-chants-friendly-feline-tests-buddhist-monks-patience-idUSKBN1Z20Z2
1297 https://www.reuters.com/article/us-solar-eclipse-egg-standing/egg-standing-test-goes-viral-as-ring-of-fire-eclipse-crosses-asia-idUSKBN1YU0MZ
1298 https://www.reuters.com/article/us-colorado-bankrobber/colorado-bank-robber-throws-cash-in-air-shouting-merry-christmas-idUSKBN1YT024
1299 https://www.reuters.com/article/us-indonesia-cats/indonesian-housewife-tackles-homelessness-for-250-feral-cats-idUSKBN1YR1G5
1300 https://www.reuters.com/article/us-usa-trade-wakanda/wakanda-free-trade-forever-fictional-nation-removed-from-u-s-trade-list-idUSKBN1YN0FN
1301 https://www.reuters.com/article/us-new-year-japan-zodiac-window-cleaners/you-dirty-rat-zodiac-window

1407 https://www.reuters.com/news/archive/BigStory10


## 2.5 Index statistics

In [5]:
# load index, doc_lengths and doc_urls
with open('inverted_index.p', 'rb') as fp:
    index = pickle.load(fp)
with open('doc_lengths.p', 'rb') as fp:
    doc_lengths = pickle.load(fp)
with open('doc_urls.p', 'rb') as fp:
    doc_urls = pickle.load(fp)

In [6]:
print('Total index length', len(index))
print('\nTop terms by number of documents they apperared in:')
sorted_by_n_docs = sorted(index.items(), key=lambda kv: (len(kv[1]), kv[0]), reverse=True)
print([(sorted_by_n_docs[i][0], len(sorted_by_n_docs[i][1])) for i in range(20)])
print('\nTop terms by overall frequency:')
sorted_by_freq = sorted(index.items(), key=lambda kv: (kv[1][0], kv[0]), reverse=True)
print([(sorted_by_freq[i][0], sorted_by_freq[i][1][0]) for i in range(20)])

Total index length 1996

Top terms by number of documents they apperared in:
[('world', 21), ('use', 21), ('us', 21), ('unit', 21), ('tv', 21), ('trust', 21), ('thomson', 21), ('term', 21), ('tax', 21), ('support', 21), ('state', 21), ('standard', 21), ('solut', 21), ('site', 21), ('see', 21), ('risk', 21), ('right', 21), ('reuter', 21), ('reserv', 21), ('report', 21)]

Top terms by overall frequency:
[('world', 20), ('use', 20), ('us', 20), ('unit', 20), ('tv', 20), ('trust', 20), ('thomson', 20), ('term', 20), ('tax', 20), ('support', 20), ('state', 20), ('standard', 20), ('solut', 20), ('site', 20), ('see', 20), ('risk', 20), ('right', 20), ('reuter', 20), ('reserv', 20), ('report', 20)]


# 3. Answering query

Now, given that we already have built the inverted index, it's time to utilize it for answering user queries. In this class there are two methods you need to implement:
- `boolean_retrieval`, the simplest form of document retrieval which returns a set of documents such that each one contains all query terms. Returns a set of document ids. Refer to *ch.1* of the book for details;
- `okapi_scoring`, Okapi BM25 ranking function - assigns scores to documents in the collection that are relevant to the user query. Returns a dictionary of scores, `doc_id:score`. Read about it in [Wikipedia](https://en.wikipedia.org/wiki/Okapi_BM25#The_ranking_function) and implement accordingly.

Both methods accept `query` parameter in a form of a dictionary, `term:frequency`

In [7]:
from collections import Counter
import math
from numpy import mean, log

class QueryProcessing:
    
    @staticmethod
    def prepare_query(raw_query):
        prep = Preprocessor()
        # pre-process query the same way as documents
        query = prep.preprocess(raw_query)
        # count frequency
        return Counter(query)
    
    @staticmethod
    def boolean_retrieval(query, index):
        def intersect(list1, list2):
            l1, l2 = 0, 0
            intersection = []
            while l1<len(list1) and l2<len(list2):
                if list1[l1]==list2[l2]:
                    intersection.append(list1[l1])
                    l1+=1
                    l2+=2
                elif list1[l1]>list2[l2]:
                    l2+=1
                else:
                    l1+=1
            return intersection
        #TODO retrieve a set of documents containing all query terms
        # step1. get postings lists of all documents containing the query terms
        postings = None
        for term in query:
            if postings:
                # retrieve new postings and intersect
                postings = intersect(postings, [t[0] for t in index[term][1:]])
            else:
                postings = [t[0] for t in index[term][1:]]
        # step2. Intersect them
        return set(postings)

    
    @staticmethod
    def okapi_scoring(query, doc_lengths, index, k1=1.2, b=0.75):
        #TODO retrieve relevant documents with scores
        scores = {}
        N = len(doc_lengths)
        avgdl = mean([doc_lengths[k] for k in doc_lengths])
        k1 = 2.
        b = 0.27
        for doc_id in doc_lengths:
            score = 0
            for term in query:
                tf = 0
                try:
                    for elem in index[term][1:]:
                        if elem[0]==doc_id:
                            tf = elem[1]
                            break
                    m1 = log((N-index[term][0]+0.5)/(index[term][0]+0.5))
                    m2 = (tf*(k1+1))/(tf+k1*(1-b+b*doc_lengths[doc_id]/avgdl))
                    score+=m1*m2
                except:
                    continue
            if score!=0:
                scores[doc_id] = score
        return scores

## 3.1 Tests 

In [8]:
test_doc_lengths = {1: 20, 2: 15, 3: 10, 4:20, 5:30}
test_index = {'x': [2, (1, 1), (2, 1)], 'y': [2, (1, 1), (3, 1)], 'z': [3, (2, 1), (4,2)]}


test_query1 = QueryProcessing.prepare_query('x z')
test_query2 = QueryProcessing.prepare_query('x y')


assert QueryProcessing.boolean_retrieval(test_query1, test_index) == {2}
assert QueryProcessing.boolean_retrieval(test_query2, test_index) == {1}
okapi_res = QueryProcessing.okapi_scoring(test_query2, test_doc_lengths, test_index)
assert all(k in okapi_res for k in (1,2,3))
assert not any(k in okapi_res for k in (4,5))
assert okapi_res[1] > okapi_res[3] > okapi_res[2]

print(okapi_res)

{1: 0.6666290402297231, 2: 0.3497249724181096, 3: 0.367835011265998}


# 4. Setting up a server

**Bonus task \*** Organize the resulting search engine as a web-service that gets a query from get-parameters and returns urls with scores as a `json` dictionary. Check its work in a browser of with curl, should look smth like this:
 
`> curl localhost:8080/?q=some_query_text
{ "url1" : 0.9, "url2": 0.8 }`

You can use one of the following tools for this task: https://www.acmesystems.it/python_http, http.server.ThreadingHTTPServer (3.7+) https://docs.python.org/3/library/http.server.html#http.server.SimpleHTTPRequestHandler

In [29]:
#TODO write a web-service that answers queries using inverted index
from http.server import BaseHTTPRequestHandler,HTTPServer
from urllib.parse import parse_qs, urlparse
import json
import pickle

with open('inverted_index.p', 'rb') as fp:
    index = pickle.load(fp)
with open('doc_lengths.p', 'rb') as fp:
    doc_lengths = pickle.load(fp)


PORT_NUMBER = 8080

#This class will handles any incoming request from
#the browser 
class myHandler(BaseHTTPRequestHandler):

    #Handler for the GET requests
    def do_GET(self):
        params = urlparse(self.path).query
        params = parse_qs(params)
        if 'q' in params:
            # get the query
            q = params['q'][0]
            # process query
            q = QueryProcessing.prepare_query(q)
            okapi_res = QueryProcessing.okapi_scoring(q, doc_lengths, index)
            # provide results back to user
            self.send_response(200)
            self.send_header('Content-type','text/html')
            self.end_headers()
            # Send the html message
            self.wfile.write(json.dumps(okapi_res).encode())
            return

try:
    #Create a web server and define the handler to manage the
    #incoming request
    server = HTTPServer(('', PORT_NUMBER), myHandler)
    print ('Started httpserver on port ' , PORT_NUMBER)

    #Wait forever for incoming htto requests
    server.serve_forever()

except KeyboardInterrupt:
    print ('^C received, shutting down the web server')
    server.socket.close()


Started httpserver on port  8080
okapi res  {32: 2.6781078273290206}


127.0.0.1 - - [31/Jan/2020 14:40:19] "GET /?q=fire HTTP/1.1" 200 -


okapi res  {}


127.0.0.1 - - [31/Jan/2020 14:40:28] "GET /?q=austria HTTP/1.1" 200 -


okapi res  {}


127.0.0.1 - - [31/Jan/2020 14:40:34] "GET /?q=australia HTTP/1.1" 200 -


okapi res  {}


127.0.0.1 - - [31/Jan/2020 14:40:50] "GET /?q=tree HTTP/1.1" 200 -


okapi res  {}


127.0.0.1 - - [31/Jan/2020 14:40:59] "GET /?q=street HTTP/1.1" 200 -


okapi res  {19: 2.6534862366503087}


127.0.0.1 - - [31/Jan/2020 14:41:24] "GET /?q=the%20burning%20problems HTTP/1.1" 200 -


^C received, shutting down the web server
