# Information Retrieval 1#
## Assignment 2: Retrieval models [100 points] ##

In this assignment you will get familiar with basic and advanced information retrieval concepts. You will implement different information retrieval ranking models and evaluate their performance.

We provide you with a Indri index. To query the index, you'll use a Python package ([pyndri](https://github.com/cvangysel/pyndri)) that allows easy access to the underlying document statistics.

For evaluation you'll use the [TREC Eval](https://github.com/usnistgov/trec_eval) utility, provided by the National Institute of Standards and Technology of the United States. TREC Eval is the de facto standard way to compute Information Retrieval measures and is frequently referenced in scientific papers.

This is a **groups-of-three assignment**, the deadline is **Wednesday, January 31st**. Code quality, informative comments and convincing analysis of the results will be considered when grading. Submission should be done through blackboard, questions can be asked on the course [Piazza](piazza.com/university_of_amsterdam/spring2018/52041inr6y/home).

### Technicalities (must-read!) ###

The assignment directory is organized as follows:
   * `./assignment.ipynb` (this file): the description of the assignment.
   * `./index/`: the index we prepared for you.
   * `./ap_88_90/`: directory with ground-truth and evaluation sets:
      * `qrel_test`: test query relevance collection (**test set**).
      * `qrel_validation`: validation query relevance collection (**validation set**).
      * `topics_title`: semicolon-separated file with query identifiers and terms.

You will need the following software packages (tested with Python 3.5 inside [Anaconda](https://conda.io/docs/user-guide/install/index.html)):
   * Python 3.5 and Jupyter
   * Indri + Pyndri (Follow the installation instructions [here](https://github.com/nickvosk/pyndri/blob/master/README.md))
   * gensim [link](https://radimrehurek.com/gensim/install.html)
   * TREC Eval [link](https://github.com/usnistgov/trec_eval)

### TREC Eval primer ###
The TREC Eval utility can be downloaded and compiled as follows:

    git clone https://github.com/usnistgov/trec_eval.git
    cd trec_eval
    make

TREC Eval computes evaluation scores given two files: ground-truth information regarding relevant documents, named *query relevance* or *qrel*, and a ranking of documents for a set of queries, referred to as a *run*. The *qrel* will be supplied by us and should not be changed. For every retrieval model (or combinations thereof) you will generate a run of the top-1000 documents for every query. The format of the *run* file is as follows:

    $query_identifier Q0 $document_identifier $rank_of_document_for_query $query_document_similarity $run_identifier
    
where
   * `$query_identifier` is the unique identifier corresponding to a query (usually this follows a sequential numbering).
   * `Q0` is a legacy field that you can ignore.
   * `$document_identifier` corresponds to the unique identifier of a document (e.g., APXXXXXXX where AP denotes the collection and the Xs correspond to a unique numerical identifier).
   * `$rank_of_document_for_query` denotes the rank of the document for the particular query. This field is ignored by TREC Eval and is only maintained for legacy support. The ranks are computed by TREC Eval itself using the `$query_document_similarity` field (see next). However, it remains good practice to correctly compute this field.
   * `$query_document_similarity` is a score indicating the similarity between query and document where a higher score denotes greater similarity.
   * `$run_identifier` is an identifier of the run. This field is for your own convenience and has no purpose beyond bookkeeping.
   
For example, say we have two queries: `Q1` and `Q2` and we rank three documents (`DOC1`, `DOC2`, `DOC3`). For query `Q1`, we find the following similarity scores `score(Q1, DOC1) = 1.0`, `score(Q1, DOC2) = 0.5`, `score(Q1, DOC3) = 0.75`; and for `Q2`: `score(Q2, DOC1) = -0.1`, `score(Q2, DOC2) = 1.25`, `score(Q1, DOC3) = 0.0`. We can generate run using the following snippet:

In [1]:
import logging
import sys
import os

def write_run(model_name, data, out_f,
              max_objects_per_query=sys.maxsize,
              skip_sorting=False):
    """
    Write a run to an output file.
    Parameters:
        - model_name: identifier of run.
        - data: dictionary mapping topic_id to object_assesments;
            object_assesments is an iterable (list or tuple) of
            (relevance, object_id) pairs.
            The object_assesments iterable is sorted by decreasing order.
        - out_f: output file stream.
        - max_objects_per_query: cut-off for number of objects per query.
    """
    for subject_id, object_assesments in data.items():
        if not object_assesments:
            logging.warning('Received empty ranking for %s; ignoring.',
                            subject_id)

            continue

        # Probe types, to make sure everything goes alright.
        # assert isinstance(object_assesments[0][0], float) or \
        #     isinstance(object_assesments[0][0], np.float32)
        assert isinstance(object_assesments[0][1], str) or \
            isinstance(object_assesments[0][1], bytes)

        if not skip_sorting:
            object_assesments = sorted(object_assesments, reverse=True)

        if max_objects_per_query < sys.maxsize:
            object_assesments = object_assesments[:max_objects_per_query]

        if isinstance(subject_id, bytes):
            subject_id = subject_id.decode('utf8')

        for rank, (relevance, object_id) in enumerate(object_assesments):
            if isinstance(object_id, bytes):
                object_id = object_id.decode('utf8')

            out_f.write(
                '{subject} Q0 {object} {rank} {relevance} '
                '{model_name}\n'.format(
                    subject=subject_id,
                    object=object_id,
                    rank=rank + 1,
                    relevance=relevance,
                    model_name=model_name))
            
# The following writes the run to standard output.
# In your code, you should write the runs to local
# storage in order to pass them to trec_eval.
write_run(
    model_name='example',
    data={
        'Q1': ((1.0, 'DOC1'), (0.5, 'DOC2'), (0.75, 'DOC3')),
        'Q2': ((-0.1, 'DOC1'), (1.25, 'DOC2'), (0.0, 'DOC3')),
    },
    out_f=sys.stdout,
    max_objects_per_query=1000)

Q1 Q0 DOC1 1 1.0 example
Q1 Q0 DOC3 2 0.75 example
Q1 Q0 DOC2 3 0.5 example
Q2 Q0 DOC2 1 1.25 example
Q2 Q0 DOC3 2 0.0 example
Q2 Q0 DOC1 3 -0.1 example


Now, imagine that we know that `DOC1` is relevant and `DOC3` is non-relevant for `Q1`. In addition, for `Q2` we only know of the relevance of `DOC3`. The query relevance file looks like:

    Q1 0 DOC1 1
    Q1 0 DOC3 0
    Q2 0 DOC3 1
    
We store the run and qrel in files `example.run` and `example.qrel` respectively on disk. We can now use TREC Eval to compute evaluation measures. In this example, we're only interested in Mean Average Precision and we'll only show this below for brevity. However, TREC Eval outputs much more information such as NDCG, recall, precision, etc.

    $ trec_eval -m all_trec -q example.qrel example.run | grep -E "^map\s"
    > map                   	Q1	1.0000
    > map                   	Q2	0.5000
    > map                   	all	0.7500
    
Now that we've discussed the output format of rankings and how you can compute evaluation measures from these rankings, we'll now proceed with an overview of the indexing framework you'll use.

### Pyndri primer ###
For this assignment you will use [Pyndri](https://github.com/cvangysel/pyndri) [[1](https://arxiv.org/abs/1701.00749)], a python interface for [Indri](https://www.lemurproject.org/indri.php). We have indexed the document collection and you can query the index using Pyndri. We will start by giving you some examples of what Pyndri can do:

First we read the document collection index with Pyndri:

In [2]:
import pyndri

index = pyndri.Index('index/')

The loaded index can be used to access a collection of documents in an easy manner. We'll give you some examples to get some idea of what it can do, it is up to you to figure out how to use it for the remainder of the assignment.

First let's look at the number of documents, since Pyndri indexes the documents using incremental identifiers we can simply take the lowest index and the maximum document and consider the difference:

In [3]:
print("There are %d documents in this collection." % (index.maximum_document() - index.document_base()))

There are 164597 documents in this collection.


Let's take the first document out of the collection and take a look at it:

In [4]:
example_document = index.document(index.document_base())
print(example_document)

('AP890425-0001', (1360, 192, 363, 0, 880, 0, 200, 0, 894, 412, 92160, 3, 192, 0, 363, 34, 1441, 0, 174134, 0, 200, 0, 894, 412, 2652, 0, 810, 107, 49, 4903, 420, 0, 1, 48, 35, 489, 0, 35, 687, 192, 243, 0, 249311, 1877, 0, 1651, 1174, 0, 2701, 117, 412, 0, 810, 391, 245233, 1225, 5838, 16, 0, 233156, 3496, 0, 393, 17, 0, 2435, 4819, 930, 0, 0, 200, 0, 894, 0, 22, 398, 145, 0, 3, 271, 115, 0, 1176, 2777, 292, 0, 725, 192, 0, 0, 50046, 0, 1901, 1130, 0, 192, 0, 408, 0, 243779, 0, 0, 553, 192, 0, 363, 0, 3747, 0, 0, 0, 0, 1176, 0, 1239, 0, 0, 1115, 17, 0, 0, 585, 192, 1963, 0, 0, 412, 54356, 0, 773, 0, 0, 0, 192, 0, 0, 1130, 0, 363, 0, 545, 192, 0, 1174, 1901, 1130, 0, 4, 398, 145, 39, 0, 577, 0, 355, 0, 491, 0, 6025, 0, 0, 193156, 88, 34, 437, 0, 0, 1852, 0, 828, 0, 1588, 0, 0, 0, 2615, 0, 0, 107, 49, 420, 0, 0, 190, 7, 714, 2701, 0, 237, 192, 157, 0, 412, 34, 437, 0, 0, 200, 6025, 26, 0, 0, 0, 0, 363, 0, 22, 398, 145, 0, 200, 638, 126222, 6018, 0, 880, 0, 0, 161, 0, 0, 319, 894, 2701, 

Here we see a document consists of two things, a string representing the external document identifier and an integer list representing the identifiers of words that make up the document. Pyndri uses integer representations for words or terms, thus a token_id is an integer that represents a word whereas the token is the actual text of the word/term. Every id has a unique token and vice versa with the exception of stop words: words so common that there are uninformative, all of these receive the zero id.

To see what some ids and their matching tokens we take a look at the dictionary of the index:

In [5]:
token2id, id2token, _ = index.get_dictionary()
print(list(id2token.items())[:15])

[(1, 'new'), (2, 'percent'), (3, 'two'), (4, '1'), (5, 'people'), (6, 'million'), (7, '000'), (8, 'government'), (9, 'president'), (10, 'years'), (11, 'state'), (12, '2'), (13, 'states'), (14, 'three'), (15, 'time')]


Using this dictionary we can see the tokens for the (non-stop) words in our example document:

In [6]:
print([id2token[word_id] for word_id in example_document[1] if word_id > 0])

['52', 'students', 'arrested', 'takeover', 'university', 'massachusetts', 'building', 'fifty', 'two', 'students', 'arrested', 'tuesday', 'evening', 'occupying', 'university', 'massachusetts', 'building', 'overnight', 'protest', 'defense', 'department', 'funded', 'research', 'new', 'york', 'city', 'thousands', 'city', 'college', 'students', 'got', 'unscheduled', 'holiday', 'demonstrators', 'occupied', 'campus', 'administration', 'building', 'protest', 'possible', 'tuition', 'increases', 'prompting', 'officials', 'suspend', 'classes', '60', 'police', 'riot', 'gear', 'arrived', 'university', 'massachusetts', '5', 'p', 'm', 'two', 'hours', 'later', 'bus', 'drove', 'away', '29', 'students', 'camped', 'memorial', 'hall', 'students', 'charged', 'trespassing', '23', 'students', 'arrested', 'lying', 'bus', 'prevent', 'leaving', 'police', '300', 'students', 'stood', 'building', 'chanting', 'looking', 'students', 'hall', 'arrested', '35', 'students', 'occupied', 'memorial', 'hall', '1', 'p', 'm',

The reverse can also be done, say we want to look for news about the "University of Massachusetts", the tokens of that query can be converted to ids using the reverse dictionary:

In [7]:
query_tokens = index.tokenize("University of Massachusetts")
print("Query by tokens:", query_tokens)
query_id_tokens = [token2id.get(query_token,0) for query_token in query_tokens]
print("Query by ids with stopwords:", query_id_tokens)
query_id_tokens = [word_id for word_id in query_id_tokens if word_id > 0]
print("Query by ids without stopwords:", query_id_tokens)

Query by tokens: ['university', '', 'massachusetts']
Query by ids with stopwords: [200, 0, 894]
Query by ids without stopwords: [200, 894]


Naturally we can now match the document and query in the id space, let's see how often a word from the query occurs in our example document:

In [8]:
matching_words = sum([True for word_id in example_document[1] if word_id in query_id_tokens])
print("Document %s has %d word matches with query: \"%s\"." % (example_document[0], matching_words, ' '.join(query_tokens)))
print("Document %s and query \"%s\" have a %.01f%% overlap." % (example_document[0], ' '.join(query_tokens),matching_words/float(len(example_document[1]))*100))

Document AP890425-0001 has 13 word matches with query: "university  massachusetts".
Document AP890425-0001 and query "university  massachusetts" have a 2.5% overlap.


While this is certainly not everything Pyndri can do, it should give you an idea of how to use it. Please take a look at the [examples](https://github.com/cvangysel/pyndri) as it will help you a lot with this assignment.

**CAUTION**: Avoid printing out the whole index in this Notebook as it will generate a lot of output and is likely to corrupt the Notebook.

### Parsing the query file
You can parse the query file (`ap_88_89/topics_title`) using the following snippet:

In [9]:
import collections
import io
import logging
import sys
import time

def parse_topics(file_or_files,
                 max_topics=sys.maxsize, delimiter=';'):
    assert max_topics >= 0 or max_topics is None

    topics = collections.OrderedDict()

    if not isinstance(file_or_files, list) and \
            not isinstance(file_or_files, tuple):
        if hasattr(file_or_files, '__iter__'):
            file_or_files = list(file_or_files)
        else:
            file_or_files = [file_or_files]

    for f in file_or_files:
        assert isinstance(f, io.IOBase)

        for line in f:
            assert(isinstance(line, str))

            line = line.strip()

            if not line:
                continue

            topic_id, terms = line.split(delimiter, 1)

            if topic_id in topics and (topics[topic_id] != terms):
                    logging.error('Duplicate topic "%s" (%s vs. %s).',
                                  topic_id,
                                  topics[topic_id],
                                  terms)

            topics[topic_id] = terms

            if max_topics > 0 and len(topics) >= max_topics:
                break

    return topics

with open('./ap_88_89/topics_title', 'r') as f_topics:
    print(parse_topics([f_topics]))

OrderedDict([('51', 'Airbus Subsidies'), ('52', 'South African Sanctions'), ('53', 'Leveraged Buyouts'), ('54', 'Satellite Launch Contracts'), ('55', 'Insider Trading'), ('56', 'Prime (Lending) Rate Moves, Predictions'), ('57', 'MCI'), ('58', 'Rail Strikes'), ('59', 'Weather Related Fatalities'), ('60', 'Merit-Pay vs. Seniority'), ('61', 'Israeli Role in Iran-Contra Affair'), ('62', "Military Coups D'etat"), ('63', 'Machine Translation'), ('64', 'Hostage-Taking'), ('65', 'Information Retrieval Systems'), ('66', 'Natural Language Processing'), ('67', 'Politically Motivated Civil Disturbances'), ('68', 'Health Hazards from Fine-Diameter Fibers'), ('69', 'Attempts to Revive the SALT II Treaty'), ('70', 'Surrogate Motherhood'), ('71', 'Border Incursions'), ('72', 'Demographic Shifts in the U.S.'), ('73', 'Demographic Shifts across National Boundaries'), ('74', 'Conflicting Policy'), ('75', 'Automation'), ('76', 'U.S. Constitution - Original Intent'), ('77', 'Poaching'), ('78', 'Greenpeace'

In [10]:
# Logic for allowing the import of notebooks.
import io, os, sys, types
from IPython import get_ipython
from nbformat import read
from IPython.core.interactiveshell import InteractiveShell

def find_notebook(fullname, path=None):
    """find a notebook, given its fully qualified name and an optional path

    This turns "foo.bar" into "foo/bar.ipynb"
    and tries turning "Foo_Bar" into "Foo Bar" if Foo_Bar
    does not exist.
    """
    name = fullname.rsplit('.', 1)[-1]
    if not path:
        path = ['']
    for d in path:
        nb_path = os.path.join(d, name + ".ipynb")
        if os.path.isfile(nb_path):
            return nb_path
        # let import Notebook_Name find "Notebook Name.ipynb"
        nb_path = nb_path.replace("_", " ")
        if os.path.isfile(nb_path):
            return nb_path

class NotebookLoader(object):
    """Module Loader for Jupyter Notebooks"""
    def __init__(self, path=None):
        self.shell = InteractiveShell.instance()
        self.path = path

    def load_module(self, fullname):
        """import a notebook as a module"""
        path = find_notebook(fullname, self.path)

        print ("importing Jupyter notebook from %s" % path)

        # load the notebook object
        with io.open(path, 'r', encoding='utf-8') as f:
            nb = read(f, 4)


        # create the module and add it to sys.modules
        # if name in sys.modules:
        #    return sys.modules[name]
        mod = types.ModuleType(fullname)
        mod.__file__ = path
        mod.__loader__ = self
        mod.__dict__['get_ipython'] = get_ipython
        sys.modules[fullname] = mod

        # extra work to ensure that magics that would affect the user_ns
        # actually affect the notebook module's ns
        save_user_ns = self.shell.user_ns
        self.shell.user_ns = mod.__dict__

        try:
            for cell in nb.cells:
                if cell.cell_type == 'code':
                    # transform the input to executable Python
                    code = self.shell.input_transformer_manager.transform_cell(cell.source)
                    # run the code in themodule
                    exec(code, mod.__dict__)
        finally:
            self.shell.user_ns = save_user_ns
        return mod
    
class NotebookFinder(object):
    """Module finder that locates Jupyter Notebooks"""
    def __init__(self):
        self.loaders = {}

    def find_module(self, fullname, path=None):
        nb_path = find_notebook(fullname, path)
        if not nb_path:
            return

        key = path
        if path:
            # lists aren't hashable
            key = os.path.sep.join(path)

        if key not in self.loaders:
            self.loaders[key] = NotebookLoader(path)
        return self.loaders[key]

sys.meta_path.append(NotebookFinder())

### Task 1: Implement and compare lexical IR methods [35 points] ### 

In this task you will implement a number of lexical methods for IR using the **Pyndri** framework. Then you will evaluate these methods on the dataset we have provided using **TREC Eval**.

Use the **Pyndri** framework to get statistics of the documents (term frequency, document frequency, collection frequency; **you are not allowed to use the query functionality of Pyndri**) and implement the following scoring methods in **Python**:

- [TF-IDF](http://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html) and 
- [BM25](http://nlp.stanford.edu/IR-book/html/htmledition/okapi-bm25-a-non-binary-model-1.html) with k1=1.2 and b=0.75. **[5 points]**
- Language models ([survey](https://drive.google.com/file/d/0B-zklbckv9CHc0c3b245UW90NE0/view))
    - Jelinek-Mercer (explore different values of 𝛌 in the range [0.1, 0.5, 0.9]). **[5 points]**
    - Dirichlet Prior (explore different values of 𝛍 [500, 1000, 1500]). **[5 points]**
    - Absolute discounting (explore different values of 𝛅 in the range [0.1, 0.5, 0.9]). **[5 points]**
    - [Positional Language Models](http://sifaka.cs.uiuc.edu/~ylv2/pub/sigir09-plm.pdf) define a language model for each position of a document, and score a document based on the scores of its PLMs. The PLM is estimated based on propagated counts of words within a document through a proximity-based density function, which both captures proximity heuristics and achieves an effect of “soft” passage retrieval. Implement the PLM, all five kernels, but only the Best position strategy to score documents. Use 𝛔 equal to 50, and Dirichlet smoothing with 𝛍 optimized on the validation set (decide how to optimize this value yourself and motivate your decision in the report). **[10 points]**
    
Implement the above methods and report evaluation measures (on the test set) using the hyper parameter values you optimized on the validation set (also report the values of the hyper parameters). Use TREC Eval to obtain the results and report on `NDCG@10`, Mean Average Precision (`MAP@1000`), `Precision@5` and `Recall@1000`.

For the language models, create plots showing `NDCG@10` with varying values of the parameters. You can do this by chaining small scripts using shell scripting (preferred) or execute trec_eval using Python's `subprocess`.

Compute significance of the results using a [two-tailed paired Student t-test](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html) **[5 points]**. Be wary of false rejection of the null hypothesis caused by the [multiple comparisons problem](https://en.wikipedia.org/wiki/Multiple_comparisons_problem). There are multiple ways to mitigate this problem and it is up to you to choose one.

Analyse the results by identifying specific queries where different methods succeed or fail and discuss possible reasons that cause these differences. This is *very important* in order to understand who the different retrieval functions behave.

**NOTE**: Don’t forget to use log computations in your calculations to avoid underflows. 

**IMPORTANT**: You should structure your code around the helper functions we provide below.

In [11]:
with open('./ap_88_89/topics_title', 'r') as f_topics:
    queries = parse_topics([f_topics])

index = pyndri.Index('index/')

num_documents = index.maximum_document() - index.document_base()

dictionary = pyndri.extract_dictionary(index)

tokenized_queries = {
    query_id: [dictionary.translate_token(token)
               for token in index.tokenize(query_string)
               if dictionary.has_token(token)]
    for query_id, query_string in queries.items()}

query_term_ids = set(
    query_term_id
    for query_term_ids in tokenized_queries.values()
    for query_term_id in query_term_ids)

print('Gathering statistics about', len(query_term_ids), 'terms.')

# inverted index creation.

start_time = time.time()

document_lengths = {}
unique_terms_per_document = {}

inverted_index = collections.defaultdict(dict)
collection_frequencies = collections.defaultdict(int)

total_terms = 0

for int_doc_id in range(index.document_base(), index.maximum_document()):
    ext_doc_id, doc_token_ids = index.document(int_doc_id)

    document_bow = collections.Counter(
        token_id for token_id in doc_token_ids
        if token_id > 0)
    document_length = sum(document_bow.values())

    document_lengths[int_doc_id] = document_length
    total_terms += document_length

    unique_terms_per_document[int_doc_id] = len(document_bow)

    for query_term_id in query_term_ids:
        assert query_term_id is not None

        document_term_frequency = document_bow.get(query_term_id, 0)

        if document_term_frequency == 0:
            continue

        collection_frequencies[query_term_id] += document_term_frequency
        inverted_index[query_term_id][int_doc_id] = document_term_frequency

avg_doc_length = total_terms / num_documents

print('Inverted index creation took', time.time() - start_time, 'seconds.')

Gathering statistics about 456 terms.
Inverted index creation took 40.925872802734375 seconds.


In [12]:
def run_retrieval(model_name, score_fn):
    """
    Runs a retrieval method for all the queries and writes the TREC-friendly results in a file.
    
    :param model_name: the name of the model (a string)
    :param score_fn: the scoring function (a function - see below for an example) 
    """
    run_out_path = '{}.run'.format(model_name)

    if os.path.exists(run_out_path):
        return

    retrieval_start_time = time.time()

    print('Retrieving using', model_name)

    data = collections.defaultdict(list)
    scores = collections.defaultdict(lambda: collections.defaultdict(lambda: 0))
    

    # TODO: fill the data dictionary. Use entire query collection in the end
    # The dictionary data should have the form: query_id --> [(document_score, external_doc_id)]    
    for query_id, query in list(tokenized_queries.items())[0:5]:
        for query_term_id in query:
            for int_doc_id in inverted_index[query_term_id]:
                ext_doc_id = index.document(int_doc_id)[0]
                doc_term_freq = inverted_index[query_term_id][int_doc_id]
                
                scores[query_id][ext_doc_id] += score_fn(int_doc_id, query_term_id, doc_term_freq)
    
    for query_id in scores:
        for ext_doc_id in scores[query_id]:
            data[query_id].append((scores[query_id][ext_doc_id], ext_doc_id))

    
    with open(run_out_path, 'w') as f_out:
        write_run(
            model_name=model_name,
            data=data,
            out_f=f_out,
            max_objects_per_query=1000)

In [13]:
# for now I put the first bit in the same file, becuase it's a bit tricky to split it up. let's discuss this.
# in the end we need to have just the functions in the global score
# or possibly classes for each scoring method

# we have at our disposal the following:
# index: pyndry index term index of the entire collection, useful for translating between tokens and terms, etc.
# document_lengths: dict of document lengths in the form `doc_id`: `doc_length`
# unique_terms_per_document: dict of document vocabulary size in the form `docID`: `vocabulary_size`
# inverted_index: dict of query term frequencies per doc in the form `query_term_id`.`doc_id`: `doc_term_frequency`
# collection_frequencies: dict of accumulated query term frequencies ``query_term_id``: `sum(doc_term_frequencies)`
# avg_doc_length: average doc length over entire collection

In [14]:
import math

In [18]:
class TFIDF:
    """Scoring class for the tf-idf method.
    
    Note:
        The log sublinear transform of the term frequencies is used as a benchmark.
    
    Attributes:
        index: pyndry index for the entire collection.
        inverted_index: dict of term frequencies per document.
        col_freq: dict of term frequencies for the entire collection.
        col_size: number of documents in the collection.
        tf_transform: string denoting possible sublinear tf transformations. accepted values are: log
    """
    
    def __init__(self, index: pyndri.Index, inverted_index: collections.defaultdict(dict), col_freq: collections.defaultdict(int), tf_transform: str):
        """Initialize tf-idf scoring function.
        
        Args:
            index: pyndry index for the entire collection.
            inverted_index: dict of term frequencies per document.
            col_freq: dict of term frequencies for the entire collection.
            tf_transform: string denoting possible sublinear tf transformations. accepted values are: `log`
        """
        self.index = index
        self.inverted_index = inverted_index
        self.col_freq = col_freq
        self.col_size = index.maximum_document() - index.document_base()
        self.tf_transform = tf_transform

    def score(self, int_doc_id: int, query_term_id: int, doc_term_freq: float) -> float:
        """Scoring method for a document and a query term.

        Args:
            int_document_id: the document id.
            query_token_id: the query term id (assuming you have split the query to tokens).
            document_term_freq: the document term frequency of the query term.
        """
        if self.tf_transform == 'log':
            wtf = self.log_tf(doc_term_freq)
        else:
            raise ValueError('Unsupported term frequency transformation specified: {}'.format(self.tf_transform))
        idf = self.idf(int_doc_id, query_term_id)
        
        return wtf * idf
    
    def log_tf(self, doc_term_freq: int)-> float:
        """Apply sublinear transformation to document query term frequency.
        
        Args:
            doc_term_freq: the document term frequency for the query term.
            
        Return:
            Log sublinear transformation.
        """
        return 1 + math.log(doc_term_freq)
    
    def idf(self, int_doc_id: int, query_term_id: int) -> float:
        """Calculate inverted document frequency.
        
        Args:
            int_doc_id: pyndri index internal document id.
            query_term_id: pyndri query term id.
        Return:
            Inverted document frequency.
        """
        return math.log(self.col_size) - math.log(self.df(query_term_id))
    
    def df(self, query_term_id: int) -> int:
        """Calculate document frequency of query term.
        
        Args:
            query_term_id: pyndri query term id.
        Return:
            Length of the inverted index for the query.
        """
        return len(self.inverted_index[query_term_id])

In [None]:
tfidf = TFIDF(index, inverted_index, collection_frequencies, 'log')
run_retrieval('tfidf', tfidf.score)

In [33]:
class BM25:
    """Scoring class for the BM25 method.
    
    Note:
        The method neglects the term that normalizes query term frequencies because the
        queries presented here are tipically short. Also, average document length is 
        computed relative to the entire collection, as opposed to the average length
        of the documents that contain one or more query terms. this makes the score generalisable
        to all the collection.
    Attributes:
        index: pyndry index for the entire collection.
        inverted_index: dict of term frequencies per document.
        col_freq: dict of term frequencies for the entire collection.
        k: tuning parameter that calibrates the document term frequency scaling.
        b: tuning parameter which calibrates the document length scaling.
        avg_len: average document length for the entire collection.
        col_size: number of documents in the collection.
    """
    
    def __init__(self, index, inverted_index, k, b, avg_len):
        """Initialize BM25 scoring method.
        
        Args: 
            index: pyndry index for the entire collection.
            inverted_index: dict of term frequencies per document.
            k: tuning parameter that calibrates the document term frequency scaling.
            b: tuning parameter which calibrates the document length scaling.
            avg_len: average document length for the entire collection.
        """
        self.index = index
        self.inverted_index = inverted_index
        self.k = k
        self.b = b
        self.avg_len = avg_len
        self.col_size = index.maximum_document() - index.document_base()
        
    def score(self, int_doc_id: int, query_term_id: int, doc_term_freq: float)-> float:
        """Compute the score for a document and a query term.
        
        Args:
            int_doc_id: the document id.
            query_term_id: the query term id (assuming you have split the query to tokens).
            doc_term_freq: the document term frequency of the query term.
        """
        
        wtf = self.wtf(int_doc_id, query_term_id, doc_term_freq)
        idf = self.idf(int_doc_id, query_term_id)
        
        return wtf * idf
        
    def wtf(self, int_doc_id: int, query_term_id: int, doc_term_freq: int) -> float:
        """Compute the term frequency term in the score.
        
        Args:
            int_doc_id: the document id.
            query_term_id: the query term id (assuming you have split the query to tokens).
            doc_term_freq: the document term frequency of the query term.
            
        Return:
            Term frequency weight.
        """
        return self.num(doc_term_freq) / self.denom(int_doc_id, doc_term_freq)
        
    def num(self, doc_term_freq: int) -> float:
        """Numerator of the first term.
        
        Args:
            doc_term_freq: the document term frequency of the query term.
            
        Return:
            Term frequency scaled by the `k` paramenter.
        """
        return (self.k + 1) * doc_term_freq
    
    def denom(self, int_doc_id: int, doc_term_freq: int) -> float:
        """Denominator of the first term.
        
        Args:
            int_doc_id: the document id.
            doc_term_freq: the document term frequency of the query term.
            
        Return:
            term frequency normalized by document length according to parameters `k` and `b`.
        """
        doc_len = len(self.index.document(int_doc_id)[1])
        return self.k * ((1-self.b) + b * (doc_len/self.avg_len)) + doc_term_freq
    
    def df(self, query_term_id: int) -> int:
        """Calculate document frequency of query term.
        
        Args:
            query_term_id: pyndri query term id.
        Return:
            Length of the inverted index for the query.
        """
        return len(self.inverted_index[query_term_id])
        
    def idf(self, int_doc_id: int, query_term_id: int) -> float:
        """Calculate inverted document frequency.
        
        Args:
            int_doc_id: pyndri index internal document id.
            query_term_id: pyndri query term id.
        Return:
            Inverted document frequency.
        """
        return math.log(self.col_size) - math.log(self.df(query_term_id))

In [34]:
k1 = 1.2
b = 0.75
bm25 = BM25(index, inverted_index, k1, b, avg_doc_length)
run_retrieval('bm25', bm25.score)

Retrieving using bm25
697
656
648
663
661
923
526
763
339
345
239
868
359
446
777
537
278
688
182
131
810
521
109
434
191
796
341
537
793
270
490
662
189
557
408
210
188
351
701
481
331
851
641
318
740
276
711
464
564
256
226
943
998
44
389
268
160
816
836
437
441
328
247
176
548
549
794
695
490
250
513
411
857
777
691
247
801
768
117
580
590
549
560
756
494
207
332
1362
154
416
818
705
446
730
523
545
475
609
478
672
1143
657
724
757
374
396
480
1103
191
582
896
585
310
656
197
130
151
261
713
430
98
561
807
538
289
220
106
606
591
158
406
855
265
147
499
160
433
340
703
154
244
587
651
446
203
217
159
601
256
296
455
453
164
366
820
575
513
789
234
964
558
526
298
377
151
313
594
376
474
228
830
422
711
477
125
553
385
313
547
322
86
199
779
644
259
560
655
443
361
625
500
696
566
816
616
429
347
1151
421
321
590
1454
257
603
600
468
565
654
685
804
435
866
743
396
350
570
177
278
175
417
853
381
412
293
878
288
638
476
310
482
106
1024
202
259
783
1762
859
388
537
317
838
656
668
30

370
759
692
467
835
807
751
184
203
191
712
492
463
543
122
516
948
726
261
349
1003
573
197
710
508
553
592
358
342
337
625
509
669
451
875
861
148
369
846
757
410
560
452
565
777
518
310
796
507
780
122
665
462
445
810
733
252
1409
379
500
808
517
506
1017
280
545
399
124
826
824
503
512
667
731
569
258
461
738
264
552
94
668
707
393
500
437
754
512
680
381
343
713
675
493
140
608
371
399
622
675
794
446
412
690
708
430
209
972
1075
555
733
258
265
480
770
550
191
559
405
400
393
1072
859
184
985
405
946
752
341
409
142
184
569
624
224
662
427
228
474
280
396
618
488
777
558
447
216
760
286
558
804
1238
461
495
752
600
268
272
508
403
960
618
885
519
273
634
460
159
605
469
373
414
612
370
270
557
1124
370
447
402
533
393
491
230
437
236
463
316
789
1050
304
308
383
577
489
464
629
757
517
475
248
713
536
284
516
293
533
405
687
493
499
309
974
486
253
298
857
474
271
421
721
685
828
523
164
489
513
453
368
666
376
453
196
474
455
503
238
187
551
603
360
309
760
352
446
492
349
589
6

271
575
680
244
773
427
445
230
529
544
675
466
590
727
127
745
57
407
480
444
420
408
466
464
583
384
353
740
830
258
222
601
1217
948
360
968
814
387
169
733
384
518
327
170
178
821
224
110
448
516
322
655
486
999
195
483
322
204
464
329
348
943
759
334
1179
695
771
663
979
620
153
319
527
374
395
593
284
263
954
647
496
788
284
720
583
633
568
498
244
657
360
823
158
1178
177
153
166
508
870
810
197
730
589
597
340
774
279
375
794
811
919
695
705
1146
1120
1123
309
596
376
142
381
227
1034
264
582
293
854
511
181
698
569
303
431
254
479
274
219
955
517
162
399
363
540
886
859
779
509
289
613
201
456
350
766
538
469
141
591
195
347
621
566
334
474
360
424
510
470
409
891
604
710
664
265
749
189
392
161
289
429
170
365
561
696
251
407
506
567
480
202
179
747
257
1337
209
830
832
189
662
279
646
373
645
582
140
354
514
815
391
660
607
722
388
511
656
715
148
396
649
456
656
966
585
234
488
1026
199
197
451
465
679
1172
1162
1037
760
1186
694
206
698
271
500
656
453
173
469
445
385
799


420
471
426
557
1004
473
506
563
314
430
446
878
392
230
356
637
556
90
125
214
464
951
673
859
837
1194
826
651
1222
896
496
327
895
251
581
865
606
368
522
209
341
428
613
246
126
695
655
575
288
175
458
257
550
223
553
608
366
675
808
920
908
493
621
487
409
651
759
1247
410
154
702
636
532
561
634
609
674
426
423
403
361
487
866
807
338
702
663
485
560
984
494
637
608
531
578
567
430
670
499
813
453
591
168
621
352
1024
421
310
757
765
422
798
169
330
602
288
316
769
615
742
583
459
318
168
533
385
531
509
443
91
448
643
253
761
227
561
753
553
368
621
615
167
686
560
521
1098
263
195
554
332
595
421
824
195
213
685
204
657
542
868
519
386
347
438
224
1351
690
142
340
286
568
460
652
401
497
114
577
665
487
224
169
892
385
623
331
542
472
509
322
504
1078
638
293
296
691
660
585
688
241
497
636
219
961
654
278
755
671
612
958
482
426
140
778
140
895
288
170
1064
810
954
480
543
263
472
1019
671
722
596
288
209
584
455
852
215
159
417
896
750
823
889
429
544
561
317
994
709
195
874


383
482
130
1105
1018
601
116
882
130
468
583
625
373
396
403
958
556
1166
477
417
638
215
312
275
775
327
422
982
945
303
498
371
683
499
568
506
443
576
375
445
349
780
243
476
365
362
770
647
533
517
324
769
153
330
341
791
901
610
593
743
598
389
851
772
273
494
332
328
319
594
294
698
526
664
336
279
394
625
416
370
339
478
900
364
701
672
390
457
268
483
663
541
749
1168
319
657
554
800
938
471
621
613
609
350
401
126
758
570
437
212
461
85
87
335
646
197
885
793
360
1045
633
426
411
344
297
632
452
270
819
1067
607
255
843
375
518
783
458
577
166
502
494
726
202
434
605
410
366
569
901
381
424
461
487
266
765
645
202
198
856
306
560
456
529
657
469
473
229
576
472
824
473
1033
295
569
447
560
921
739
520
154
259
587
794
463
613
524
820
257
647
631
104
782
391
699
494
388
546
164
653
255
593
607
563
555
326
636
303
466
494
543
599
604
435
673
557
787
190
298
234
478
163
491
158
110
247
276
647
302
883
277
198
229
465
425
445
504
830
436
314
215
514
273
862
503
226
805
592
423
658

90
287
741
493
362
155
549
387
132
477
857
603
430
1011
250
712
645
71
911
1199
905
835
463
230
914
374
642
430
558
512
347
387
280
748
524
557
704
155
464
237
375
1049
710
621
169
240
364
821
277
644
618
893
273
718
324
907
578
754
1231
900
669
578
419
164
443
525
104
564
623
430
1148
558
296
612
850
314
506
268
767
387
425
946
600
295
808
557
753
659
598
659
625
137
546
420
537
519
661
287
757
704
648
917
628
918
844
1108
706
510
655
180
256
737
481
282
437
558
636
315
540
222
389
628
787
314
680
861
1108
1085
1306
504
897
781
846
524
584
381
513
270
566
464
200
83
596
562
686
189
136
348
235
853
752
644
506
428
187
120
776
470
557
502
813
476
82
303
691
664
1253
264
398
404
845
643
503
562
710
656
249
809
283
85
185
404
502
207
487
498
310
838
430
328
616
710
794
754
165
564
295
547
467
710
451
642
168
805
302
844
846
774
861
463
280
260
800
863
458
820
274
767
515
617
749
677
501
511
409
152
138
482
166
563
185
683
249
647
711
921
868
428
431
776
855
623
480
377
479
768
587
682
125

367
810
527
503
667
399
192
772
356
212
561
397
160
577
339
869
592
853
883
495
569
728
127
754
859
214
1328
531
176
431
285
581
743
571
233
937
264
615
715
168
228
595
495
428
306
365
651
826
553
430
389
508
284
498
452
658
843
767
601
611
642
947
676
476
655
549
487
433
491
746
646
430
401
355
385
167
105
735
455
611
326
404
593
838
574
320
311
170
606
841
572
376
420
395
1039
549
482
1028
635
245
1057
756
725
493
506
480
347
173
767
672
614
788
703
385
778
489
753
715
538
377
680
304
295
698
667
417
429
490
343
399
903
632
522
469
745
952
984
757
497
936
779
472
648
248
307
240
681
317
457
842
617
827
417
754
459
519
590
470
227
220
181
783
813
729
937
296
567
704
749
587
432
496
574
171
247
824
462
676
381
556
675
149
88
694
1077
818
577
289
621
550
134
569
400
564
393
193
317
484
310
764
659
774
752
572
156
402
268
238
876
342
189
292
407
347
687
326
894
489
605
684
736
461
1128
557
1025
741
541
720
562
211
776
147
599
200
771
455
348
158
166
446
622
582
312
506
803
271
755
514
31

727
675
731
352
462
535
186
283
414
209
353
269
900
392
744
1186
1175
973
597
140
1192
430
384
453
893
819
658
490
487
641
663
712
457
330
788
390
446
455
348
155
1083
730
936
395
618
319
390
534
1224
233
765
125
1225
541
1077
487
788
705
147
547
568
817
620
448
214
656
436
368
362
176
1203
1174
886
647
506
742
716
404
539
401
189
310
552
512
236
79
222
588
785
760
435
172
535
447
361
695
690
1019
315
548
519
392
632
600
857
833
220
97
562
455
459
995
1108
499
526
697
1261
643
294
322
585
657
362
625
623
368
256
454
290
1195
765
461
703
487
728
760
145
196
702
359
582
763
261
435
691
691
554
201
190
951
271
177
581
317
651
486
194
370
449
292
364
377
633
939
447
472
1211
581
997
979
954
1331
1214
612
643
665
673
647
171
518
819
565
522
345
488
199
334
300
151
494
809
311
330
287
401
505
658
499
137
360
588
265
236
876
1211
636
701
739
960
745
590
641
410
392
663
268
424
679
746
719
339
681
527
455
857
474
347
528
407
666
822
126
390
974
208
717
719
301
315
620
748
912
632
726
492
590
5

970
691
949
1047
1106
876
453
136
283
569
605
399
170
226
734
223
488
564
769
342
794
373
260
702
109
1156
521
734
1335
725
446
524
688
661
647
662
337
310
261
619
187
192
745
342
255
652
220
523
436
184
291
698
470
1151
493
142
448
597
457
548
587
225
892
824
635
337
213
640
326
423
591
141
434
387
122
551
476
485
626
425
562
425
614
544
766
249
432
772
615
669
694
726
944
281
827
510
197
842
708
357
849
836
440
543
344
765
483
256
764
280
167
557
576
767
797
113
326
684
98
443
572
684
935
198
150
324
384
392
629
38
489
637
168
959
465
694
600
391
618
720
619
624
548
674
437
475
197
382
315
426
324
105
141
481
636
585
346
491
268
751
373
112
798
483
390
1021
804
613
323
422
130
281
783
782
530
706
355
706
304
272
912
694
187
269
649
367
785
607
244
90
213
231
82
525
257
557
439
487
498
464
559
802
858
991
307
816
923
162
544
625
646
941
757
651
471
628
451
308
360
390
298
281
432
601
280
266
537
319
222
738
458
626
566
698
295
245
470
707
373
765
378
188
988
562
440
237
798
819
636
86

670
1102
1063
1025
188
995
966
1028
760
187
159
276
349
815
805
767
704
479
626
627
339
367
609
420
503
360
774
450
885
222
723
281
702
451
726
598
489
654
527
180
160
635
383
626
581
751
366
353
698
554
593
741
430
283
729
602
956
1217
514
522
704
491
563
554
642
554
481
643
377
385
271
258
184
583
492
818
251
418
578
543
446
369
787
829
1010
348
556
299
274
632
358
278
576
356
414
253
223
577
532
466
415
458
747
638
326
341
736
386
374
727
601
320
424
371
1332
654
295
650
804
787
729
758
1041
818
993
1011
699
574
532
822
448
715
771
890
190
846
745
337
782
718
434
136
146
333
620
121
142
238
367
838
919
540
243
662
357
841
267
379
577
541
435
426
701
370
351
393
217
310
443
229
1600
528
454
670
484
913
748
886
386
434
394
397
433
308
584
556
497
564
120
770
698
284
366
504
288
445
190
427
379
144
834
435
303
239
246
928
979
798
609
762
829
613
1007
491
262
769
661
496
481
485
689
339
490
532
502
925
832
763
618
598
511
395
753
469
341
780
792
671
240
1054
1225
1106
866
383
739
452
50

501
1087
707
676
212
414
198
266
291
308
656
472
679
697
459
397
164
134
257
371
448
423
287
837
856
557
535
557
322
514
223
1674
660
588
262
276
715
538
667
724
378
610
372
619
549
416
856
662
438
570
331
610
264
751
76
491
276
452
751
603
768
486
877
667
299
360
510
750
223
743
447
153
501
622
654
405
183
447
357
530
293
442
295
458
1053
590
798
355
130
516
587
333
983
661
642
1180
389
493
474
904
134
213
333
695
629
149
669
461
661
616
496
490
953
509
335
496
158
760
703
191
663
462
312
323
221
366
485
431
722
848
577
572
727
555
677
634
140
618
1403
763
613
147
503
894
397
149
421
503
413
615
168
767
652
498
1378
767
401
838
312
262
376
420
719
385
609
417
399
469
1155
605
299
417
590
470
813
296
574
247
462
88
694
876
659
238
189
118
541
348
657
288
299
734
718
331
71
153
823
314
706
861
993
525
492
750
723
168
767
508
514
860
260
730
500
310
516
724
611
917
539
397
245
1136
707
179
357
133
442
420
530
994
437
855
518
312
92
244
560
429
310
580
740
355
321
687
794
372
468
838
535


844
409
692
578
372
681
137
447
744
291
876
853
917
952
791
603
224
364
176
681
630
630
342
735
553
232
779
334
921
392
768
809
752
600
447
687
371
792
967
600
645
263
349
501
794
467
805
661
596
662
467
502
236
614
426
353
671
379
773
407
539
298
415
190
851
253
750
296
860
648
571
589
156
384
386
577
589
350
625
584
757
499
524
1194
257
613
506
615
741
497
385
561
424
278
697
613
561
753
387
848
498
520
869
646
824
715
308
486
486
964
816
960
818
664
366
632
488
331
235
887
805
431
408
668
781
461
386
722
795
683
487
653
583
507
603
515
202
688
779
349
534
776
534
739
733
297
252
939
745
320
866
358
726
223
513
381
271
768
933
649
665
550
623
388
610
812
458
300
937
550
160
378
1122
777
421
1409
834
601
640
703
429
341
557
403
965
306
309
678
1106
426
532
461
438
530
276
794
783
356
325
818
646
729
436
480
589
470
483
874
359
602
375
304
423
351
676
554
414
291
665
460
795
597
356
337
557
343
651
477
850
469
749
380
745
810
773
353
501
207
237
692
749
987
546
524
974
745
336
1129
676

608
936
458
251
592
575
783
544
679
662
336
376
1054
842
610
246
605
350
683
156
325
114
692
327
572
130
476
526
597
239
226
804
461
515
717
392
563
533
712
918
509
697
646
378
273
679
195
626
624
400
376
842
173
332
482
383
721
783
453
407
670
525
501
172
651
881
934
454
942
549
429
809
575
774
810
482
461
387
434
584
591
864
684
332
796
240
310
742
679
438
311
275
566
656
1045
133
907
612
438
454
225
546
1091
309
319
450
584
974
664
277
464
630
470
254
443
452
1029
738
461
765
859
835
120
393
258
508
613
342
685
677
180
495
617
669
221
674
735
288
500
716
1217
440
436
339
705
183
613
673
401
435
201
187
522
652
514
493
439
825
373
673
442
368
326
685
480
471
430
511
368
662
604
447
517
243
474
754
575
766
566
635
774
778
326
459
190
367
659
689
686
252
226
647
182
287
123
310
290
505
664
211
597
819
428
835
558
678
72
236
417
632
182
343
562
535
659
789
995
506
651
793
464
150
582
800
292
417
641
406
468
483
544
271
235
484
636
522
307
583
641
676
1097
837
276
516
498
1034
211
248
69

618
914
384
1043
360
765
585
711
136
673
564
658
801
417
903
573
1222
786
251
483
551
307
469
649
555
246
676
890
881
456
653
355
873
650
720
423
892
732
360
373
822
821
100
637
495
775
376
497
557
623
233
242
174
598
893
464
388
1182
440
685
1153
260
114
535
547
599
142
1115
545
247
675
845
715
619
454
671
99
482
1143
772
442
501
610
698
464
471
656
170
106
609
844
79
596
493
492
348
614
186
701
189
124
327
954
1003
371
260
352
497
466
910
589
249
552
206
648
880
787
442
540
429
418
599
848
570
393
507
170
709
405
1150
685
1133
451
539
277
686
468
518
402
276
226
450
108
149
426
137
243
328
397
1211
134
403
453
610
1079
473
634
301
436
757
678
538
341
412
351
598
818
923
554
370
488
747
416
865
555
618
477
532
679
199
818
821
658
265
214
186
915
348
364
429
322
579
399
493
1019
661
853
596
756
382
347
824
305
158
468
453
794
436
444
423
519
692
1232
402
737
583
570
872
362
435
323
680
400
144
167
686
249
530
374
293
1431
901
563
614
690
841
563
79
732
334
660
488
180
474
493
550
221
4

795
427
343
715
345
399
142
709
764
562
383
401
862
657
672
288
690
595
1115
831
364
658
504
227
316
1020
826
366
1005
219
183
297
764
815
916
389
1084
450
694
400
571
263
254
687
785
334
612
549
770
607
654
374
490
432
362
494
286
369
250
687
195
456
176
504
221
267
660
666
333
792
435
719
624
371
197
825
1018
975
390
781
1147
207
523
351
505
390
347
234
627
225
231
680
389
720
1005
601
500
562
640
1230
633
638
246
247
683
520
510
630
540
661
653
559
582
645
584
434
802
778
537
205
308
503
642
212
440
603
370
568
259
1216
768
386
278
495
553
552
168
509
564
345
562
771
639
257
282
727
304
612
474
483
911
679
800
737
715
785
806
683
636
580
813
460
239
664
870
645
340
160
235
455
895
267
663
205
398
179
866
461
571
215
418
503
148
520
654
565
498
507
368
118
626
166
600
330
478
656
381
702
460
313
460
1074
196
512
375
123
354
347
586
352
322
117
347
660
799
668
246
968
466
424
601
577
693
269
869
856
783
919
404
743
682
420
642
219
643
1151
386
202
854
366
490
517
705
349
547
579
646
7

1112
804
475
257
742
752
503
390
606
217
161
499
502
233
381
787
765
743
289
242
813
473
244
881
816
536
391
678
988
190
858
926
471
211
738
822
347
381
630
333
441
581
233
812
351
628
1050
631
1061
607
449
687
439
714
583
403
512
491
390
496
252
829
419
515
497
304
421
114
486
340
761
566
524
821
512
1009
267
873
560
976
498
261
433
511
244
725
362
209
813
501
927
394
342
315
859
456
312
535
424
296
407
546
155
224
417
435
835
305
261
791
558
918
339
441
390
944
787
567
480
386
1146
199
452
601
849
471
444
935
554
535
510
376
511
663
1182
296
660
807
224
407
247
436
674
481
921
251
628
451
471
564
894
606
677
208
830
1056
446
698
587
436
510
557
775
226
241
806
390
863
379
720
574
553
318
670
448
181
1148
893
440
429
394
537
755
410
303
257
797
470
498
430
481
689
758
505
388
489
748
635
385
565
399
318
267
446
385
388
412
122
392
1246
593
616
520
502
470
207
257
360
243
826
552
494
559
255
256
453
610
438
625
591
523
201
778
766
161
789
542
812
344
590
656
125
641
287
257
393
730
343

257
494
712
545
262
785
249
809
294
418
417
423
742
600
684
284
249
294
301
291
503
750
572
329
585
324
697
712
578
247
298
495
797
804
844
521
212
316
299
224
305
263
131
287
332
327
319
495
759
692
560
327
227
124
467
807
835
393
579
521
839
453
267
130
822
292
99
244
326
1086
358
484
717
514
318
824
635
310
304
282
361
307
245
253
321
277
266
200
268
221
686
544
503
847
532
345
794
286
378
951
731
801
330
350
293
249
243
319
1030
517
845
484
624
686
571
1138
315
143
248
285
308
266
707
613
713
333
378
290
366
619
703
549
666
232
247
292
729
316
311
311
659
540
388
369
303
148
568
230
236
775
614
692
181
690
366
371
321
320
292
392
235
367
387
241
28
559
183
827
604
374
294
722
267
369
788
252
267
365
196
332
402
156
265
552
303
676
375
708
798
696
738
1056
396
413
772
501
408
30
214
243
219
310
752
699
407
287
232
304
297
297
299
326
277
491
701
729
229
451
675
666
239
719
483
422
621
476
277
261
318
759
293
282
282
179
80
493
621
381
529
786
813
275
476
321
309
319
759
268
733
783


114
523
314
242
279
415
349
276
182
597
292
208
320
153
665
695
570
455
798
304
317
452
624
778
659
694
689
424
286
236
990
278
232
282
495
537
795
365
390
465
620
441
628
363
281
788
1066
1110
253
778
897
384
622
198
400
308
276
367
857
295
280
349
398
291
511
500
758
401
416
453
274
160
394
187
774
314
269
335
301
334
449
418
404
333
261
37
348
519
497
388
804
559
385
564
302
544
987
736
252
208
295
255
179
63
118
574
575
701
369
412
342
660
536
738
566
433
545
626
229
423
276
262
314
148
268
331
194
332
29
687
471
457
185
606
675
166
506
305
239
223
446
281
891
269
315
330
287
259
75
624
234
240
232
235
398
210
215
704
699
619
283
750
711
273
786
282
70
416
702
259
389
805
586
653
558
525
430
336
254
596
310
847
316
366
718
214
377
292
254
640
665
521
899
435
194
338
602
565
391
542
666
757
732
290
303
378
626
294
543
245
269
316
288
203
320
840
513
391
511
576
172
701
182
358
322
507
787
617
245
257
281
304
363
206
640
346
265
32
615
767
607
749
185
807
239
659
895
859
331
226
619


378
249
468
824
693
1082
760
281
333
294
395
311
297
321
211
141
515
433
202
184
620
320
468
302
850
392
817
455
726
766
575
403
292
487
345
299
879
298
310
98
300
307
220
309
483
620
300
230
342
435
291
763
828
187
335
696
788
369
303
297
38
309
120
620
297
627
635
665
788
338
482
397
186
489
296
529
332
509
736
622
315
249
258
296
456
324
321
133
1128
481
544
674
207
188
427
333
373
664
299
680
293
1482
727
307
270
288
191
285
263
664
137
288
497
479
577
701
273
223
583
176
210
461
639
401
338
242
300
273
335
288
606
164
285
285
232
304
559
525
609
683
415
741
534
238
544
328
239
495
576
343
502
771
971
261
254
589
275
291
338
349
256
1047
35
558
252
517
668
345
653
256
890
503
188
342
510
727
901
380
836
449
412
291
734
264
313
322
558
316
317
528
701
739
442
281
377
321
254
213
498
288
237
459
485
406
157
60
322
504
421
292
545
777
1008
978
314
212
657
675
779
514
459
148
282
327
303
261
281
287
215
187
169
284
286
512
605
613
600
278
432
754
364
509
310
339
999
580
532
310
347
323

756
280
327
598
869
666
1130
701
561
559
194
697
214
246
1110
341
263
298
193
298
225
407
790
647
368
648
285
595
367
374
280
237
341
300
215
308
230
560
660
351
825
811
600
282
629
651
547
692
634
158
522
217
301
220
308
371
327
540
533
556
517
836
470
258
246
289
362
683
421
327
330
224
362
230
563
560
766
176
428
362
670
413
690
508
332
322
216
394
202
296
221
146
604
817
660
683
481
301
926
352
641
870
330
718
266
793
581
413
190
409
576
496
333
292
279
298
167
317
312
212
487
484
758
183
139
928
220
285
257
647
414
490
139
272
216
307
116
202
108
547
486
845
662
597
497
768
297
189
735
278
342
464
479
294
692
283
256
341
320
312
311
224
304
492
741
611
186
559
344
637
644
464
634
620
956
791
520
273
376
170
321
277
479
333
303
1169
151
296
690
297
114
511
521
432
487
869
799
633
294
835
682
94
489
260
433
517
349
290
299
460
295
296
286
153
462
560
492
569
800
338
476
566
311
660
124
409
308
906
942
410
125
302
458
542
532
256
698
159
233
316
308
321
386
657
875
769
398
293
509
10

866
374
408
379
580
320
293
246
301
149
753
625
278
454
329
356
535
372
1199
270
296
318
521
314
434
248
245
314
121
737
484
848
532
371
192
668
174
786
793
468
345
292
289
347
498
121
270
134
479
763
918
552
440
326
224
840
878
1115
1067
491
520
199
356
165
356
294
313
269
328
767
552
608
781
593
570
487
419
534
514
394
216
669
351
399
561
584
235
346
363
182
354
318
340
260
582
803
602
416
199
847
521
658
700
691
663
704
343
283
432
803
483
577
167
308
321
294
318
318
181
360
594
336
205
291
701
307
447
825
575
564
280
702
289
232
631
751
751
486
372
517
309
819
447
392
151
555
658
671
384
607
677
493
1090
523
201
213
192
233
678
668
647
309
370
355
534
653
826
732
166
291
207
397
304
356
160
379
333
278
625
448
486
796
689
544
253
350
763
364
808
573
529
738
1126
450
225
603
316
640
272
1000
746
222
325
323
705
586
215
471
787
699
682
144
250
336
428
324
507
263
487
231
567
262
286
275
302
650
303
303
137
301
103
475
561
770
686
235
358
390
234
745
344
1496
188
224
351
393
311
197
6

In [16]:
# combining the two functions above:

# TODO implement the rest of the retrieval functions 

# TODO implement tools to help you with the analysis of the results.

### Task 2: Latent Semantic Models (LSMs) [15 points] ###

In this task you will experiment with applying distributional semantics methods ([LSI](http://lsa3.colorado.edu/papers/JASIS.lsi.90.pdf) **[5 points]** and [LDA](https://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf) **[5 points]**) for retrieval.

You do not need to implement LSI or LDA on your own. Instead, you can use [gensim](http://radimrehurek.com/gensim/index.html). An example on how to integrate Pyndri with Gensim for word2vec can be found [here](https://github.com/cvangysel/pyndri/blob/master/examples/word2vec.py). For the remaining latent vector space models, you will need to implement connector classes (such as `IndriSentences`) by yourself.

In order to use a latent semantic model for retrieval, you need to:
   * build a representation of the query **q**,
   * build a representation of the document **d**,
   * calculate the similarity between **q** and **d** (e.g., cosine similarity, KL-divergence).
     
The exact implementation here depends on the latent semantic model you are using. 
   
Each of these LSMs come with various hyperparameters to tune. Make a choice on the parameters, and explicitly mention the reasons that led you to these decisions. You can use the validation set to optimize hyper parameters you see fit; motivate your decisions. In addition, mention clearly how the query/document representations were constructed for each LSM and explain your choices.

In this experiment, you will first obtain an initial top-1000 ranking for each query using TF-IDF in **Task 1**, and then re-rank the documents using the LSMs. Use TREC Eval to obtain the results and report on `NDCG@10`, Mean Average Precision (`MAP@1000`), `Precision@5` and `Recall@1000`.

Perform significance testing **[5 points]** (similar as in Task 1) in the class of semantic matching methods.

### Task 3:  Word embeddings for ranking [20 points] (open-ended) ###

First create word embeddings on the corpus we provided using [word2vec](http://arxiv.org/abs/1411.2738) -- [gensim implementation](https://radimrehurek.com/gensim/models/word2vec.html). You should extract the indexed documents using pyndri and provide them to gensim for training a model (see example [here](https://github.com/nickvosk/pyndri/blob/master/examples/word2vec.py)).
   
This is an open-ended task. It is left up you to decide how you will combine word embeddings to derive query and document representations. Note that since we provide the implementation for training word2vec, you will be graded based on your creativity on combining word embeddings for building query and document representations.

Note: If you want to experiment with pre-trained word embeddings on a different corpus, you can use the word embeddings we provide alongside the assignment (./data/reduced_vectors_google.txt.tar.gz). These are the [google word2vec word embeddings](https://code.google.com/archive/p/word2vec/), reduced to only the words that appear in the document collection we use in this assignment.

### Task 4: Learning to rank (LTR) [15 points] (open-ended) ###

In this task you will get an introduction into learning to rank for information retrieval.

You can explore different ways for devising features for the model. Obviously, you can use the retrieval methods you implemented in Task 1, Task 2 and Task 3 as features. Think about other features you can use (e.g. query/document length). Creativity on devising new features and providing motivation for them will be taken into account when grading.

For every query, first create a document candidate set using the top-1000 documents using TF-IDF, and subsequently compute features given a query and a document. Note that the feature values of different retrieval methods are likely to be distributed differently.

You are adviced to start some pointwise learning to rank algorithm e.g. logistic regression, implemented in [scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).
Train your LTR model using 10-fold cross validation on the test set. More advanced learning to rank algorithms will be appreciated when grading.

### Task 4: Write a report [15 points; instant FAIL if not provided] ###

The report should be a PDF file created using the [sigconf ACM template](https://www.acm.org/publications/proceedings-template) and will determine a significant part of your grade.

   * It should explain what you have implemented, motivate your experiments and detail what you expect to learn from them. **[10 points]**
   * Lastly, provide a convincing analysis of your results and conclude the report accordingly. **[10 points]**
      * Do all methods perform similarly on all queries? Why?
      * Is there a single retrieval model that outperforms all other retrieval models (i.e., silver bullet)?
      * ...

**Hand in the report and your self-contained implementation source files.** Only send us the files that matter, organized in a well-documented zip/tgz file with clear instructions on how to reproduce your results. That is, we want to be able to regenerate all your results with minimal effort. You can assume that the index and ground-truth information is present in the same file structure as the one we have provided.
