# Information Retrieval Exercise 3 Notebook 


This is the template notebook for Exercise 3. The specification for the exercise and the corresponding Exercise 3 Quiz submission instance are available on the Moodle page of the course.

This exercise builds upon Exercise 2, and assumes that you are now familiar with concepts we have introduced in both Exercise 1 and Exercise 2, including:
 - [PyTerrier operators](https://pyterrier.readthedocs.io/en/latest/operators.html)
 - [Pyterrier apply transformers](https://pyterrier.readthedocs.io/en/latest/transformer.html)
 - [PyTerrier pt.Experiment()](https://pyterrier.readthedocs.io/en/latest/experiments.html)


## PyTerrier Setup

First, let's install PyTerrier as usual. 

In [None]:
!pip install python-terrier lightgbm==2.2.3

Collecting python-terrier
[?25l  Downloading https://files.pythonhosted.org/packages/37/bd/77d14471ff175b648369444715b7be7b49226068683e4b797cc1c0073ffe/python-terrier-0.6.0.tar.gz (86kB)
[K     |███▊                            | 10kB 14.7MB/s eta 0:00:01[K     |███████▌                        | 20kB 18.1MB/s eta 0:00:01[K     |███████████▎                    | 30kB 11.1MB/s eta 0:00:01[K     |███████████████                 | 40kB 9.4MB/s eta 0:00:01[K     |██████████████████▉             | 51kB 5.2MB/s eta 0:00:01[K     |██████████████████████▋         | 61kB 5.8MB/s eta 0:00:01[K     |██████████████████████████▍     | 71kB 5.9MB/s eta 0:00:01[K     |██████████████████████████████▏ | 81kB 6.5MB/s eta 0:00:01[K     |████████████████████████████████| 92kB 4.4MB/s 
Collecting wget
  Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip
Collecting pyjnius~=1.3.0
[?25l  Downloading https://fil

Let's start PyTerrier:

In [None]:
import pyterrier as pt
if not pt.started():
  pt.init()

# we require a specific version of LightGBM for this exercise
import lightgbm
assert lightgbm.__version__ == '2.2.3'

  from pandas import Panel


terrier-assemblies 5.5  jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.5  jar not found, downloading to /root/.pyterrier...
Done
PyTerrier 0.6.0 has loaded Terrier 5.5 (built by craigmacdonald on 2021-05-20 13:12)


In [None]:
# patch location of topics and qrels
def _filter_on_qid_type(self, component, variant):
  import pandas as pd
  if component == "topics":
    data = self.get_topics("all")
  elif component == "qrels":
    data = self.get_qrels("all")
  qid2type = pd.read_csv("http://mirror.ir-datasets.com/79737768b3be1aa07b14691aa54802c5", names=["qid", "type"], sep=" ")
  qid2type["qid"] = qid2type.apply(lambda row: row["qid"].split("-")[1], axis=1)
  rtr = data.merge(qid2type[qid2type["type"] == variant], on=["qid"])
  if len(rtr) == 0:
    raise ValueError("No such topic type '%s'" % variant)
  rtr.drop(columns=['type'], inplace=True)
  return (rtr, "direct")

dataset = pt.get_dataset("trec-wt-2004")
for t in ["np", "td", "hp"]:
  dataset.locations["qrels"][t] = _filter_on_qid_type
  dataset.locations["topics"][t] = _filter_on_qid_type
dataset.locations["qrels"]["all"] = ('04.qrels.web.mixed.txt', "http://www.dcs.gla.ac.uk/~craigm/04.qrels.web.mixed.txt")
dataset.locations["topics_prefixed"]["all"] = ('Web2004.query.stream.trecformat.txt', "http://www.dcs.gla.ac.uk/~craigm/Web2004.query.stream.trecformat.txt", "trec")

## Index, Topics & Qrels for Exercise 3

You will need your login & password credentials from Exercise 2. We will be using again the "50pct" and the "trec-wt-2004" datasets from Exercise 2.


In [None]:
UNAME = "2576183s"
PWORD = "9c8d7804"

from pyterrier.datasets import STANDARD_TERRIER_INDEX_FILES, RemoteDataset

# we will again be using the "50pct" and "trec-wt-2004" datasets
Fiftypct = pt.get_dataset("50pct",  user=UNAME, password=PWORD)
dotgov_topicsqrels = pt.get_dataset("trec-wt-2004")

On the other hand, you will be using a slightly updated index for Exercise 3. It is a bit bigger than the Exercise 2 index, hence it takes about 2-3 minutes to download to Colab. 

We also remove the Ex2 index, if it is found (this will only apply if you are not running on Colab). 

In [None]:
def removeEx2Index():
  import os
  indexdir = os.path.join(Fiftypct.corpus_home, "index")
  if os.path.exists(os.path.join(indexdir, "data.properties")) and not os.path.exists(os.path.join(indexdir, "data-pagerank.oos")):
    #this branch only occurs if the index from IRM Ex2 is found  
    print("WARNING: I have detected and removed an Ex2 index - if you are still working on Ex2, results will be identical, but " +
          "querying time will be a bit longer")
    print("To restore the original Ex2 index, you can delete %s and rerun the Ex2 notebook" % indexdir)
    import shutil
    shutil.rmtree(indexdir)

removeEx2Index()

indexref = Fiftypct.get_index(variant="ex2")
index = pt.IndexFactory.of(indexref)


Let's check out the new index. Compared to the index we used for Exercise 2, you can see that this index has `Field Names: [TITLE, ELSE]`, which means that we can provide statistics about how many times each term occurs in the title of each document (the "TITLE" field), vs the rest of the document (the "ELSE" field). Refer to Lecture 8 for more information about fields.

Let's also display the keys in the meta index - this is the metadata that we have stored for each document. You can see that we are storing the "url" and the "body" (content) of the document. These will particularly come in handy for Q2 and Q3 of Exercise 3, respectively.


In [None]:
print(index.getCollectionStatistics())
print("In the meta index: " + str(index.getMetaIndex().getKeys()))

Finally, these are all of the topics and qrels (including the training and validation datasets) that you will need to conduct Exercise 3.

In [None]:
tr_topics = Fiftypct.get_topics("training")
va_topics = Fiftypct.get_topics("validation")

tr_qrels = Fiftypct.get_qrels("training")
va_qrels = Fiftypct.get_qrels("validation")

test_topics = dotgov_topicsqrels.get_topics("hp")
test_qrels = dotgov_topicsqrels.get_qrels("hp")

## Baseline Setup

We introduce here the BatchRetrieve for our baseline. Note that:
 - We are using PL2 as our weighting model to generate the sample (the candidate set of documents to re-rank).
 - We expose more document metadata, namely "url" and "body" for each document retrieved, which you will need to deploy your two new features. 
 - By setting `verbose=True`, we display a progress bar while retrieval executes.

In [None]:
firstpassUB = pt.BatchRetrieve(index, wmodel="PL2", metadata=["docno", "url", "body"], verbose=True)


Let's see the resulting output - you can see that there are now "url" and "body" attributed for each retrieved document. (We also display a progress bar, enabled by the `verbose=True`).

In [None]:
firstpassUB.search("chemical reactions")

# Standard list of features

Let's introduce the list of features we need to deploy a baseline learning-to-rank approach.

In [None]:
pagerankfile = indexref.toString().replace(".properties", "-pagerank.oos")
features = [
    "SAMPLE", #ie PL2
    "WMODEL:SingleFieldModel(BM25,0)", #BM25 title
    "QI:StaticFeature(OIS,%s)" % pagerankfile,
]

stdfeatures = pt.FeaturesBatchRetrieve(index, features, verbose=True)
stage12 = firstpassUB >> stdfeatures

This is our feature set. We will be using FeaturesBatchRetrieve to compute these extra features on the fly. Let's see the output. You can see that there is now a "features" column.

In [None]:
stage12.search("chemical reactions").head(2)

Let's look in more detail at the features. It is clear that there are 3 numbers for each document. The first is the PL2 score (1.27555456e+01 == 12.7555), the second is the BM25 score, and the third is the PageRank (a link analysis feature - discussed in more detail in Lecture 10)

In [None]:
stage12.search("chemical reactions").head(1).iloc[0]["features"]

# Q1

You now have everyting you need to attempt Q1. You will need to refer to the specification, and to PyTerrier's [learning to rank documentation](https://pyterrier.readthedocs.io/en/latest/ltr.html).

You should use a LightGBM LambdaMART implementation (*not* XGBoost), instantiated using the configuration suggested in the PyTerrier documentation.

Hints:
 - You will need to use the provided separate “training” and “validation” topic sets and qrels to train the learning-to-rank.
 - There is no need to vary the configuration of LightGBM from that in the documentation.

In [None]:
#YOUR SOLUTION

# Q2 - URL Length Features

In this block, please provide your code for Q2 concerning your two URL Length features, namely URL Length by counting slashes (URL-slashes) and URL Length through using the type of the URL (URL-type). There are different possible URL length features that you could implement (see specification). Do carefully read and follow the Exercise 3 specification before starting the implementation of the features.

Some hints:

 - You will need to use a [pt.apply function](https://pyterrier.readthedocs.io/en/latest/apply.html) for computing your URL feature(s). The dataframe of results obtained from the upstream transformer has all of the information you need.

 - You can use a `**` operator for combining feature sets.

 - Refer to the PyTerrier learning to rank documentation  concerning `features_importances_` for obtaining feature importances.

 - You may wish to refer to Python's [`urlparse()`](https://docs.python.org/3/library/urllib.parse.html) function.

 - Use Python assertions to test that your feature implmentation(s) give the expected results. 


## Q2 (a) URL-Slashes Feature

In this block you should define your URL-Slashes feature, and test it. 

In [None]:
#YOUR SOLUTION

#### (i) URL-Slashes as a PL2 re-ranker

Now you should evaluate your URL-slashes score by re-ranking PL2. You can now answer the corresponding quiz questions.

In [None]:
#YOUR SOLUTION

#### (ii) URL-Slashes within an LTR model

Now you should evaluate your URL-slashes score as a feature within a new learned model. You can now answer the corresponding quiz questions.

In [None]:
#YOUR SOLUTION

## Q2 (b) URL Type Feature

In this block you should define your URL Type feature and test it.

In [None]:
#YOUR SOLUTION

#### (i) URL Type as a PL2 re-ranker

Now you should evaluate your URL type score by re-ranking PL2. You can now answer the corresponding quiz questions.

In [None]:
#YOUR SOLUTION

#### (ii) URL Type within an LTR model

Now you should evaluate your URL type score as a feature within a new learned model. You can now answer the corresponding quiz questions.

In [None]:
#YOUR SOLUTION

# Q3 Proximity Search Feature

Now you will implement a new query-dependent feature, using the MinDist() function, as discussed in the specification. Do carefully read the specification before starting the implementation.

Hints:
 - Again, remember to use assertions to test your feature implementations.
 - Refer to the PyTerrier learning to rank documentation concerning features_importances_ for obtaining feature importances

As mentioned in the specification, you should implement a function called avgmindist(), which takes the text of the query and the text of the document, and returns a score for the document, i.e. it must conform to the following Python specification:
```python
def avgmindist(query : str, document : str) -> float
```

NB: There are particular specific requirements for your implementations of MinDist() and avgmindist() that are detailed in the specification.

In [None]:
#YOUR AVGMINDIST IMPLEMENTATION

def avgmindist(query : str, document : str) -> float
  #update your implementation here.
  return 0.0

You should test your impementation yourself, however to allow us to verify your implementation, we have created 9 testcases. Please run `run_test_cases()` and use its responses to answer the relevant quiz questions.



In [None]:
#DO NOT ALTER THIS CELL
TEST_CASES = [
  ('fermilab directory', 45, 567257), #1
  ('webcam', 45, 567257), #2
  ('DOM surface', 384034, 388292), #3
  ('DOM surface', 45, 384034), #4
  ('DOM surface document', 388292, 384034), #5
  ('DOM software AMANDA', 639302, 384034), #6
  ('fermilab directory', 388292, 384034), #7
  ('trigger data', 596532, 639302), #8
  ('underlying hardware', 384034, 333649) #9
]

def run_test_cases():
  docno=0
  body=3
  for i, (query, docid1, docid2) in enumerate(TEST_CASES):
    meta1 = index.getMetaIndex().getAllItems(docid1)
    meta2 = index.getMetaIndex().getAllItems(docid2)
    s1 = avgmindist(query, meta1[body])
    s2 = avgmindist(query, meta2[body])
    if s1 > s2:
      result = meta1[docno]
      cmpD = "%s > %s" % (meta1[docno],meta2[docno])
    elif s2 > s1:
      result = meta2[docno]
      cmpD = "%s > %s" % (meta2[docno],meta1[docno])
    else:
      result = "EQUAL"
      cmpD = "%s == %s" % (meta1[docno],meta2[docno])
    print("TEST CASE %d result %s " % (i+1, result))

run_test_cases()

You should now integrate your avgmindist() function into a new LTR model, and compare its MAP & P@5 performance to the LTR baseline. You can now answer the corresponding quiz questions.

In [None]:
#YOUR SOLUTION

# Q4 A 5-feature Learning-to-Rank Model

You will now experiment with the LightGBM LambdaMART technique where you include both your added features (URL Type and AvgMinDist) along the 3 initial features inc PL2 sample (5 features in total). 

You need to learn a *new* model when using your final selection of 5 features.

Evaluate the performance of your resulting LTR system in comparison to the LTR baseline and answer the quiz questions.

In [None]:
#YOUR SOLUTION

# That's all Folks

**Submission Instructions:** Complete this notebook. All your answers to Exercise 3 must be submitted on the Exercise 3 Quiz instance on Moodle with your completed notebook (showing both your solutions and the results of their executions).


Your answers to the Quiz questions along with your .ipynb notebook file (showing code and outputs) must be submitted by **Friday 18th June 2021, 4:30pm**.