# Information Retrieval Exercise 3 Notebook 


This is the template notebook for Exercise 3. The specification for the exercise and the corresponding Exercise 3 Quiz submission instance are available on the Moodle page of the course.

This exercise builds upon Exercise 2, and assumes that you are now familiar with concepts we have introduced in both Exercise 1 and Exercise 2, including:
 - [PyTerrier operators](https://pyterrier.readthedocs.io/en/latest/operators.html)
 - [Pyterrier apply transformers](https://pyterrier.readthedocs.io/en/latest/transformer.html)
 - [PyTerrier pt.Experiment()](https://pyterrier.readthedocs.io/en/latest/experiments.html)


## PyTerrier Setup

First, let's install PyTerrier as usual. 

In [None]:
!pip install python-terrier lightgbm==2.2.3

Collecting python-terrier
[?25l  Downloading https://files.pythonhosted.org/packages/37/bd/77d14471ff175b648369444715b7be7b49226068683e4b797cc1c0073ffe/python-terrier-0.6.0.tar.gz (86kB)
[K     |████████████████████████████████| 92kB 2.8MB/s 
Collecting wget
  Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip
Collecting pyjnius~=1.3.0
[?25l  Downloading https://files.pythonhosted.org/packages/ea/b1/e33db12a20efe28b20fbcf4efc9b95a934954587cd7aa5998987a22e8885/pyjnius-1.3.0-cp37-cp37m-manylinux2010_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 16.0MB/s 
[?25hCollecting matchpy
[?25l  Downloading https://files.pythonhosted.org/packages/2a/25/6b8fa5846476c2d56856a4926fda859b218656b14571ace76fbcd1d39986/matchpy-0.5.4-py3-none-any.whl (69kB)
[K     |████████████████████████████████| 71kB 7.8MB/s 
Collecting deprecation
  Downloading https://files.pythonhosted.org/packages/02/c3/253a8

Let's start PyTerrier:

In [None]:
import pyterrier as pt
if not pt.started():
  pt.init()

# we require a specific version of LightGBM for this exercise
import lightgbm
assert lightgbm.__version__ == '2.2.3'

  from pandas import Panel


terrier-assemblies 5.5  jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.5  jar not found, downloading to /root/.pyterrier...
Done
PyTerrier 0.6.0 has loaded Terrier 5.5 (built by craigmacdonald on 2021-05-20 13:12)


In [None]:
# patch location of topics and qrels
def _filter_on_qid_type(self, component, variant):
  import pandas as pd
  if component == "topics":
    data = self.get_topics("all")
  elif component == "qrels":
    data = self.get_qrels("all")
  qid2type = pd.read_csv("http://mirror.ir-datasets.com/79737768b3be1aa07b14691aa54802c5", names=["qid", "type"], sep=" ")
  qid2type["qid"] = qid2type.apply(lambda row: row["qid"].split("-")[1], axis=1)
  rtr = data.merge(qid2type[qid2type["type"] == variant], on=["qid"])
  if len(rtr) == 0:
    raise ValueError("No such topic type '%s'" % variant)
  rtr.drop(columns=['type'], inplace=True)
  return (rtr, "direct")

dataset = pt.get_dataset("trec-wt-2004")
for t in ["np", "td", "hp"]:
  dataset.locations["qrels"][t] = _filter_on_qid_type
  dataset.locations["topics"][t] = _filter_on_qid_type
dataset.locations["qrels"]["all"] = ('04.qrels.web.mixed.txt', "http://www.dcs.gla.ac.uk/~craigm/04.qrels.web.mixed.txt")
dataset.locations["topics_prefixed"]["all"] = ('Web2004.query.stream.trecformat.txt', "http://www.dcs.gla.ac.uk/~craigm/Web2004.query.stream.trecformat.txt", "trec")

## Index, Topics & Qrels for Exercise 3

You will need your login & password credentials from Exercise 2. We will be using again the "50pct" and the "trec-wt-2004" datasets from Exercise 2.


In [None]:
UNAME="2576183s"
PWORD="9c8d7804"

from pyterrier.datasets import STANDARD_TERRIER_INDEX_FILES, RemoteDataset

# we will again be using the "50pct" and "trec-wt-2004" datasets
Fiftypct = pt.get_dataset("50pct",  user=UNAME, password=PWORD)
dotgov_topicsqrels = pt.get_dataset("trec-wt-2004")

On the other hand, you will be using a slightly updated index for Exercise 3. It is a bit bigger than the Exercise 2 index, hence it takes about 2-3 minutes to download to Colab. 

We also remove the Ex2 index, if it is found (this will only apply if you are not running on Colab). 

In [None]:
def removeEx2Index():
  import os
  indexdir = os.path.join(Fiftypct.corpus_home, "index")
  if os.path.exists(os.path.join(indexdir, "data.properties")) and not os.path.exists(os.path.join(indexdir, "data-pagerank.oos")):
    #this branch only occurs if the index from IRM Ex2 is found  
    print("WARNING: I have detected and removed an Ex2 index - if you are still working on Ex2, results will be identical, but " +
          "querying time will be a bit longer")
    print("To restore the original Ex2 index, you can delete %s and rerun the Ex2 notebook" % indexdir)
    import shutil
    shutil.rmtree(indexdir)

removeEx2Index()

indexref = Fiftypct.get_index(variant="ex2")
index = pt.IndexFactory.of(indexref)


Downloading 50pct index to /root/.pyterrier/corpora/50pct/index


HBox(children=(FloatProgress(value=0.0, description='data.meta-0.fsomapfile', max=52505375.0, style=ProgressSt…




HBox(children=(FloatProgress(value=0.0, description='data-pagerank.oos', max=9982051.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='data.direct.bf', max=1387136072.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='data.document.fsarrayfile', max=20194375.0, style=Progres…




HBox(children=(FloatProgress(value=0.0, description='data.inverted.bf', max=1397611120.0, style=ProgressStyle(…




HBox(children=(FloatProgress(value=0.0, description='data.lexicon.fsomapfile', max=192116072.0, style=Progress…




HBox(children=(FloatProgress(value=0.0, description='data.lexicon.fsomaphash', max=1017.0, style=ProgressStyle…




HBox(children=(FloatProgress(value=0.0, description='data.lexicon.fsomapid', max=8175152.0, style=ProgressStyl…




HBox(children=(FloatProgress(value=0.0, description='data.meta.idx', max=6462200.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='data.meta.zdata', max=902671248.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='data.properties', max=4462.0, style=ProgressStyle(descrip…


01:16:15.119 [main] WARN  o.t.s.BaseCompressingMetaIndex - Structure meta reading data file directly from disk (SLOW) - try index.meta.data-source=fileinmem in the index properties file. 860.9 MiB of memory would be required.


Let's check out the new index. Compared to the index we used for Exercise 2, you can see that this index has `Field Names: [TITLE, ELSE]`, which means that we can provide statistics about how many times each term occurs in the title of each document (the "TITLE" field), vs the rest of the document (the "ELSE" field). Refer to Lecture 8 for more information about fields.

Let's also display the keys in the meta index - this is the metadata that we have stored for each document. You can see that we are storing the "url" and the "body" (content) of the document. These will particularly come in handy for Q2 and Q3 of Exercise 3, respectively.


In [None]:
print(index.getCollectionStatistics())
print("In the meta index: " + str(index.getMetaIndex().getKeys()))

Number of documents: 807775
Number of terms: 2043788
Number of postings: 177737957
Number of fields: 2
Number of tokens: 572916194
Field names: [TITLE, ELSE]
Positions:   true

In the meta index: ['docno', 'url', 'title', 'body']


Finally, these are all of the topics and qrels (including the training and validation datasets) that you will need to conduct Exercise 3.

In [None]:
tr_topics = Fiftypct.get_topics("training")
va_topics = Fiftypct.get_topics("validation")

tr_qrels = Fiftypct.get_qrels("training")
va_qrels = Fiftypct.get_qrels("validation")

test_topics = dotgov_topicsqrels.get_topics("hp")
test_qrels = dotgov_topicsqrels.get_qrels("hp")

Downloading 50pct topics to /root/.pyterrier/corpora/50pct/training.topics


HBox(children=(FloatProgress(value=0.0, description='training.topics', max=7938.0, style=ProgressStyle(descrip…


Downloading 50pct topics to /root/.pyterrier/corpora/50pct/validation.topics


HBox(children=(FloatProgress(value=0.0, description='validation.topics', max=4491.0, style=ProgressStyle(descr…


Downloading 50pct qrels to /root/.pyterrier/corpora/50pct/training.qrels


HBox(children=(FloatProgress(value=0.0, description='training.qrels', max=2944.0, style=ProgressStyle(descript…


Downloading 50pct qrels to /root/.pyterrier/corpora/50pct/validation.qrels


HBox(children=(FloatProgress(value=0.0, description='validation.qrels', max=1518.0, style=ProgressStyle(descri…


Downloading trec-wt-2004 topics_prefixed to /root/.pyterrier/corpora/trec-wt-2004/Web2004.query.stream.trecformat.txt


HBox(children=(FloatProgress(value=0.0, description='Web2004.query.stream.trecformat.txt', max=15657.0, style=…


Downloading trec-wt-2004 qrels to /root/.pyterrier/corpora/trec-wt-2004/04.qrels.web.mixed.txt


HBox(children=(FloatProgress(value=0.0, description='04.qrels.web.mixed.txt', max=1996931.0, style=ProgressSty…




In [None]:
test_topics.head()

Unnamed: 0,qid,query
0,6,philadelphia streets
1,7,togo embassy
2,9,baltimore
3,17,secure linux
4,29,grand canyon monitoring and research center


## Baseline Setup

We introduce here the BatchRetrieve for our baseline. Note that:
 - We are using PL2 as our weighting model to generate the sample (the candidate set of documents to re-rank).
 - We expose more document metadata, namely "url" and "body" for each document retrieved, which you will need to deploy your two new features. 
 - By setting `verbose=True`, we display a progress bar while retrieval executes.

In [None]:
firstpassUB = pt.BatchRetrieve(index, wmodel="PL2", metadata=["docno", "url", "body"], verbose=True)


Let's see the resulting output - you can see that there are now "url" and "body" attributed for each retrieved document. (We also display a progress bar, enabled by the `verbose=True`).

In [None]:
firstpassUB.search("chemical reactions")

HBox(children=(FloatProgress(value=0.0, description='BR(PL2)', max=1.0, style=ProgressStyle(description_width=…




Unnamed: 0,qid,docid,docno,url,body,rank,score,query
0,1,513586,G18-38-1767991,http://www.boulder.nist.gov/div838/tar/file03....,NIST - Physical and Chemical Properties Divi...,0,12.755546,chemical reactions
1,1,38544,G01-14-2537005,http://www.labtrain.noaa.gov/shemtfa/chemhaz/n...,. ...,1,11.906524,chemical reactions
2,1,707122,G26-06-3754605,http://www.aps.anl.gov/xfd/tech/safetyenvelope...,APS Experiment Safety Envelope 6: Chemicals ...,2,11.877550,chemical reactions
3,1,382754,G13-59-3981168,http://response.restoration.noaa.gov/chemaids/...,"""); } else { document.write(...",3,11.858475,chemical reactions
4,1,70292,G02-16-2617043,http://www.symp14.nist.gov/PDF/COR04MAY.PDF,A Database of Chemical Reactions Designed to A...,4,11.731490,chemical reactions
...,...,...,...,...,...,...,...,...
995,1,246965,G08-68-4141101,http://en-env.llnl.gov/asd/pinatub.html,The Chemical and Radiative Effects of the Moun...,995,6.290707,chemical reactions
996,1,611136,G22-04-3955177,http://eospso.gsfc.nasa.gov/ftp_docs/Ch7.pdf,Chapter 7 ...,996,6.289822,chemical reactions
997,1,594957,G21-38-0191596,http://www.oit.doe.gov/news/oittimes/wn02/wn02...,search ...,997,6.287830,chemical reactions
998,1,280944,G09-85-3411646,http://www.ig.doe.gov/pdf/chemfina.pdf,INS-O-00-01 I N S P E C T I O N ...,998,6.287759,chemical reactions


# Standard list of features

Let's introduce the list of features we need to deploy a baseline learning-to-rank approach.

In [None]:
pagerankfile = indexref.toString().replace(".properties", "-pagerank.oos")
features = [
    "SAMPLE", #ie PL2
    "WMODEL:SingleFieldModel(BM25,0)", #BM25 title
    "QI:StaticFeature(OIS,%s)" % pagerankfile,
]

stdfeatures = pt.FeaturesBatchRetrieve(index, features, verbose=True)
stage12 = firstpassUB >> stdfeatures

In [None]:
from google.colab import drive
drive.mount('/content/drive')

This is our feature set. We will be using FeaturesBatchRetrieve to compute these extra features on the fly. Let's see the output. You can see that there is now a "features" column.

In [None]:
stage12.search("chemical reactions").head(2)

HBox(children=(FloatProgress(value=0.0, description='BR(PL2)', max=1.0, style=ProgressStyle(description_width=…




HBox(children=(FloatProgress(value=0.0, description='FBR(3 features)', max=1.0, style=ProgressStyle(descriptio…




Unnamed: 0,qid,query,docid,rank,features,docno,score
0,1,chemical reactions,513586,0,"[12.755545561073266, 3.0924078763629836, 0.000...",G18-38-1767991,12.755546
1,1,chemical reactions,38544,1,"[11.90652405775751, 10.789390732195702, 0.0002...",G01-14-2537005,11.906524


Let's look in more detail at the features. It is clear that there are 3 numbers for each document. The first is the PL2 score (1.27555456e+01 == 12.7555), the second is the BM25 score, and the third is the PageRank (a link analysis feature - discussed in more detail in Lecture 10)

In [None]:
stage12.search("chemical reactions").head(1).iloc[0]["features"]

HBox(children=(FloatProgress(value=0.0, description='BR(PL2)', max=1.0, style=ProgressStyle(description_width=…




HBox(children=(FloatProgress(value=0.0, description='FBR(3 features)', max=1.0, style=ProgressStyle(descriptio…




array([1.27555456e+01, 3.09240788e+00, 1.05668333e-04])

# Q1

You now have everyting you need to attempt Q1. You will need to refer to the specification, and to PyTerrier's [learning to rank documentation](https://pyterrier.readthedocs.io/en/latest/ltr.html).

You should use a LightGBM LambdaMART implementation (*not* XGBoost), instantiated using the configuration suggested in the PyTerrier documentation.

Hints:
 - You will need to use the provided separate “training” and “validation” topic sets and qrels to train the learning-to-rank.
 - There is no need to vary the configuration of LightGBM from that in the documentation.

In [None]:
#YOUR SOLUTION
import lightgbm as lgb
# this configures LightGBM as LambdaMART
lmart_l = lgb.LGBMRanker(task="train",
    min_data_in_leaf=1,
    min_sum_hessian_in_leaf=100,
    max_bin=255,
    num_leaves=7,
    objective="lambdarank",
    metric="ndcg",
    ndcg_eval_at=[1, 3, 5, 10],
    learning_rate= .1,
    importance_type="gain",
    num_iterations=10)
lmart_l_pipe = stage12 >> pt.ltr.apply_learned_model(lmart_l, form="ltr")
lmart_l_pipe.fit(tr_topics, tr_qrels, va_topics, va_qrels)

HBox(children=(FloatProgress(value=0.0, description='BR(PL2)', max=96.0, style=ProgressStyle(description_width…




HBox(children=(FloatProgress(value=0.0, description='FBR(3 features)', max=96.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='BR(PL2)', max=54.0, style=ProgressStyle(description_width…




HBox(children=(FloatProgress(value=0.0, description='FBR(3 features)', max=54.0, style=ProgressStyle(descripti…






[1]	valid_0's ndcg@1: 0.277778
[2]	valid_0's ndcg@1: 0.351852
[3]	valid_0's ndcg@1: 0.351852
[4]	valid_0's ndcg@1: 0.388889
[5]	valid_0's ndcg@1: 0.388889
[6]	valid_0's ndcg@1: 0.407407
[7]	valid_0's ndcg@1: 0.407407
[8]	valid_0's ndcg@1: 0.407407
[9]	valid_0's ndcg@1: 0.388889
[10]	valid_0's ndcg@1: 0.388889


In [None]:
performance_map = pt.Experiment(
    [firstpassUB,lmart_l_pipe],
     test_topics,
     test_qrels, 
     eval_metrics=['map'],
     round={"map" : 4 },
     names = ["PL2", "LambdaMART (LightGBM)"],
     baseline = 0

  )
performance_map

HBox(children=(FloatProgress(value=0.0, description='BR(PL2)', max=75.0, style=ProgressStyle(description_width…




HBox(children=(FloatProgress(value=0.0, description='BR(PL2)', max=75.0, style=ProgressStyle(description_width…




HBox(children=(FloatProgress(value=0.0, description='FBR(3 features)', max=75.0, style=ProgressStyle(descripti…

In [None]:
performance_P_5 = pt.Experiment(
    [firstpassUB, lmart_l_pipe],
    test_topics,
    test_qrels,
    eval_metrics=['P_5'],
    round={"P_5" : 4 },
    names=["PL2",  "LambdaMART (LightGBM)" ],
    baseline = 0
)
performance_P_5

# Q2 - URL Length Features

In this block, please provide your code for Q2 concerning your two URL Length features, namely URL Length by counting slashes (URL-slashes) and URL Length through using the type of the URL (URL-type). There are different possible URL length features that you could implement (see specification). Do carefully read and follow the Exercise 3 specification before starting the implementation of the features.

Some hints:

 - You will need to use a [pt.apply function](https://pyterrier.readthedocs.io/en/latest/apply.html) for computing your URL feature(s). The dataframe of results obtained from the upstream transformer has all of the information you need.

 - You can use a `**` operator for combining feature sets.

 - Refer to the PyTerrier learning to rank documentation  concerning `features_importances_` for obtaining feature importances.

 - You may wish to refer to Python's [`urlparse()`](https://docs.python.org/3/library/urllib.parse.html) function.

 - Use Python assertions to test that your feature implmentation(s) give the expected results. 


## Q2 (a) URL-Slashes Feature

In this block you should define your URL-Slashes feature, and test it. 

In [None]:
#YOUR SOLUTION

def URL_slashes(url):
  count = url["url"].count("/")
  return count

print(("http://www.atsdr.cdc.gov/toxprofiles/tp4-c1.pdf").count("/"))

4


#### (i) URL-Slashes as a PL2 re-ranker

Now you should evaluate your URL-slashes score by re-ranking PL2. You can now answer the corresponding quiz questions.

In [None]:
#YOUR SOLUTION

pipeline_url_slash = firstpassUB >> pt.apply.doc_score(URL_slashes)

performance_url_slash = pt.Experiment(
    [firstpassUB, pipeline_url_slash],
    test_topics,
    test_qrels,
    eval_metrics=['map','P_5'],
    round={"map" : 4 ,"P_5" : 4 },
    names=["PL2",  "Rerank - URL Slash" ],
    baseline = 0
)
performance_url_slash

HBox(children=(FloatProgress(value=0.0, description='BR(PL2)', max=75.0, style=ProgressStyle(description_width…




HBox(children=(FloatProgress(value=0.0, description='BR(PL2)', max=75.0, style=ProgressStyle(description_width…




Unnamed: 0,name,map,P_5,map +,map -,map p-value,P_5 +,P_5 -,P_5 p-value
0,PL2,0.2251,0.0693,,,,,,
1,LambdaMART (LightGBM),0.0022,0.0,0.0,70.0,3.386873e-08,0.0,26.0,2.216676e-08


In [None]:
pipeline_url_slash.search('cryption').head()

HBox(children=(FloatProgress(value=0.0, description='BR(PL2)', max=1.0, style=ProgressStyle(description_width=…




Unnamed: 0,qid,docid,docno,url,body,score,query,rank
1,1,494954,G17-68-2584616,http://www.ncs.gov/n2/content/technote/tnv7n4/...,OFFICE OF THE MANAGER ...,7,cryption,0
0,1,434993,G15-50-1054100,http://cs-www.ncsl.nist.gov/publications/nistp...,"References[BOCK 88] Peter Bocker, ISDN The Int...",6,cryption,1
6,1,457024,G16-34-3764782,http://w3.access.gpo.gov/bxa/ear/txt/734.txt,Part 734--Scope of the Export Administration R...,6,cryption,2
7,1,424551,G15-11-3633588,http://cs-www.ncsl.nist.gov/publications/nistp...,Special Publication 800-41 Guidelines on Firew...,6,cryption,3
8,1,427549,G15-22-3805523,http://cs-www.ncsl.nist.gov/publications/nistp...,Security Issues in the Database Language SQLW....,6,cryption,4
3,1,418765,G14-90-3191980,http://cs-www.ncsl.nist.gov/publications/nistb...,November 1997INTERNET ...,5,cryption,5
5,1,88187,G02-78-3621877,http://cs-www.ncsl.nist.gov/ipsec/papers/aes-d...,Network Working Group ...,5,cryption,6
9,1,567214,G20-36-0506919,http://cs-www.ncsl.nist.gov/staff/jansen/IEEEa...,I N T E L L I G E N T A G E N T SAgents for ...,5,cryption,7
2,1,515873,G18-46-1865362,http://socialsecurity.gov/employer/Repfal00.pdf,SSA / I R SSocial SecurityAdministrationIntern...,4,cryption,8
4,1,580563,G20-84-2477732,http://itos.gsfc.nasa.gov/ITOS/remcmd.pdf,Remote Commanding Documentation ...,4,cryption,9


#### (ii) URL-Slashes within an LTR model

Now you should evaluate your URL-slashes score as a feature within a new learned model. You can now answer the corresponding quiz questions.

In [None]:
# pipeline_url_slash_LTR = firstpassUB>>(stdfeatures **  pt.apply.doc_score(URL_slashes))
# pipeline_url_slash_LTR.search("chemical reactions").head(1).iloc[0]['features']

HBox(children=(FloatProgress(value=0.0, description='BR(PL2)', max=1.0, style=ProgressStyle(description_width=…




HBox(children=(FloatProgress(value=0.0, description='FBR(3 features)', max=1.0, style=ProgressStyle(descriptio…




array([1.27555456e+01, 3.09240788e+00, 1.05668333e-04, 5.00000000e+00])

In [None]:
#YOUR SOLUTION

pipeline_url_slash_LTR = firstpassUB>>(stdfeatures **  pt.apply.doc_score(URL_slashes))

lmart_l_2 = lgb.LGBMRanker(task="train",
    min_data_in_leaf=1,
    min_sum_hessian_in_leaf=100,
    max_bin=255,
    num_leaves=7,
    objective="lambdarank",
    metric="ndcg",
    ndcg_eval_at=[1, 3, 5, 10],
    learning_rate= .1,
    importance_type="gain",
    num_iterations=10)
lmart_l_pipe_2 = pipeline_url_slash_LTR >> pt.ltr.apply_learned_model(lmart_l_2, form="ltr")
lmart_l_pipe_2.fit(tr_topics, tr_qrels, va_topics, va_qrels)

performance_url_slash_LTR = pt.Experiment(
    [lmart_l_pipe, lmart_l_pipe_2],
    test_topics,
    test_qrels,
    eval_metrics=['map','P_5'],
    round={"map" : 4 ,"P_5" : 4 },
    names=["LTR (3 features)",  "LTR (4 features)" ],
    baseline = 0
)

performance_url_slash_LTR

HBox(children=(FloatProgress(value=0.0, description='BR(PL2)', max=96.0, style=ProgressStyle(description_width…




HBox(children=(FloatProgress(value=0.0, description='FBR(3 features)', max=96.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='BR(PL2)', max=54.0, style=ProgressStyle(description_width…




HBox(children=(FloatProgress(value=0.0, description='FBR(3 features)', max=54.0, style=ProgressStyle(descripti…






[1]	valid_0's ndcg@1: 0.277778
[2]	valid_0's ndcg@1: 0.296296
[3]	valid_0's ndcg@1: 0.333333
[4]	valid_0's ndcg@1: 0.351852
[5]	valid_0's ndcg@1: 0.388889
[6]	valid_0's ndcg@1: 0.407407
[7]	valid_0's ndcg@1: 0.407407
[8]	valid_0's ndcg@1: 0.592593
[9]	valid_0's ndcg@1: 0.666667
[10]	valid_0's ndcg@1: 0.685185


HBox(children=(FloatProgress(value=0.0, description='BR(PL2)', max=75.0, style=ProgressStyle(description_width…




HBox(children=(FloatProgress(value=0.0, description='FBR(3 features)', max=75.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='BR(PL2)', max=75.0, style=ProgressStyle(description_width…




HBox(children=(FloatProgress(value=0.0, description='FBR(3 features)', max=75.0, style=ProgressStyle(descripti…




Unnamed: 0,name,map,P_5,map +,map -,map p-value,P_5 +,P_5 -,P_5 p-value
0,LTR (3 features),0.4107,0.1147,,,,,,
1,LTR (4 features),0.4577,0.1253,23.0,27.0,0.312466,9.0,6.0,0.349206


In [None]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=400)
rf_pipe = pipeline_url_slash_LTR >> pt.ltr.apply_learned_model(rf)
rf_pipe.fit(tr_topics, tr_qrels)

HBox(children=(FloatProgress(value=0.0, description='BR(PL2)', max=96.0, style=ProgressStyle(description_width…




HBox(children=(FloatProgress(value=0.0, description='FBR(3 features)', max=96.0, style=ProgressStyle(descripti…




In [None]:
# stage12.search("chemical reactions").head(1).iloc[0]["features"]

HBox(children=(FloatProgress(value=0.0, description='BR(PL2)', max=1.0, style=ProgressStyle(description_width=…




HBox(children=(FloatProgress(value=0.0, description='FBR(3 features)', max=1.0, style=ProgressStyle(descriptio…




array([1.27555456e+01, 3.09240788e+00, 1.05668333e-04])

In [None]:
# pipeline_url_slash_LTR.search("chemical reactions").head(1).iloc[0]["features"]

HBox(children=(FloatProgress(value=0.0, description='BR(PL2)', max=1.0, style=ProgressStyle(description_width=…




HBox(children=(FloatProgress(value=0.0, description='FBR(3 features)', max=1.0, style=ProgressStyle(descriptio…




array([1.27555456e+01, 3.09240788e+00, 1.05668333e-04, 5.00000000e+00])

In [None]:
rf.feature_importances_


array([0.27572948, 0.32656943, 0.35602904, 0.04167205])

## Q2 (b) URL Type Feature

In this block you should define your URL Type feature and test it.

In [None]:
#YOUR SOLUTION

def URL_type(row):
  category = 0
  count = URL_slashes(row)

  if count == 2:
    category = 1
  elif count == 4:
    category = 2
  elif count > 4 and (row['url'][-10:] == 'index.html' or row['url'][-1:] == '/'):
    category = 3
  elif count > 4 and row['url'][-10:] != 'index.html':
    category = 4
  return category

print(URL_type({'url':"http://www.atsdr.cdc.gov/toxprofiles/tp4-c1.pdf"}))

2


#### (i) URL Type as a PL2 re-ranker

Now you should evaluate your URL type score by re-ranking PL2. You can now answer the corresponding quiz questions.

In [None]:
#YOUR SOLUTION
pipeline_url_type = firstpassUB >> pt.apply.doc_score(URL_type)

performance_url_type = pt.Experiment(
    [firstpassUB, pipeline_url_type],
    test_topics,
    test_qrels,
    eval_metrics=['map','P_5'],
    round={"map" : 4 ,"P_5" : 4 },
    names=["PL2",  "Rerank - URL Type" ],
    baseline = 0
)
performance_url_type

HBox(children=(FloatProgress(value=0.0, description='BR(PL2)', max=75.0, style=ProgressStyle(description_width…




HBox(children=(FloatProgress(value=0.0, description='BR(PL2)', max=75.0, style=ProgressStyle(description_width…




Unnamed: 0,name,map,P_5,map +,map -,map p-value,P_5 +,P_5 -,P_5 p-value
0,PL2,0.2251,0.0693,,,,,,
1,Rerank - URL Type,0.0014,0.0,0.0,70.0,2.996545e-08,0.0,26.0,2.216676e-08


In [None]:
pipeline_url_type.search('aaie').head()

HBox(children=(FloatProgress(value=0.0, description='BR(PL2)', max=1.0, style=ProgressStyle(description_width=…




Unnamed: 0,qid,docid,docno,url,body,score,query,rank
2,1,301428,G10-61-1895354,http://www.cdpr.ca.gov/docs/ipminov/01awards.htm,The 2001 IPM Innovators Awards The 2001 A...,4,aaie,0
3,1,375914,G13-35-3399834,http://www.cdpr.ca.gov/docs/pressrls/9pestinno...,Media Contacts: Glenn Brank 916/445-3974 ...,4,aaie,1
0,1,543541,G19-52-0995113,http://sunshine.jpl.nasa.gov/AAIE%20Site%20%c4...,AAIE Photo Album The Jet Propulsion Labo...,3,aaie,2
1,1,88532,G02-80-0379929,http://sunshine.jpl.nasa.gov/1rst%20Tier/Photo...,C ol o Photo Album This section is f...,2,aaie,3
4,1,51341,G01-54-3873617,http://goldmine.cde.ca.gov/calendar/,BODY { margin-left : 0; margin-...,2,aaie,4


#### (ii) URL Type within an LTR model

Now you should evaluate your URL type score as a feature within a new learned model. You can now answer the corresponding quiz questions.

In [None]:
#YOUR SOLUTION
pipeline_url_type_LTR = firstpassUB>>(stdfeatures **  pt.apply.doc_score(URL_type))

lmart_l_3 = lgb.LGBMRanker(task="train",
    min_data_in_leaf=1,
    min_sum_hessian_in_leaf=100,
    max_bin=255,
    num_leaves=7,
    objective="lambdarank",
    metric="ndcg",
    ndcg_eval_at=[1, 3, 5, 10],
    learning_rate= .1,
    importance_type="gain",
    num_iterations=10)
lmart_l_pipe_3 = pipeline_url_type_LTR >> pt.ltr.apply_learned_model(lmart_l_3, form="ltr")
lmart_l_pipe_3.fit(tr_topics, tr_qrels, va_topics, va_qrels)

performance_url_type_LTR = pt.Experiment(
    [lmart_l_pipe, lmart_l_pipe_3],
    test_topics,
    test_qrels,
    eval_metrics=['map','P_5'],
    round={"map" : 4 ,"P_5" : 4 },
    names=["LTR (3 features)",  "LTR (4 features)" ],
    baseline = 0
)

performance_url_type_LTR

HBox(children=(FloatProgress(value=0.0, description='BR(PL2)', max=96.0, style=ProgressStyle(description_width…




HBox(children=(FloatProgress(value=0.0, description='FBR(3 features)', max=96.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='BR(PL2)', max=54.0, style=ProgressStyle(description_width…




HBox(children=(FloatProgress(value=0.0, description='FBR(3 features)', max=54.0, style=ProgressStyle(descripti…






[1]	valid_0's ndcg@1: 0.277778
[2]	valid_0's ndcg@1: 0.351852
[3]	valid_0's ndcg@1: 0.351852
[4]	valid_0's ndcg@1: 0.388889
[5]	valid_0's ndcg@1: 0.388889
[6]	valid_0's ndcg@1: 0.388889
[7]	valid_0's ndcg@1: 0.37037
[8]	valid_0's ndcg@1: 0.37037
[9]	valid_0's ndcg@1: 0.37037
[10]	valid_0's ndcg@1: 0.425926


HBox(children=(FloatProgress(value=0.0, description='BR(PL2)', max=75.0, style=ProgressStyle(description_width…




HBox(children=(FloatProgress(value=0.0, description='FBR(3 features)', max=75.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='BR(PL2)', max=75.0, style=ProgressStyle(description_width…




HBox(children=(FloatProgress(value=0.0, description='FBR(3 features)', max=75.0, style=ProgressStyle(descripti…




Unnamed: 0,name,map,P_5,map +,map -,map p-value,P_5 +,P_5 -,P_5 p-value
0,LTR (3 features),0.4107,0.1147,,,,,,
1,LTR (4 features),0.4236,0.128,26.0,11.0,0.623603,6.0,1.0,0.058259


In [None]:
rf2 = RandomForestRegressor(n_estimators=400)
rf_pipe2 = pipeline_url_type_LTR >> pt.ltr.apply_learned_model(rf2)
rf_pipe2.fit(tr_topics, tr_qrels)

HBox(children=(FloatProgress(value=0.0, description='BR(PL2)', max=96.0, style=ProgressStyle(description_width…




HBox(children=(FloatProgress(value=0.0, description='FBR(3 features)', max=96.0, style=ProgressStyle(descripti…




In [None]:
rf2.feature_importances_

array([0.26629381, 0.28105373, 0.33691975, 0.11573271])

# Q3 Proximity Search Feature

Now you will implement a new query-dependent feature, using the MinDist() function, as discussed in the specification. Do carefully read the specification before starting the implementation.

Hints:
 - Again, remember to use assertions to test your feature implementations.
 - Refer to the PyTerrier learning to rank documentation concerning features_importances_ for obtaining feature importances

As mentioned in the specification, you should implement a function called avgmindist(), which takes the text of the query and the text of the document, and returns a score for the document, i.e. it must conform to the following Python specification:
```python
def avgmindist(query : str, document : str) -> float
```

NB: There are particular specific requirements for your implementations of MinDist() and avgmindist() that are detailed in the specification.

In [None]:
#YOUR AVGMINDIST IMPLEMENTATION


  

def avgmindist(query, document ):
 
 
  return 0.0

You should test your impementation yourself, however to allow us to verify your implementation, we have created 9 testcases. Please run `run_test_cases()` and use its responses to answer the relevant quiz questions.



In [None]:
i = index.getMetaIndex().getAllItems(567257)
i[3]

'World\n          Wide Web at Fermilab     |     World\n          Wide Web Group     |     Computing\n          Division     |     Fermilab\n          at Work     |     Fermilab\n          Home     \n         ______________________________________________________________________________________________________________  \n         Computing Division \n         \n       \n     \n    \n      Professional\n    Home Pages At Fermilab  \n      \n     \n    \n     Employees and users at Fermilab may wish to have a  ;personal ;\n      home page which lists professional information about themself and links to\n      their projects and papers. Home page authors must read and follow the\n        Fermilab\n        Policy on Computing .  \n     These pages are supported by the Computing Division. Your\n      professional home page may be served from your department or experiment\n      web server or you may choose to have it served from  fnalu .\n     \n     If you wish to restrict access to your w

In [None]:
#DO NOT ALTER THIS CELL
TEST_CASES = [
  ('fermilab directory', 45, 567257), #1
  ('webcam', 45, 567257), #2
  ('DOM surface', 384034, 388292), #3
  ('DOM surface', 45, 384034), #4
  ('DOM surface document', 388292, 384034), #5
  ('DOM software AMANDA', 639302, 384034), #6
  ('fermilab directory', 388292, 384034), #7
  ('trigger data', 596532, 639302), #8
  ('underlying hardware', 384034, 333649) #9
]

def run_test_cases():
  docno=0
  body=3
  for i, (query, docid1, docid2) in enumerate(TEST_CASES):
    meta1 = index.getMetaIndex().getAllItems(docid1)
    meta2 = index.getMetaIndex().getAllItems(docid2)
    s1 = avgmindist(query, meta1[body])
    s2 = avgmindist(query, meta2[body])
    if s1 > s2:
      result = meta1[docno]
      cmpD = "%s > %s" % (meta1[docno],meta2[docno])
    elif s2 > s1:
      result = meta2[docno]
      cmpD = "%s > %s" % (meta2[docno],meta1[docno])
    else:
      result = "EQUAL"
      cmpD = "%s == %s" % (meta1[docno],meta2[docno])
    print("TEST CASE %d result %s " % (i+1, result))

run_test_cases()

You should now integrate your avgmindist() function into a new LTR model, and compare its MAP & P@5 performance to the LTR baseline. You can now answer the corresponding quiz questions.

In [None]:
#YOUR SOLUTION

# Q4 A 5-feature Learning-to-Rank Model

You will now experiment with the LightGBM LambdaMART technique where you include both your added features (URL Type and AvgMinDist) along the 3 initial features inc PL2 sample (5 features in total). 

You need to learn a *new* model when using your final selection of 5 features.

Evaluate the performance of your resulting LTR system in comparison to the LTR baseline and answer the quiz questions.

In [None]:
#YOUR SOLUTION

# That's all Folks

**Submission Instructions:** Complete this notebook. All your answers to Exercise 3 must be submitted on the Exercise 3 Quiz instance on Moodle with your completed notebook (showing both your solutions and the results of their executions).


Your answers to the Quiz questions along with your .ipynb notebook file (showing code and outputs) must be submitted by **Friday 18th June 2021, 4:30pm**.