# Scaffolding project

_DSAIT4050: Information retrieval lecture, TU Delft_

Welcome to the **DSAIT4050: Information retrieval** lecture!

This project acts as a gentle introduction to information retrieval for you. You do not need any prior knowledge about IR for this task. Only some Python programming skills are required.

## Getting started

Under the hood, this notebook uses a library called **PyTerrier**. Please check out the first part of our _Introduction to PyTerrier_ series to learn how to install PyTerrier. However, you do not need to interact with PyTerrier directly for now; rather, we're providing you with simple utility functions you can use. Feel free to have a look how these are implemented, but it's not required.

**Task 1**: Install PyTerrier (see the `01-setup.ipynb` notebook).

Now you should be able to import the utility functions. A dataset will be downloaded and indexed automatically (this will take a minute).


In [11]:
from util import search, evaluate, evaluate_all

Now that we have loaded the data, you can run search queries. For example:


In [6]:
search("what is the meaning of life")



Unnamed: 0,qid,docid,docno,text,rank,score,query
0,1,284036,327334_2,If life did not suck sometime it would not be ...,0,25.241914,what is the meaning of life
1,1,24563,514843_5,"To live until we die,and have a meaningful. li...",1,24.42892,what is the meaning of life
2,1,183199,3286609_30,To make my mom and dad's life meaningful.,2,24.42892,what is the meaning of life
3,1,338875,1977360_8,"Change it to "" What makes life meaningful?""",3,24.42892,what is the meaning of life
4,1,146534,422602_7,to have a meaningful life or a life of meaning...,4,22.001176,what is the meaning of life
5,1,37272,3770850_1,"Oh... I get it.... ""thes""?. . Thes. is the abb...",5,21.88286,what is the meaning of life
6,1,142099,3352535_8,whats the purpose of life?,6,21.260102,what is the meaning of life
7,1,169083,2534143_13,whats life with no exitment!,7,21.260102,what is the meaning of life
8,1,338872,1977360_5,"Perhaps ""what can you do to make your life mor...",8,21.04054,what is the meaning of life
9,1,104996,4351526_3,the meaning of life is whatever you ascribe to...,9,19.934543,what is the meaning of life


What you get here is a list of ten documents from the corpus that are ordered by how relevant they are to our query (according to the search engine).

## Query rewriting

The goal of this task is to come up with a way of **rewriting queries** such that the search engine can "understand" them better.

In order to do this, let's first take a look at some example queries from our dataset. We represent these queries using a `pandas.DataFrame`, where the first column corresponds to the **query ID** and the second column corresponds to the **query**:


In [7]:
from scaffolding.util import DATASET
import pandas as pd

example_queries = pd.DataFrame(
    [
        [
            "443848",
            "does anybody know where i could get a free guide on how to train a siberian husky",
        ],
        [
            "1783010",
            "what is blaphsemy",
        ],
        [
            "2838988",
            "how can i get a cork out of not into a wine bottle without a corkscrew",
        ],
    ],
    columns=["qid", "query"],
)

antique/test/non-offensive documents: 100%|██████████| 403666/403666 [00:35<00:00, 11382.04it/s]


Since these queries are taken from the dataset, we can **evaluate the performance** of our search engine on these queries. This means that we know which documents the system should retrieve for each query.

You can use the following evaluation function to do this. This function takes your queries and returns a score (mean average precision -- you will learn about this later). For now, all you need to know is that, the higher this score, the better the system works.

Let's evaluate the queries we have:


In [8]:
print("score:", evaluate(example_queries))

score: 0.07906002902973568


Now it's up to you to figure out if and how it's possible to make the search engine perform better on these queries. How would you query a search engine if you wanted to know about these topics? Experiment a bit.

**Task 2**: Try to manually come up with ways to rewrite or reformulate the queries so the performance improves.

**Important**: Make sure that the query IDs match! Otherwise, evaluation will not work.


In [9]:
example_queries_rewritten = pd.DataFrame(
    [
        [
            "443848",
            "guide train siberian husky dog",
        ],
        [
            "1783010",
            "what is blasphemi",
        ],
        [
            "2838988",
            "remove get cork from wine bottle without corkscrew",
        ],
    ],
    columns=["qid", "query"],
)

print("score after rewriting:", evaluate(example_queries_rewritten))

score after rewriting: 0.11060433045979755


# An automatic approach

In this last part, we'll try to come up with an automatic approach to perform query re-writing. Use your findings from task 2 for this.

**Task 3**: Implement a function that automatically re-writes any input query.

You can use any approach or library you want for this task. However, keep in mind that simple ideas often work well!


In [121]:
import re
import nltk
import pyterrier as pt
import pkg_resources
from symspellpy import SymSpell, Verbosity
from nltk.stem.snowball import SnowballStemmer

nltk.download('stopwords')

sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "frequency_dictionary_en_82_765.txt"
)
sym_spell.load_dictionary(dictionary_path, 0, 1)
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "frequency_bigramdictionary_en_243_342.txt"
)
sym_spell.load_bigram_dictionary(dictionary_path, 0, 2)

stemmer = SnowballStemmer("english", ignore_stopwords=True)

tokenizer = pt.java.autoclass("org.terrier.indexing.tokenisation.Tokeniser").getTokeniser()


def fix_not_apostrophe(query: str) -> str:
    return re.sub(
        pattern=r'n t ',
        repl='nt ',
        string=query
    )


def rewrite_query(query: str) -> str:
    fixed = fix_not_apostrophe(query)
    spellchecked_words = []

    for word in fixed.split():
        if len(word) < 3:
            spellchecked_words.append(word)
        else:
            suggestions = sym_spell.lookup(word, Verbosity.ALL, max_edit_distance=2)
            added = 0

            for i in range(min(2, len(suggestions))):
                if suggestions[i].term == word:
                    spellchecked_words.append(word)
                    added += 1
                    break
                if suggestions[i].term.find("'") != -1:
                    suggestions[i].term.replace("'", "")
                    added += 1
                    continue
                spellchecked_words.append(suggestions[i].term)
                added += 1

            if added == 0:
                spellchecked_words.append(word)

    stemmed_words = [stemmer.stem(word) for word in spellchecked_words]

    improved_query = " ".join(stemmed_words)

    # suggestions = sym_spell.lookup_compound(fixed, ignore_non_words=True, max_edit_distance=2)
    # spellchecked_query = suggestions[0].term
    print(f"{query} -> {fixed} -> {improved_query}")

    return improved_query

[nltk_data] Downloading package stopwords to /home/jim/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


This time, we'll evalute on _all_ queries in the dataset. This will give us a more general result:


In [57]:
print("score:", evaluate_all())

score: 0.06179994498738492


Are you able to improve the overall performance using your rewriting approach?


In [122]:
print("score after rewriting", evaluate_all(rewrite_query))

how can we get concentration onsomething -> how can we get concentration onsomething -> how can we get concentr someth
why doesn t the water fall off earth if it s round -> why doesnt the water fall off earth if it s round -> why does the water fall off earth if it s round
how do i determine the charge of the iron ion in fecl3 -> how do i determine the charge of the iron ion in fecl3 -> how do i determin the charg of the iron ion in feel felt
i have mice how do i get rid of them humanely -> i have mice how do i get rid of them humanely -> i have mice how do i get rid of them human
what does see leaflet mean on ept pregnancy test -> what does see leaflet mean on ept pregnancy test -> what does see leaflet mean on est pet pregnanc test
what is innate immunity -> what is innate immunity -> what is innat immun
how can i lose 30 pounds by june3 -> how can i lose 30 pounds by june3 -> how can i lose 30 pound by june june
what are the words to write the sound of raindrops moving train scribbl