# Scaffolding project

Welcome to the __IN4325: Information Retrieval__ lecture!

This project acts as a gentle introduction to information retrieval for you. You do not need any prior knowledge about IR for this task. Only some Python programming skills are required.

## Getting started
Under the hood, this notebook uses a library called __PyTerrier__. Please check out the first part of our _Introduction to PyTerrier_ series to learn how to install PyTerrier. However, you do not need to interact with PyTerrier directly for now; rather, we're providing you with simple utility functions you can use. Feel free to have a look how these are implemented, but it's not required.

__Task 1__: Install PyTerrier (see the `01-setup.ipynb` notebook).

Now you should be able to import the utility functions. The first time you do this, a dataset will be downloaded and indexed automatically (this will take a minute). If you have any issues running this cell, try removing the `index` directory (if it exists) and restarting the kernel of this notebook.

In [2]:
from util import search, evaluate, evaluate_all

PyTerrier 0.10.0 has loaded Terrier 5.8 (built by craigm on 2023-11-01 18:05) and terrier-helper 0.0.8



Now that we have loaded the data, you can run search queries. For example:

In [3]:
search("what is the meaning of life")

Unnamed: 0,qid,docid,docno,text,rank,score,query
0,1,252644,3702072_14,"there is no meaning to life, it is only the af...",0,15.56633,what is the meaning of life
1,1,217423,1761229_12,the meaning of life is to search for the meani...,1,15.334672,what is the meaning of life
2,1,169345,3619417_9,"The meaning of life is simple: ""There is life ...",2,15.015069,what is the meaning of life
3,1,91240,1142437_3,Life doesn't require meaning. Life need have ...,3,14.797602,what is the meaning of life
4,1,203572,3958274_0,no meaning.. what else have meaning?. if meani...,4,14.732545,what is the meaning of life
5,1,121641,239942_7,Life only takes on the meaning that we give it...,5,14.730245,what is the meaning of life
6,1,102700,783019_0,Biology is the study of life. Bio means life a...,6,14.471157,what is the meaning of life
7,1,398376,1630688_0,Donia means : life on earth... . hayat means l...,7,14.407119,what is the meaning of life
8,1,183988,2056871_14,to find out why people constantly ask this que...,8,14.205518,what is the meaning of life
9,1,169340,3619417_4,life mean to do some thing for other,9,14.161821,what is the meaning of life


What you get here is a list of ten documents from the corpus that are ordered by how relevant they are to our query (according to the search engine).

## Query rewriting
The goal of this task is to come up with a way of __rewriting queries__ such that the search engine can "understand" them better.

In order to do this, let's first take a look at some example queries from our dataset. We represent these queries using a `pandas.DataFrame`, where the first column corresponds to the __query ID__ and the second column corresponds to the __query__:

In [4]:
import pandas as pd

example_queries = pd.DataFrame(
    [
        [
            "443848",
            "does anybody know where i could get a free guide on how to train a siberian husky",
        ],
        [
            "1783010",
            "what is blaphsemy",
        ],
        [
            "2838988",
            "how can i get a cork out of not into a wine bottle without a corkscrew",
        ],
    ],
    columns=["qid", "query"],
)

Since these queries are taken from the dataset, we can __evaluate the performance__ of our search engine on these queries. This means that we know which documents the system should retrieve for each query.

You can use the following evaluation function to do this. This function takes your queries and returns a score (mean average precision -- you will learn about this later). For now, all you need to know is that, the higher this score, the better the system works.

Let's evaluate the queries we have:

In [5]:
print("score:", evaluate(example_queries))

score: 0.5223984383435115


Now it's up to you to figure out if and how it's possible to make the search engine perform better on these queries. How would you query a search engine if you wanted to know about these topics? Experiment a bit.

__Task 2__: Try to manually come up with ways to rewrite or reformulate the queries so the performance improves.

__Important__: Make sure that the query IDs match! Otherwise, evaluation will not work.

In [6]:
example_queries_rewritten = pd.DataFrame(
    [
        [
            "443848",
            "free guide on how to train siberian husky", # TODO: add rewritten query here
        ],
        [
            "1783010",
            "blasphemy what", # TODO: add rewritten query here
        ],
        [
            "2838988",
            "how can i get a cork out of wine bottle without corkscrew", # TODO: add rewritten query here
        ],
    ],
    columns=["qid", "query"],
)
print("score after rewriting:", evaluate(example_queries_rewritten))

score after rewriting: 0.538588696112892


# An automatic approach

In this last part, we'll try to come up with an automatic approach to perform query re-writing. Use your findings from task 2 for this.

__Task 3__: Implement a function that automatically re-writes any input query.

You can use any approach or library you want for this task. However, keep in mind that simple ideas often work well!

In [7]:
# Cell for importing python packages to the current kernel

import sys
!{sys.executable} -m pip install pyenchant
!{sys.executable} -m pip install autocorrect
!{sys.executable} -m pip install nltk

import nltk
nltk.download('words')




[nltk_data] Downloading package words to /home/steve/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

Notes from lecure about queries.

- Average query length is 30 on specific websites, 2.8 for more general searches
    - Maybe make query length depend on specificity of words in the initial query
- Find similar earlier queries
- Noisy channel probabalistic model for spell checking
- Context improvement:
    - Use wordnet to find context (siberian husky vs nigerian husky for example)
    - Word co-occurrence in corpus for same improvement (more naive)
- Look if uppercase version of word is in system as acronym
- Markov models (for autocomplete)
- mAP as a grading metric
- Seven vs Sven paper implementation:
    - Output -> Fully Connected Layer -> Recurrent Layer -> Embeddings -> Words
    -        -> Features
- Look into multi-language searching
- Learning Multiple Intent Representations for Search Queries (llm paper)


Notes from lectures about ranking

- Sample from documents that are retrieved when a query is executed to find words that can be used to expand said query.
    - Only grab documents that contain all keywords.
- Use bigram (sliding window of size 2) to find documents containing the same bigram
    - Split bigrams back into unigrams if no bigram is found from the query


In [8]:
import os
import json
import pandas as pd
import pyterrier as pt
import numpy as np
import re
from autocorrect import Speller
from nltk.corpus import words

data = pt.datasets.get_dataset("irds:antique/test/non-offensive")
corp_iter = data.get_corpus_iter(verbose=False)

idf_map = {}
doc_lens = []

p = os.path.join(os.getcwd(), 'my_index')
if not os.path.exists(p):
    os.mkdir(p)

def seperate_words(quer):
    return [re.sub(r"[!.,?']$", "", word).lower() for word in quer.split(' ')] # Split words in document text
        
# Create reverse word index and word length index for spelling correction.
for item in corp_iter:
    text_words = seperate_words(item['text'])
    doc_lens.append(len(text_words)) # Store document lengths for IDF calculation
    
    docno = item['docno']
    
    f_path = os.path.join(p, docno + '.json')
    
    if os.path.exists(f_path):
        os.remove(f_path)
    
    doc_map = {}
    
    for word in text_words:
        entr = idf_map.setdefault(word, {}) # Add word to mapping
        entr.setdefault(docno, 0) # Update word occurrence
        entr[docno] = entr[docno] + 1
        doc_map.setdefault(word, 0)
        doc_map[word] = doc_map[word] + 1
        
    with open(f_path, 'w') as f:
        json.dump(doc_map, f)

# Add idf score to entries
for key in list(idf_map.keys()):
    entr = idf_map[key]
    docs_containing = len(entr.keys()) # Get the amount of documents containing the word
    total_docs = len(doc_lens) # Get total documents
    idf = np.log(1 + ((total_docs - docs_containing + 0.5) / (docs_containing + 0.5)))
    idf_map[key] = (idf, entr)
    
spell = Speller()
word_set = set(words.words())

In [22]:
def score_word(local_word):
    return idf_map[local_word][0] if local_word in idf_map.keys() else 0

def rewrite_query(query: str) -> str:
    parts = [part for part in seperate_words(query)]
    scored_parts = sorted([(score_word(word), index, word) for index, word in enumerate(parts)])
    
    total_score = sum([part[0] for part in scored_parts])
    boundry = total_score * 0.1
    
    sub_split = []
    for i, part in enumerate(scored_parts):
        summation = sum([s[0] for s in scored_parts[0:(i+1)]])
        if summation > boundry:
            sub_split = scored_parts[max(0, (i-1)):len(scored_parts)]
            break
    
    sub_split = [item[1] for item in sorted([(part[1], part[2]) for part in sub_split])]
    
    query_co_oc = {}
    for word in sub_split:
        if word not in idf_map.keys():
            continue
        for key in idf_map[word][1].keys():
            query_co_oc.setdefault(key, 0)
            query_co_oc[key] = query_co_oc[key] + 1
    query_co_oc = list(sorted(list(query_co_oc.items()), key=lambda x: x[1], reverse=True))
    top_k = query_co_oc[0:10]
    
    merged_doc_map = {}
    for pair in top_k:
        f_path = os.path.join(p, pair[0] + '.json')
        with open(f_path, 'r') as f:
            d = json.load(f)
            for key in d.keys():
                merged_doc_map.setdefault(key, [set(), 0])
                merged_doc_map[key][0].add(pair[0])
                merged_doc_map[key][1] = merged_doc_map[key][1] + d[key]
    
    sorted_co_oc = list(sorted(filter(lambda y: len(y[1][0]) > 5, merged_doc_map.items()), key=lambda x: x[1][1], reverse=False))
    
    select_count = 30 - len(sub_split)
    selected = list(filter(lambda x: len(x) > 0, [re.sub(r"['\'!?.,]", "", tu[0]) for tu in sorted_co_oc]))[0:select_count]
    
    query = " ".join(sub_split)
    addition = " ".join(selected)
    
    return query

rewrite_query("does anybody know where i could get a free guide on how to train a siberian husky")

'does anybody know where could get free guide on how train siberian husky'

This time, we'll evalute on _all_ queries in the dataset. This will give us a more general result:

In [19]:
print("score:", evaluate_all())

score: 0.4570052947026987


Are you able to improve the overall performance using your rewriting approach?

In [23]:
print("score after rewriting", evaluate_all(rewrite_query))

score after rewriting 0.45670174444542516
