# MSMARCO Processing
This file's purpose is to be an interactive environment where I can do PyTerrier's indexing of MSMARCO Document and test how PyTerrier works.
The purpose of each section is provided as documentation.

In [1]:
import pyterrier as pt

In [2]:
import os
from itertools import islice

## STEP 0. The absolute path
This absolute path is needed for the backend, where we won't process the entire collection there.
Instead, I process it here, so that the backend uses the already processed version. Thus, I just pass the
absolute path to where this collection is located. I tried using the relative path at first, but it didn't quite work for some reason.

In [5]:
#From https://stackoverflow.com/a/51523
absolute_indices_path = os.path.abspath("E:/code/ua/s2/info556/pyterrier_ui/backend/indices/msmarco-document")

## STEP 1. Get the dataset
We must first get the dataset. The [ir-datasets](https://ir-datasets.com/) page indicates how to get msmarco-document using PyTerrier, which is in [this page](https://ir-datasets.com/msmarco-document.html)

In [6]:
dataset = pt.get_dataset("irds:msmarco-document")

## STEP 2. Create an indexer that will process the index
This object works through the dataset. Note that below I left it using "absolute_indices_path", which is the already-downloaded index.
This is because downloading it again takes time.

PyTerrier provides documentation for [IterDictIndexer](https://pyterrier.readthedocs.io/en/latest/terrier/api.html#pyterrier.terrier.IterDictIndexer). There are also some Google Colab notebooks that PyTerrier offered, I used [this one](https://colab.research.google.com/github/terrier-org/pyterrier/blob/master/examples/notebooks/indexing.ipynb) as reference.

### What the fields do
I am not exactly certain of all fields, but here is my understanding of them.
 - text_attrs: It says in the documentation that this corresponds to which columns should be indexed. I chose to consider the title as well, since it might be relevant.
 - fields : Documentation says that it "allows application of weighting models such as BM25F". I was not sure if that implies BM25 as well, or if I wouldn't be able to use weighting at all, which is important for the "Related Terms" bag, so I set it to "True" just in case.
 - blocks : Without this, I couldn't use constraints nor phrase search, so I had to re-index with this set to "True"
 - meta : documentation says "what metadata for each document to record in the index". I think this is to be able to keep the three fields in retrieval.

In [13]:
# STEP 2: Create an indexer that captures blocks to allow phrase search (constraints)
indexer = pt.IterDictIndexer(
    absolute_indices_path,
    text_attrs=["body", "title"],
    fields=True,
    blocks=True, #Allows phrase search
    meta=["docno", "title", "url"],
    threads=1
) #"./indices/msmarco-document"

In [14]:
# Looking at an example just to make sure we capture everything right
example = next(iter(dataset.get_corpus_iter()))
print(example)

msmarco-document documents:   0%|                                                                                                                                  | 0/3213835 [00:00<?, ?it/s]

{'url': 'https://answers.yahoo.com/question/index?qid=20071007114826AAwCFvR', 'title': 'The hot glowing surfaces of stars emit energy in the form of electromagnetic radiation.?', 'body': 'Science & Mathematics Physics\nThe hot glowing surfaces of stars emit energy in the form of electromagnetic radiation.?\nIt is a good approximation to assume that the emissivity e is equal to 1 for these surfaces.\nFind the radius of the star Rigel, the bright blue star in the constellation Orion that radiates energy at a rate of 2.7 x 10^32 W and has a surface temperature of 11,000 K. Assume that the star is spherical.\nUse σ =... show more\nFollow 3 answers\nAnswers\nRelevance\nRating\nNewest\nOldest\nBest Answer: Stefan-Boltzmann law states that the energy flux by radiation is proportional to the forth power of the temperature: q = ε · σ · T^4 The total energy flux at a spherical surface of Radius R is Q = q·π·R² = ε·σ·T^4·π·R² Hence the radius is R = √ ( Q / (ε·σ·T^4·π) ) = √ ( 2.7x10+32 W / (1 · 




## STEP 4. PROCESSING THE TEXT COLLECTION
These first two functions serve to process the text. This part was troublesome at first;
I thought that it was unable of retrieving the text due to some formatting or other type of error, since it said "Indexed _ empty documents".
However, it seemed to work fine.

However, it indeed required me to close down other processes in my computer, because at first it crashed.

In [16]:
# This function's true purpose is to use the "doc" dictionary, and retrieve
# from it only up until 500 characters. Trying to parse the entire collection doesn't work,
# it is too much information and the operation never finishes.
def safe_text(doc, prop):
    return (doc.get(prop,"")[:500])

In [5]:
# This function helps me get the documents in a shortened form.
# Before, it also helped me check that the document was in the format that was expected of PyTerrier
# In the end, it looks a little redundant, but still works just fine.
def get_shortened_documents():
    it = dataset.get_corpus_iter()
    for i, doc in enumerate(it):
        yield {
            "docno": doc["docno"],
            "title":safe_text(doc,"title"),
            "body": safe_text(doc,"body"),
            "url": doc.get("url","")
        }

In [18]:
#This is the actual indexing process. It takes a long time.
index_ref = indexer.index(get_shortened_documents())

msmarco-document documents:   0%|                                                                                                                      | 71/3213835 [00:00<7:52:14, 113.42it/s]



msmarco-document documents: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3213835/3213835 [25:26<00:00, 2104.83it/s]


15:15:32.391 [main] WARN org.terrier.structures.indexing.Indexer -- Indexed 11401 empty documents


In [None]:
#If you have the indices on an absolute path, it saves some time.
index_ref = pt.IndexRef.of(absolute_indices_path)

## Step 5: Creating an index on top of this indexing process
Now I can place an index on top that will be passed on to a "Retriever" object.

In [19]:
index = pt.IndexFactory.of(index_ref)



In [20]:
#Looking at the statistics to see if it processed anything.
print(index.getCollectionStatistics())

Number of documents: 3213835
Number of terms: 2729640
Number of postings: 123294510
Number of fields: 2
Number of tokens: 173995000
Field names: [body, title]
Positions:   true



## STEP 6. Creating the BM25 Retriever

In [21]:
bm25 = pt.terrier.Retriever(index, wmodel="BM25")

## STEP 7. Performing the first search
This resulted in another long processing step the very first time you search for something. So it's important to get it done before using it in the server.

In [37]:
search_results = bm25.search("+\"mario bros\"")

In [38]:
#Just visualizing what the results look like
search_results

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,634373,D3066401,0,50.862053,"+""mario bros"""
1,1,1301281,D1999271,1,50.464653,"+""mario bros"""
2,1,109112,D2498469,2,49.908528,"+""mario bros"""
3,1,1682776,D3309046,3,49.602092,"+""mario bros"""
4,1,2464639,D2338450,4,48.710064,"+""mario bros"""
...,...,...,...,...,...,...
124,1,2581731,D15021,124,25.943301,"+""mario bros"""
125,1,2740548,D2128771,125,25.049945,"+""mario bros"""
126,1,125355,D1122728,126,24.542866,"+""mario bros"""
127,1,2247587,D2362656,127,24.216066,"+""mario bros"""


In [34]:
#Another example of a previous search I did.
search_results[:5]

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,1337154,D1197453,0,25.711631,+dancing with the +stars
1,1,3115804,D546002,1,25.699398,+dancing with the +stars
2,1,2288669,D2922614,2,25.350747,+dancing with the +stars
3,1,1480379,D2828163,3,25.10623,+dancing with the +stars
4,1,2230419,D2466103,4,25.098821,+dancing with the +stars


# PART 2: Inspecting the results and how to show them
This part is about me figuring out how the results can be obtained so that I can return the search results to the frontend.

In [24]:
#"get_text" from https://pyterrier.readthedocs.io/en/latest/text.html
text_getter = pt.text.get_text(
    indexlike=dataset,
    metadata=["body", "title", "url"]
) 

In [25]:
search_data = text_getter(search_results[:20])

In [26]:
search_data

Unnamed: 0,qid,docid,docno,rank,score,query,body,title,url
0,1,1337154,D1197453,0,25.711631,+dancing with the +stars,Benwf83 339 Contributions\nWho won Season 4 of...,Who won Season 4 of Dancing with the Stars?,http://www.answers.com/Q/How_many_seasons_a_ye...
1,1,3115804,D546002,1,25.699398,+dancing with the +stars,Home Celebrity Q&A ‘Dancing With the Stars’ Sa...,âDancing With the Starsâ Salaries,http://americanprofile.com/articles/dancing-wi...
2,1,2288669,D2922614,2,25.350747,+dancing with the +stars,How Much Money Will the Stars Make on Dancing ...,How Much Money Will the Stars Make on Dancing ...,http://gawker.com/5627611/how-much-money-will-...
3,1,1480379,D2828163,3,25.10623,+dancing with the +stars,Who Went Home On Dancing with the Stars 2014 L...,Who Went Home On Dancing with the Stars 2014 L...,http://gossipandgab.com/43327/who-went-home-on...
4,1,2230419,D2466103,4,25.098821,+dancing with the +stars,"On Dancing With the Stars tonight, it’s Finale...",Dancing With the Stars Freestyle Night (Recap ...,http://guardianlv.com/2014/05/dancing-with-the...
5,1,2725870,D2566136,5,25.047994,+dancing with the +stars,When Does Dancing with the Stars 2017 Start?\n...,When Does Dancing with the Stars 2017 Start? S...,http://gossipandgab.com/130056/when-does-danci...
6,1,854279,D2566138,6,24.933088,+dancing with the +stars,When Does Dancing with the Stars 2016 Start?\n...,When Does Dancing with the Stars 2016 Start? S...,http://gossipandgab.com/99798/when-does-dancin...
7,1,1641339,D1690723,7,24.746394,+dancing with the +stars,Dancing with the Stars News\nMost Recent Most ...,Dancing with the Stars News,http://www.buddytv.com/dancing-with-the-stars....
8,1,1568776,D2516688,8,24.518309,+dancing with the +stars,New 'Dancing With the Stars' cast is...\nBy Li...,New 'Dancing With the Stars' cast is...,http://www.cnn.com/2016/08/30/entertainment/da...
9,1,1709393,D458962,9,24.428796,+dancing with the +stars,All Tickets > Music Tickets > Rock & Pop > Dan...,Dancing with the Stars Tickets,http://www.ticketmaster.com/Dancing-with-the-S...


In [27]:
search_data[:5][["body", "title", "url"]]

Unnamed: 0,body,title,url
0,Benwf83 339 Contributions\nWho won Season 4 of...,Who won Season 4 of Dancing with the Stars?,http://www.answers.com/Q/How_many_seasons_a_ye...
1,Home Celebrity Q&A ‘Dancing With the Stars’ Sa...,âDancing With the Starsâ Salaries,http://americanprofile.com/articles/dancing-wi...
2,How Much Money Will the Stars Make on Dancing ...,How Much Money Will the Stars Make on Dancing ...,http://gawker.com/5627611/how-much-money-will-...
3,Who Went Home On Dancing with the Stars 2014 L...,Who Went Home On Dancing with the Stars 2014 L...,http://gossipandgab.com/43327/who-went-home-on...
4,"On Dancing With the Stars tonight, it’s Finale...",Dancing With the Stars Freestyle Night (Recap ...,http://guardianlv.com/2014/05/dancing-with-the...


# PART 3: Query Expansion module
Since the "Related Words" bag will show some suggestions, I needed to see how that works. [This article](https://pyterrier.readthedocs.io/en/latest/terrier/rewrite.html) on PyTerrier's documentation is about Query Rewriting and Expansion.

I chose the method that looked simplest to obtain an expanded query. Below I try it out and visualize the expansion result

In [73]:
bo1 = pt.rewrite.Bo1QueryExpansion(index)
original_query = "+dancing with the +stars"
expansion = bo1(bm25.search(original_query))
expansion

Unnamed: 0,qid,query_0,query
0,1,+dancing with the +stars,applypipeline:off danc^1.420494110 star^1.2824...


In [74]:
#Here I get the value in "query", and print it out. It seems like it adds the words with very specific weighting.
# Also, the words are not necessarily real words, but shortened forms to "cast a wider net", so to speak.
new_words = expansion["query"].item()
new_words

'applypipeline:off danc^1.420494110 star^1.282471458 who^0.099931968 abc^0.075553253 salari^0.056309564 celebr^0.036232875 last^0.000000000 realiti^0.000000000 palin^0.000000000 kipp^0.000000000'

## STEP 1: Using Regex to capture the words in the expanded query.
In the "Human Language Technology I" class, I learned how to use regex with Python's library "re". The documentation in [Regular Expression HOWTO](https://docs.python.org/3/howto/regex.html) was my preferred resource, followed in second by [re - Regular expression operations](https://docs.python.org/3/library/re.html).

Also, to test if the regex works correctly, I use [Pythex](https://pythex.org/) and [Debuggex](https://www.debuggex.com/) To create and test the patterns. Make sure that you set Debuggex to "Python", because engines might vary slightly.

In [3]:
import re

## STEP 1.2: Capturing words in the query expansion
The query expansion yields a new query that adds terms, but may include existing terms, so I wanted to make sure that didn't happen.Thus, I will capture all words in this expanded query using a Regex, which you can see below. In the backend, I store these words, and create a Set() with the words in the original query.

A Set() is a specific data structure that allows for O(1) checking of the existence of items in a collection. Thus, what I will do is take the items captured in the expanded query, and check which of these are not in the "Original Query" set. Those are the original terms that I will send to the frontend as suggestions for the user.

### Pattern Description:
 - **CLAUSE 1:** (?P<term>\[a-zA-Z0-9]+) - This is a capture group which I named "term", because it captures a term. the "\[a-zA-Z0-9]+" block means that I want to capture one or more alphanumerical characters.
 - \\^ - This little segment means that I want a term that is followed by the "^" character. This is because every term is provided by a weight, which is specified using this character. (ex. rock^0.333).
 - **CLAUSE 2:** (?P<weight>\[0-9\.]+) - This block captures the "weight" part of each term. The "\[0-9\.]+" block specifies that I am looking for one or more numbers and a "." character in there, since weights are often a decimal. Technically, this block could accept something like "...", but since PyTerrier is the one that provides this value, I find it unlikely to happen, so it's good enough for my task.

You can see this pattern in Pythex [here](https://pythex.org/?regex=%28%3FP%3Cterm%3E%5Ba-zA-Z0-9%5D%2B%29%5C%5E%28%3FP%3Cweight%3E%5B0-9%5C.%5D%2B%29&test_string=applypipeline%3Aoff+danc%5E1.420494110+star%5E1.282471458+who%5E0.099931968+abc%5E0.075553253+salari%5E0.056309564+celebr%5E0.036232875+last%5E0.000000000+realiti%5E0.000000000+palin%5E0.000000000+kipp%5E0.000000000&mode=finditer)

In [76]:
find_term_and_weights = r"(?P<term>[a-zA-Z0-9]+)\^(?P<weight>[0-9\.]+)"
findtw = re.compile(find_term_and_weights)

matches = findtw.findall(new_words)
matches

[('danc', '1.420494110'),
 ('star', '1.282471458'),
 ('who', '0.099931968'),
 ('abc', '0.075553253'),
 ('salari', '0.056309564'),
 ('celebr', '0.036232875'),
 ('last', '0.000000000'),
 ('realiti', '0.000000000'),
 ('palin', '0.000000000'),
 ('kipp', '0.000000000')]

## STEP 1.3: Retrieving the words from the query pre-expansion
I built a regex that could capture all words in the original query, so that I can create a Set() of all elements
which are in the original query. That way, I can take the words from the expanded query and check which words are original.

That is, any term that isn't in the "Original Words" Set() must be an addition made by PyTerrier!

### Pattern Description:
 - (?<=\[+|-]) - This unnamed group checks if a term starts with the "+" or "-" signs, which are for constraints (the former forces a word to appear, while the later forbids it)
 - **CLAUSE 1:** (?:(?<=\[+|-])\[A-Za-z0-9\.\-\_]+) - This overall expression contains the above capture in it. The "\[A-Za-z0-9\.\-\_]+" section means I want any sequence of one or more alphanumeric characters, or the dot, or a hyphen, or an underscore. These can be combined, of course. The "?:" serves to say that this group is non-capturing. I don't mind retrieving or referencing it later.
 - **CLAUSE 2:** (?:\[A-Za-z0-9\.\-\_]+(?=\[\^|\"])) - This one is a loaded term as well. We can see the same specification as before, "\[A-Za-z0-9\.\-\_]+". Which I use to capture terms. The new part is the "(?=\[\^|\"])" block. It means that I am looking for a word which has the ^ sign after it, or the " sign after it. This helps capture words that have a weight, or words with double-quotes, used for phrase search.
 - **CLAUSE 3:** (?<=\b)\[A-Za-z]+(?=\b) - This one is to help me capture words that don't have constraint characters, nor double-quote characters, nor the ^ sign with a weight following it. It captures words that are by themselves. The "\b" character means "Word Boundary", and is for capturing where a word ends. Note that I don't capture numbers here, and that's because I had some trouble setting it up right.

This pattern can be seen in pythex [here](https://pythex.org/?regex=%28%3FP%3Cword%3E%28%3F%3A%28%3F%3C%3D%5B%2B%7C-%5D%29%5BA-Za-z0-9%5C.%5C-%5C_%5D%2B%29%7C%28%3F%3A%5BA-Za-z0-9%5C.%5C-%5C_%5D%2B%28%3F%3D%5B%5C%5E%7C%5C%22%5D%29%29%7C%28%3F%3C%3D%5Cb%29%5BA-Za-z%5D%2B%28%3F%3D%5Cb%29%29&test_string=%22danc%22%5E1.420494110+star%5E1.282471458+who%5E0.099931968+abc%5E0.075553253+salari%5E0.056309564+celebr%5E0.036232875+last%5E0.000000000+realiti%5E0.000000000+palin%5E0.000000000+kipp%5E0.000000000+%2Bparty+-fern+tern+23.23%5E0.128942&mode=finditer)

In [83]:
extract_just_the_words = r"(?P<word>(?:(?<=[+|-])[A-Za-z0-9\.\-\_]+)|(?:[A-Za-z0-9\.\-\_]+(?=[\^|\"]))|(?<=\b)[A-Za-z]+(?=\b))"
#r"((?<=[+|-])[A-Za-z0-9\.\-\_]+)|([A-Za-z0-9\.\-\_]+(?=[\^|\"]))|(?<=\b)[A-Za-z]+(?=\b)"
extract_jtw = re.compile(extract_just_the_words)

oq_matches = extract_jtw.findall(original_query)
set_oq = set(oq_matches) #From https://stackoverflow.com/a/15768778
extended_words = [x for x in matches if matches[0] not in set_oq]
extended_words
extended_words = [(term, round(float(weight), 2)) for (term, weight) in matches if matches[0] not in set_oq]
extended_words

[('danc', 1.42),
 ('star', 1.28),
 ('who', 0.1),
 ('abc', 0.08),
 ('salari', 0.06),
 ('celebr', 0.04),
 ('last', 0.0),
 ('realiti', 0.0),
 ('palin', 0.0),
 ('kipp', 0.0)]