# Alternative Second Term Project: ARQMath Collection, Answer Retrieval Task

“In a recent study, Mansouri et al. found that 20% of mathematical queries in a general-purpose search engine were expressed as well-formed questions, a rate ten times higher than that for all queries submitted. Results such as these and the presence of Community Question Answering sites such as Math Stack Exchange suggest there is interest in finding answers to mathematical questions posed in natural language, using both text and mathematical notation.” [1]

“[ARQMath](https://www.cs.rit.edu/~dprl/ARQMath/) is a co-operative evaluation exercise aiming to advance math-aware search and the semantic analysis of mathematical notation and texts. **ARQMath is being run for the second time at CLEF 2021.** An overview paper (including results) from ARQMath 2020 is available along with participant papers in the [CLEF 2020 working notes](http://ceur-ws.org/Vol-2696).” [2]

 ![Answer Retrieval Task](https://www.cs.rit.edu/~dprl/ARQMath/assets/images/screen-shot-2019-09-09-at-11.11.57-pm-2656x1229.png)

Your tasks, reviewed by your colleagues and the course instructors, are the following:

1.   *Implement a supervised ranked retrieval system*, [3, Chapter 15] which will produce a list of documents from the TREC collection in a descending order of relevance to a query from the TREC collection. You SHOULD use training and validation relevance judgements from the TREC collection in your information retrieval system. Test judgements MUST only be used for the evaluation of your information retrieval system.

2.   *Document your code* in accordance with [PEP 257](https://www.python.org/dev/peps/pep-0257/), ideally using [the NumPy style guide](https://numpydoc.readthedocs.io/en/latest/format.html#docstring-standard) as seen in the code from exercises.  
     *Stick to a consistent coding style* in accordance with [PEP 8](https://www.python.org/dev/peps/pep-0008/).

3.   *Reach at least 1.2% mean average precision* [3, Section 8.4] with your system on the Trec collection. You are encouraged to use techniques for tokenization, [3, Section 2.2] document representation [3, Section 6.4], tolerant retrieval [3, Chapter 3], relevance feedback, query expansion, [3, Chapter 9], learning to rank [3, Chapter 15], and others discussed in the course.

4.   _[Upload an .ipynb file](https://is.muni.cz/help/komunikace/spravcesouboru#k_ss_1) with this Jupyter notebook to the homework vault in IS MU._ You MAY also include a brief description of your information retrieval system and a link to an external service such as [Google Colaboratory](https://colab.research.google.com/), [DeepNote](https://deepnote.com/), or [JupyterHub](https://iirhub.cloud.e-infra.cz/).

The best student systems will enter the ARQMath competition and help develop the new search engine for [the Math StackExchange question answering forum](http://math.stackexchange.com/). This is not only useful, but also a nice reference for your CVs!

[1] Zanibbi, R. et al. [Overview of ARQMath 2020 (Updated Working Notes Version): CLEF Lab on Answer Retrieval for Questions on Math](http://ceur-ws.org/Vol-2696/paper_271.pdf). In: *Working Notes of CLEF 2020-Conference and Labs of the Evaluation Forum*. 2020.

[2] Zanibbi, R. et al. [*ARQMath: Answer Retrieval for Questions on Math*](https://www.cs.rit.edu/~dprl/ARQMath/index.html). Rochester Institute of Technology. 2021.

[3] Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. [*Introduction to information retrieval*](https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf). Cambridge university press, 2008.

## Loading the ARQMath collection

First, we will install [our library](https://gitlab.fi.muni.cz/xstefan3/pv211-utils) and load the ARQMath collection. If you are interested, you can take a peek at [how we preprocessed the raw ARQMath collection](https://drive.google.com/file/d/1ZFJyBHUuMe4CkwV1HGKYg_F-Fk_PSW9R/view) to the final dataset that we will be using.

In [1]:
#%%capture
#! pip install git+https://github.com/MIR-MU/pv211-utils.git
#! pip install gensim==3.6.0

#
#The questions and answers from the ARQMath collection, and the queries from the from the answer retrieval task of ARQMath 2020 contain both text and mathematical formulae. We have prepared several encodings of the text and mathematical, which you can choose from:

- `text` – Plain text, which contains no mathematical formulae. *Nice and easy*, but you are losing all information about the math:

    > Finding value of  such that ...

- `text+latex` – Plain text with mathematical formulae in LaTeX surrounded by dollar signs. Still quite nice to work:

    > Finding value of \$c\$ such that ...

- `text+tangentl` – Plain text with mathematical formulae in [the mathtuples format][5] of [the Tangent-L system][6]. Like LaTeX, the mathtuples format encodes how a mathematical formula looks, but is fuzzier in order to improve recall.

    > Finding value of #(start)# #(v!c,!0,-)# #(v!c,!0)# #(end)# such that ...

- `text+prefix` – Plain text with mathematical formulae in [the prefix format][1]. Unlike LaTeX, which encodes how a mathematical formula looks, the prefix format encodes the semantic content of the formulae using [the Polish notation][2].

    > Finding value of V!𝑐 such that ...

- `xhtml+latex` – XHTML text with mathematical formulae in LaTeX, surrounded by the `<span class="math-container">` tags:

    > ``` html
    > <p>Finding value of <span class="math-container">$c$</span> such that ...
    > ```

- `xhtml+pmml` – XHTML text with mathematical formulae in the [Presentation MathML][4] XML format, which encodes how a mathematical formula looks:

    > ``` html
    > <p>Finding value of <math><mi>c</mi></math> such that'
    > ```

- `xhtml+cmml` – XHTML text with mathematical formulae in the [Content MathML][3] XML format, which encodes the semantic content of a formula. This format is *much more difficult to work with*, but it allows you to represent mathematical formulae structurally and use XML Retrieval [3, Chapter 10].

    > ``` html
    > <p>Finding value of <math><ci>𝑐</ci></math> such that ...
    > ```

 [1]: http://ceur-ws.org/Vol-2696/paper_235.pdf#page=5
 [2]: https://en.wikipedia.org/wiki/Polish_notation
 [3]: https://www.w3.org/TR/MathML2/chapter4.html
 [4]: https://www.w3.org/TR/MathML2/chapter3.html
 [5]: https://github.com/fwtompa/mathtuples
 [6]: http://ceur-ws.org/Vol-2936/paper-05.pdf#page=3

In [2]:
text_format = 'text'

### Loading the answers

Next, we will define a class named `Answer` that will represent a preprocessed answer from the ARQMath 2020 collection. Tokenization and preprocessing of the `body` attribute of the individual answers as well as the creative use of the `upvotes` and `is_accepted` attributes is left to your imagination and craftsmanship.

We will load answers into the `answers` [ordered dictionary](https://docs.python.org/3.8/library/collections.html#collections.OrderedDict). Each answer is an instance of the `Answer` class that we have just defined.

In [3]:
from pv211_utils.datasets import ArqmathDataset
data = ArqmathDataset(year=2021, text_format=text_format)
answers = data.load_answers()

  from tqdm.autonotebook import tqdm


Computing MD5: /home/batman/.cache/pv211-utils/0342d654e429f56600fded6e7794be00
MD5 matches: /home/batman/.cache/pv211-utils/0342d654e429f56600fded6e7794be00


In [4]:
print('\n'.join(repr(answer) for answer in list(answers.values())[:3]))
print('...')
print('\n'.join(repr(answer) for answer in list(answers.values())[-3:]))

<Document 4 “More or Less is a BBC Radio 4 programme about math ...”>
<Document 7 “You use a proof by contradiction. Basically, you s ...”>
<Document 9 “Suppose no one ever taught you the names for ordin ...”>
...
<Document 3058133 “First take some such that , for example take . The ...”>
<Document 3058134 “The answer is NO. Take and . Consider Then , in th ...”>
<Document 3058136 “Firstly, it’s trivial that Next, we see that which ...”>


For a demonstration, we will load [the accepted answer from the image above][1].

 [1]: https://math.stackexchange.com/a/30741

In [5]:
answer = answers['30741']
answer

<Document 30741 “No need to use Taylor series, this can be derived  ...”>

In [6]:
print(answer.body)

No need to use Taylor series, this can be derived in a similar way to the formula for geometric series. Let's find a general formula for the following sum:    Notice that  \begin{align*} S_{m}-rS_{m} & = -mr^{m+1}+\sum_{n=1}^{m}r^{n}\\   & = -mr^{m+1}+\frac{r-r^{m+1}}{1-r} \\ & =\frac{mr^{m+2}-(m+1)r^{m+1}+r}{1-r}. \end{align*} Hence   This equality holds for any , but in your case we have  and a factor of  in front of the sum.    That is  \begin{align*} \sum_{n=1}^{\infty}\frac{2n}{3^{n+1}}  & = \frac{2}{3}\lim_{m\rightarrow\infty}\frac{m\left(\frac{1}{3}\right)^{m+2}-(m+1)\left(\frac{1}{3}\right)^{m+1}+\left(\frac{1}{3}\right)}{\left(1-\left(\frac{1}{3}\right)\right)^{2}} \\ & =\frac{2}{3}\frac{\left(\frac{1}{3}\right)}{\left(\frac{2}{3}\right)^{2}} \\ & =\frac{1}{2}. \end{align*}  Added note:    We can define   Then the sum above considered is , and the geometric series is .  We can evaluate  by using a similar trick, and considering .  This will then equal a combination of  and  wh

In [7]:
print(answer.upvotes)

318


In [8]:
print(answer.is_accepted)

True


### Loading the questions

Next, we will define a class named `Question` that will represent a preprocessed question from the ARQMath 2020 collection. Tokenization and preprocessing of the `title` and `body` attributes of the individual questions as well as the creative use of the `tags`, `upvotes`, `views`, and `answers` attributes is left to your imagination and craftsmanship.

We will not be returning these questions from our search engine, but we could use them for example to look up similar existing questions to a query and then return the answers to these existing questions.

We will load answers into the `questions` [ordered dictionary](https://docs.python.org/3.8/library/collections.html#collections.OrderedDict). Each answer is an instance of the `Question` class that we have just defined.

In [10]:
questions = data.load_questions()
answer_to_question = {
    answer: question
    for question in questions.values()
    for answer in question.answers
}

Computing MD5: /home/batman/.cache/pv211-utils/0342d654e429f56600fded6e7794be00
MD5 matches: /home/batman/.cache/pv211-utils/0342d654e429f56600fded6e7794be00
Computing MD5: /home/batman/.cache/pv211-utils/1ae90b86bee40f7de4d572ad7ec04955
MD5 matches: /home/batman/.cache/pv211-utils/1ae90b86bee40f7de4d572ad7ec04955


In [11]:
print('\n'.join(repr(question) for question in list(questions.values())[:3]))
print('...')
print('\n'.join(repr(question) for question in list(questions.values())[-3:]))

<Document 1 “Can someone explain to me how there can be differe ...”>
<Document 3 “mathfactor is one I listen to. Does anyone else ha ...”>
<Document 5 “I have read a few proofs that is irrational. I hav ...”>
...
<Document 3058122 “On have the spherical parametrization . Is it mean ...”>
<Document 3058135 “I had a question in my exam and they asked to prov ...”>
<Document 3062816 “I was trying to solve the following problem: Let b ...”>


For a demonstration, we will load [the question from the image above][1].

 [1]: https://math.stackexchange.com/q/30732

In [12]:
question = questions['30732']
question

<Document 30732 “How can I evaluate ? I know the answer thanks to W ...”>

In [13]:
print(question.title)

How can I evaluate ?


In [14]:
print(question.body)

How can I evaluate ? I know the answer thanks to Wolfram Alpha, but I'm more concerned with how I can derive that answer. It cites tests to prove that it is convergent, but my class has never learned these before. So I feel that there must be a simpler method.  In general, how can I evaluate


In [15]:
print(question.tags)

['sequences-and-series', 'convergence-divergence', 'power-series', 'faq']


In [16]:
print(question.upvotes)

360


In [17]:
print(question.views)

37953


In [18]:
print(question.answers)

[<Document 30741 “No need to use Taylor series, this can be derived  ...”>, <Document 223857 “If you want a solution that doesn't require deriva ...”>, <Document 30746 “As indicated in other answers, you can reduce this ...”>, <Document 30747 “Factor out the . Then write It is easy to show tha ...”>, <Document 81635 “My favorite proof of this is in this paper of Roge ...”>, <Document 30734 “Hints You know (don't you?) the formula for for Ta ...”>, <Document 223850 “Note that , i.e., a geometric series, which conver ...”>, <Document 30736 “You can find by differentiation. Just notice that  ...”>, <Document 539711 “Consider the generating function If we let , then  ...”>, <Document 548068 “Let be It's easy to prove that for , the sums sati ...”>, <Document 879374 “In fact, For , we have”>, <Document 639269 “Note that is the number ways to choose items of ty ...”>, <Document 820130 “I assume that the to be less than . Now, consider, ...”>, <Document 1063667 “I first encountered this sum w

In [19]:
print([answer for answer in question.answers if answer.is_accepted])

[<Document 30741 “No need to use Taylor series, this can be derived  ...”>]


### Loading the queries
Next, we will define a class named `Query` that will represent a preprocessed query from the answer retrieval task of ARQMath 2020. Tokenization and preprocessing of the `title` and `body` attributes of the individual questions as well as the creative use of the `tags` attribute is left to your imagination and craftsmanship.

We will load queries into the `train_queries` and `validation_queries` [ordered dictionaries](https://docs.python.org/3.8/library/collections.html#collections.OrderedDict). Each query is an instance of the `Query` class that we have just defined. You should use `train_queries`, `validation_queries`, and *relevance judgements* (see the next section) for training your supervised information retrieval system.

If you are training just a single machine learning model without any early stopping or hyperparameter optimization, you can use `bigger_train_queries` as the input.

If you are training a single machine learning model with early stopping or hyperparameter optimization, you can use `train_queries` for training your model and `validation_queries` to stop early or to select the optimal hyperparameters for your model. You can then use `bigger_train_queries` to train the model with the best number of epochs or the best hyperparameters.

In [20]:
from collections import OrderedDict
from itertools import chain

train_queries = data.load_train_queries()
validation_queries = data.load_validation_queries()

bigger_train_queries = OrderedDict(chain(train_queries.items(), validation_queries.items()))

In [21]:
print('\n'.join(repr(query) for query in list(train_queries.values())[:3]))
print('...')
print('\n'.join(repr(query) for query in list(train_queries.values())[-3:]))

<Query 18 “Evaluate using Cesáro-Stolz theorem. I know there  ...”>
<Query 89 “Is there any known complete parametrization of the ...”>
<Query 49 “I came across an exercise in which we are asked to ...”>
...
<Query 303 “Theorem- Up to isomorphism, the only noncommutativ ...”>
<Query 373 “Show that is irrational if is notperfect square, u ...”>
<Query 374 “Find all monic complex polynomials such that . My  ...”>


For a demonstration, we will look at query number 5. This is a query that is relatively easy to answer using just the text of the query, not the mathematical formulae. The user is asking for a computational solution to an interesting puzzle.

In [22]:
query = validation_queries[5]
query

<Query 5 “A family has two children. Given that one of the c ...”>

In [23]:
print(query.title)

A family has two children. Given that one of the children is a boy, what is the probability that both children are boys?


In [24]:
print(query.body)

A family has two children. Given that one of the children is a boy, what is the probability that both children are boys?   I was doing this question using conditional probability formula.   Suppose, (1) is the event, that the first child is a boy, and (2) is the event that the second child is a boy.  Then the probability of the second child to be boy given that first child is a boys by formula,  ...since second child to be boy doesn't depend on first child and vice versa. Please provide the detailed solution and correct me if I am wrong.


In [25]:
print(query.tags)

['probability', 'proof-verification', 'conditional-probability']


### Loading the relevance judgements
Next, we will load train and validation relevance judgements into the `train_judgements` and `validation_judgement` sets. Relevance judgements specify, which answers are relevant to which queries. You should use relevance judgements for training your supervised information retrieval system.


If you are training just a single machine learning model without any early stopping or hyperparameter optimization, you can use `bigger_train_judgements` as the input.

If you are training a single machine learning model with early stopping or hyperparameter optimization, you can use `train_judgements` for training your model and `validation_judgements` to stop early or to select the optimal hyperparameters for your model. You can then use `bigger_train_judgements` to train the model with the best number of epochs or the best hyperparameters.

In [26]:
from pv211_utils.arqmath.loader import load_judgements

train_judgements = data.load_train_judgements()
validation_judgements = data.load_validation_judgements()

bigger_train_judgements = train_judgements | validation_judgements

Computing MD5: /home/batman/.cache/pv211-utils/0342d654e429f56600fded6e7794be00
MD5 matches: /home/batman/.cache/pv211-utils/0342d654e429f56600fded6e7794be00
Computing MD5: /home/batman/.cache/pv211-utils/0342d654e429f56600fded6e7794be00
MD5 matches: /home/batman/.cache/pv211-utils/0342d654e429f56600fded6e7794be00
Computing MD5: /home/batman/.cache/pv211-utils/0342d654e429f56600fded6e7794be00
MD5 matches: /home/batman/.cache/pv211-utils/0342d654e429f56600fded6e7794be00
Computing MD5: /home/batman/.cache/pv211-utils/0342d654e429f56600fded6e7794be00
MD5 matches: /home/batman/.cache/pv211-utils/0342d654e429f56600fded6e7794be00


In [27]:
len(bigger_train_judgements)

4747

For a demonstration, we will look at query number 5 and show a relevant answer to the query and a non-relevant answer to the query.

In [28]:
query = validation_queries[5]
relevant_answer = answers['1037824']
irrelevant_answer = answers['432200']

In [29]:
query

<Query 5 “A family has two children. Given that one of the c ...”>

In [30]:
relevant_answer

<Document 1037824 “If he has more daughters than sons, Below are the  ...”>

In [31]:
irrelevant_answer

<Document 432200 “It is interesting that everyone is considering tha ...”>

In [32]:
(query, relevant_answer) in train_judgements

False

In [34]:
(query, irrelevant_answer) in train_judgements

False

## Implementation of your information retrieval system
Next, we will define a class named `IRSystem` that will represent your information retrieval system. Your class must define a method name `search` that takes a query and returns answers in descending order of relevance to the query.

The example implementation returns answers in decreasing order of the TF-IDF cosine similarity between the answer and the query. You can use the example implementation as a basis of your system, or you can replace it with your own implementation.

## Evaluation
Finally, we will evaluate your information retrieval system using [the Mean Average Precision](https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Mean_average_precision) (MAP) evaluation measure.

In [1]:
from pv211_utils.arqmath.leaderboard import ArqmathLeaderboard
from pv211_utils.arqmath.eval import ArqmathEvaluation
from pv211_utils.preprocessing import preprocessing
from pv211_utils.systems import BM25PlusSystem

submit_result = False
author_name = 'Surname, Name'

preprocessing = preprocessing.SimpleDocPreprocessing()
system = BM25PlusSystem(answers, preprocessing)

print('Initializing your system ...')
test_queries = data.load_test_queries()
test_judgements = data.load_test_judgements()
evaluation = ArqmathEvaluation(system, test_judgements, 10, ArqmathLeaderboard(), author_name, num_workers=1)
evaluation.evaluate(test_queries, submit_result)

NameError: name 'answers' is not defined