# ARQMath-2 corpus for Document Maps by Vítek Novotný

In this notebook, we are going to prepare a JSON ARQMath-2 corpus for [the document maps visual information retrieval system][1].

 [1]: https://www.fi.muni.cz/~xpetr2/document-maps/

In [1]:
%%capture
! pip install git+https://github.com/MIR-MU/ARQMath-eval@0.0.21
! pip install git+https://gitlab.fi.muni.cz/xstefan3/pv211-utils.git@spring2021
! pip install gdown==3.13.0
! pip install gensim==4.0.1

## Loading the system run

First, we will load the run for our SCM system.

We will only load the first ten topics of ARQMath-2 for simplicity.

In [2]:
def is_allowed(topic_id: str) -> bool:
    topic_id: int = int(topic_id[2:])
    return topic_id >= 201 and topic_id <= 210

In [3]:
from pathlib import Path

class System:
    def __init__(self, path: Path):
        self.path = path
        self.parsed_run = dict()
        with self.path.open('rt') as f:
            lines = [line.strip().split() for line in f]
            for line in lines:
                topic_id, result_id, *_, rank, __, run_name = line
                if not is_allowed(topic_id):
                    continue
                if topic_id not in self.parsed_run:
                    self.parsed_run[topic_id] = dict()
                self.parsed_run[topic_id][result_id] = 1.0 / int(rank)
        self.run_name = run_name
    
    def __repr__(self) -> str:
        return f'{self.run_name} loaded from {self.path}'

In [4]:
from gdown import cached_download

def download(file_id: str, path: str):
    cached_download(url=f'https://drive.google.com/uc?id={file_id}', path=path)

download('1glXyz72Nah7uPXJTMbvEcoqy7cJ8PRLn', '2021-MIRMU-task1-Novotny-auto-both-A.tsv')

File exists: 2021-MIRMU-task1-Novotny-auto-both-A.tsv


In [5]:
system = System(Path('2021-MIRMU-task1-Novotny-auto-both-A.tsv'))
system

Run_Novotny_2021_0 loaded from 2021-MIRMU-task1-Novotny-auto-both-A.tsv

## Loading the queries and answers

Next, we will load the ARQMath-2 queries and the ARQMath collection questions and answers.

For each topic, we will only use the top five results returned by our system:

In [6]:
top_results = dict()
for topic_id, answers in system.parsed_run.items():
    top_results[topic_id] = []
    top_answers = sorted(answers.items(), key=lambda x: (-x[1], x[0]))[:5]
    for answer_id, answer_score in top_answers:
        top_results[topic_id].append(answer_id)

print(f'{len(top_results)} topics with {sum(map(len, top_results))} answers')

10 topics with 50 answers


In [7]:
text_format = 'text+prefix'

In [8]:
from typing import List

from pv211_utils.arqmath.entities import ArqmathQueryBase
from pv211_utils.arqmath.loader import load_queries

class Topic(ArqmathQueryBase):
    def __init__(self, query_id: int, title: str, body: str, tags: List[str]):
        super().__init__(query_id, title, body, tags)

queries = load_queries('text+prefix', Topic, subset=None, year=2021)

In [9]:
from itertools import chain

from pv211_utils.arqmath.entities import ArqmathAnswerBase
from pv211_utils.arqmath.loader import load_answers

class Answer(ArqmathAnswerBase):
    def __init__(self, document_id: str, body: str, upvotes: int,
                 is_accepted: bool):
        super().__init__(document_id, body, upvotes, is_accepted)

answers = load_answers(text_format, Answer)
top_answers = load_answers(text_format, Answer,
                           filter_document_ids=set(chain(*top_results.values())))

Computing MD5: /home/xnovot32/.cache/gdown/https-COLON--SLASH--SLASH-drive.google.com-SLASH-uc-QUESTION-id-EQUAL-1_BvB7ZoiblHTwx8qrs1GQGkPThH3Tdf3
MD5 matches: /home/xnovot32/.cache/gdown/https-COLON--SLASH--SLASH-drive.google.com-SLASH-uc-QUESTION-id-EQUAL-1_BvB7ZoiblHTwx8qrs1GQGkPThH3Tdf3
Computing MD5: /home/xnovot32/.cache/gdown/https-COLON--SLASH--SLASH-drive.google.com-SLASH-uc-QUESTION-id-EQUAL-1_BvB7ZoiblHTwx8qrs1GQGkPThH3Tdf3
MD5 matches: /home/xnovot32/.cache/gdown/https-COLON--SLASH--SLASH-drive.google.com-SLASH-uc-QUESTION-id-EQUAL-1_BvB7ZoiblHTwx8qrs1GQGkPThH3Tdf3


In [10]:
from pv211_utils.arqmath.entities import ArqmathQuestionBase
from pv211_utils.arqmath.loader import load_questions

class Question(ArqmathQuestionBase):
    def __init__(self, document_id: str, title: str, body: str, tags: List[str],
                 upvotes: int, views: int, answers: List[Answer]):
        super().__init__(document_id, title, body, tags, upvotes, views, answers)

questions = load_questions(text_format, answers, Question)

Computing MD5: /home/xnovot32/.cache/gdown/https-COLON--SLASH--SLASH-drive.google.com-SLASH-uc-QUESTION-id-EQUAL-16UO3BH-qFUUNj6AyM0zSuCUt7T7BB0hO
MD5 matches: /home/xnovot32/.cache/gdown/https-COLON--SLASH--SLASH-drive.google.com-SLASH-uc-QUESTION-id-EQUAL-16UO3BH-qFUUNj6AyM0zSuCUt7T7BB0hO


## Preprocessing

Next, we will create a dictionary, a TF-IDF model, and a document similarity matrix following [the notebook of the SCM system][1].

 [1]: https://colab.research.google.com/drive/1LACGqdkUUeprHGTrEoocWoSOavpfP5Ki

We will define our text preprocessing function.

In [11]:
answers_to_questions = {
    answer: question
    for question in questions.values()
    for answer in question.answers
}

In [12]:
from typing import Union

from gensim.utils import simple_preprocess

def preprocess(document: Union[Topic, Question, Answer]) -> List[str]:
    """
    Tokenizes a document into lower-case text tokens and upper-case math tokens.

    Parameters
    ----------
    document: Topic or Question or Answer
        The document.
    
    Returns
    -------
    list of str
        The lower-case text tokens and upper-case math tokens.

    """
    tokens = []
    text = [document.body]
    title_weight = 3
    if 'tags' in dir(document):
        text += document.tags
    if 'title' in dir(document):
        text += [document.title] * title_weight
    if document in answers_to_questions:
        question = answers_to_questions[document]
        text += [question.body]
        text += question.tags
        text += [question.title] * title_weight
    for coarse_grained_token in chain(*map(str.split, text)):
        if len(coarse_grained_token) > 1 and coarse_grained_token[1] == '!':
            token = coarse_grained_token.upper()  # a mathematical token
            tokens.append(token)
        else:
            for fine_grained_token in simple_preprocess(coarse_grained_token):
                token = fine_grained_token.lower()  # a text token
                tokens.append(token)
    tokens = filter(lambda x: x, tokens)  # filter out empty tokens
    tokens = list(tokens)
    return tokens



We will produce a dictionary.

In [13]:
from multiprocessing import Pool

from gensim.corpora import Dictionary
from tqdm.notebook import tqdm

with Pool(None) as pool:
    document_bodies = pool.imap(preprocess, tqdm(answers.values(), desc='Producing a dictionary'))
    dictionary = Dictionary(document_bodies)
    dictionary.filter_extremes()

Producing a dictionary:   0%|          | 0/1445495 [00:00<?, ?it/s]

In [14]:
from itertools import chain

top_document_bodies = map(preprocess, top_answers.values())
top_query_bodies = map(preprocess, [topic for topic_id, topic in queries.items() if f'A.{topic_id}' in system.parsed_run])
top_dictionary = Dictionary(chain(top_document_bodies, top_query_bodies))

We will produce a TF-IDF model:

In [15]:
from gensim.models import TfidfModel

tfidf_model_documents = TfidfModel(dictionary=dictionary, slope=0.2, smartirs='Lnu')
tfidf_model_queries = TfidfModel(dictionary=dictionary, slope=0.2, smartirs='ltb')

We will prepare a term similarity matrix.

In [16]:
from gensim.similarities import SparseTermSimilarityMatrix

%mkdir -p data/novotny
download('1R_XCVLBAyV9K_9h79R1jbC0kKSIQf2KT', 'data/novotny/word-similarities-100000')
word_similarity_matrix = SparseTermSimilarityMatrix.load('data/novotny/word-similarities-100000')

File exists: data/novotny/word-similarities-100000


In [17]:
# from gensim.models.keyedvectors import KeyedVectors

# if not (Path('data')/'novotny'/'medium-vectors').exists():
#     %mkdir -p data/novotny
#     download('1L6yz4cTyrPZgb-gkpLfAw-XTUVOK4tpZ', 'data/novotny/medium-vectors.zip')
#     ! unzip data/novotny/medium-vectors.zip -d data/novotny
# word_vectors = KeyedVectors.load('data/novotny/medium-vectors')

In [18]:
# from gensim.similarities.annoy import AnnoyIndexer

# annoy_indexer = AnnoyIndexer(word_vectors, num_trees=1)

In [19]:
# from gensim.similarities import WordEmbeddingSimilarityIndex

# word_similarities = WordEmbeddingSimilarityIndex(
#     word_vectors,
#     threshold=-1.0,
#     exponent=4.0,
#     kwargs={'indexer': annoy_indexer},
# )

In [20]:
# from gensim.similarities import SparseTermSimilarityMatrix

# word_similarity_matrix = SparseTermSimilarityMatrix(
#     word_similarities,
#     dictionary,
#     tfidf_model_documents,
#     symmetric=True,
#     dominant=True,
#     nonzero_limit=100,
# )

In [21]:
# word_similarity_matrix.save('data/novotny/word-similarities-100000')

In [22]:
from scipy.sparse import dok_matrix

word_similarity_matrix = dok_matrix(word_similarity_matrix.matrix)

## Producing the JSON corpus

Finally, we will produce the JSON ARQMath-2 corpus.

In [29]:
corpus = {'version': '1'}

In [30]:
corpus['results'] = dict()

for topic_id, answers in top_results.items():
    topic_id = f'Topic {topic_id}'
    corpus['results'][topic_id] = answers

In [31]:
corpus['dictionary'] = dict()

for token, token_id in dictionary.token2id.items():
    if token not in top_dictionary.token2id:
        continue
    corpus['dictionary'][token_id] = token

In [32]:
corpus['word_similarities'] = dict()

term1_ids, term2_ids = word_similarity_matrix.nonzero()
term1_ids, term2_ids = map(int, term1_ids), map(int, term2_ids)
for term1_id, term2_id in zip(term1_ids, term2_ids):
    term1, term2 = dictionary.id2token[term1_id], dictionary.id2token[term2_id]
    if term1 not in top_dictionary.token2id:
        continue
    if term2 not in top_dictionary.token2id:
        continue
    if term1_id >= term2_id:
        continue
    if term1_id not in corpus['word_similarities']:
        corpus['word_similarities'][term1_id] = dict()
    word_similarity = word_similarity_matrix[term1_id, term2_id]
    word_similarity = float(word_similarity)
    corpus['word_similarities'][term1_id][term2_id] = word_similarity

In [34]:
corpus['texts'] = dict()
corpus['texts_bow'] = dict()

for topic_id, topic in queries.items():
    topic_id = f'A.{topic_id}'
    if topic_id not in system.parsed_run:
        continue
    topic_id = f'Topic {topic_id}'
    topic_tokens = preprocess(topic)
    corpus['texts'][topic_id] = []
    for token in topic_tokens:
        if token not in dictionary.token2id:
            continue
        token_id = dictionary.token2id[token]
        assert token in top_dictionary.token2id
        corpus['texts'][topic_id].append(str(token_id))
    topic_vector = dictionary.doc2bow(topic_tokens)
    topic_vector = tfidf_model_queries[topic_vector]
    corpus['texts_bow'][topic_id] = dict()
    for term_id, term_weight in topic_vector:
        term = dictionary.id2token[term_id]
        assert term in top_dictionary.token2id
        corpus['texts_bow'][topic_id][term_id] = term_weight

for answer_id, answer in top_answers.items():
    answer_vector = dictionary.doc2bow(preprocess(answer))
    answer_vector = tfidf_model_documents[answer_vector]
    corpus['texts'][answer_id] = []
    corpus['texts_bow'][answer_id] = dict()
    for term_id, term_weight in answer_vector:
        term = dictionary.id2token[term_id]
        assert term in top_dictionary.token2id
        corpus['texts'][answer_id].append(str(term_id))
        corpus['texts_bow'][answer_id][term_id] = term_weight

In [35]:
import json

with open('document-maps-corpus.json', 'wt') as f:
    json.dump(corpus, f, sort_keys=True, indent=4)

%ls -lh document-maps-corpus.json

-rw-r--r--. 1 xnovot32 student 602K  4. čec 16.03 document-maps-corpus.json
