# Evaluation Example
This notebook will show of how to evaluate a model by evaluating a simple TF-IDF Model based on scikit learn

## TF-IDF Test model

In [1]:
import scipy
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from src import init_data
from arqmath_code.topic_file_reader import Topic
from arqmath_code.Entities.Post import Answer
from typing import List
import re

In [2]:
topic_reader, data_reader = init_data(task=1)

reading users
reading comments
reading votes
reading post links
reading posts


In [38]:
def title_tf_idf_model(query: Topic, answers: List[Answer]) -> List[tuple[int, float]]:
    answer_bodys: List[str] = [answer.body for answer in answers]
    training_set: List[str] = answer_bodys.copy()
    training_set.append(query.title)

    vectorizer: TfidfVectorizer = TfidfVectorizer()
    vectorizer.fit(training_set)
    query_vector: scipy.sparse_csr.csr_matrix = vectorizer.transform([query.title])
    word_term_matrix: scipy.sparse_csr.csr_matrix = vectorizer.transform(answer_bodys)
    cos_sims: np.ndarray = cosine_similarity(query_vector, word_term_matrix)
    ranking: List[tuple[int, float]] = sorted(zip(range(cos_sims.shape[1]), cos_sims[0,]), key=lambda tuple: tuple[1], reverse=True)[:1000]
    return ranking

def binary_tag_retrieval(query: Topic) -> List[Answer]:
    questions = list(set([question for tag in query.lst_tags for question in data_reader.get_question_of_tag(tag=tag)]))
    questions = filter(lambda question: question.answers is not None, questions)
    return [answer for single_question in questions for answer in single_question.answers]


def clean_post(query: Topic, answers: List[Answer]) -> (Topic, List[Answer]):
    query.title = re.sub(r"</?(p|span)[^>]*>", "", query.title)
    query.question = re.sub(r"</?(p|span)[^>]*>", "", query.question)
    for answer in answers:
        answer.body = re.sub(r"</?(p|span)[^>]*>", "", answer.body)
    return query, answers

In [39]:
# retrieve Ranking
test_topic: Topic = topic_reader.get_topic('A.301')
answers: List[Answer] = binary_tag_retrieval(test_topic)
test_topic, answers = clean_post(test_topic, answers=answers)
ranking = title_tf_idf_model(query=test_topic, answers=answers)
ranking

[(130043, 0.7480850652590038),
 (22275, 0.7226301265950487),
 (115360, 0.720254697052057),
 (121637, 0.7100482990228669),
 (146408, 0.6994569035989084),
 (146560, 0.6785593559975694),
 (25235, 0.6465412416307746),
 (49078, 0.6334861978286853),
 (45154, 0.6079706198174982),
 (49079, 0.5901755599983786),
 (94429, 0.5811625634076902),
 (35287, 0.578443391072904),
 (46049, 0.5775151445782488),
 (24322, 0.5773428535034383),
 (141944, 0.5754797383089231),
 (61766, 0.5692501198451573),
 (28850, 0.5691115620063324),
 (143500, 0.5687758109160417),
 (140654, 0.5680495034940503),
 (54960, 0.5650201783446158),
 (151201, 0.5626210945204969),
 (154901, 0.5624934590697046),
 (25237, 0.5622719432440964),
 (72399, 0.5572678417823526),
 (118868, 0.5554166495088814),
 (117654, 0.5542611695350558),
 (114004, 0.5506869786880706),
 (142874, 0.5502775580397614),
 (114007, 0.5472653559290036),
 (154902, 0.5334137599553129),
 (46035, 0.5287540190003677),
 (65252, 0.5274849500179186),
 (39519, 0.526937263154112

## ARQmath evaluation format
In order to evaluate the result of a retrieval pipeline, the results have to be saved in the ARQmath tsv format. Which is explained further in the [ArqMath2020-EvaluationProtocols.V1.1.pdf](documentation/ArqMath2020-EvaluationProtocols.V1.1.pdf):
```
Query_Id Post_Id Rank Score Run_Number
```
As shown below, I suggest generating a pandas Dataframe and then save it as a .tsv in ./results (this folder is in gitignore and thus not present in the repository), such that the folder structure look like this:
![../documentation/resutls_folders.png](../documentation/resutls_folders.png)

In [40]:
import pandas as pd

In [41]:
df_dict = {
    "Query_Id": [test_topic.topic_id for i in range(len(ranking))],
    "Post_Id": [answers[answer_idx].post_id for answer_idx, score in ranking],
    "Rank": [i for i in range(len(ranking))],
    "Score": [score for answer_idx, score in ranking],
    "Run_Number": [0 for i in range(len(ranking))]
}
df = pd.DataFrame(df_dict)
df

Unnamed: 0,Query_Id,Post_Id,Rank,Score,Run_Number
0,A.301,242489,0,0.748085,0
1,A.301,133987,1,0.722630,0
2,A.301,583792,2,0.720255,0
3,A.301,1603654,3,0.710048,0
4,A.301,2049554,4,0.699457,0
...,...,...,...,...,...
995,A.301,2896245,995,0.204031,0
996,A.301,346082,996,0.204002,0
997,A.301,1313526,997,0.203914,0
998,A.301,1643912,998,0.203615,0


In [42]:
df.to_csv(path_or_buf="../results/model_results/tf-idf-test.tsv", sep='\t', index=False)

Take a look at the answer post with the best score

In [43]:
data_reader.post_parser.map_just_answers[242489].body

"\\def\\R{\\mathbb R}\\def\\norm#1{\\left\\|#1\\right\\|}\\def\\abs#1{\\left|#1\\right|}\\def\\sp#1{\\left\\langle#1\\right\\rangle}Recall that for a function F\\colon \\R^n \\to \\R^n to be differentiable at x \\in \\R^n there must exist a linear DF(x) \\colon \\R^n \\to \\R^n such that  F(x+h) = F(x) + DF(x)h + o(h), \\qquad h \\to 0  For the given F we have for x,h\\in \\R^n \\begin{align*}   F(x+h) - F(x) &amp;= \\norm{x+h}^2(x+h) - \\norm x^2x\\\\          &amp;= \\sp{x+h,x+h}(x+h) - \\norm x^2x\\\\          &amp;= \\bigl(\\norm x^2 + 2\\sp{x,h} + \\norm h^2\\bigr)(x+h) - \\norm x^2 x\\\\          &amp;= \\norm x^2h + 2\\sp{x,h}x + \\norm h^2x + 2\\sp{x,h}h + \\norm h^2 h \\end{align*} Note that DF(x)h := \\norm x^2 h + 2\\sp{x,h}x is linear in h and  \\begin{align*}   \\norm{\\norm h^2x + 2\\sp{x,h}h + \\norm h^2 h} &amp;\\le \\norm h^2\\norm x + 2\\norm x\\norm h^2    + \\norm h^3\\\\    &amp;= \\norm h \\bigl(2\\norm h\\norm x + \\norm h^2\\bigr)\\\\    &amp;= o(\\norm h), \\qq

## Evaluating Results

In [35]:
from arqmath_code.evaluation.task1 import arqmath_to_prime_task1
from arqmath_code.evaluation.task1 import task1_get_results

As a first step the results have to be converted to trec format:

In [44]:
qrel_dictionary = arqmath_to_prime_task1.read_qrel_to_dictionary("../arqmath_dataset/evaluation/Task 1/Qrel Files/qrel_task1_2022_all.tsv")
arqmath_to_prime_task1.convert_result_files_to_trec(submission_dir="../results/model_results/", qrel_result_dic=qrel_dictionary, prim_dir="../results/ARQmath_prim/", trec_dir="../results/ARQmath_trec/")

In the second phase the actual performance metrics can be computed. For this u have to have trec_eval installed!!! On mac this can be done via homebrew.

In [45]:
number_topics = 78.0
task1_get_results.get_result(trec_eval_tool="trec_eval", qre_file_path="../arqmath_dataset/evaluation/Task 1/Qrel Files/qrel_task1_2022_all.tsv", prim_result_dir="../results/ARQmath_prim/", evaluation_result_file="../results/test1.tsv", number_topics=number_topics)

-----------
['ndcg                  ', 'A.301', '0.1866']
-----------
['ndcg                  ', 'all', '0.1866']
-----------
['map                   ', 'A.301', '0.0515']
-----------
['map                   ', 'all', '0.0515']
-----------
['P_10                  ', 'A.301', '0.1000']
-----------
['P_10                  ', 'all', '0.1000']
