# Case studies

1. **Gold standard**: `mine-50-andor` contains the 50 most recent articles from [arxiv.org in both the cs.LG and stat.ML categories](https://arxiv.org/list/cs.LG/recent), between the dates 2022-10-24 and 2022-10-25 and contained 570 search results at the time of the dataset creation. We select articles that belong to cs.LG `or` (cs.LG `and` stat.ML) category.

2. `mine50` contains the 50 most recent articles from [arxiv.org in both the cs.LG and stat.ML categories](https://arxiv.org/list/cs.LG/recent), between the dates 2022-10-24 and 2022-10-25 and contained 570 search results at the time of the dataset creation. The search result is sorted by date in descending order

    !!! note
        The date being queried for is the last updated date and not the date of paper submission

3. `mine50-csLG` contains the results using the same method as `mine50` but without looking for articles in both cs.LG and stat.ML.

## Evaluating ReproScreener on the manually labeled (gold standard) dataset

In [1]:
import pandas as pd
from IPython.display import display
from pathlib import Path

path_corpus_andor = Path("../case-studies/arxiv-corpus/mine50-andor/")

dtypes_repro = {'id': str, 'link_count': float, 'found_links': str}
eval_andor = pd.read_csv(path_corpus_andor / 'output/repro_eval_tex.csv', dtype=dtypes_repro)[['id', 'link_count', 'found_links']]

The first 5 articles where ReproScreener found potential code/repository links:

In [2]:
eval_andor_links = eval_andor[eval_andor['link_count'] > 0]
eval_andor_links.head()

Unnamed: 0,id,link_count,found_links
4,1909.00931,3.0,['https://github.com/lanwuwei/Twitter-URL-Corp...
8,2009.01947,1.0,['https://gitlab.com/luciacyx/nm-adaptive-code...
9,2010.04261,1.0,['https://github.com/goodfeli/dlbook_notation/']
11,2011.11576,5.0,"['https://github.com/jpbrooks/conjecturing.', ..."
12,2012.09302,1.0,['https://github.com/ain-soph/trojanzoo}.']


Below are the scores from the manually labeled dataset of 50 articles.
- `article_link_avail`: Whether ink to the code/repository was able to be found in the article.
- `pwc_link_avail`: Whether ink to the code/repository was able to be found in the Papers With Code (`pwc`) website.
- `pwc_link_match`: Whether ink to the code/repository found in the Papers With Code (`pwc`) website matches the link found in the article (whether the previous 2 columns match or not).
- `result_replication_code_avail`: Whether code to replicate the specific experiments presented in the article was available. This to measure that the code is not just a generic implementation of the model (part of the tool/package) but is specific to the experiments in the article. If code is not available, this defaults to false.

In [3]:
manual = pd.read_csv("./manual_eval.csv")
manual_df_numerical = manual[['paper', 'article_link_avail', 'pwc_link_avail', 'pwc_link_match', 'result_replication_code_avail']]
manual_df_numerical = manual_df_numerical.drop(index=[0,51]) # drop first row (summary) and last row (totals)
manual_df_numerical = manual_df_numerical.fillna(0) # fill NaN with 0
dtypes_manual = {'paper': str, 'article_link_avail': float, 'pwc_link_avail': float, 'pwc_link_match': float, 'result_replication_code_avail': float}
manual_df_numerical = manual_df_numerical.astype(dtypes_manual) # convert to int
manual_df_numerical[9:15]

Unnamed: 0,paper,article_link_avail,pwc_link_avail,pwc_link_match,result_replication_code_avail
10,2010.04261,0.0,0.0,0.0,0.0
11,2010.04855,0.0,0.0,0.0,0.0
12,2011.11576,1.0,1.0,0.0,0.0
13,2012.09302,1.0,1.0,1.0,1.0
14,2101.07354,0.0,0.0,0.0,0.0
15,2102.11887,0.0,0.0,0.0,0.0


Tally of manual evaluation of the 50 articles:

In [4]:
manual_df_numerical.sum(axis=0, numeric_only=True)

article_link_avail               23.0
pwc_link_avail                   22.0
pwc_link_match                   19.0
result_replication_code_avail    20.0
dtype: float64

In [5]:
manual_vs_repro = manual_df_numerical.merge(eval_andor_links, left_on='paper', right_on='id', how='left')
# manual_df_numerical.article_link_avail.sum(), manual_df_numerical.result_replication_code_avail.sum()
print(f"Manual evaluation found links in {manual_vs_repro.article_link_avail.sum()} papers, ReproScreener found links in {(manual_vs_repro.link_count>0).sum()} papers")

Manual evaluation found links in 23.0 papers, ReproScreener found links in 21 papers
