# Virtual Document (VDoc) on SCROLLS datasets 

In this notebook we demonstrate how to run the vdoc code on the Qasper dataset ([Dasigi et al., 2021](https://arxiv.org/abs/2105.03011)) which is part of the SCROLLS suite ([Shaham et al. 2022](https://arxiv.org/abs/2201.03533)).



## Prerequisites

It is assumed that the user has already setup a Python 3.11 environment and installed the requirements in `requirements.txt` following the README file.

In addition to that, we will need to install the Python package `wget`

In [1]:
%pip install wget

Collecting wget
  Using cached wget-3.2-py3-none-any.whl
Installing collected packages: wget
Successfully installed wget-3.2
Note: you may need to restart the kernel to use updated packages.


And to make sure we have NLTK tokenizer

In [2]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/yosimass/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

## Experimental Setup

Here we define the configuration for our experiment:

In [3]:

# scrolls base url
scrolls_url = 'https://huggingface.co/datasets/tau/scrolls/resolve/main'

# dataset name, file and split
dataset_name = "qasper"
dataset_zip_file = 'quality.zip'
dataset_split = "validation"

# vdoc settings
model_name = "sentence-transformers/all-MiniLM-L12-v2"
model_token_limit = 2400
max_new_tokens = 0
passage_len = 512
order = "doc"
queries_limit=500

# input and output
import os
input_file = os.path.join(dataset_name, dataset_zip_file[:dataset_zip_file.index(".")], f"{dataset_split}.jsonl")
output_file = os.path.join(dataset_name, "output.csv")


Now we turn to downloading Qasper from [SCROLLS](https://www.scrolls-benchmark.com/). 

In [4]:
import wget
import zipfile


wget.download(f"{scrolls_url}/{dataset_zip_file}")
with zipfile.ZipFile(dataset_zip_file, 'r') as zip_ref:
    zip_ref.extractall(dataset_name)

100% [........................................................................] 31252462 / 31252462

## Running a VDoc experiment

In [5]:
from vdoc_eval import vdoc

vdoc(dataset="scrolls",
     input_file=input_file,
     output_file=output_file,
     model_name=model_name,
     model_token_limit=model_token_limit,
     max_new_tokens=max_new_tokens,
     passage_len=passage_len,
     order=order,
     queries_limit=queries_limit)


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.3 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/yosimass/PycharmProjects/vdoc_paper/venv/lib/python3.11/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/Users/yosimass/PycharmProjects/vdoc_paper/venv/lib/python3.11/site-packages/traitlets/config/application.py", line 1075, in launch_instance
    app.start()
  File "/Users/yosimass/PycharmProjects/vdoc_paper/venv/lib/python3.11/site-packages/ipykernel/ker

qasper/output.csv - Total count 500. bad documents 0. total vdoc 499. bad vdocs 0
