# Virtual Document (VDoc) on SCROLLS datasets 

In this notebook we demonstrate how to run the vdoc code on the Qasper dataset ([Dasigi et al., 2021](https://arxiv.org/abs/2105.03011)) which is part of the SCROLLS suite ([Shaham et al. 2022](https://arxiv.org/abs/2201.03533)).



## Prerequisites

It is assumed that the user has already setup a Python 3.11 environment and installed the requirements in `requirements.txt` following the README file.

In addition to that, we will need to install the Python package `wget`

In [None]:
%pip install wget

And to make sure we have NLTK tokenizer

In [None]:
import nltk
nltk.download('punkt_tab')

## Experimental Setup

Here we define the configuration for our experiment:

In [None]:

# scrolls base url
scrolls_url = 'https://huggingface.co/datasets/tau/scrolls/resolve/main'

# dataset name, file and split
dataset_name = "qasper"
dataset_zip_file = 'quality.zip'
dataset_split = "validation"

# vdoc settings
model_name = "sentence-transformers/all-MiniLM-L12-v2"
model_token_limit = 2400
max_new_tokens = 0
passage_len = 512
order = "doc"

# input and output
import os
input_file = os.path.join(dataset_name, dataset_zip_file[:dataset_zip_file.index(".")], f"{dataset_split}.jsonl")
output_file = os.path.join(dataset_name, "output.csv")


Now we turn to downloading Qasper from [SCROLLS](https://www.scrolls-benchmark.com/). 

In [None]:
import wget
import zipfile


wget.download(f"{scrolls_url}/{dataset_zip_file}")
with zipfile.ZipFile(dataset_zip_file, 'r') as zip_ref:
    zip_ref.extractall(dataset_name)

## Running a VDoc experiment

In [None]:
from vdoc_eval import vdoc

vdoc(dataset="scrolls",
     input_file=input_file,
     output_file=output_file,
     model_name=model_name,
     model_token_limit=model_token_limit,
     max_new_tokens=max_new_tokens,
     passage_len=passage_len,
     order=order)