# Mini-tutorial: RepLiQA samples with associated PDFs

The present tutorial assumes that the reader is familiar with the [`datasets`](https://huggingface.co/docs/datasets) library from Hugging Face.

## Everything except the PDFs

All extracted text and associated annotations for all the released RepLiQA splits are accessible through standard commands.

In [None]:
import datasets
cache_dir = './hf_cache'  # Edit for wherever you wish the cache to be.
repliqa = datasets.load_dataset("ServiceNow/repliqa", cache_dir=cache_dir)
repliqa

Only `repliqa_0` was released at the time of writing these lines, so running the above gave:
```
DatasetDict({
    repliqa_0: Dataset({
        features: ['document_id', 'document_topic', 'document_path', 'document_extracted', 'question_id', 'question', 'answer', 'long_answer'],
        num_rows: 17955
    })
})
```
The first element (i.e., `repliqa["repliqa_0"][0]`) is:
```
 {
    "document_id": "kiqpsbuw",
    "document_topic": "Small and Medium Enterprises",
    "document_path": 'pdfs/repliqa_0/kiqpsbuw.pdf',
    "document_extracted': 'Founder Journeys: The Personal and Professional Journey of SME Entrepreneurs \n\nFuzhou is a vibrant hub of entrepreneurial spirit and innovation, and stories of resilience, creativity, and determination weaved throughout small and medium enterprises (SMEs). [...]',
    'question_id': 'kiqpsbuw-q1',
    'question': 'What motivated Zhao Wei to found WeTech?',
    'answer': 'Zhao was motivated by his belief in doing well by doing good.',
    'long_answer': 'My motivation comes from doing well by doing good, explained Zhao while sipping\nlocally-sourced green tea as his energy booster. His work contributes to protecting the earth\nfor future generations - something which kept him going even when times got difficult.'
}
```
Note that the full `"document_extracted"` field is 6452 characters long and has thus been truncated above.

## Accessing the PDFs

The PDFs are also stored on Hugging Face, but extra steps are required to actually get them. This tutorial proposes two main solutions.

### Approach 1: get the PDFs one-by-one

In [None]:
import os
import huggingface_hub

def get_path_to_local_pdf_cache(sample: dict[str, str]) -> str:
    filename = os.path.basename(sample["document_path"])
    subfolder = os.path.dirname(sample["document_path"])
    return huggingface_hub.hf_hub_download(repo_id="ServiceNow/repliqa", filename=filename, subfolder=subfolder, repo_type="dataset", cache_dir=cache_dir)

path_to_local_pdf_cache = get_path_to_local_pdf_cache(repliqa["repliqa_0"][0])
path_to_local_pdf_cache

The PDF associated with the 0-th sample of `repliqa_0` should now be available at `path_to_local_pdf_cache`. Note that `repliqa["repliqa_0"][1]` uses the same document as `repliqa["repliqa_0"][0]`, so calling `get_path_to_local_pdf_cache(repliqa["repliqa_0"][1])` will not download the document again; storage is handled by Hugging Face's caching mechanism.

### Approach 2: get all PDFs at once

In [None]:
local_dir = "./repliqa"  # Edit for wherever you wish to store the snapshot.
snapshot_path = huggingface_hub.snapshot_download(repo_id="ServiceNow/repliqa", repo_type="dataset", local_dir=local_dir, cache_dir=cache_dir)

def get_path_to_local_pdf_snapshot(sample: dict[str, str]) -> str:
    return os.path.join(snapshot_path, sample["document_path"])

path_to_local_pdf_snapshot = get_path_to_local_pdf_snapshot(repliqa["repliqa_0"][0])
path_to_local_pdf_snapshot

Here all PDFs are downloaded to `local_dir` (i.e., not in Hugging Face's cache). Note that if the call to `huggingface_hub.snapshot_download` above fails, you should try running the same command again: it should quickly get back to where it was and continue downloading from there.