# 1 - Downloading The Pile - Deduplicated and MiniPile

**Objectives:**
- [x] Download "The Pile - Deduplicated" to a specified directory
- [x] Download "MiniPile" to a specified directory

We want to download the datasets [EleutherAI/the_pile_deduplicated](https://huggingface.co/datasets/EleutherAI/the_pile_deduplicated) and [JeanKaddour/minipile](https://huggingface.co/datasets/JeanKaddour/minipile) through HuggingFace to a specific directory.<br>
HuggingFace can be a bit stubborn in that it loads the datasets and their caches to some default directory.<br>
We can't do that here, so we need to implement a custom download function based on the HuggingFace API.<br>
Thankfully, the HuggingFace-Hub's `snapshot_download` function performs much of the heavy lifting for us in that regard.

In [None]:
import os
from pathlib import Path
from huggingface_hub import snapshot_download

In [None]:
down_dir = "/vol/tmp/koppelmm"

In [None]:
def download_dataset(down_dir: str, target_folder: str, cache_folder: str, repo_id: str) -> None:
    # Download a dataset without affecting local cache. 
    # Download genuinely only to target_folder.
    down_dir = Path(down_dir)
    target_dir = down_dir / target_folder
    cache_dir = down_dir / cache_folder

    os.makedirs(target_dir, exist_ok=True)
    os.makedirs(cache_dir, exist_ok=True)

    print(f"Downloading {repo_id}...")

    # I tried fiddling with os.environs, I wanted to use the load_dataset function
    # but that's really not needed, snapshot_download suffices
    while True:
        try:
            snapshot_download(repo_id, repo_type="dataset", cache_dir=str(cache_dir), local_dir=str(target_dir))
            break
        except Exception as _:
            continue

With the download logic in place, we can go right ahead and download the datasets to their intended directories.

In [None]:
download_dataset(down_dir=down_dir, target_folder="Pile_Deduplicated", 
                 cache_folder="Pile_Deduplicated_Cache", repo_id="EleutherAI/the_pile_deduplicated")

[The Pile Deduplicated](https://huggingface.co/datasets/EleutherAI/the_pile_deduplicated) is stated to measure $451$ GB at $134,318,121$ rows.<br>
The dataset only consists of a train split. Each row only consists of the value to the single attribute `text`.

In [None]:
download_dataset(down_dir=down_dir, target_folder="MiniPile", 
                 cache_folder="MiniPile_Cache", repo_id="JeanKaddour/minipile")

[MiniPile](https://huggingface.co/datasets/JeanKaddour/minipile) is stated to measure $3.18$ GB at $1,000,000:500:10,000$ train:val:test rows.<br>
Again, each row only consists of the value to the single attribute `text`.

MiniPile's training set is therefore $0.745\%$ the size of The Pile Deduplicated, while retaining a memory footprint of $0.705\%$.