References:

- [Data Persistence](https://huggingface.co/docs/hub/spaces-sdks-docker#data-persistence)
- [Create and manage a repository](https://huggingface.co/docs/huggingface_hub/guides/repository)
- [Managing local and online repositories](https://huggingface.co/docs/huggingface_hub/v0.13.4/en/package_reference/repository)
- [git-lfs tutorial](https://sabicalija.github.io/git-lfs-intro/)

In [1]:
%cd ../datasets

/workspaces/llm-playground/datasets


In [2]:
from huggingface_hub import Repository

In [3]:
!git lfs --help

GIT-LFS(1)                                                          GIT-LFS(1)

NAME
       git-lfs - Work with large files in Git repositories

SYNOPSIS
       git lfs <command> [<args>]

DESCRIPTION
       Git LFS is a system for managing and versioning large files in
       association with a Git repository. Instead of storing the large files
       within the Git repository as blobs, Git LFS stores special "pointer
       files" in the repository, while storing the actual file contents on a
       Git LFS server. The contents of the large file are downloaded
       automatically when needed, for example when a Git branch containing the
       large file is checked out.

       Git LFS works by using a "smudge" filter to look up the large file
       contents based on the pointer file, and a "clean" filter to create a
       new version of the pointer file when the large file’s contents change.
       It also uses a pre-push hook to upload the large file contents to the
       Git L

In [4]:
# https://huggingface.co/datasets/hoskinson-center/proof-pile/

repo = Repository(local_dir="proof-pile", clone_from="hoskinson-center/proof-pile", repo_type='dataset', skip_lfs_files=True)

/workspaces/llm-playground/datasets/proof-pile is already a clone of https://huggingface.co/datasets/hoskinson-center/proof-pile. Make sure you pull the latest changes with `repo.git_pull()`.


In [5]:
!git lfs pull --help

git lfs pull [options] [<remote>]

Download Git LFS objects for the currently checked out ref, and update
the working copy with the downloaded content if required.

This is equivalent to running the following 2 commands:

git lfs fetch [options] [] git lfs checkout

Options:

-I <paths>:
--include=<paths>:
   Specify lfs.fetchinclude just for this invocation; see "Include and exclude"
-X <paths>:
--exclude=<paths>:
   Specify lfs.fetchexclude just for this invocation; see "Include and exclude"

Include and exclude
-------------------

You can configure Git LFS to only fetch objects to satisfy references in
certain paths of the repo, and/or to exclude certain paths of the repo,
to reduce the time you spend downloading things you do not use.

In your Git configuration or in a .lfsconfig file, you may set either
or both of lfs.fetchinclude and lfs.fetchexclude to comma-separated
lists of paths. If lfs.fetchinclude is defined, Git LFS objects will
only be fetched if their path matches one 

In [6]:
%cd proof-pile

/workspaces/llm-playground/datasets/proof-pile


In [7]:
repo.git_pull()

In [8]:
!ls -lhta train

total 96K
drwxrwxrwx+ 2 codespace codespace 4.0K Apr 16 08:16 .
-rw-rw-rw-  1 codespace codespace  134 Apr 16 08:16 proofpile_train_8.jsonl.gz
-rw-rw-rw-  1 codespace codespace  134 Apr 16 08:16 proofpile_train_9.jsonl.gz
-rw-rw-rw-  1 codespace codespace  134 Apr 16 08:16 proofpile_train_19.jsonl.gz
-rw-rw-rw-  1 codespace codespace  134 Apr 16 08:16 proofpile_train_2.jsonl.gz
-rw-rw-rw-  1 codespace codespace  134 Apr 16 08:16 proofpile_train_20.jsonl.gz
-rw-rw-rw-  1 codespace codespace  134 Apr 16 08:16 proofpile_train_3.jsonl.gz
-rw-rw-rw-  1 codespace codespace  134 Apr 16 08:16 proofpile_train_4.jsonl.gz
-rw-rw-rw-  1 codespace codespace  134 Apr 16 08:16 proofpile_train_5.jsonl.gz
-rw-rw-rw-  1 codespace codespace  134 Apr 16 08:16 proofpile_train_6.jsonl.gz
-rw-rw-rw-  1 codespace codespace  134 Apr 16 08:16 proofpile_train_7.jsonl.gz
drwxrwxrwx+ 6 codespace codespace 4.0K Apr 16 08:16 ..
-rw-rw-rw-  1 codespace codespace 1.4K Apr 16 08:16 .gitattributes
-rw-rw-rw-  1 codespac

In [9]:
%%bash
git lfs pull --include="dev/proofpile_dev.jsonl.gz"
ls -lhta dev

total 155M
-rw-rw-rw-  1 codespace codespace 155M Apr 16 08:28 proofpile_dev.jsonl.gz
drwxrwxrwx+ 2 codespace codespace 4.0K Apr 16 08:28 .
drwxrwxrwx+ 6 codespace codespace 4.0K Apr 16 08:16 ..
-rw-rw-rw-  1 codespace codespace   59 Apr 16 08:16 .gitattributes


In [10]:
%%bash
git lfs pointer --file="dev/proofpile_dev.jsonl.gz"
ls -lhta dev

Git LFS pointer for dev/proofpile_dev.jsonl.gz



version https://git-lfs.github.com/spec/v1
oid sha256:9a33bde2feabb2421936bbde65ab15692ba814812bb2cf0e82a77b35e07e0b9b
size 161818020
total 155M
-rw-rw-rw-  1 codespace codespace 155M Apr 16 08:28 proofpile_dev.jsonl.gz
drwxrwxrwx+ 2 codespace codespace 4.0K Apr 16 08:28 .
drwxrwxrwx+ 6 codespace codespace 4.0K Apr 16 08:16 ..
-rw-rw-rw-  1 codespace codespace   59 Apr 16 08:16 .gitattributes


In [11]:
!pwd

/workspaces/llm-playground/datasets/proof-pile


In [12]:
%cd ../../

/workspaces/llm-playground


In [13]:
import os
HF_TOKEN = os.environ.get("HUGGINGFACE_TOKEN")

In [14]:
# https://huggingface.co/datasets/utensil/storage
repo = Repository(local_dir="storage", clone_from="utensil/storage", repo_type='dataset', skip_lfs_files=True, use_auth_token=HF_TOKEN)

Cloning https://huggingface.co/datasets/utensil/storage into local empty directory.


In [15]:
!touch storage/test.txt

In [16]:
repo.push_to_hub()

In [17]:
!date --rfc-3339=seconds > storage/time.txt

In [18]:
!cat storage/time.txt

2023-04-16 12:03:25+00:00


In [19]:
!pwd

/workspaces/llm-playground


In [20]:
!python helper/upload.py

Working directory changed to: /workspaces/llm-playground/helper/..
/workspaces/llm-playground/storage is already a clone of https://huggingface.co/datasets/utensil/storage. Make sure you pull the latest changes with `repo.git_pull()`.
To https://huggingface.co/datasets/utensil/storage
   7ea436a..39774b0  main -> main

Upload succeeded.
