# Dataset preparation for Lepton pre-training

This notebook guides the subscriber how to download a dataset from Hugging Face, prepare the dataset with the Lepton tokenizer, and upload the prepared dataset to their S3 bucket for running a training job with the Mindbeam-Lepton pre-training algorithm.

## Install required packages

In [28]:
!pip install jsonargparse pandas pyarrow gitpython boto3 tqdm numpy lightning tokenizers==0.19.1

Collecting tokenizers==0.19.1
  Downloading tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting huggingface-hub<1.0,>=0.16.4 (from tokenizers==0.19.1)
  Downloading huggingface_hub-0.28.1-py3-none-any.whl.metadata (13 kB)
Downloading tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m127.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading huggingface_hub-0.28.1-py3-none-any.whl (464 kB)
Installing collected packages: huggingface-hub, tokenizers
Successfully installed huggingface-hub-0.28.1 tokenizers-0.19.1


## Install git and git-lfs

In [10]:
# !sudo apt-get install git git-lfs  # For Ubuntu/Debian
# or
# brew install git git-lfs     # For macOS
# or
# For Amazon Sagemaker  do below
!curl -Lo /tmp/git-lfs.tar.gz https://github.com/git-lfs/git-lfs/releases/download/v3.4.1/git-lfs-linux-amd64-v3.4.1.tar.gz
!tar -xf /tmp/git-lfs.tar.gz -C /tmp
!sudo /tmp/git-lfs-3.4.1/install.sh

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 4814k  100 4814k    0     0  68.7M      0 --:--:-- --:--:-- --:--:-- 68.7M
Updated Git hooks.
Git LFS initialized.


In [13]:
# check if git-lfs has been successfully installed
!git lfs version

git-lfs/3.4.1 (GitHub; linux amd64; go 1.20.11; git 0898dcbc)


## Dataset preparation

First export your Hugging Face token for downloading the dataset.

In [22]:
import os
os.environ['HF_TOKEN'] = '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'

Now run the dataset preparation script. Make sure that the `destination_path` is a local directory that has sufficient storage to store the prepared data. `s3_bucket` is the name of your S3 bucket, `s3_prefix` is the directory under the bucket where the prepared dataset is to be uploaded.

In [None]:
!python -m prepare_dataset.custom_preparer \
    "roneneldan/TinyStories" \
    --destination_path /tmp/prepared_data \
    --tokenizer_path tokenizer/pints \
    --s3_bucket "sagemaker-us-east-1-975050170529" \
    --s3_prefix "datasets/prepared"

Updated Git hooks.
Git LFS initialized.
Cloning into 'data/raw/TinyStories'...
remote: Enumerating objects: 69, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 69 (delta 0), reused 0 (delta 0), pack-reused 66 (from 1)[K
Unpacking objects: 100% (69/69), 14.51 KiB | 1.61 MiB/s, done.
Filtering content: 100% (11/11), 7.09 GiB | 255.85 MiB/s, done.
Found 4 .txt files
Found 5 .parquet files
====FILES FOUND====
['/home/ec2-user/SageMaker/lepton-aws-marketplace/data/raw/TinyStories/TinyStories-train.txt', '/home/ec2-user/SageMaker/lepton-aws-marketplace/data/raw/TinyStories/TinyStories-valid.txt', '/home/ec2-user/SageMaker/lepton-aws-marketplace/data/raw/TinyStories/TinyStoriesV2-GPT4-train.txt', '/home/ec2-user/SageMaker/lepton-aws-marketplace/data/raw/TinyStories/TinyStoriesV2-GPT4-valid.txt', '/home/ec2-user/SageMaker/lepton-aws-marketplace/data/raw/TinyStories/data/train-00000-of-00004-2d5a1467fff1081b.parquet', '/ho