# Dataset preparation for Lepton pre-training

This notebook guides the subscriber how to download a dataset from Hugging Face, prepare the dataset with the Lepton tokenizer, and upload the prepared dataset to their S3 bucket for running a training job with the Mindbeam-Lepton pre-training algorithm.

## Install required packages

In [None]:
!pip install jsonargparse pandas pyarrow gitpython boto3 tqdm numpy lightning tokenizers==0.19.1

## Install git and git-lfs

In [None]:
# !sudo apt-get install git git-lfs  # For Ubuntu/Debian
# or
# brew install git git-lfs     # For macOS
# or
# For Amazon Sagemaker  do below
!curl -Lo /tmp/git-lfs.tar.gz https://github.com/git-lfs/git-lfs/releases/download/v3.4.1/git-lfs-linux-amd64-v3.4.1.tar.gz
!tar -xf /tmp/git-lfs.tar.gz -C /tmp
!sudo /tmp/git-lfs-3.4.1/install.sh

In [None]:
# check if git-lfs has been successfully installed
!git lfs version

## Dataset preparation

First export your Hugging Face token for downloading the dataset.

In [None]:
import os
os.environ['HF_TOKEN'] = '<hf_your_token>'

Now run the dataset preparation script. Make sure that the `destination_path` is a local directory that has sufficient storage to store the prepared data. `s3_bucket` is the name of your S3 bucket, `s3_prefix` is the directory under the bucket where the prepared dataset is to be uploaded.

In [None]:
!python -m prepare_dataset.custom_preparer \
    "<org_name>/<Dataset_name>" \
    --destination_path /tmp/prepared_data \
    --tokenizer_path tokenizer/pints \
    --s3_bucket "<my_s3_bucket_name>" \
    --s3_prefix "<dataset_directory_inside_s3_bucket>"