### Foundational Model Finetuning dataset preparation

At the end of this notebook, you will have datasets that can be used to train and optimize a foundational model as demonstrated in [this notebook](../dataset_prepare/foundational_model_finetuning.ipynb)

### FIXME

1. Choose between default and custom dataset in FIXME 1 - default for the dataset used in this tutorial notebook; custom for a different dataset
1. Assign path of DATA_DIR in FIXME 2
1. Assign Cloud credentials in FIXME 3

You need download the ImageNet2012 dataset and format it into train/ val/ test folders. The train, val folders should be unzipped and placed in $DATA_DIR/imagenet.

The Data can be Downloaded by following instructions here: 
[MMPretrain Imagenet Download Instructions](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html) 

Go to official [Download page](http://www.image-net.org/download-images). Find download links for ILSVRC2012 and download the following two files.

* ILSVRC2012_img_train.tar (~138GB)

    * For training untar the class folders into the `train` such that the train folder has 1000 folders corresponding to each class.

* ILSVRC2012_img_val.tar (~6.3GB)
    * For validation images: You need to move the images to respective class folders. You can use this script for the same [valprep](https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh).

You can also use this shell script to perform the above 2 steps: [extract_ILSVRC script](https://github.com/pytorch/examples/blob/main/imagenet/extract_ILSVRC.sh).
Example contents of the final train/ val folder looks like this:

**./train**

    n07693725
    ...
    n07614500

**./val**

    n07693725
    ...
    n07614500


The above steps can also be performed by the following bash commands, if you have not downloaded the ImageNet 2012 dataset. 

**Note:** This Notebook example uses subset of 1000 classes (cats and dogs). If you are using your custom dataset other than ImageNet - Please update the `dataset.data` config with `classes` field that points to a file with class names. Please refer to documentation for more details on the classes text file. Update the `num_classes` under `model.head` accordingly. For reference: Please refer to the `train_cats_dogs.yaml` in specs of `clsasification_pyt` under parent directory which gives an example of fine-tuning on 2-classes dataset. 

In [None]:
import os

In [None]:
dataset_to_be_used = "default" #FIXME1 
DATA_DIR = "/data/fm/nvdino_v2" #FIXME2
os.environ['DATA_DIR']= DATA_DIR
!mkdir -p $DATA_DIR
print(f"DATA_DIR: {DATA_DIR}")

### Dataset download and pre-processing <a class="anchor" id="head-1"></a>

In [None]:
if dataset_to_be_used == "default":
    !wget https://www.dropbox.com/s/wml49yrtdo53mie/cats_dogs_dataset_reorg.zip?dl=0 -O $DATA_DIR/cats_dogs_dataset.zip
    !unzip -qo $DATA_DIR/cats_dogs_dataset.zip -d $DATA_DIR/
    !rm $DATA_DIR/cats_dogs_dataset.zip

In [None]:
# Verify the dataset is downloaded
assert os.path.exists(f"{DATA_DIR}/cats_dogs_dataset/training_set/training_set"), "Training Dataset Not Found. Please check properly."
assert os.path.exists(f"{DATA_DIR}/cats_dogs_dataset/val_set/val_set"), "Val Dataset Not Found. Please check properly."
assert os.path.exists(f"{DATA_DIR}/cats_dogs_dataset/test_set/test_set"), "Test Dataset Not Found. Please check properly."
assert len(os.listdir(f"{DATA_DIR}/cats_dogs_dataset/training_set/training_set")) == 2, "Dataset validation failed. Sample dataset should have 2 classes."

!mv $DATA_DIR/cats_dogs_dataset/training_set/training_set $DATA_DIR/cats_dogs_dataset/training_set/images_train
!mv $DATA_DIR/cats_dogs_dataset/val_set/val_set $DATA_DIR/cats_dogs_dataset/val_set/images_val
!mv $DATA_DIR/cats_dogs_dataset/test_set/test_set $DATA_DIR/cats_dogs_dataset/test_set/images_test

assert os.path.exists(f"{DATA_DIR}/cats_dogs_dataset/training_set/images_train")
assert os.path.exists(f"{DATA_DIR}/cats_dogs_dataset/val_set/images_val")
assert os.path.exists(f"{DATA_DIR}/cats_dogs_dataset/test_set/images_test")

### Create Tar files to upload

In [None]:
!mkdir -p $DATA_DIR/cloud_folders/data/nvdinov2_train $DATA_DIR/cloud_folders/data/nvdinov2_val $DATA_DIR/cloud_folders/data/nvdinov2_test

!tar -C $DATA_DIR/cats_dogs_dataset/training_set -czf $DATA_DIR/cloud_folders/data/nvdinov2_train/images_train.tar.gz images_train
!tar -C $DATA_DIR/cats_dogs_dataset/val_set -czf $DATA_DIR/cloud_folders/data/nvdinov2_val/images_val.tar.gz images_val
!tar -C $DATA_DIR/cats_dogs_dataset/test_set -czf $DATA_DIR/cloud_folders/data/nvdinov2_test/images_test.tar.gz images_test

### Final step: Upload the /data folder to your cloud storage and move on to running the API requests example notebooks
When you do a ls of your bucket it should have /data folder and the subfolders we created above within in (classification_train, classification_val, classification_test)

In [None]:
!python3 -m pip install --upgrade awscli
ACCESS_KEY=FIXME3.1
SECRET_KEY=FIXME3.2
BUCKET_NAME=FIXME3.3
!AWS_ACCESS_KEY_ID={ACCESS_KEY} AWS_SECRET_ACCESS_KEY={SECRET_KEY} aws s3 cp {DATA_DIR}/cloud_folders/data/nvdinov2_train s3://{BUCKET_NAME}/data/nvdinov2_train_tiny --recursive
!AWS_ACCESS_KEY_ID={ACCESS_KEY} AWS_SECRET_ACCESS_KEY={SECRET_KEY} aws s3 cp {DATA_DIR}/cloud_folders/data/nvdinov2_val s3://{BUCKET_NAME}/data/nvdinov2_val_tiny --recursive
!AWS_ACCESS_KEY_ID={ACCESS_KEY} AWS_SECRET_ACCESS_KEY={SECRET_KEY} aws s3 cp {DATA_DIR}/cloud_folders/data/nvdinov2_test s3://{BUCKET_NAME}/data/nvdinov2_test_tiny --recursive

In [None]:
train_dataset_path = "/data/nvdinov2_train_cats_dogs"
eval_dataset_path = "/data/nvdinov2_val_cats_dogs"
test_dataset_path = "/data/nvdinov2_test_cats_dogs"