Skip to content

Latest commit

 

History

History
157 lines (103 loc) · 7.82 KB

DatasetPreparation.md

File metadata and controls

157 lines (103 loc) · 7.82 KB

Dataset Preparation

English

Data Storage Format

At present, there are three types of data storage formats supported:

  1. Store in hard disk directly in the format of images / video frames.
  2. Make LMDB, which could accelerate the IO and decompression speed during training.

How to Use

At present, we can modify the configuration yaml file to support different data storage formats. Taking PairedImageDataset as an example, we can modify the yaml file according to different requirements.

  1. Directly read disk data.

    type: VideoTestDataset
    dataroot_gt: ./train_sharp
    dataroot_lq: ./train_sharp_bicubic/X4/
    io_backend:
      type: disk
  2. Use LMDB. We need to make LMDB before using it. Please refer to LMDB description. Note that we add meta information to the original LMDB, and the specific binary contents are also different. Therefore, LMDB from other sources can not be used directly.

    type: REDSDataset
    dataroot_gt: /cluster/work/cvl/videosr/REDS/train_sharp_with_val.lmdb 
    dataroot_lq: /cluster/work/cvl/videosr/REDS/train_sharp_bicubic_with_val.lmdb 
    io_backend:
      type: lmdb

How to Implement

The implementation is to call the elegant fileclient design in mmcv. In order to be compatible with BasicSR, we have made some changes to the interface (mainly to adapt to LMDB). See file_client.py for details.

When we implement our own dataloader, we can easily call the interfaces to support different data storage forms. Please refer to PairedImageDataset for more details.

LMDB Description

During training, we use LMDB to speed up the IO and CPU decompression. (During testing, usually the data is limited and it is generally not necessary to use LMDB). The acceleration depends on the configurations of the machine, and the following factors will affect the speed:

  1. Some machines will clean cache regularly, and LMDB depends on the cache mechanism. Therefore, if the data fails to be cached, you need to check it. After the command free -h, the cache occupied by LMDB will be recorded under the buff/cache entry.
  2. Whether the memory of the machine is large enough to put the whole LMDB data in. If not, it will affect the speed due to the need to constantly update the cache.
  3. If you cache the LMDB dataset for the first time, it may affect the training speed. So before training, you can enter the LMDB dataset directory and cache the data by: cat data.mdb > /dev/nul.

In addition to the standard LMDB file (data.mdb and lock.mdb), we also add meta_info.txt to record additional information. Here is an example:

Folder Structure

DIV2K_train_HR_sub.lmdb
├── data.mdb
├── lock.mdb
├── meta_info.txt

meta information

meta_info.txt, We use txt file to record for readability. The contents are:

0001_s001.png (480,480,3) 1
0001_s002.png (480,480,3) 1
0001_s003.png (480,480,3) 1
0001_s004.png (480,480,3) 1
...

Each line records an image with three fields, which indicate:

  • Image name (with suffix): 0001_s001.png
  • Image size: (480, 480,3) represents a 480x480x3 image
  • Other parameters (BasicSR uses cv2 compression level for PNG): In restoration tasks, we usually use PNG format, so 1 represents the PNG compression level CV_IMWRITE_PNG_COMPRESSION is 1. It can be an integer in [0, 9]. A larger value indicates stronger compression, that is, smaller storage space and longer compression time.

Binary Content

For convenience, the binary content stored in LMDB dataset is encoded image by cv2: cv2.imencode('.png', img, [cv2.IMWRITE_PNG_COMPRESSION, compress_level]. You can control the compression level by compress_level, balancing storage space and the speed of reading (including decompression).

How to Make LMDB We provide a script to make LMDB. Before running the script, we need to modify the corresponding parameters accordingly. At present, we support DIV2K, REDS and Vimeo90K datasets; other datasets can also be made in a similar way.
python scripts/data_preparation/create_lmdb.py

Data Pre-fetcher

Apar from using LMDB for speed up, we could use data per-fetcher. Please refer to prefetch_dataloader for implementation.
It can be achieved by setting prefetch_mode in the configuration file. Currently, it provided three modes:

  1. None. It does not use data pre-fetcher by default. If you have already use LMDB or the IO is OK, you can set it to None.

    prefetch_mode: ~
  2. prefetch_mode: cuda. Use CUDA prefetcher. Please see NVIDIA/apex for more details. It will occupy more GPU memory. Note that in the mode. you must also set pin_memory=True.

    prefetch_mode: cuda
    pin_memory: true
  3. prefetch_mode: cpu. Use CPU prefetcher, please see IgorSusmelj/pytorch-styleguide for more details. (In my tests, this mode does not accelerate)

    prefetch_mode: cpu
    num_prefetch_queue: 1  # 1 by default

Video Super-Resolution

It is recommended to symlink the dataset root to datasets with the command ln -s xxx yyy. If your folder structure is different, you may need to change the corresponding paths in config files.

REDS

Official website.
We regroup the training and validation dataset into one folder. The original training dataset has 240 clips from 000 to 239. And we rename the validation clips from 240 to 269.

Validation Partition

The official validation partition and that used in EDVR for competition are different:

name clips total number
REDSOfficial [240, 269] 30 clips
REDS4 000, 011, 015, 020 clips from the original training set 4 clips

All the left clips are used for training. Note that it it not required to explicitly separate the training and validation datasets; and the dataloader does that.

Preparation Steps

  1. Download the datasets from the official website.
  2. Regroup the training and validation datasets: python scripts/data_preparation/regroup_reds_dataset.py
  3. [Optional] Make LMDB files when necessary. Please refer to LMDB Description. python scripts/data_preparation/create_lmdb.py. Use the create_lmdb_for_reds function and remember to modify the paths and configurations accordingly.
  4. Test the dataloader with the script tests/test_reds_dataset.py. Remember to modify the paths and configurations accordingly.

Vimeo90K

Official webpage

  1. Download the dataset: Septuplets dataset --> The original training + test set (82GB).This is the Ground-Truth (GT). There is a sep_trainlist.txt file listing the training samples in the download zip file.
  2. Generate the low-resolution images (TODO) The low-resolution images in the Vimeo90K test dataset are generated with the MATLAB bicubic downsampling kernel. Use the script data_scripts/generate_LR_Vimeo90K.m (run in MATLAB) to generate the low-resolution images.
  3. [Optional] Make LMDB files when necessary. Please refer to LMDB Description. python scripts/data_preparation/create_lmdb.py. Use the create_lmdb_for_vimeo90k function and remember to modify the paths and configurations accordingly.
  4. Test the dataloader with the script tests/test_vimeo90k_dataset.py. Remember to modify the paths and configurations accordingly.