Skip to content
A plug-in replacement for DataLoader to load ImageNet disk-sequentially in PyTorch.
Branch: master
Clone or download
Type Name Latest commit message Commit time
Failed to load latest commit information.
imagenet_seq working version, pending test on ImageNet Nov 9, 2017
.gitignore add gitignore Nov 9, 2017
LICENSE add license Nov 10, 2017 maybe required, might as well include... Dec 14, 2017 initial commit Nov 9, 2017 initial commit Nov 9, 2017 initial commit Nov 9, 2017 initial commit Nov 9, 2017

A plug-in ImageNet DataLoader for PyTorch. Uses tensorpack's sequential loading to load fast even if you're using a HDD.



If you use pip's editable install, you can fix bugs I have probably introduced:

git clone
cd sequential-imagenet-dataloader
pip install -e .

To start, you must set the environment variable IMAGENET to point to wherever you have saved the ILSVRC2012 dataset. You must also set the TENSORPACK_DATASET environment variable, because tensorpack may download some things itself.


Before being able to train anything, you have to run the preprocessing script to create the huge LMDB binary files. They will get put in wherever your IMAGENET environment variable is, and they will take up 140G for train, plus more for val.


Wherever the DataLoader is defined in your Pytorch code, replaced that with; although you can't call it with exactly the same arguments. For an example, this would be the substitution in the PyTorch ImageNet example:

    #train_loader =
    #    train_dataset, batch_size=args.batch_size, shuffle=(train_sampler is None),
    #    num_workers=args.workers, pin_memory=True, sampler=train_sampler)
    train_loader = ImagenetLoader('train', batch_size=args.batch_size, num_workers=args.workers)

You may need to tune the number of workers to use to get best results.


Running the PyTorch ImageNet Example on the server I work on that has no SSD, but a set of 4 Titan X GPUs, I get an average minibatch speed of 5.3s. Using this iterator to feed examples, I'm able to get about 0.59s per minibatch, so 54 minutes per epoch; 90 epochs should take about 73 hours, and that's enough to get results. A resnet-18 converged to 69% top-1 and 89% top-5, which appears to be the standard.

The Titan Xs still look a little hungry if we're running on all four, but it's fast enough to work with.

You can’t perform that action at this time.