Dataset Pruning for Transfer Learning [NeurIPS 2023]

Welcome to the official implementation of the paper Selectivity Drives Productivity: Efficient Dataset Pruning for Enhanced Transfer Learning. This work introduces two innovative dataset pruning techniques: Label Mapping (LM) and Feature Mapping (FM), leveraging source-target domain mapping.

Requirements

You can install the necessary Python packages with:

pip install -r requirements.txt

We remark that to accelerate model training, this code repository is built based on FFCV and we refer its installation instructions to its official website. In this work, we build our argument system via fastargs, and we provide a revised version here. The installation of the latest fastargs is automatically handled by the command above.

Datasets

We studied 9 commonly used datasets for transfer learning. We use FFCV to accelerate the data loading and preprocessing. For most datasets, we provide the preprocessed data (.beton) in this link. Please download the data and put them in the data folder. For the datasets that are not provided, they are automatically downloaded by PyTorch.

For Flowers102, DTD, UCF101, Food101, EuroSAT, OxfordPets, StanfordCars and SUN397, we use datasets split configuration in CoOp. For other datasets we use official ones provided by pytorch.

Code Structure

The source code is organized as follows:

configs: contains the default parameters for each dataset
src: contains the source code for the proposed methods
- algorithm: contains the mathematical algorithms used for our method or the baselines
- auxiliary: contains the executable files to generate the intermediate results, e.g., the pruned data, the image features, etc.
- data: contains the data loader for each dataset
- experiments: contains the main executable files to run the experiments
- tools: contains the tools and utilities for the experiments
arguments: contains the data arguments

Usage

In this section, we provide the instructions to reproduce the results in our paper.

Pretrain on ImageNet

We first pretrain the surrogate model (ResNet-18) on ImageNet using the following command:

python src/experiment/imagenet_train_from_scratch.py --config-file configs/imagenet_train_from_scratch/rn18_16.json

You can change the type of the surrogate model by changing the --network.architecture argument.

Prune the Source Dataset using LM

We then prune the source dataset by 10% to 90% with a step size of 10% using LM with the following command:

python src/auxiliary/lm_selection_for_imagenet.py --cfg.data_path PATH_TO_DOWNSTREAM_TRAINING_DATA --cfg.source_train_label_path PATH_TO_IMAGNET_TRAINING_LABLE --cfg.source_val_label_path PATH_TO_IMAGNET_VALIDATION_LABLE --cfg.architecture resnet18 --cfg.pretrained_ckpt PATH_TO_PRETRAINED_CKPT --cfg.retain_class_nums 900,800,700,600,500,400,300,200,100 --cfg.write_path files/class_selection/oxfordpets

Please note the first parameter refers to the path to the training data (.beton file) of the target data. The second and third parameters refer to the path to the generated label index for each data sample of ImageNet. This will be automatically downloaded when downloading the ImageNet .beton files. Please refer to the dataset section.

You can also generate your own label index files using the src/auxiliary/get_label_and_indices.py file.

Prune the Source Dataset using FM

We can also prune the source dataset using FM. Unlike LM, we need to first determine the features of each data sample of both the source and target dataset with the surrogate model. Below we provide an example of how to generate the features of the source dataset.

python src/auxiliary/feature_gen.py --cfg.data_path PATH_TO_IMAGENET_TRAINING_DATA --cfg.dataset imagenet --cfg.architecture resnet18 --cfg.pretrained_ckpt PATH_TO_PRETRAINED_CKPT --cfg.write_path PATH_TO_FEATURES

Next, with the features of the source and target dataset, we can prune the source dataset using FM with the following command:

python src/auxiliary/fm_selection_for_imagenet.py --dataset.src_train_fx_path PATH_TO_SOURCE_TRAINING_FEATURES --dataset.tgt_train_fx_path PATH_TO_TARGET_TRAINING_FEATURES --dataset.src_train_id_path PATH_TO_SOURCE_DATA_CLUSTER_MAPPING --dataset.src_val_id_path PATH_TO_TARGET_DATA_CLUSTER_MAPPING

Note that the first two parameters are generated by the src/auxiliary/feature_gen.py file. The last two parameters are the clustering results. This in general indicates which cluster each data sample belongs to.

Model Pretrain with Pruned Source Dataset

We then pretrain the large model with the pruned source dataset obtained by either LM or FM. We use the same file to pretrain the model as the one used to pretrain the surrogate model. The only difference is that we need to specify the argument --dataset.prune 1 to indicate that the source dataset is pruned. Besides, we need to input the selected training and testing data indices --dataset.indices.training and --dataset.indices.testing. Below we provide an example of how to pretrain the ResNet-101 with the pruned source dataset obtained by LM.

python src/experiment/imagenet_train_from_scratch.py --config-file configs/imagenet_train_from_scratch/rn18_16.json --dataset.prune 1 --dataset.indices.training files/class_selection/oxfordpets_flm_train_top${cls_num}.indices --dataset.indices.testing files/class_selection/oxfordpets_flm_val_top${cls_num}.indices

Downstream Finetune with the Pretrained Model

We then finetune the pretrained model on the target dataset. Below we provide an example of how to finetune the pretrained ResNet-101 on OxfordPets.

python src/experiment/imagenet_transfer_to_downstream.py --config-file configs/imagenet_transfer_to_downstream/oxfordpets_rn101_ff.json --dataset.train_path ./data/oxfordpets/ffcv/train_400_10_90.beton --dataset.test_path ./data/oxfordpets/ffcv/test_400_10_90.beton --network.pretrained_ckpt PATH_TO_PRETRINED_CKPT --exp.identifier oxfordpets_rn101_ff

Downstream Train from Scratch

We also provide the option to train the model from scratch on the target dataset. Below we provide an example of how to train the ResNet-101 from scratch on OxfordPets.

python src/experiment/downstream_train_from_scratch.py --config-file configs/downstream_train_from_scratch/oxfordpets_rn101.json --dataset.train_path ../data/oxfordpets/ffcv/train_400_10_90.beton --dataset.test_path ../data/oxfordpets/ffcv/test_400_10_90.beton

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commits
arguments		arguments
configs		configs
manager		manager
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arguments

arguments

configs

configs

manager

manager

src

src

.gitignore

.gitignore

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Dataset Pruning for Transfer Learning [NeurIPS 2023]

Requirements

Datasets

Code Structure

Usage

Pretrain on ImageNet

Prune the Source Dataset using LM

Prune the Source Dataset using FM

Model Pretrain with Pruned Source Dataset

Downstream Finetune with the Pretrained Model

Downstream Train from Scratch

About

Releases

Packages

Languages

OPTML-Group/DP4TL

Folders and files

Latest commit

History

Repository files navigation

Dataset Pruning for Transfer Learning [NeurIPS 2023]

Requirements

Datasets

Code Structure

Usage

Pretrain on ImageNet

Prune the Source Dataset using LM

Prune the Source Dataset using FM

Model Pretrain with Pruned Source Dataset

Downstream Finetune with the Pretrained Model

Downstream Train from Scratch

About

Resources

Stars

Watchers

Forks

Languages