Skip to content

OpenOOD v1.5 change log

Jingyang Zhang edited this page Aug 13, 2023 · 8 revisions

This pape goes through every update that we made from OpenOOD v1.0 to v1.5.

  • A leaderboard with more accurate, comprehensive results

    • We are hosting a leaderboard to track SOTA performance and progress of the OOD detection community (currently focusing on image classification).
    • In v1.5 we further emphasize the study of full-spectrum detection, which considers OOD generalization and OOD detection at the same time. This is reflected by the two full-spectrum benchmarks in the leaderboard. Please refer to our new report for more details.
  • A unified, easy-to-use evaluator

    • We introduced a new evaluator with which you can get OOD detection results as long as you 1) provide a pre-trained base classifier, 2) specify the ID dataset the classifier is trained upon, and 3) specify or provide a desired postprocessor (e.g., MSP). The evaluator encodes pre-defined OOD data splits to ensure fair and consistent evaluation. It also handles full-spectrum detection.
    • Check out this Colab tutorial!
  • Reconfigure data splits

    • Validation OOD splits. In v1.0 the OOD validation samples leaked information about OOD test samples. For example, the val OOD split for CIFAR-10 contained samples from all 100 categories of CIFAR-100. What is more realistic is that the val OOD categories do not overlap with the test OOD distributions. To this end, we randomly selected 20 categories from Tiny ImageNet (TIN) test OOD split to serve as the new val OOD split for CIFAR-10/100. The picked 20 categories are removed from the original TIN test OOD split.
    • OE data. In v1.0 the OE data for CIFAR overlaped with TIN test OOD split: All 10000 training images from TIN which cover all 200 TIN categories were used in OE training. In v1.5, we started from the 800 remaining categories of ImageNet (apart from the 200 categories of TIN) and filtered out categories that may relate to CIFAR-10/100 based on WordNet. This results in a dataset with 597 categories, which we use as the new OE dataset for CIFAR experiments.
    • ImageNet OOD splits. We first removed MNIST and SVHN from the far-OOD splits as these low-resolution images of numerical digits are too far away from ImageNet in both non-semantic and semantic aspects and do not represent a meaningful test. We also removed ImageNet-O from the near-OOD splits, as it was adversarially constructed by targeting the MSP detector on a ResNet-50 model and is known to cause biased evaluation for other detectors and models reference. Instead, we added images carefully selected from ImageNet-21K by reference to near-OOD splits. We also included a recent new OOD dataset, NINCO, for evaluation. Finally, we reorganize the near-OOD and far-OOD splits based on difficulty. For ImageNet which consists of 1000 diverse visual categories, it's difficult to define "near-OOD" or "far-OOD"; it's more like "hard-OOD" and "easy-OOD".
    • ImageNet-200 ID dataset. In our experiments we identified that methods can show distinct patterns when shifting from small-scale datasets (e.g., CIFAR) to large-scale datasets (ImageNet). While ImageNet results are definitely more meaningful, the full ImageNet-1K could still be too large for quick development, especially for training methods. Therefore we introduced the ImageNet-200 as a new ID dataset by sampling 200 categories from ImageNet-1K (it shares the same 200 categories as in ImageNet-R). The OOD splits for ImageNet-200 are exactly the same as those for ImageNet-1K, which allows for a straight comparison (between ImageNet-200 and ImageNet-1K) to demonstrate methods' scalability.
    • Again, see our new report for more details on this part!
  • New architectures support for ImageNet-1K

    • ViT-B-16, Swin-T
  • New methods

  • Fixed bugs and improved efficiency for several methods

    • The OOD evaluator's automatic hyperparameter search pipeline was wrong (see #156)
    • Previous evaluation of certain methods (e.g., KLM) used ID/OOD test samples for setup or hyperparameter search.
    • Improved efficiency and removed redundancy for several postprocessors.
    • Fixed OE training bugs and recovered the expected performance (in v1's report the performance was bad).
    • ODIN and G-ODIN's postprocessors were using CIFAR-10's input standard deviation for all datasets.
    • Enabled seeding for multiple independent training runs.
    • Added validation for OpenGAN training.
    • Fixed the wrong loss of MCD training.
    • Fixed several bugs in the whole pipeline.
    • Enabled hyperparameter search for ReAct and KNN. The search list is taken from the papers.
    • The "class_labels" in the provided imglist files of MOS (here) were wrong. Also, MOS's code is designed to work with fine-tuning; training from scratch (which was the case in v1.0 implementation) won't work desirably. MOS evaluator was also wrong: Val ID rather than test ID data was used.