WIP: Documentation & Typos #1629

jneuendorf · 2024-03-21T21:54:37Z

Still PR is still in progress and not "clean for review" - just some things, I stumbled across. I opened it already, so you know there is some progress on certain aspects that others don't have to worry about 😉

Relates to #886

AntonioCarta · 2024-03-22T08:42:43Z

Thank you! ping me once you are ready for a review.

jneuendorf · 2024-03-23T23:41:06Z

@AntonioCarta Thanks for the quick reply!

I have a question not directly related to documentation: In a codebase using a past avalanche version there is this import from avalanche.benchmarks.utils import make_classification_dataset. From what I could see in the git history, this function was deprecated and no longer exists in avalanche. What is the equivalent in the current version (0.5.0)?

from avalanche.benchmarks.utils import as_classification_dataset seems to be the successor. However, it's not quite clear to me how it can be used equivalently. _init_transform_groups was used in make_classification_dataset and is still a "private" function in the current code, but there is no public function that uses the private one (only private ones, e.g _make_taskaware_classification_dataset). So the question is:

What's the reason that only transform_groups are passed to as_classification_dataset instead of all the kwargs of make_classification_dataset? Would you be fine with reintroducing the former kwargs and calling _init_transform_groups? Otherwise, I think the best alternative is to make _init_transform_groups public so users can call

as_classification_dataset(
  dataset,
  transform_groups=init_transform_groups(
    target_transform=lambda x: x,  # or whatever
    ...
  ),
)

This could also be achieved via a class method (the constructor seems a bit complex already), e.g.

as_classification_dataset(
  dataset,
  transform_groups=TransformGroups.create(
    target_transform=lambda x: x,  # or whatever
    ...
  ),
)

Kind regards 🙂

PS: Providing a dict to as_classification_dataset as in the docs does not comply to the TransformGroup type.

AntonioCarta · 2024-03-25T10:52:49Z

Thank you for highlighting these issues. The methods changed over time because at first AvalancheDataset only supported classification datasets. Now, a classification dataset is an AvalancheDataset where:

the data returns triplets <x, y, t>
it has a targets DataAttribute
it has a targets_task_labels DataAttribute

Therefore, the goal of as_classification_dataset should be to make sure that the data has the correct attributes that are expected from classification datasets. Transformation groups in theory should be defined before this step. But I agree that it may be useful to define them at this stage as long as it's not too complex or confusing.

Would you be fine with reintroducing the former kwargs and calling _init_transform_groups? Otherwise, I think the best alternative is to make _init_transform_groups public so users can call

I think it makes sense to add the arguments to as_classification_dataset. Transformation groups are a bit verbose to create from scratch.

Otherwise, I think the best alternative is to make _init_transform_groups

I like this less. Classification transform groups are different from other transformations and TransformGroups tries to be agnostic to these differences (as much as possible). Also, It would not solve the verbosity problem.

PS: Providing a dict to as_classification_dataset as in the docs does not comply to the TransformGroup type.

This is probably a mistake from the earlier API version.

coveralls · 2024-03-25T10:57:11Z

Pull Request Test Coverage Report for Build 8382229381

Details

1 of 1 (100.0%) changed or added relevant line in 1 file are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 51.806%

Totals
Change from base Build 8098020118:	0.0%
Covered Lines:	14756
Relevant Lines:	28483

💛 - Coveralls

AntonioCarta · 2024-03-25T11:03:07Z

btw, if you have other comments about usability/pain point in the API/documentation do let me know. Fixing these issues takes a lot of time so we can't do everything quickly, but we try to keep track of it and improve over time.

…et, as_taskaware_classification_dataset

…documentation

AntonioCarta · 2024-05-13T09:58:31Z

Hi @jneuendorf I see that the PR is still a draft but if there are no blocking issue I think we could merge it. Doc improvements can be easily split into multiple PRs.

jneuendorf · 2024-05-18T13:04:15Z

Hi @jneuendorf I see that the PR is still a draft but if there are no blocking issue I think we could merge it. Doc improvements can be easily split into multiple PRs.

Hi, your Github-comment email got lost between the other ones from Github like coverall etc. 🙈 You're correct in that some things could already be merged because later amendments could go into another PR.

So feel free to review/cherry-pick the changes and decide which ones are ok for you. 😉

jneuendorf · 2024-05-23T22:09:08Z

EDIT

Following the hint at How to Create an AvalancheDataset, I got it working like this:

from torch import Tensor
from avalanche.benchmarks.utils import _make_taskaware_tensor_classification_dataset

def generate_class(
    label: int | float,
    n_samples: int = 10,
) -> tuple[Tensor, Tensor]:
    ...

class0_x, class0_y = generate_class(label=0)
class1_x, class1_y = generate_class(label=1)

dataset = _make_taskaware_tensor_classification_dataset(
    torch.concat([class0_x, class1_x]),
    torch.concat([class0_y, class1_y]),
)

print('targets', dataset.targets)
# FlatData (len=20,subset=True,cat=False,cf=True)
# 	list (len=20)

The statement

make_tensor_classification_dataset for tensor datasets all of these will automatically create the targets and targets_task_labels attributes.

guided me into the right direction. However, it was not obvious

that one should use a "private" helper function
that subclassing a TensorDataset with a targets property like below is not equivalent to using a benchmark utility.

I think it would be helpful to extend the documentation with examples for creating datasets from typical sources:

from tensors
from (x, y) tuples
from torchvision datasets
from custom datasets

I guess, some useful information can be found at Benchmarks, so, maybe it can be linked from "How to Create an AvalancheDataset".

Depending on what use cases are reasonable and Avalanche is willing to implement, there could be different convenience utilities - this seems to be the current approach. Maybe all of the dataset/benchmark creation utilities can be boiled down to (1) "from tensors" (I admin, I haven't fully understood all the nomenclature of the different dataset and scenario types, and experience attributes, e.g. supervised vs. detection vs. supervised-detection or task labels vs. class labels).

I currently have a problem with training a benchmark from a custom dataset:

My custom dataset is a PyTorch TensorDataset that implements ISupportedClassificationDataset, i.e. the targets method.

class ClassificationMixin(torch.utils.data.Dataset, ISupportedClassificationDataset):
    @cached_property
    def targets(self):
        return []  # whatever...

class TensorDataset(torch.utils.data.TensorDataset, ClassificationMixin):
    ...

This way, it (and its modified splits) is compatible with as_classification_dataset(...). However, when I try to train the following benchmark

benchmark = class_incremental_benchmark(
    {
        'train': as_classification_dataset(train_dataset),
        'test': as_classification_dataset(test_dataset),
    },
    num_experiences=1,
)

I get the following error:

...
    cl_strategy.train(experience)
  File "avalanche/avalanche/training/templates/base_sgd.py", line 211, in train
    super().train(experiences, eval_streams, **kwargs)
  File "avalanche/avalanche/training/templates/base.py", line 162, in train
    self._before_training_exp(**kwargs)
  File "avalanche/avalanche/training/templates/base_sgd.py", line 291, in _before_training_exp
    self.make_train_dataloader(**kwargs)
  File "avalanche/avalanche/training/templates/base_sgd.py", line 456, in make_train_dataloader
    self.dataloader = TaskBalancedDataLoader(
                      ^^^^^^^^^^^^^^^^^^^^^^^
  File "avalanche/avalanche/benchmarks/utils/data_loader.py", line 404, in __init__
    task_labels_field = getattr(data, "targets_task_labels")
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'ClassificationDataset' object has no attribute 'targets_task_labels'

For me, this is a hint that there is some protocol mismatch: class_incremental_benchmark accepts a dataset created by as_classification_dataset but its training doesn't work. So am I correct that

either ISupportedClassificationDataset (and/or similar) should also required a targets_task_labels method
or the base SGD should not rely on a data loader that depends on targets_task_labels

?

…ment

# Conflicts: # avalanche/models/dynamic_optimizers.py # avalanche/training/templates/observation_type/batch_observation.py

AntonioCarta · 2024-05-31T08:38:14Z

I think it would be helpful to extend the documentation with examples for creating datasets from typical sources:

yes, we need to update that notebook. Most of the time doing AvalancheDataset(your_data) should work. You don't need the old benchark builders (from tensors, from files,...) once you converted your data into an AvalancheDataset with appropriate attributes (needed to easily split it).

For me, this is a hint that there is some protocol mismatch: class_incremental_benchmark accepts a dataset created by as_classification_dataset but its training doesn't work.

This is a missing feature. Basically, I implemented the datasets without task labels but many strategies still require them even when not strictly necessary. We have to fix this problem. Early version of datasets always had task labels, which was confusing for some people.

jneuendorf · 2024-06-02T20:32:12Z

yes, we need to update that notebook. Most of the time doing AvalancheDataset(your_data) should work. You don't need the old benchark builders (from tensors, from files,...) once you converted your data into an AvalancheDataset with appropriate attributes (needed to easily split it).

Okay. Based on your answer and How to Create an AvalancheDataset I tried using

import torch
from torch.utils.data.dataset import TensorDataset
from avalanche.benchmarks.utils import AvalancheDataset

# Create a dataset of 100 data points described by 22 features + 1 class label
x_data = torch.rand(100, 22)
y_data = torch.randint(0, 5, (100,))

# Create the Dataset
torch_data = TensorDataset(x_data, y_data)

avl_data = AvalancheDataset(
  torch_data,
  data_attributes=[
    DataAttribute(y_data, "targets"),
    DataAttribute(ConstantSequence(0, len(y_data)), "targets_task_labels", use_in_getitem=True),
  ],
)

instead of

# ...
avl_data = _make_taskaware_tensor_classification_dataset(x_data, y_data)

This is "with appropriate attributes" but quite cumbersome and typing does not work because .targets does not exist on AvalancheDataset, for example. Also the logic in _make_taskaware_tensor_classification_dataset is more powerful as it handles more cases.

Suggestion 1: Dataset Hierarchy

So it would be really useful to add a dataset hierarchy to the documentation, i.e. the inheritance chain, along with the required data attributes (and more?).

Suggestion 2: Dataset construction

We should think about whether the structure of Suggestion 1 is ideal. If so, it would be consequent to force the user to use the dataset class for the according scenario, e.g.

TaskAwareClassificationDataset(
  # ...
  task_labels=[1, 2, 3],
)

As with the "old benchmark builders", I would try to hide internal complexity from the user, such as DataAttribute or ConstantSequence.

More advanced benchmark builders could still exist with correct typing (e.g. via @overload) - also as classmethods on AvalancheDataset for nicer imports.

AntonioCarta · 2024-06-03T07:59:04Z

Thanks for the feedback.

The API for the benchmarks is quite cumbersome for a lot of reasons (changed over time, was supposed to be only an internal tool). We need a better API for attributes and transformations. If you have some proposals I would be happy to discuss it with you.
IMO it's better to discuss that in a separate issue so that we can track it separately.

So it would be really useful to add a dataset hierarchy to the documentation, i.e. the inheritance chain, along with the required data attributes (and more?).

IMO the issue is that CL benchmarks are quite difficult and have an overloaded nomenclature. Many Avalanche users are new to CL so they don't have a complete understanding of the CL literature, and we assume they do.

Avalanche models different benchmarks by adding attributes to datasets/experiences and splitting experiences appropriately. CL Benchmarks have 4 large properties: task-awareness, online, boundary-awareness, supervision. These are explained in the FZTH. There is no strict hierarchy in Avalanche. Other CL frameworks tried to provide a strict hierarchy (e.g. sequoia) but we don't do it.

I would add other relevant examples to the FZTH notebook if you feel anything is missing or there needs to be more explanation.

the logic in _make_taskaware_tensor_classification_dataset is more powerful as it handles more cases

The issue with that approach is that you are mixing the dataset creation with the benchmark creation and you need to make a lot of assumptions about the datasets. It is hard to generalize it to all possible settings (e.g. sequence classification, videos, graphs, ...). We also wanted to give the possibility of creating class-incremental benchmarks without task labels, which was confusing for many users. I agree with you that the new API is not as powerful yet, but I believe it's easier to provide high level methods to manage transforms/attributes than it is to provide truly general benchmark generators.

AntonioCarta · 2024-06-03T08:31:05Z

For me, this is a hint that there is some protocol mismatch: class_incremental_benchmark accepts a dataset created by as_classification_dataset but its training doesn't work.

FYI #1652 fixes this error. Now you should be able to use class-incremental benchmarks without task labels with strategies (as long as they don't need task labels.

This is an example of the benchmark creation:

    # create pytorch dataset
    train_data = MNIST(root=default_dataset_location("mnist"), train=True)
    test_data = MNIST(root=default_dataset_location("mnist"), train=False)

    # prepare transformations
    train_transform = Compose([ToTensor(), Normalize((0.1307,), (0.3081,))])
    eval_transform = Compose([ToTensor(), Normalize((0.1307,), (0.3081,))])
    tgroups = TransformGroups({"train": train_transform, "eval": eval_transform})

    # create Avalanche datasets with targets attributes (needed to split by class)
    da = DataAttribute(train_data.targets, "targets")
    train_data = make_avalanche_dataset(
        train_data, data_attributes=[da], transform_groups=tgroups
    )

    da = DataAttribute(test_data.targets, "targets")
    test_data = make_avalanche_dataset(
        test_data, data_attributes=[da], transform_groups=tgroups
    )

    # create benchmark
    bm = class_incremental_benchmark(
        {"train": train_data, "test": test_data}, num_experiences=5
    )

jneuendorf · 2024-06-03T09:04:38Z

@AntonioCarta Thanks for your thorough reply. 😊 I will see, if the latest version works for me.

Generally, I agree on your thoughts. 👍 And I must admit that this issue's discussion has gone out of scope. Since I am neither an CL nor Avalanche expert, most of my comments were questions.

So I thought Slack would be a better place for discussion and tried the Slack Invite Link but it appears to be dead. In case you still use Slack for discussion, could you provide a new invite link? 🙏

One note (we could further discuss on Slack 😉):

CL Benchmarks have 4 large properties: ... There is no strict hierarchy in Avalanche.

I understand the difficulty/infeasibility of creating an abstraction for different benchmark scenarios. I meant the hierarchy of datasets only. Maybe adding (data)attributes like targets by benchmarks is common enough to result in a built-in subclass of AvalancheDataset (SupervisedDataset). If implemented correctly, the user could create a custom benchmark that utilizes a dataset which inherits from multiple built-in datasets.

…e dataset

… documentation2

AntonioCarta · 2024-06-04T07:29:42Z

So I thought Slack would be a better place for discussion and tried the Slack Invite Link but it appears to be dead. In case you still use Slack for discussion, could you provide a new invite link? 🙏

where did you take this link? The one on the website seems to be working.

Anyway, I think it's better to discuss on github because slack only shows the last 90 days of messages. It's easier for us to track discussions here. I check them periodically and eventually close them if they don't lead to actionable items.

I understand the difficulty/infeasibility of creating an abstraction for different benchmark scenarios. I meant the hierarchy of datasets only. Maybe adding (data)attributes like targets by benchmarks is common enough to result in a built-in subclass of AvalancheDataset (SupervisedDataset). If implemented correctly, the user could create a custom benchmark that utilizes a dataset which inherits from multiple built-in datasets.

For scenarios we have Protocols (example). Again, keep in mind that a scenario may need multiple protocols (e.g. task-aware + boundary-aware).

For datasets we would need some work to clean the hierarchy. We have some helpers (example) but I admit it's not clear for the user which one they should use.

jneuendorf · 2024-06-06T20:39:32Z

So I thought Slack would be a better place for discussion and tried the Slack Invite Link but it appears to be dead. In case you still use Slack for discussion, could you provide a new invite link? 🙏

where did you take this link? The one on the website seems to be working.

I tried both, the one directly from the sidebar and the one at the bottom of The People - also across different machines. But none have worked.

For scenarios we have Protocols (example). Again, keep in mind that a scenario may need multiple protocols (e.g. task-aware + boundary-aware).

For datasets we would need some work to clean the hierarchy. We have some helpers (example) but I admit it's not clear for the user which one they should use.

I tried to get a better picture of the structure and relations between scenarios, streams experiences, and their interfaces/protocols, ABCs and implementations. Here is what I came up with

Link to editable PlantUML

This is neither complete nor mirroring the actual code base (ClassesTimelineCLExperience, conceptual interfaces according to the wording and for consistency).

Some things, I figured out

Everything seems to be working with DatasetExperience rather than CLExperience
Some dataset definitions are not used
Some experience protocols are not used
Some protocols are basically ABCs as they provide method/property implementations.

Remarks / Suggestions

I like the idea of having protocols, as they let the user ship their own implementations.
Avalanche experience properties should be ABCs/mixins implementing the protocols. This would very nicely support the idea "that a scenario may need multiple protocols (e.g. task-aware + boundary-aware)."
Merge CLExperience and DatasetExperience to be used for implementations, since it is used as a single unit. There could be an experience protocol for typing, as well.
Adjust dataset definitions according to the above points. The protocol/class structure could match the one of experiences. This way, it would be more obvious for the user what parts are required for a certain settings, e.g.
- task-aware + boundary-aware $\leadsto$
  - scenario type class Scenario(scenarios.TaskAware, scenarios.BoundaryAware): ...
  - experience type class Experience(experiences.TaskAware, experiences.BoundaryAware): ...
  - dataset type class Dataset(datasets.TaskAware, datasets.BoundaryAware): ...
- Of course, this could also be implemented as factory functions like already done. With overloading, intersection typing should be doable (basically function composition instead multi-inheritance classes).

Let me know what you think 😉

AntonioCarta · 2024-06-07T09:56:32Z

Everything seems to be working with DatasetExperience rather than CLExperience

The CLExperience provides a bit of additional flexibility. We have some minimal support for reinforcement learning, which is a scenario where we don't have a dataset. We also have avalanche-rl, which was supported by an older avalanche version but it is currently unmantained. For example, methods like progressive neural networks should work with task-aware RL scenarios without any modifications.

Avalanche experience properties should be ABCs/mixins implementing the protocols. This would very nicely support the idea "that a scenario may need multiple protocols (e.g. task-aware + boundary-aware)."

Yes, but I would keep this as an internal API. Users should be using high-level methods, because I think most people would be very confused by mixins.

Adjust dataset definitions according to the above points. The protocol/class structure could match the one of experiences. This way, it would be more obvious for the user what parts are required for a certain settings, e.g.

I tend to agree with this, but keep in mind that some protocols don't need to be propagated. For example, scenarios/experiences can be BoundaryAware, but the dataset is not BoundaryAware since this is a property of the stream, not the data (unlike task and class labels).
There are also some older benchmarks that respect the protocols but were implemented before their definition so they have a slightly different class structure. I would not touch those to avoid introducing bugs.

With overloading, intersection typing should be doable (basically function composition instead multi-inheritance classes)

I'm not an expert on python typing but last time I checked I remember that there were some serious limitations (e.g. protocols runtime checks not checking attributes or full signature of functions). We also have the additional challenge of having to support older python versions (for example right now the CI fails due to typing import errors).

typos

9c9b8f9

jneuendorf marked this pull request as draft March 23, 2024 13:37

jneuendorf added 5 commits March 25, 2024 20:00

WIP: refactoring (_init_transform_groups -> TransformGroups.create)

e2a3bd3

typo

f9936db

revert e2a3bd3, enable transform groups kwargs for as_avalanche_datas…

471935b

…et, as_taskaware_classification_dataset

Merge branch 'documentation' of github.com:jneuendorf/avalanche into …

fb59bf3

…documentation

typos, fix benchmark example code

f407b7a

export MNIST default transforms for composing to them

d1f171c

jneuendorf added 12 commits May 25, 2024 23:45

typos

a5bb5bc

[docs] replace dataset with experience for dynamic moduels

f1a4d0f

defaults for _init_transform_groups (implementation covers these cases)

d7ad34b

typos

f7d1b4e

typo

6916ef0

split_validation_random: no default value for required 'dataset' argu…

b062b96

…ment

Merge branch 'master' into documentation

46c98ec

Merge branch 'documentation' into documentation2

4410aa6

# Conflicts: # avalanche/models/dynamic_optimizers.py # avalanche/training/templates/observation_type/batch_observation.py

typo

a6fcb39

[typing] use Self instead TypeVar, organize imports

68cd80b

[benchmark] typing

b94bed2

[benchmark] typo

3021e90

[benchmark] enhance split_validation_random typing

8c819b7

jneuendorf added 3 commits June 1, 2024 10:36

[models] fix incorrectly merged optimizer

393ba88

[training] fix imports/exports, typo

13a33f5

[notebooks] fix Avalanche dataset contructor

ebc2aef

jneuendorf added 5 commits June 3, 2024 11:09

[benchmarks] prefer make_avalanche_dataset over constructor for singl…

38173c1

…e dataset

Merge branch 'master' into documentation

61f3253

[models] dynamic module typos

15efc3c

Merge branch 'documentation2' of github.com:jneuendorf/avalanche into…

74d3aa3

… documentation2

Merge branch 'documentation2' into documentation

8b7b73e

jneuendorf added 3 commits June 6, 2024 22:44

[benchmarks] typos, doc notes

a8995fb

[benchmarks] typing

1486573

[benchmarks] typing

7cc252f

[training] fix typing and typo in SCR

65ba3d4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Documentation & Typos #1629

WIP: Documentation & Typos #1629

jneuendorf commented Mar 21, 2024 •

edited

AntonioCarta commented Mar 22, 2024

jneuendorf commented Mar 23, 2024 •

edited

AntonioCarta commented Mar 25, 2024

coveralls commented Mar 25, 2024

AntonioCarta commented Mar 25, 2024

AntonioCarta commented May 13, 2024

jneuendorf commented May 18, 2024

jneuendorf commented May 23, 2024 •

edited

AntonioCarta commented May 31, 2024

jneuendorf commented Jun 2, 2024

AntonioCarta commented Jun 3, 2024

AntonioCarta commented Jun 3, 2024

jneuendorf commented Jun 3, 2024 •

edited

AntonioCarta commented Jun 4, 2024

jneuendorf commented Jun 6, 2024 •

edited

AntonioCarta commented Jun 7, 2024

WIP: Documentation & Typos #1629

Are you sure you want to change the base?

WIP: Documentation & Typos #1629

Conversation

jneuendorf commented Mar 21, 2024 • edited

AntonioCarta commented Mar 22, 2024

jneuendorf commented Mar 23, 2024 • edited

AntonioCarta commented Mar 25, 2024

coveralls commented Mar 25, 2024

Pull Request Test Coverage Report for Build 8382229381

Details

💛 - Coveralls

AntonioCarta commented Mar 25, 2024

AntonioCarta commented May 13, 2024

jneuendorf commented May 18, 2024

jneuendorf commented May 23, 2024 • edited

EDIT

AntonioCarta commented May 31, 2024

jneuendorf commented Jun 2, 2024

Suggestion 1: Dataset Hierarchy

Suggestion 2: Dataset construction

AntonioCarta commented Jun 3, 2024

AntonioCarta commented Jun 3, 2024

jneuendorf commented Jun 3, 2024 • edited

AntonioCarta commented Jun 4, 2024

jneuendorf commented Jun 6, 2024 • edited

Some things, I figured out

Remarks / Suggestions

Let me know what you think 😉

AntonioCarta commented Jun 7, 2024

jneuendorf commented Mar 21, 2024 •

edited

jneuendorf commented Mar 23, 2024 •

edited

jneuendorf commented May 23, 2024 •

edited

jneuendorf commented Jun 3, 2024 •

edited

jneuendorf commented Jun 6, 2024 •

edited