Fixing issue#771: concat_datasets_sequentially discards classes #844

ashok-arjun · 2021-12-05T05:54:17Z

The problem was that class_mapping was not correctly created before to adhere to new_class_id = class_mapping[original_class_id].

So a list of size largest class index is created, only the classes present here are filled with the remapped IDs.

Test:

import torch
from torch.utils.data import TensorDataset
from avalanche.benchmarks.utils import concat_datasets_sequentially, AvalancheDataset

#  4 training datasets, 4 test datasets with targets [0,1], [2,3], [4,5], [0,1] respectively. 
#  The final output should remap classes in [0-7]

train = [AvalancheDataset(TensorDataset(torch.randn(20,10), torch.randint(0, 2, (20,)))), 
            AvalancheDataset(TensorDataset(torch.randn(20,10), torch.randint(2, 4, (20,)))), 
            AvalancheDataset(TensorDataset(torch.randn(20,10), torch.randint(4, 6, (20,)))), 
            AvalancheDataset(TensorDataset(torch.randn(20,10), torch.randint(0, 2, (20,))))]

test = [AvalancheDataset(TensorDataset(torch.randn(20,10), torch.randint(0, 2, (20,)))), 
            AvalancheDataset(TensorDataset(torch.randn(20,10), torch.randint(2, 4, (20,)))), 
            AvalancheDataset(TensorDataset(torch.randn(20,10), torch.randint(4, 6, (20,)))), 
            AvalancheDataset(TensorDataset(torch.randn(20,10), torch.randint(0, 2, (20,))))]

final_train, final_test, classes = concat_datasets_sequentially(train, test)

# this is ok -> [[0, 1], [2, 3], [4, 5], [6, 7]]
print("Classes: ", classes) 

# THIS IS WRONG -> {0, 1}, expected {0, 1, 2, 3, 4, 5, 6, 7}
print("Final Train: ", set([el[1] for el in final_train]))

Output:

Classes:  [[0, 1], [2, 3], [4, 5], [6, 7]]
Final Train:  {0, 1, 2, 3, 4, 5, 6, 7}

AndreaCossu · 2021-12-21T18:02:39Z

@lrzpellegrini can you check? seems ok to me!

lrzpellegrini · 2021-12-22T14:15:54Z

Hi there, line

dataset_classes = set([el[1] for el in train_set])

is a bottleneck and the performance hit may be huge for big datasets. Use the dataset.targets field instead.

Also, consider including a regression unit test to check that everything is in order!

ashok-arjun · 2021-12-23T10:42:21Z

Hi there, line

dataset_classes = set([el[1] for el in train_set])

is a bottleneck and the performance hit may be huge for big datasets. Use the dataset.targets field instead.

Also, consider including a regression unit test to check that everything is in order!

Thanks for the review!

Yes, I'll make the edit regarding dataset.targets.
I'm not sure how to add a regression test - the function is specifically meant for classification (as the docstring also suggests).

AndreaCossu · 2021-12-23T11:17:53Z

I think a simple unit test checking the correct presence of the labels in the final dataset and the total number of patterns may be enough. You can add it inside this test file. You can create a fake dataset and check that the concat_datasets_sequentially returns a correct dataset.

coveralls · 2021-12-25T05:27:00Z

Pull Request Test Coverage Report for Build 1649403949

19 of 19 (100.0%) changed or added relevant lines in 2 files are covered.
2 unchanged lines in 1 file lost coverage.
Overall coverage increased (+0.007%) to 79.833%

Files with Coverage Reduction	New Missed Lines	%
avalanche/benchmarks/scenarios/generic_definitions.py	2	85.37%

Totals
Change from base Build 1621854592:	0.007%
Covered Lines:	11464
Relevant Lines:	14360

💛 - Coveralls

ashok-arjun · 2021-12-25T05:32:43Z

@lrzpellegrini @AndreaCossu Regarding dataset.targets, it gives a list of tensors containing the class of each example in the dataset. Here, we would like the unique classes from the dataset, hence we need to convert it to a set. But converting dataset.targets directly to a set is not working.

Illustration from the example used above:

print(type(dataset.targets))
print(dataset.targets)

<class 'list'>
[tensor(0), tensor(1), tensor(1), tensor(1), tensor(0), tensor(1), tensor(0), tensor(1), tensor(1), tensor(0), tensor(0), tensor(1), tensor(1), tensor(1), tensor(0), tensor(1), tensor(0), tensor(0), tensor(0), tensor(0)]

print(type(set(dataset.targets)))
print(set(dataset.targets))

<class 'set'>
{tensor(0), tensor(1), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(1), tensor(1), tensor(0), tensor(0), tensor(1), tensor(0), tensor(1), tensor(0), tensor(1), tensor(0), tensor(0), tensor(0), tensor(1)}

As seen, dataset.targets cannot be converted into a set directly, and this returns errors / wrong results.

On the other hand, mapping the targets list to .item() and then performing the same operation returns a perfect result

classes = set(map(lambda x: x.item(), dataset.targets))
print(type(classes))
print(classes)

<class 'set'>
{0, 1}

Is there a better way to get the unique classes of a dataset?

lrzpellegrini · 2021-12-27T15:15:37Z

Hi @ashok-arjun. Converting using .item() works fine when using PyTorch Tensors, but it doesn't work with Python lists.

I recommend using:

set(map(int, dataset.targets))

which should work fine in both cases.

Alas, there is no better way to achieve this.

ashok-arjun · 2021-12-31T07:46:02Z

@lrzpellegrini @AndreaCossu I've added the tests.

ContinualAI-bot · 2021-12-31T07:46:27Z

Oh no! It seems there are some PEP8 errors! 😕
Don't worry, you can fix them! 💪
Here's a report about the errors and where you can find them:

tests/test_avalanche_dataset.py:1329:5: E303 too many blank lines (2)
tests/test_avalanche_dataset.py:1331:63: E231 missing whitespace after ','
tests/test_avalanche_dataset.py:1331:81: E501 line too long (97 > 80 characters)
tests/test_avalanche_dataset.py:1332:17: E128 continuation line under-indented for visual indent
tests/test_avalanche_dataset.py:1332:62: E231 missing whitespace after ','
tests/test_avalanche_dataset.py:1332:81: E501 line too long (96 > 80 characters)
tests/test_avalanche_dataset.py:1333:17: E128 continuation line under-indented for visual indent
tests/test_avalanche_dataset.py:1333:62: E231 missing whitespace after ','
tests/test_avalanche_dataset.py:1333:81: E501 line too long (96 > 80 characters)
tests/test_avalanche_dataset.py:1334:17: E128 continuation line under-indented for visual indent
tests/test_avalanche_dataset.py:1334:62: E231 missing whitespace after ','
tests/test_avalanche_dataset.py:1334:81: E501 line too long (96 > 80 characters)
tests/test_avalanche_dataset.py:1337:62: E231 missing whitespace after ','
tests/test_avalanche_dataset.py:1337:81: E501 line too long (96 > 80 characters)
tests/test_avalanche_dataset.py:1338:62: E231 missing whitespace after ','
tests/test_avalanche_dataset.py:1338:81: E501 line too long (96 > 80 characters)
tests/test_avalanche_dataset.py:1339:62: E231 missing whitespace after ','
tests/test_avalanche_dataset.py:1339:81: E501 line too long (96 > 80 characters)
tests/test_avalanche_dataset.py:1340:62: E231 missing whitespace after ','
tests/test_avalanche_dataset.py:1340:81: E501 line too long (96 > 80 characters)
tests/test_avalanche_dataset.py:1343:81: E501 line too long (84 > 80 characters)
tests/test_avalanche_dataset.py:1356:1: E302 expected 2 blank lines, found 1
3       E128 continuation line under-indented for visual indent
8       E231 missing whitespace after ','
1       E302 expected 2 blank lines, found 1
1       E303 too many blank lines (2)
9       E501 line too long (97 > 80 characters)

ContinualAI-bot · 2021-12-31T08:06:08Z

Oh no! It seems there are some PEP8 errors! 😕
Don't worry, you can fix them! 💪
Here's a report about the errors and where you can find them:

tests/test_avalanche_dataset.py:1332:17: E128 continuation line under-indented for visual indent
tests/test_avalanche_dataset.py:1333:49: E127 continuation line over-indented for visual indent
tests/test_avalanche_dataset.py:1334:17: E128 continuation line under-indented for visual indent
tests/test_avalanche_dataset.py:1335:49: E127 continuation line over-indented for visual indent
tests/test_avalanche_dataset.py:1336:17: E128 continuation line under-indented for visual indent
tests/test_avalanche_dataset.py:1337:49: E127 continuation line over-indented for visual indent
tests/test_avalanche_dataset.py:1341:49: E127 continuation line over-indented for visual indent
tests/test_avalanche_dataset.py:1343:49: E127 continuation line over-indented for visual indent
tests/test_avalanche_dataset.py:1345:49: E127 continuation line over-indented for visual indent
tests/test_avalanche_dataset.py:1347:49: E127 continuation line over-indented for visual indent
7       E127 continuation line over-indented for visual indent
3       E128 continuation line under-indented for visual indent

ashok-arjun · 2021-12-31T14:07:50Z

@AndreaCossu @lrzpellegrini Can you please tell me how to resolve these two errors?

AndreaCossu · 2022-01-03T13:22:48Z

Thanks @ashok-arjun ! I adjusted the line indentations highlighted in the comment above. It seems good now, we can merge.

ashok-arjun · 2022-01-03T16:01:11Z

@AndreaCossu Thanks a ton for the help! :))

Fixed

4d3d923

ashok-arjun added 2 commits December 22, 2021 09:27

Merge branch 'ContinualAI:master' into Issue#771

8fb4e9b

Change tensor to int type

7cb42ae

Merge branch 'ContinualAI:master' into Issue#771

3aeac8b

ashok-arjun added 4 commits December 29, 2021 09:58

Merge branch 'ContinualAI:master' into Issue#771

390c9e5

Get classes by dataset.targets

8484008

Correct error in var name

66337bb

Added tests for concat_datasets_sequentially

2b62cb8

Fix PEP8 errors

438f700

Fixed pep

93fb791

AndreaCossu merged commit bf67e20 into ContinualAI:master Jan 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing issue#771: concat_datasets_sequentially discards classes #844

Fixing issue#771: concat_datasets_sequentially discards classes #844

ashok-arjun commented Dec 5, 2021 •

edited

AndreaCossu commented Dec 21, 2021

lrzpellegrini commented Dec 22, 2021

ashok-arjun commented Dec 23, 2021 •

edited

AndreaCossu commented Dec 23, 2021

coveralls commented Dec 25, 2021 •

edited

ashok-arjun commented Dec 25, 2021 •

edited

lrzpellegrini commented Dec 27, 2021

ashok-arjun commented Dec 31, 2021

ContinualAI-bot commented Dec 31, 2021

ContinualAI-bot commented Dec 31, 2021

ashok-arjun commented Dec 31, 2021

AndreaCossu commented Jan 3, 2022

ashok-arjun commented Jan 3, 2022

Fixing issue#771: concat_datasets_sequentially discards classes #844

Fixing issue#771: concat_datasets_sequentially discards classes #844

Conversation

ashok-arjun commented Dec 5, 2021 • edited

AndreaCossu commented Dec 21, 2021

lrzpellegrini commented Dec 22, 2021

ashok-arjun commented Dec 23, 2021 • edited

AndreaCossu commented Dec 23, 2021

coveralls commented Dec 25, 2021 • edited

Pull Request Test Coverage Report for Build 1649403949

💛 - Coveralls

ashok-arjun commented Dec 25, 2021 • edited

lrzpellegrini commented Dec 27, 2021

ashok-arjun commented Dec 31, 2021

ContinualAI-bot commented Dec 31, 2021

ContinualAI-bot commented Dec 31, 2021

ashok-arjun commented Dec 31, 2021

AndreaCossu commented Jan 3, 2022

ashok-arjun commented Jan 3, 2022

ashok-arjun commented Dec 5, 2021 •

edited

ashok-arjun commented Dec 23, 2021 •

edited

coveralls commented Dec 25, 2021 •

edited

ashok-arjun commented Dec 25, 2021 •

edited