Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing issue#771: concat_datasets_sequentially discards classes #844

Merged
merged 10 commits into from Jan 3, 2022

Conversation

ashok-arjun
Copy link
Contributor

@ashok-arjun ashok-arjun commented Dec 5, 2021

Fixes #771 raised by @AndreaCossu

The problem was that class_mapping was not correctly created before to adhere to new_class_id = class_mapping[original_class_id].

So a list of size largest class index is created, only the classes present here are filled with the remapped IDs.

Test:

import torch
from torch.utils.data import TensorDataset
from avalanche.benchmarks.utils import concat_datasets_sequentially, AvalancheDataset

#  4 training datasets, 4 test datasets with targets [0,1], [2,3], [4,5], [0,1] respectively. 
#  The final output should remap classes in [0-7]

train = [AvalancheDataset(TensorDataset(torch.randn(20,10), torch.randint(0, 2, (20,)))), 
            AvalancheDataset(TensorDataset(torch.randn(20,10), torch.randint(2, 4, (20,)))), 
            AvalancheDataset(TensorDataset(torch.randn(20,10), torch.randint(4, 6, (20,)))), 
            AvalancheDataset(TensorDataset(torch.randn(20,10), torch.randint(0, 2, (20,))))]

test = [AvalancheDataset(TensorDataset(torch.randn(20,10), torch.randint(0, 2, (20,)))), 
            AvalancheDataset(TensorDataset(torch.randn(20,10), torch.randint(2, 4, (20,)))), 
            AvalancheDataset(TensorDataset(torch.randn(20,10), torch.randint(4, 6, (20,)))), 
            AvalancheDataset(TensorDataset(torch.randn(20,10), torch.randint(0, 2, (20,))))]

final_train, final_test, classes = concat_datasets_sequentially(train, test)

# this is ok -> [[0, 1], [2, 3], [4, 5], [6, 7]]
print("Classes: ", classes) 

# THIS IS WRONG -> {0, 1}, expected {0, 1, 2, 3, 4, 5, 6, 7}
print("Final Train: ", set([el[1] for el in final_train]))  

Output:

Classes:  [[0, 1], [2, 3], [4, 5], [6, 7]]
Final Train:  {0, 1, 2, 3, 4, 5, 6, 7}

@AndreaCossu
Copy link
Collaborator

@lrzpellegrini can you check? seems ok to me!

@lrzpellegrini
Copy link
Collaborator

Hi there, line

dataset_classes = set([el[1] for el in train_set])

is a bottleneck and the performance hit may be huge for big datasets. Use the dataset.targets field instead.

Also, consider including a regression unit test to check that everything is in order!

@ashok-arjun
Copy link
Contributor Author

ashok-arjun commented Dec 23, 2021

Hi there, line

dataset_classes = set([el[1] for el in train_set])

is a bottleneck and the performance hit may be huge for big datasets. Use the dataset.targets field instead.

Also, consider including a regression unit test to check that everything is in order!

Thanks for the review!

Yes, I'll make the edit regarding dataset.targets.
I'm not sure how to add a regression test - the function is specifically meant for classification (as the docstring also suggests).

@AndreaCossu
Copy link
Collaborator

I think a simple unit test checking the correct presence of the labels in the final dataset and the total number of patterns may be enough. You can add it inside this test file. You can create a fake dataset and check that the concat_datasets_sequentially returns a correct dataset.

@coveralls
Copy link

coveralls commented Dec 25, 2021

Pull Request Test Coverage Report for Build 1649403949

  • 19 of 19 (100.0%) changed or added relevant lines in 2 files are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.007%) to 79.833%

Files with Coverage Reduction New Missed Lines %
avalanche/benchmarks/scenarios/generic_definitions.py 2 85.37%
Totals Coverage Status
Change from base Build 1621854592: 0.007%
Covered Lines: 11464
Relevant Lines: 14360

💛 - Coveralls

@ashok-arjun
Copy link
Contributor Author

ashok-arjun commented Dec 25, 2021

@lrzpellegrini @AndreaCossu Regarding dataset.targets, it gives a list of tensors containing the class of each example in the dataset. Here, we would like the unique classes from the dataset, hence we need to convert it to a set. But converting dataset.targets directly to a set is not working.

Illustration from the example used above:

print(type(dataset.targets))
print(dataset.targets)
<class 'list'>
[tensor(0), tensor(1), tensor(1), tensor(1), tensor(0), tensor(1), tensor(0), tensor(1), tensor(1), tensor(0), tensor(0), tensor(1), tensor(1), tensor(1), tensor(0), tensor(1), tensor(0), tensor(0), tensor(0), tensor(0)]
print(type(set(dataset.targets)))
print(set(dataset.targets))
<class 'set'>
{tensor(0), tensor(1), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(1), tensor(1), tensor(0), tensor(0), tensor(1), tensor(0), tensor(1), tensor(0), tensor(1), tensor(0), tensor(0), tensor(0), tensor(1)}

As seen, dataset.targets cannot be converted into a set directly, and this returns errors / wrong results.

On the other hand, mapping the targets list to .item() and then performing the same operation returns a perfect result

classes = set(map(lambda x: x.item(), dataset.targets))
print(type(classes))
print(classes)
<class 'set'>
{0, 1}

Is there a better way to get the unique classes of a dataset?

@lrzpellegrini
Copy link
Collaborator

Hi @ashok-arjun. Converting using .item() works fine when using PyTorch Tensors, but it doesn't work with Python lists.

I recommend using:

set(map(int, dataset.targets))

which should work fine in both cases.

Alas, there is no better way to achieve this.

@ashok-arjun
Copy link
Contributor Author

@lrzpellegrini @AndreaCossu I've added the tests.

@ContinualAI-bot
Copy link
Collaborator

Oh no! It seems there are some PEP8 errors! 😕
Don't worry, you can fix them! 💪
Here's a report about the errors and where you can find them:

tests/test_avalanche_dataset.py:1329:5: E303 too many blank lines (2)
tests/test_avalanche_dataset.py:1331:63: E231 missing whitespace after ','
tests/test_avalanche_dataset.py:1331:81: E501 line too long (97 > 80 characters)
tests/test_avalanche_dataset.py:1332:17: E128 continuation line under-indented for visual indent
tests/test_avalanche_dataset.py:1332:62: E231 missing whitespace after ','
tests/test_avalanche_dataset.py:1332:81: E501 line too long (96 > 80 characters)
tests/test_avalanche_dataset.py:1333:17: E128 continuation line under-indented for visual indent
tests/test_avalanche_dataset.py:1333:62: E231 missing whitespace after ','
tests/test_avalanche_dataset.py:1333:81: E501 line too long (96 > 80 characters)
tests/test_avalanche_dataset.py:1334:17: E128 continuation line under-indented for visual indent
tests/test_avalanche_dataset.py:1334:62: E231 missing whitespace after ','
tests/test_avalanche_dataset.py:1334:81: E501 line too long (96 > 80 characters)
tests/test_avalanche_dataset.py:1337:62: E231 missing whitespace after ','
tests/test_avalanche_dataset.py:1337:81: E501 line too long (96 > 80 characters)
tests/test_avalanche_dataset.py:1338:62: E231 missing whitespace after ','
tests/test_avalanche_dataset.py:1338:81: E501 line too long (96 > 80 characters)
tests/test_avalanche_dataset.py:1339:62: E231 missing whitespace after ','
tests/test_avalanche_dataset.py:1339:81: E501 line too long (96 > 80 characters)
tests/test_avalanche_dataset.py:1340:62: E231 missing whitespace after ','
tests/test_avalanche_dataset.py:1340:81: E501 line too long (96 > 80 characters)
tests/test_avalanche_dataset.py:1343:81: E501 line too long (84 > 80 characters)
tests/test_avalanche_dataset.py:1356:1: E302 expected 2 blank lines, found 1
3       E128 continuation line under-indented for visual indent
8       E231 missing whitespace after ','
1       E302 expected 2 blank lines, found 1
1       E303 too many blank lines (2)
9       E501 line too long (97 > 80 characters)

@ContinualAI-bot
Copy link
Collaborator

Oh no! It seems there are some PEP8 errors! 😕
Don't worry, you can fix them! 💪
Here's a report about the errors and where you can find them:

tests/test_avalanche_dataset.py:1332:17: E128 continuation line under-indented for visual indent
tests/test_avalanche_dataset.py:1333:49: E127 continuation line over-indented for visual indent
tests/test_avalanche_dataset.py:1334:17: E128 continuation line under-indented for visual indent
tests/test_avalanche_dataset.py:1335:49: E127 continuation line over-indented for visual indent
tests/test_avalanche_dataset.py:1336:17: E128 continuation line under-indented for visual indent
tests/test_avalanche_dataset.py:1337:49: E127 continuation line over-indented for visual indent
tests/test_avalanche_dataset.py:1341:49: E127 continuation line over-indented for visual indent
tests/test_avalanche_dataset.py:1343:49: E127 continuation line over-indented for visual indent
tests/test_avalanche_dataset.py:1345:49: E127 continuation line over-indented for visual indent
tests/test_avalanche_dataset.py:1347:49: E127 continuation line over-indented for visual indent
7       E127 continuation line over-indented for visual indent
3       E128 continuation line under-indented for visual indent

@ashok-arjun
Copy link
Contributor Author

@AndreaCossu @lrzpellegrini Can you please tell me how to resolve these two errors?

@AndreaCossu
Copy link
Collaborator

Thanks @ashok-arjun ! I adjusted the line indentations highlighted in the comment above. It seems good now, we can merge.

@AndreaCossu AndreaCossu merged commit bf67e20 into ContinualAI:master Jan 3, 2022
@ashok-arjun
Copy link
Contributor Author

@AndreaCossu Thanks a ton for the help! :))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

concat_datasets_sequentially discards classes
5 participants