Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy in types of document subjects #502

Closed
juhoinkinen opened this issue Jul 1, 2021 · 1 comment · Fixed by #510
Closed

Discrepancy in types of document subjects #502

juhoinkinen opened this issue Jul 1, 2021 · 1 comment · Fixed by #510
Assignees
Labels
Milestone

Comments

@juhoinkinen
Copy link
Member

PR #501 was a quick fix for training SVC backend on fulltext corpus, but it did not address the underlying reason:

...DocumentDirectory defines uris for documents as a set, while in SVC it was assumed that uris are list or other subscriptable (which for DocumentFile is true)

Right now I am not sure if there is a reason for subjects to be a set from from DocumentDirectory but a list from DocumentFile, so I'm just reporting this as a possible bug.

@juhoinkinen juhoinkinen added the bug label Jul 1, 2021
@juhoinkinen juhoinkinen added this to the Short term milestone Jul 1, 2021
@osma
Copy link
Member

osma commented Aug 3, 2021

You are right, this is an inconsistency. It could be resolved either way. And it should be resolved, since it already caused a bug in a release!

It's quite common to assume that the order of subjects doesn't matter, and this is the case for all the current algorithms (SVC being a special case since it can handle only one subject per document). Also the vector representation for subjects is order-agnostic. So I think it would make sense to change DocumentFile to use a set instead of a list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants