Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use set as container of uris instead of list in DocumentFile #510

Merged

Conversation

juhoinkinen
Copy link
Member

@juhoinkinen juhoinkinen commented Aug 12, 2021

Makes the type of the uris container in DocumentFile and DocumentDirectory the same, i.e. set.

Closes #502.

@juhoinkinen juhoinkinen added this to the 0.54 milestone Aug 12, 2021
@juhoinkinen
Copy link
Member Author

By just changing the list to set in DocumentFile a test for SVC fails occasionally (about half the time), because then in training the "arkeologit" subject is not necessarily taken as the target subject when there are multiple subjects, but a random one is taken. Before the "arkeologit" subject was always the target subject as it is first of the subjects in the training file, e.g. in here.

Increasing the number of requested subjects to 50 helps to ensure "arkeologit" is always one of the suggested subjects (even 40 is not enough), but I wonder if there would be some better way in this?

One possibility would be to tweak the SVC training to work with many target subjects instead of taking a random one by making as many copies of the text as there are uris with something like:

for uri in doc.uris:
     texts.append(doc.text)
     classes.append(uri)

@juhoinkinen juhoinkinen requested a review from osma August 12, 2021 13:02
@osma
Copy link
Member

osma commented Aug 12, 2021

That's a bit unfortunate...in part it stems from using a test corpus which is not really intended for multiclass classification in the SVC unit tests.

The fix you suggested is possible but does it really help? It would change the way SVC works, perhaps not a lot, but it would affect the results in a paper on Libris DDC classification I'm writing :) Not that it matters much since this change would be in a specific release, and the paper can specify the version of Annif that was used.

Can you make a PR implementing this change (perhaps just adding to this one?), then I could check that it won't adversely affect the results I'm getting with SVC on the Libris-DDC data set?

@juhoinkinen
Copy link
Member Author

A simple option would be to change the suggest test to use some other input text that always gives a predictable subject.

@osma
Copy link
Member

osma commented Aug 12, 2021

That's a great idea - do you have anything specific in mind? Maybe there is some subject in the training set that appears alone in some document.

It wouldn't hurt to have another test corpus for SVC and similar multiclass algorithms, but that's a bit more work...

@sonarcloud
Copy link

sonarcloud bot commented Aug 12, 2021

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@codecov
Copy link

codecov bot commented Aug 12, 2021

Codecov Report

Merging #510 (11dfc63) into master (02111ca) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #510   +/-   ##
=======================================
  Coverage   99.51%   99.51%           
=======================================
  Files          82       82           
  Lines        5771     5809   +38     
=======================================
+ Hits         5743     5781   +38     
  Misses         28       28           
Impacted Files Coverage Δ
annif/corpus/document.py 100.00% <100.00%> (ø)
tests/test_backend_svc.py 100.00% <100.00%> (ø)
annif/backend/nn_ensemble.py 99.40% <0.00%> (-0.60%) ⬇️
annif/backend/stwfsa.py 100.00% <0.00%> (+1.56%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 02111ca...11dfc63. Read the comment docs.

@juhoinkinen
Copy link
Member Author

I thought it would have been trivial to produce an input text that for which SVC always gives the same subject, but it was not... A document about zikkuratit is the only document with the "zikkuratit" subject, but to make the unit test pass reliably the input text had contain only "zikkuratit", adding basically anything else gave many other subject suggestions and sometimes the "zikkurarit" subject was missing. However using a short text including "arkeologia" and getting 20 suggestions seems to work.

(There is a similar indeterminacy issue with stwfsa suggest test that I've encountered ~5 times, but I think it's not related to any recent changes, it's been appearing for some time already.)

@juhoinkinen juhoinkinen marked this pull request as ready for review August 12, 2021 19:34
Copy link
Member

@osma osma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So much effort for such a small change! Looks good to me.

@juhoinkinen juhoinkinen merged commit 5efde6b into master Aug 13, 2021
@juhoinkinen juhoinkinen deleted the issue502-discrepancy-in-types-of-document-subjects branch August 13, 2021 07:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Discrepancy in types of document subjects
2 participants