Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Truncating long documents #46

Open
juhoinkinen opened this issue May 17, 2021 · 5 comments
Open

Truncating long documents #46

juhoinkinen opened this issue May 17, 2021 · 5 comments

Comments

@juhoinkinen
Copy link

Hi,
I found out that when using YAKE for long documents, it can be advantageous to truncate them in advance.

We have a test set of theses and dissertations (766 documents of on average 196k characters, 22k words), and when those documents are used as a gold standard for evaluation of YAKE (or its integration in our application), a F1@5 score of 0.29 is reached. However, if the documents are first truncated to a fixed length of 15000 characters, a better score 0.33 is reached.

Being such a simple way to possibly improve results, maybe a parameter/option for truncating input text could be added directly to YAKE? Or, better yet, could the term position feature be tuned to be better suited for long texts? To somehow make it to give even more importance to the beginning part?

@prateekkrjain
Copy link

@juhoinkinen ,

I also think it is an issue as it has T_position, which is based on the Indices of the sentences a term was found in, with the hypothesis that the most important words appear at the top of the document.

So, any term appearing more frequently towards the end of the document like "metrics", "accuracy", "precision", such terms in an ML-based research paper, mainly will appear towards the end and will get a lower score.

But, how do you plan to merge the lists of Keywords we get from the segmented documents??

@arianpasquali
Copy link
Contributor

Hi @juhoinkinen and @prateekkrjain.
Interesting topic to be discussed @rncampos.

@arianpasquali
Copy link
Contributor

Hi @juhoinkinen.

Providing the parameter for truncating is something to consider. Would you be willing to suggest a PR for that?

@arianpasquali
Copy link
Contributor

@prateekkrjain

In this case I would probably break the document and manage the sections separately.

@juhoinkinen
Copy link
Author

Providing the parameter for truncating is something to consider. Would you be willing to suggest a PR for that?

At the moment I can't, but if I have more time at some point I could take a look at this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants