Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtering Words not working in build_corpus method #38

Closed
ettoreaquino opened this issue Apr 27, 2023 · 1 comment
Closed

Filtering Words not working in build_corpus method #38

ettoreaquino opened this issue Apr 27, 2023 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@ettoreaquino
Copy link
Contributor

Description

While building a Corpus, using the litstudy.build_corpus() method I have found that remove_word argument is not acting how it should be:

remove_words = ["institution", "authors", "licensee", "recent", "years", "john", "wiley"]
Corpus = litstudy.build_corpus(docs=db,
                               remove_words=remove_words,
                               min_word_length=3,
                               min_docs=10,
                               max_docs_ratio=0.75,
                               max_tokens=1000,
                               replace_words=None,
                               custom_bigrams=None,
                               ngram_threshold=0.7)

Expected behavior

The expected behavior is pretty straightforward: passing a list of words in remove_words should filter them from the word frequency vector of each document, thus removing them from the Corpus object. Which could be checked by not finding them in the Corpus.dictionary.items():

remove_words = ["institution", "authors", "licensee", "recent", "years", "john", "wiley"]
assert [] == [item for item in Corpus.dictionary.items() if item[1] in remove_words]

Observed behavior

The words are not removed at all:

remove_words = ["institution", "authors", "licensee", "recent", "years", "john", "wiley"]
assert [] == [item for item in Corpus.dictionary.items() if item[1] in remove_words]

Raises:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[28], line 2
      1 remove_words = ["institution", "authors", "licensee", "recent", "years", "john", "wiley"]
----> 2 assert [] == [item for item in Corpus.dictionary.items() if item[1] in remove_words]

AssertionError: 
@stijnh stijnh added the bug Something isn't working label Apr 28, 2023
@stijnh stijnh self-assigned this Apr 28, 2023
@stijnh
Copy link
Member

stijnh commented May 1, 2023

Thanks. Fixed by #39.

@stijnh stijnh closed this as completed May 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants