Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Context snippet returns first occurance even if the word is appearing as a substring #5

Open
sethdandridge opened this issue Jun 3, 2019 · 1 comment

Comments

@sethdandridge
Copy link

sethdandridge commented Jun 3, 2019

You have a small bug in NYT-first-said.parsers.simple_scrape.context: if the word appears as a substring of a word before appearing on its own, the context snippet returns the first occurrence of that word and not the standalone word.

This bug manifests itself if there's a new word that appears plural first (with an s at the end) and then singular, the snippet will always return the context of the plural (since str.find() returns the index of the first occurrence). See: https://twitter.com/NYT_first_said/status/1135591139413778433

One possible fix would be to find the shortest word (token) in the article that contains the new word and use that to determine the snippet:

def context(content, word):
    tokens_containing_word = []
    tokens = content.split()
    for token in tokens:
        if word in token:
            tokens_containing_word.append(token)
    # you also might want to write a custom key function here that calculates length after 
    # removing punctuation, otherwise "crocodyliforms" is the same length as "crocodyliform."
    context_token = min(tokens_containing_word, key=lambda x: len(x))
    loc = content.find(context_token)
    # existing logic proceeds...
@MaxBittker
Copy link
Owner

I agree with your analysis & solution, thanks for opening the detailed issue!

I don't have time to fix this right now but I will get to it eventually - I suspect there are some edge cases related to the way I split/tokenize

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants