-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] #59
Comments
Hi @BradenAnderson I will be honest here, based on some cases which I checked I thought it was working okay not great though. Your statement:
I will not deny this claim. There is still a lot of work to be done and that is why the versions is 0.x. Possible investigation points for very bad correction could be the
The default model (bert-base-cased) might not be trained entirely on the tweets for which the syntax is very different from text in news for example. Expecting the library to work out of the box on the entire set of use cases if a big ask, try to debug code (it is not that complex) or if you want you can create a fork and play around with the code to see what is causing the issue. |
Hi Raajat, Thanks for providing this response. I was honestly thinking that that after I described the issue, that whoever responded would be pointing out some error I had made in my implementation that was leading to these results. I have tried to debug with no luck, but I will look through the links and information you provide and try again. If I have some extra time it would be fun to fork and try to do some more in depth debugging. Hope my original post didn't come across as being harsh. I was really thinking that I had just been using the library wrong so in my description I was trying to give lots of detail on what I had done and the results. Like you said, you are taking on a big challenge as building something like this is no easy task. If I can't get it working this time around I'd be happy to try again sometime in the future. I'll keep looking at it and if I come up with any insight I'll send over another message. Thanks again for the response. -Braden |
@BradenAnderson, I finally got the time to look at your question again. I think you should try to pass one of these models to nlp = spacy.load("en_core_web_sm")
# max_edit_dist is a good parameter to play with and see the effect on the results
nlp.add_pipe("contextual spellchecker",config={"model_name": "vinai/bertweet-base","max_edit_dist": 4})
sent = html.unescape(<SENTENCE>)
doc = nlp(sent) I programmatically checked for 15 sentences and below are the result in the table : Output Table
For the example you mentioned earlier: doc = nlp("@user all #smiles when #media is silly joke flirt mischief excitement #pressconference in #antalya #turkey sunday #throwback love happy happy love happy love")
print(doc._.suggestions_spellCheck) #{antalya: 'italy'}
print(doc._.outcome_spellCheck) #@user all #smiles when #media is silly joke flirt mischief excitement #pressconference in #italy #turkey sunday #throwback love happy happy love happy love I hope this would work for your use-case, try out different models and different edit distances as well. Do let me know your observations. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
Describe the bug
I apologize in advance if this issue I am about to describe is simply some kind of user error rather than an actual issue. Please understand this is my first time using Spacy and contextualSpellCheck and I believe I am using them correctly however there is always the chance I am not.
That said, my application is using contextualSpellCheck to check the spelling in tweets and recommend fixes for misspelled words. In doing this, I have found that the spelling corrections are almost always incorrect, and often times completely illogical.
For example, in the tweet:
"@user all #smiles when #media is silly joke flirt mischief excitement #pressconference in #antalya #turkey sunday #throwback love happy happy love happy love"
contextual spell check indicates the word "flirt" is misspelled (which it is not) and recommends the illogical spelling correction of "#".
Please see the image below for a few more examples.
--
I have a dataset of over 20k tweets, and have created a function that will process a given number of these tweets using Spacy and contextualSpellCheck. The function stores all top spelling correction options in a csv file. Using my function (link provided below) you can easily reproduce this issue and create as many examples of these incorrect spelling suggestions as needed (using my code will make it very easy to produce the problem).
To Reproduce
Expected behavior
I expected the spelling recommendations to be at least reasonable. In many cases spelling suggestions involve changing a correctly spelled readable word to simply a punctuation mark like "." or "#".
Version (please complete the following information):
Additional information
Please be aware that the way I have structured the code to capture all of this information about what contextualSpellCheck is doing requires a lot of RAM. I do not recommend running the test driver function on more than 50 tweets at one time.
As I mentioned at the beginning I am inexperienced with both Spacy and contextualSpellCheck. I have shared my current use case and implementation and I welcome any advice on how to better use these tools. Beyond getting spellcheck working, I want to find a way to process all of these tweets without generating so many doc objects, as I believe this is what is causing so much RAM usage, which has led to slow performance and crashes.
Despite all of that, as far as I can tell the contextualSpellCheck is not giving reasonable spelling recommendations for this tweet processing application. I have dedicated a significant amount of time to troubleshooting this, and I finally decided to raise the issue with you guys. I would really like to use your tool if it can be used for this task. Please help me understand if this is an actual bug, or some kind of user error.
Thank you,
Braden
The text was updated successfully, but these errors were encountered: