Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] #59

Closed
BradenAnderson opened this issue Mar 29, 2021 · 4 comments
Closed

[BUG] #59

BradenAnderson opened this issue Mar 29, 2021 · 4 comments
Labels
question Further information is requested usage wontfix This will not be worked on

Comments

@BradenAnderson
Copy link

Describe the bug

I apologize in advance if this issue I am about to describe is simply some kind of user error rather than an actual issue. Please understand this is my first time using Spacy and contextualSpellCheck and I believe I am using them correctly however there is always the chance I am not.

That said, my application is using contextualSpellCheck to check the spelling in tweets and recommend fixes for misspelled words. In doing this, I have found that the spelling corrections are almost always incorrect, and often times completely illogical.

For example, in the tweet:

"@user all #smiles when #media is silly joke flirt mischief excitement #pressconference in #antalya #turkey sunday #throwback love happy happy love happy love"

contextual spell check indicates the word "flirt" is misspelled (which it is not) and recommends the illogical spelling correction of "#".

Please see the image below for a few more examples.

--
illogical_spelling_corrections

I have a dataset of over 20k tweets, and have created a function that will process a given number of these tweets using Spacy and contextualSpellCheck. The function stores all top spelling correction options in a csv file. Using my function (link provided below) you can easily reproduce this issue and create as many examples of these incorrect spelling suggestions as needed (using my code will make it very easy to produce the problem).

To Reproduce

#Steps to reproduce the behavior:

1. Download the colab notebook here: https://github.com/BradenAnderson/Twitter-Sentiment-Analysis/blob/main/01_2_Data_Cleaning_Spacy_and_Spellcheck.ipynb

2. Download the tweet data here: https://github.com/BradenAnderson/Twitter-Sentiment-Analysis/blob/main/Train_Test_Datasets/train_tweets_with_emojis_clean.csv

3. Download two more supporting data files these two links:

https://github.com/BradenAnderson/Twitter-Sentiment-Analysis/blob/main/Supporting_Data_Files/contractions.csv

https://github.com/BradenAnderson/Twitter-Sentiment-Analysis/blob/main/Supporting_Data_Files/sms_speak.csv

4. Run the colab notebook. The test driver is called in cell 47, and will generate an output csv file that shows you all the recommendations contextualSpellCheck made. You can change the test "start" and "end" index in the test drive function call, and that will run the test of a different set of tweets and create a new csv file. 

5. Inspect the output csv file to determine if the spelling suggestions are reasonable. output csv file will have a name formatted as:

num1_to_num2_spellcheck_test_results.csv 

where num1 and num2 are the start and end tweet indexes you passed to the test drive function.

Expected behavior
I expected the spelling recommendations to be at least reasonable. In many cases spelling suggestions involve changing a correctly spelled readable word to simply a punctuation mark like "." or "#".

Version (please complete the following information):

  • contextualSpellCheck version: 0.4.1
  • Spacy version: 3.0.5

Additional information

Please be aware that the way I have structured the code to capture all of this information about what contextualSpellCheck is doing requires a lot of RAM. I do not recommend running the test driver function on more than 50 tweets at one time.

As I mentioned at the beginning I am inexperienced with both Spacy and contextualSpellCheck. I have shared my current use case and implementation and I welcome any advice on how to better use these tools. Beyond getting spellcheck working, I want to find a way to process all of these tweets without generating so many doc objects, as I believe this is what is causing so much RAM usage, which has led to slow performance and crashes.

Despite all of that, as far as I can tell the contextualSpellCheck is not giving reasonable spelling recommendations for this tweet processing application. I have dedicated a significant amount of time to troubleshooting this, and I finally decided to raise the issue with you guys. I would really like to use your tool if it can be used for this task. Please help me understand if this is an actual bug, or some kind of user error.

Thank you,
Braden

@BradenAnderson BradenAnderson added the bug Something isn't working label Mar 29, 2021
@R1j1t
Copy link
Owner

R1j1t commented Mar 31, 2021

Hi @BradenAnderson I will be honest here, based on some cases which I checked I thought it was working okay not great though.

Your statement:

That said, my application is using contextualSpellCheck to check the spelling in tweets and recommend fixes for misspelled words. In doing this, I have found that the spelling corrections are almost always incorrect, and often times completely illogical.

I will not deny this claim. There is still a lot of work to be done and that is why the versions is 0.x. Possible investigation points for very bad correction could be the [MASK] filling bert model and spell error identification (pending unfixed #44 ). At present it defaults to bert-base-cased link. The current logic for spelling correction is as follows:

  1. provide spacy model: This will break the sentence into tokens. Now as this model is trained on a particular language (tweet specific models are also there) it knows the nuances
  2. Check the token in the transformer model's vocab: If the token is not present consider it spelling error
  3. Mask the OOV word and use the transformers model to predict words to replace mask
  4. check the edit distance to see which is closest syntactically.

The default model (bert-base-cased) might not be trained entirely on the tweets for which the syntax is very different from text in news for example. Expecting the library to work out of the box on the entire set of use cases if a big ask, try to debug code (it is not that complex) or if you want you can create a fork and play around with the code to see what is causing the issue.

@BradenAnderson
Copy link
Author

BradenAnderson commented Apr 1, 2021

Hi Raajat,

Thanks for providing this response. I was honestly thinking that that after I described the issue, that whoever responded would be pointing out some error I had made in my implementation that was leading to these results.

I have tried to debug with no luck, but I will look through the links and information you provide and try again. If I have some extra time it would be fun to fork and try to do some more in depth debugging.

Hope my original post didn't come across as being harsh. I was really thinking that I had just been using the library wrong so in my description I was trying to give lots of detail on what I had done and the results. Like you said, you are taking on a big challenge as building something like this is no easy task. If I can't get it working this time around I'd be happy to try again sometime in the future.

I'll keep looking at it and if I come up with any insight I'll send over another message. Thanks again for the response.

-Braden

@R1j1t
Copy link
Owner

R1j1t commented May 29, 2021

@BradenAnderson, I finally got the time to look at your question again. I think you should try to pass one of these models to contextualSpellCheck. See the below pseudo code for reference:

nlp = spacy.load("en_core_web_sm")
# max_edit_dist is a good parameter to play with and see the effect on the results
nlp.add_pipe("contextual spellchecker",config={"model_name": "vinai/bertweet-base","max_edit_dist": 4})

sent = html.unescape(<SENTENCE>)
doc = nlp(sent)

I programmatically checked for 15 sentences and below are the result in the table :

Output Table
ID in Dataset original sentence misspell suggested
18223 the places i'll go. #photography #smileyface #smile #eggs #toilet #flush #food�
9544 #siilyfaces #family #cousins #love #lasvegas #fremontstreet @ freemont st expeirence
10450 2of2 needs to know about you being the problem and not the solution for our black community, @user like mateen. @user
11018 about to study for my next few #speeches and work on product development.
6658 astounded with #catfish. no one knows who they are really talking to over the net. #weird #naive #lonely #sick
4562 a sign of the evil to come... #epic #anger #shout #man #manga awork by anna riley awork art
4029 yayy!!! that show i definitely on my list of things!!!
2519 @user @user what a wonderful photo, i like the spontaneity of expressions
21900 i miss my boyfriend already #relationshipgoals #missing
5333 about to sta @user book thinking differently. families at my school are reading, too! #education
2893 @user #micommunity - launching on 20th june @user
9225 there becomes a time in life where you just got to stop caring and just go with what's meant to happen and be at peace with it. # truth
17287 can #lighttherapy help with or #depression? #altwaystoheal #healthy is #happy !!
24312 i imagine it would be a lot like this. #imaginary #conversations #life #style #lifestyle�
1558 missed having cute costa dates;)

For the example you mentioned earlier:

doc = nlp("@user all #smiles when #media is silly joke flirt mischief excitement #pressconference in #antalya #turkey sunday #throwback love happy happy love happy love")
print(doc._.suggestions_spellCheck)    #{antalya: 'italy'}
print(doc._.outcome_spellCheck)        #@user all #smiles when #media is silly joke flirt mischief excitement #pressconference in #italy #turkey sunday #throwback love happy happy love happy love

I hope this would work for your use-case, try out different models and different edit distances as well. Do let me know your observations.

@R1j1t R1j1t added question Further information is requested usage and removed bug Something isn't working labels Jun 5, 2021
@stale
Copy link

stale bot commented Jul 5, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the wontfix This will not be worked on label Jul 5, 2021
@stale stale bot closed this as completed Jul 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested usage wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants