Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using a dictionary to generate blacklists #43

Closed
jaumeortola opened this issue Aug 6, 2019 · 10 comments
Closed

Using a dictionary to generate blacklists #43

jaumeortola opened this issue Aug 6, 2019 · 10 comments

Comments

@jaumeortola
Copy link
Contributor

To generate a good blacklist, a dictionary or spell-checker can be used.

We can add to the blacklist all the misspelled words (or unusual proper nouns) we find on the Wikipedia.

The main problem with this approach is word tokenization. Word tokenization must be coherently used in every step of the process (see #41), including the dictionary lookup. The tokenization used in the dictionary must be previously known, and if it is not the same, some adjustment can be necessary.

I'm very skeptical that you will be able achieve this kind of language expertise for more than a few languages. To use existing grammar & spelling checkers is a better solution.

@nukeador
Copy link

nukeador commented Aug 6, 2019

Is this something we should integrate in the script or maybe we can use externally to generate lists as we do with the word usage external script?

@jaumeortola
Copy link
Contributor Author

It will be very languange-dependant (dependant on resources available for each language and word tokenization). So let's keep it external.

@nukeador
Copy link

nukeador commented Aug 6, 2019

Great. Is the use of these dictionaries more helpful to generate blacklists or whitelists? How that would work?

If it's for blacklisting, we should then only improve our doc to explain how to use them when available in combination with @dabinat cvtools (or integrate into these tools?)

@jaumeortola
Copy link
Contributor Author

I guess a blacklist is the lesser evil.

The blacklist creation should be an independent script.

@nukeador
Copy link

nukeador commented Aug 7, 2019

@dabinat is this something that can potentially be integrated into your tool?

@dabinat
Copy link

dabinat commented Aug 7, 2019

@nukeador There’s no harm in adding the feature, but with > 1 million sentences, it seems like maintaining a blacklist would be a lot of work.

@nukeador
Copy link

nukeador commented Aug 7, 2019

What do you propose?

I don't think we need to maintain an update one, but enable the tools for communities to generate one. It has been proven key to improve the quality of the final extraction.

@dabinat
Copy link

dabinat commented Aug 7, 2019

I remember you looked at filtering little-used words before but it filtered too many. Perhaps combine that with a letter sequence check and/or word length check so it’s only filtering out the least-pronounceable ones.

@nukeador
Copy link

nukeador commented Aug 7, 2019

Less used words proved to be OKish for German and Spanish, it's a matter of deciding where to put the cut. But yes, definitely a combination of other methods will allow more quality sentences.

@MichaelKohler
Copy link
Member

There are multiple ways to come up with the blacklist. I'd like to keep this here simple, and keep the concern of actually generating the blacklist out of this repo. However I'm happy to take a PR which adds another method than the existing to the README.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants