Using a dictionary to generate blacklists #43

jaumeortola · 2019-08-06T13:37:16Z

To generate a good blacklist, a dictionary or spell-checker can be used.

We can add to the blacklist all the misspelled words (or unusual proper nouns) we find on the Wikipedia.

The main problem with this approach is word tokenization. Word tokenization must be coherently used in every step of the process (see #41), including the dictionary lookup. The tokenization used in the dictionary must be previously known, and if it is not the same, some adjustment can be necessary.

I'm very skeptical that you will be able achieve this kind of language expertise for more than a few languages. To use existing grammar & spelling checkers is a better solution.

nukeador · 2019-08-06T14:00:47Z

Is this something we should integrate in the script or maybe we can use externally to generate lists as we do with the word usage external script?

jaumeortola · 2019-08-06T15:52:41Z

It will be very languange-dependant (dependant on resources available for each language and word tokenization). So let's keep it external.

nukeador · 2019-08-06T16:33:54Z

Great. Is the use of these dictionaries more helpful to generate blacklists or whitelists? How that would work?

If it's for blacklisting, we should then only improve our doc to explain how to use them when available in combination with @dabinat cvtools (or integrate into these tools?)

jaumeortola · 2019-08-06T20:47:42Z

I guess a blacklist is the lesser evil.

The blacklist creation should be an independent script.

nukeador · 2019-08-07T12:27:40Z

@dabinat is this something that can potentially be integrated into your tool?

dabinat · 2019-08-07T17:34:45Z

@nukeador There’s no harm in adding the feature, but with > 1 million sentences, it seems like maintaining a blacklist would be a lot of work.

nukeador · 2019-08-07T18:13:20Z

What do you propose?

I don't think we need to maintain an update one, but enable the tools for communities to generate one. It has been proven key to improve the quality of the final extraction.

dabinat · 2019-08-07T20:32:15Z

I remember you looked at filtering little-used words before but it filtered too many. Perhaps combine that with a letter sequence check and/or word length check so it’s only filtering out the least-pronounceable ones.

nukeador · 2019-08-07T20:41:57Z

Less used words proved to be OKish for German and Spanish, it's a matter of deciding where to put the cut. But yes, definitely a combination of other methods will allow more quality sentences.

MichaelKohler · 2020-01-25T23:54:12Z

There are multiple ways to come up with the blacklist. I'd like to keep this here simple, and keep the concern of actually generating the blacklist out of this repo. However I'm happy to take a PR which adds another method than the existing to the README.

MichaelKohler added discussion enhancement New feature or request extract-improvements rules labels Jan 3, 2020

MichaelKohler closed this as completed Jan 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using a dictionary to generate blacklists #43

Using a dictionary to generate blacklists #43

jaumeortola commented Aug 6, 2019

nukeador commented Aug 6, 2019

jaumeortola commented Aug 6, 2019

nukeador commented Aug 6, 2019

jaumeortola commented Aug 6, 2019

nukeador commented Aug 7, 2019

dabinat commented Aug 7, 2019

nukeador commented Aug 7, 2019

dabinat commented Aug 7, 2019

nukeador commented Aug 7, 2019

MichaelKohler commented Jan 25, 2020

Using a dictionary to generate blacklists #43

Using a dictionary to generate blacklists #43

Comments

jaumeortola commented Aug 6, 2019

nukeador commented Aug 6, 2019

jaumeortola commented Aug 6, 2019

nukeador commented Aug 6, 2019

jaumeortola commented Aug 6, 2019

nukeador commented Aug 7, 2019

dabinat commented Aug 7, 2019

nukeador commented Aug 7, 2019

dabinat commented Aug 7, 2019

nukeador commented Aug 7, 2019

MichaelKohler commented Jan 25, 2020