Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Italian word blacklist #51

Merged
merged 3 commits into from Oct 8, 2019

Conversation

@Mte90
Copy link
Contributor

Mte90 commented Sep 3, 2019

I zipped it because I had some issues on uploading on github.
The list is generated from the instructions in the readme.

  • I am getting 1001868 on the previous run of 21249410
  • blacklist created with the tool
Mte90 added 2 commits Sep 2, 2019
@nukeador

This comment has been minimized.

Copy link
Contributor

nukeador commented Sep 4, 2019

Thanks for this.

Can we add the follow information to this PR for the records?

  • How many sentences are you getting?
  • How did you create the blacklist? (specify the criteria, i.e words with <80 repetitions)
  • Get 2-3 additional native speakers (ideally some linguistics) to comment here with the estimated error rate. You can share with them a few samples of 500 random sentences from your output.
@Mte90

This comment has been minimized.

Copy link
Contributor Author

Mte90 commented Sep 4, 2019

I am waiting for the scan that is very very slow to get the final number, the criteria are <80 repetitions.

@Mte90

This comment has been minimized.

Copy link
Contributor Author

Mte90 commented Sep 4, 2019

I am getting 1001868 sentences with this blacklist :-)

@Mte90

This comment has been minimized.

Copy link
Contributor Author

Mte90 commented Sep 4, 2019

A list of 500 sentences is at https://pastebin.com/sWhjMa0r and already asked to the l10n team to review it.

@Mte90

This comment has been minimized.

Copy link
Contributor Author

Mte90 commented Sep 4, 2019

We had a review, there are some german words like Scleswig/Aarhus or french Pere Lachaise and a duplicate sentence at 441 and 441.
So I guess that the blacklist doesn't work so good because those seems something not easy to find in Italian.
But if we accept a very low error rate I think that is enough, but I am waiting for a second review.

@Mte90

This comment has been minimized.

Copy link
Contributor Author

Mte90 commented Sep 4, 2019

Another volunteer found other french terms and english in that list, so maybe the blacklist generator doesn't work so good because those are mentioned only once.

@nukeador

This comment has been minimized.

Copy link
Contributor

nukeador commented Sep 4, 2019

Can you check if the words found are or are not inside the blacklist? Let's make sure the blacklist is being applied.

@Mte90

This comment has been minimized.

Copy link
Contributor Author

Mte90 commented Sep 5, 2019

They are not inside the blacklist, maybe the blacklist tool is not working on the full wikipedia dump but only on the one extracted buy this tool that change everytime that is executed.
So probably if I execute again there will be other words not included.

@nukeador

This comment has been minimized.

Copy link
Contributor

nukeador commented Sep 5, 2019

Sorry I don't understand. What is the thing you feel it's not working?

@Mte90

This comment has been minimized.

Copy link
Contributor Author

Mte90 commented Sep 5, 2019

  • We download the dump of wikipedia
  • We extract the content with this tool that is a part of the original wikipedia
  • From that extraction we execute the blacklist generator
  • We execute again this tool but this time with the blacklist generated before
  • This tool generate a new extraction from wikipedia using the blacklist

This is the flow but everytime the wikipedia content is different so the blacklist will be never aligned with the content.

@nukeador

This comment has been minimized.

Copy link
Contributor

nukeador commented Sep 5, 2019

Once you have the wikipedia dump extracted you need to generate a txt with all wikipedia sentences and that file to be used for generating a blacklist. You need to do this before step 3 in the readme and then do step 3 (I should probably add this on the readme)

Since you generate a blacklist form the whole wikipedia sentences, that should work on any wikipedia extraction you do.

What's the size and number of sentences when you run the:

cd ../common-voice-wiki-scraper
cargo run -- extract -d ../wikiextractor/text/ --no_check >> wiki.en.all.txt
@Mte90

This comment has been minimized.

Copy link
Contributor Author

Mte90 commented Sep 6, 2019

I saw now that the command is different include --no_check for the blacklist. I am trying again, I will update the pr.

Anyway I think that include few bash scripts that do everything automatically instead checking them on the readme is more easy to use and avoid this issue if the readme is updated often as example.

@nukeador

This comment has been minimized.

Copy link
Contributor

nukeador commented Sep 6, 2019

Yes, we want to fully automate the process, consider the current version as a developer preview version ;-)

I'll update the readme with a note.

@Mte90

This comment has been minimized.

Copy link
Contributor Author

Mte90 commented Sep 6, 2019

I tried now following all the commands for italian but is still not include word_usage the term mentioned at #51 (comment).
The list is the same as before so I don't know how to move on.

@nukeador

This comment has been minimized.

Copy link
Contributor

nukeador commented Sep 10, 2019

Is when using the cvtools script when you are having problems? Which command is not producing the expected output?

@Mte90

This comment has been minimized.

Copy link
Contributor Author

Mte90 commented Sep 10, 2019

Yes cvtools doesn't add the terms that I written before. But in the txt file generated from this tool it is present only once.
Maybe doesn't detect a term that is present only once?

@nukeador

This comment has been minimized.

Copy link
Contributor

nukeador commented Sep 10, 2019

The resulting blacklist should have a list of words, one per line, only once per word.

Basically you tell the script to take all wikipedia text, get the words with less than XX repetitions and create a list of these words for you.

By doing:

python3 ./word_usage.py -i ../common-voice-wiki-scraper/wiki.en.all.txt >> word_usage.en.txt

You get all the words and how many times they appear. Then you need to take a decisions, let's say 80 repetitions or less, and create the final blacklist:

python3 ./word_usage.py -i ../common-voice-wiki-scraper/wiki.en.all.txt --max-frequency 80 --show-words-only >> ../common-voice-wiki-scraper/src/rules/disallowed_words/english.txt

That should get you a functional blacklist that will be detected by the scrapper script if placed in your language folder, so when you run:

cargo run -- extract -l english -d ../wikiextractor/text/ >> wiki.en.txt

The script will look for both rules and blacklist in the language specified.

@Mte90

This comment has been minimized.

Copy link
Contributor Author

Mte90 commented Sep 12, 2019

I used the same command available in the readme, that are the same that you shared now.
I think that doesn't detect everything cvtools, I can share to you the 2.6gb file generated compressed https://send.firefox.com/download/6ce9ab4e498fea10/#AFaOyDGvJafwx848o0_HHQ

@nukeador

This comment has been minimized.

Copy link
Contributor

nukeador commented Sep 12, 2019

@dabinat is this something you can help Daniele to figure out what's failing with your cvtools?

@dabinat

This comment has been minimized.

Copy link

dabinat commented Sep 13, 2019

@Mte90 Can you give me an example of a word that wasn't detected? And you're certain the word definitely doesn't appear more than 80 times?

@Mte90

This comment has been minimized.

Copy link
Contributor Author

Mte90 commented Sep 14, 2019

#51 (comment)
There are the terms that appear like once in the dump from wikipedia

@Mte90

This comment has been minimized.

Copy link
Contributor Author

Mte90 commented Sep 24, 2019

Ping for @dabinat :-)

@dabinat

This comment has been minimized.

Copy link

dabinat commented Sep 25, 2019

There are the terms that appear like once in the dump from wikipedia

Did you actually check this? I downloaded the Italian wiki dump, ran it through the word_usage script and it identified 369 instances of the word "aarhus". I also manually checked the whole file and stopped counting after I hit 81 instances (the file was huge).

So the script was correct to not add it to the blacklist. I don't see a bug here.

@Mte90

This comment has been minimized.

Copy link
Contributor Author

Mte90 commented Sep 26, 2019

I don't understand why is not detecting that word in my case. I didn't checked the difference with case sensitive words, anyway can you send to me your blacklist file so I can upload it there?

@dabinat

This comment has been minimized.

Copy link

dabinat commented Sep 26, 2019

Can you try running the blacklist script again? I've made changes to it since you last ran it. I don't think those changes will have affected your particular issue but let's make sure we're both looking at the same results.

@Mte90

This comment has been minimized.

Copy link
Contributor Author

Mte90 commented Oct 1, 2019

I tested it now and:

  • nell'aarhus/sull'aarhus

And many others with '. The ' need to be splitted because is like in english to truncate words so the word after need to be parsed as single word and also as the truncated version.

So the script need to extract the word after ' and add it to the list.

@Mte90

This comment has been minimized.

Copy link
Contributor Author

Mte90 commented Oct 1, 2019

dabinat/cvtools#3 I did a pr to the script that fix my issue :-)

@Mte90

This comment has been minimized.

Copy link
Contributor Author

Mte90 commented Oct 1, 2019

PR updated :-D I hope that now can be merged so I can focus on other stuff.

@nukeador

This comment has been minimized.

Copy link
Contributor

nukeador commented Oct 2, 2019

Thanks for all this work @Mte90 !

Before we merge, it would be good to document here what I mentioned earlier for future reference:

  • How many sentences are you getting with these rules and blacklist?
  • How did you create the blacklist? (specify the criteria, i.e words with <80 repetitions)
  • Get 2-3 additional native speakers (ideally some linguistics) to comment here with the estimated error rate. You can share with them a few samples of 500 random sentences from your output.

Thanks again!

@Mte90

This comment has been minimized.

Copy link
Contributor Author

Mte90 commented Oct 3, 2019

Extract of the first 100 sentences https://pastebin.com/wpzcizyE. I will ask for reviews to someone else.
The sentences are created with a criteria of <80.
I got 1001896 sentences with the blacklist.

@dag7dev

This comment has been minimized.

Copy link

dag7dev commented Oct 6, 2019

#Line number Full sentence Reason
18 La vita di Abbie Hoffman è documentata nel film Steal this Movie. Steal This Movie is the name of the movie, but it's in English
20 Pete Townshend affermò successivamente di essere d'accordo con Hoffman riguardo l'imprigionamento di John Sinclair. Pete Townshend, Hoffman and John Sinclair aren't so common, I don't know if it is a good idea to merge them
71 Nello stesso anno dirige il paradossale "Leningrad Cowboys Go America", road movie musicale. Same as 18
75 A Barcellona si immerse nello studio del "Sefer yetzirah" e dei suoi numerosi commentari. Sefer yetzirah is not even English
@Mte90

This comment has been minimized.

Copy link
Contributor Author

Mte90 commented Oct 6, 2019

I think that as error rate we are very good (compared to before), I am wondering why the words inside "" are not detected by the tool, maybe I can improve my pr.

@nukeador

This comment has been minimized.

Copy link
Contributor

nukeador commented Oct 7, 2019

Some proper names in English are OK, it's expected and it's nearly impossible to detect. Also there might be cases were we want to see how Italian-speakers pronounce "John" or "Burger King"

@Mte90

This comment has been minimized.

Copy link
Contributor Author

Mte90 commented Oct 7, 2019

So I think that can be approved and maybe later we can see for improvements to the tool that generated the blacklist.

@nukeador

This comment has been minimized.

Copy link
Contributor

nukeador commented Oct 7, 2019

To follow the agreed quality checks, can we get 1-2 additional native speakers (ideally some linguistics) to comment here with the estimated error rate?

Thanks!

@Sav22999

This comment has been minimized.

Copy link

Sav22999 commented Oct 7, 2019

I report the same sentences of @dag7dev (18, 20, 71, 75), and in addition:

# Line / Row Full sentence Reason
95 Tale rappresentazione iconica è ottenuta attraverso i file '.info'. How should I read ".info"?
@dag7dev

This comment has been minimized.

Copy link

dag7dev commented Oct 8, 2019

I think we should read "dot" as "punto", as a normal Italian speaker would do.

@Sonic0

This comment has been minimized.

Copy link

Sonic0 commented Oct 8, 2019

I agree with the others, I didn't find any other issue!

@Mte90

This comment has been minimized.

Copy link
Contributor Author

Mte90 commented Oct 8, 2019

On 1170 words on the first 50 sentences we have reported only 17 words not Italian (including people names). We are talking like about 1% error rate (including names that are accepted) and I think that is quite good.

@nukeador

This comment has been minimized.

Copy link
Contributor

nukeador commented Oct 8, 2019

  • Total number of sentences: 1001896
  • Blacklist criteria: <80 repetitions
  • Estimated error rate: 1%

Merging now, thanks everyone for this work. I'll add this extraction to our TODO. Since we are currently transitioning our only full time developer this might take a bit.

Thanks for your patience.

@nukeador nukeador merged commit abaa89a into Common-Voice:master Oct 8, 2019
@nukeador

This comment has been minimized.

Copy link
Contributor

nukeador commented Oct 14, 2019

@Mte90 We are getting just 937900 sentences after running the extraction for Italian. There is a significant difference and we wanted to check with you before adding them.

@Mte90

This comment has been minimized.

Copy link
Contributor Author

Mte90 commented Oct 14, 2019

I don't know why but I think that for the moment is good enough anyway.

@nukeador nukeador referenced this pull request Oct 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.