Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[arm] can't use wikipron because of potential readtimeout? Can we use a wiktionary dump? #299

Closed
jhdeov opened this issue Dec 29, 2020 · 16 comments
Labels
enhancement New feature or request

Comments

@jhdeov
Copy link
Contributor

jhdeov commented Dec 29, 2020

Hello

I can use the terminal version of WikiPron to scrape a small language like Amharic [amh], and to scrape a big one like French. But when I try to run it on Armenian [hye or arm], the code just stops running after an hour and outputs nothing -- there's not even any errors thrown. I suspect the code is finding a readtimeout error and then skipping it.

I suspect there's a readtimeout error because in the past, I used other wiktionary extractors to Wiktextract and that took 9-12 hours to scrape the Armenian words (just 17k words). I suspect that the Armenian entries are just oddly dispersed across Wiktionary that it takes a while for some scrapers to find them. Granted Wiktextract was using a wiktionary dump and that's how it managed to eventually work. Can WikiPron work over a Wiktionary dump or does it need to actively use an internet connection?

@kylebgorman
Copy link
Collaborator

Hi Hossep,

So first off, we release a regular dump of Wiktionary run over every language and dialect we know of. We are updating this at least annually. The Armenian ones were made this summer when we added separate "dialects" (Western and Eastern are sometimes distinguished). I plan to update the entire collection shortly. This "dump" is here. This seems to us a good compromise over using a frozen dump for extraction: for instance UniMorph does that and they've set it up so that it's impossible to advance beyond 2017! FWIW, I will kick off an overall "rescrape" next week (it runs for about a week, using a fast and reliable connection at my lab) and update this dumped data set.

FWIW the latency you see has little to do with "searching" in any relevant sense. The code hits the backend for a pre-computed list of all Armenian headwords (4k or so at a time), which takes no more than a few seconds. The vast majority of time is spent waiting for the server to serve up actual HTML of the pages for those headwords. Finally, our requests identify Wikipron as a bot (the polite thing to do) so the administrators can traffic-shape us if necessary.

The extraction does sometimes hang, though if it is actually a server hang it usually hangs up, which generates a fatal exception. I've not seen your particular pattern. Note that Python buffers IO (you can control this with command-line arguments to the interpreter and/or environmental variables, which may help getting more immediate results.) The script that generates the "dump" catches hangs and it generally runs in one go over about a week, from my office, with a very fast and reliable internet connection.

We have an open "enhancement" issue (#253) to make scrapes restartable. It's not clear how to do it but I think it's a good idea.

@jhdeov
Copy link
Contributor Author

jhdeov commented Dec 29, 2020

I see, thanks for the explanation.

To give more context, this what my terminal shows for the different languages

For example, when I try Amharic, it just works:

$ wikipron amh > amh.tsv
INFO: Language: "Amharic"
INFO: No cut-off date specified
$

The outputted file has stuff in it

French also works and it uses the cut-off points

$ wikipron fra > fr.tsv
INFO: Language: "French"
INFO: No cut-off date specified
INFO: 100 pronunciations scraped
INFO: 200 pronunciations scraped
INFO: 300 pronunciations scraped
INFO: 400 pronunciations scraped
INFO: 500 pronunciations scraped
...

But the Armenian just finishes in an hour and gives me an empty file

$ wikipron hye > hye1.tsv
INFO: Language: "Armenian"
INFO: No cut-off date specified
$

Regarding this statement:

(you can control this with command-line arguments to the interpreter and/or environmental variables, which may help getting more immediate results.

Can you recommend any command to run? If the entire process would take a week for my residential internet connection, then I can just sit tight for your eventual rescrape :D

@lfashby
Copy link
Collaborator

lfashby commented Dec 29, 2020

Hello,
Perhaps this behavior is due to all Armenian entries being ‘phonetic’ transcriptions? wikipron arm will run for about an hour and output nothing since no Armenian transcriptions are contained within /'s . Running wikipron arm —phonetic appears to work as intended.

@kylebgorman
Copy link
Collaborator

Oh yes, that is a factor. There are a few hundred phonemic transcriptions (that's the default) but they're few and far between.

To disable buffering from the command line I believe you can do 

PYTHONUNBUFFERED=1 your_python_command

(Not tested.)

@jhdeov
Copy link
Contributor Author

jhdeov commented Dec 29, 2020

Huh, cool. I'm not surprised.

The Armenian IPA entries are all done via a script that converts a cleaned-up version of the orthographic form, explained here.

For example, for the orthographic word անտերունչ, the IPA is [ɑntɛˈɾunt͡ʃʰ], but the edit-page just:

===Pronunciation===

  • {{audio|hy|Hy-անտերունչ.ogg|Audio}}
    {{hy-pron}}

I'm trying it now and it's working :D

$ wikipron arm > hye1.tsv --phonetic
INFO: Language: "Armenian"
INFO: No cut-off date specified
INFO: 100 pronunciations scraped
INFO: 200 pronunciations scraped
....

Though I thought that your code would extract both phonemic and phonetic transcriptions? Maybe I misunderstood the "Phonetic versus phonemic transcription" section of your paper.

Does your scraped-data set mark any of the phonemic transcriptions? Because if so, then their entries should be fixed by a Wiktionary user (like me)

@kylebgorman
Copy link
Collaborator

The way things are set up, we just scrape twice, once for phonemic (literally just whatever is in //) and once for phonetic ([]) and store them in separate files. Some languages have a lot of one and very few of the other; if a file ends up <100 entries, we don't include it in the "big scrape" database. But Armenian has enough phonemic transcriptions (for both dialects) to make the cut.

What do you mean by "mark any of the phonemic transcriptions"? It does extract them: the big scrape will have a file with a name like arm_e_phonemic.tsv for the Eastern dialect, and so on.

PS: I didn't realize you could have > foo before a flag and Python would know that worked. I would have written wikipron --phonetic arm > hye1.tsv and wouldn't have guessed what you did would work, but I guess > is truly parsed as a low-precedence operator, huh.

@jhdeov
Copy link
Contributor Author

jhdeov commented Dec 29, 2020

But Armenian has enough phonemic transcriptions (for both dialects) to make the cut.

But your dataset only has phonetic and phonetic filtered (and it's unclear what phonetic filtered means).

PS: Yeah that surprised me too :D

@kylebgorman
Copy link
Collaborator

kylebgorman commented Dec 29, 2020 via email

@jhdeov
Copy link
Contributor Author

jhdeov commented Dec 29, 2020

Side-note: Is there any documentation on what you mean by "phonetic-filtered" on your scraped datasets?

@kylebgorman
Copy link
Collaborator

kylebgorman commented Dec 29, 2020 via email

@jhdeov
Copy link
Contributor Author

jhdeov commented Dec 29, 2020

Oh so at some point you manually convert the original stuff from Wiktionary into the segments in the phone list? But is there documentation on what are the specific graphemes (or transcribed phonemes) that you are converting from and to?

@kylebgorman
Copy link
Collaborator

kylebgorman commented Dec 29, 2020 via email

@kylebgorman kylebgorman added the enhancement New feature or request label Dec 29, 2020
@kylebgorman
Copy link
Collaborator

Linking this to #253 and closing for now.

@jhdeov
Copy link
Contributor Author

jhdeov commented Dec 29, 2020

Ah thanks for the clarification. The README doesn't mention the words 'case' or 'filter' though.

@kylebgorman
Copy link
Collaborator

kylebgorman commented Dec 29, 2020 via email

@jhdeov
Copy link
Contributor Author

jhdeov commented Dec 29, 2020

Well I found the ACL paper pretty helpful in understanding almost everything about how to interpret (and realize) how correct your arm scrape was. The README covered some holes that weren't present in the ACL doc.

I would update the README myself but then I would fear saying something wrong :/

PS: "before long." is a new construction for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants