[arm] can't use wikipron because of potential readtimeout? Can we use a wiktionary dump? #299

jhdeov · 2020-12-29T07:53:07Z

Hello

I can use the terminal version of WikiPron to scrape a small language like Amharic [amh], and to scrape a big one like French. But when I try to run it on Armenian [hye or arm], the code just stops running after an hour and outputs nothing -- there's not even any errors thrown. I suspect the code is finding a readtimeout error and then skipping it.

I suspect there's a readtimeout error because in the past, I used other wiktionary extractors to Wiktextract and that took 9-12 hours to scrape the Armenian words (just 17k words). I suspect that the Armenian entries are just oddly dispersed across Wiktionary that it takes a while for some scrapers to find them. Granted Wiktextract was using a wiktionary dump and that's how it managed to eventually work. Can WikiPron work over a Wiktionary dump or does it need to actively use an internet connection?

kylebgorman · 2020-12-29T18:01:28Z

Hi Hossep,

So first off, we release a regular dump of Wiktionary run over every language and dialect we know of. We are updating this at least annually. The Armenian ones were made this summer when we added separate "dialects" (Western and Eastern are sometimes distinguished). I plan to update the entire collection shortly. This "dump" is here. This seems to us a good compromise over using a frozen dump for extraction: for instance UniMorph does that and they've set it up so that it's impossible to advance beyond 2017! FWIW, I will kick off an overall "rescrape" next week (it runs for about a week, using a fast and reliable connection at my lab) and update this dumped data set.

FWIW the latency you see has little to do with "searching" in any relevant sense. The code hits the backend for a pre-computed list of all Armenian headwords (4k or so at a time), which takes no more than a few seconds. The vast majority of time is spent waiting for the server to serve up actual HTML of the pages for those headwords. Finally, our requests identify Wikipron as a bot (the polite thing to do) so the administrators can traffic-shape us if necessary.

The extraction does sometimes hang, though if it is actually a server hang it usually hangs up, which generates a fatal exception. I've not seen your particular pattern. Note that Python buffers IO (you can control this with command-line arguments to the interpreter and/or environmental variables, which may help getting more immediate results.) The script that generates the "dump" catches hangs and it generally runs in one go over about a week, from my office, with a very fast and reliable internet connection.

We have an open "enhancement" issue (#253) to make scrapes restartable. It's not clear how to do it but I think it's a good idea.

jhdeov · 2020-12-29T18:21:49Z

I see, thanks for the explanation.

To give more context, this what my terminal shows for the different languages

For example, when I try Amharic, it just works:

$ wikipron amh > amh.tsv
INFO: Language: "Amharic"
INFO: No cut-off date specified
$

The outputted file has stuff in it

French also works and it uses the cut-off points

$ wikipron fra > fr.tsv
INFO: Language: "French"
INFO: No cut-off date specified
INFO: 100 pronunciations scraped
INFO: 200 pronunciations scraped
INFO: 300 pronunciations scraped
INFO: 400 pronunciations scraped
INFO: 500 pronunciations scraped
...

But the Armenian just finishes in an hour and gives me an empty file

$ wikipron hye > hye1.tsv
INFO: Language: "Armenian"
INFO: No cut-off date specified
$

Regarding this statement:

(you can control this with command-line arguments to the interpreter and/or environmental variables, which may help getting more immediate results.

Can you recommend any command to run? If the entire process would take a week for my residential internet connection, then I can just sit tight for your eventual rescrape :D

lfashby · 2020-12-29T18:22:20Z

Hello,
Perhaps this behavior is due to all Armenian entries being ‘phonetic’ transcriptions? wikipron arm will run for about an hour and output nothing since no Armenian transcriptions are contained within /'s . Running wikipron arm —phonetic appears to work as intended.

kylebgorman · 2020-12-29T18:27:56Z

Oh yes, that is a factor. There are a few hundred phonemic transcriptions (that's the default) but they're few and far between.

To disable buffering from the command line I believe you can do

PYTHONUNBUFFERED=1 your_python_command

(Not tested.)

jhdeov · 2020-12-29T18:32:16Z

Huh, cool. I'm not surprised.

The Armenian IPA entries are all done via a script that converts a cleaned-up version of the orthographic form, explained here.

For example, for the orthographic word անտերունչ, the IPA is [ɑntɛˈɾunt͡ʃʰ], but the edit-page just:

===Pronunciation===

{{audio|hy|Hy-անտերունչ.ogg|Audio}}
{{hy-pron}}

I'm trying it now and it's working :D

$ wikipron arm > hye1.tsv --phonetic
INFO: Language: "Armenian"
INFO: No cut-off date specified
INFO: 100 pronunciations scraped
INFO: 200 pronunciations scraped
....

Though I thought that your code would extract both phonemic and phonetic transcriptions? Maybe I misunderstood the "Phonetic versus phonemic transcription" section of your paper.

Does your scraped-data set mark any of the phonemic transcriptions? Because if so, then their entries should be fixed by a Wiktionary user (like me)

kylebgorman · 2020-12-29T18:36:58Z

The way things are set up, we just scrape twice, once for phonemic (literally just whatever is in //) and once for phonetic ([]) and store them in separate files. Some languages have a lot of one and very few of the other; if a file ends up <100 entries, we don't include it in the "big scrape" database. But Armenian has enough phonemic transcriptions (for both dialects) to make the cut.

What do you mean by "mark any of the phonemic transcriptions"? It does extract them: the big scrape will have a file with a name like arm_e_phonemic.tsv for the Eastern dialect, and so on.

PS: I didn't realize you could have > foo before a flag and Python would know that worked. I would have written wikipron --phonetic arm > hye1.tsv and wouldn't have guessed what you did would work, but I guess > is truly parsed as a low-precedence operator, huh.

jhdeov · 2020-12-29T18:43:06Z

But Armenian has enough phonemic transcriptions (for both dialects) to make the cut.

But your dataset only has phonetic and phonetic filtered (and it's unclear what phonetic filtered means).

PS: Yeah that surprised me too :D

kylebgorman · 2020-12-29T18:45:31Z

Oh no you're right, I misread this earlier.

…

On Tue, Dec 29, 2020 at 1:43 PM Hossep Dolatian ***@***.***> wrote: But Armenian has enough phonemic transcriptions (for both dialects) to make the cut. But your dataset <https://github.com/kylebgorman/wikipron/tree/master/data/tsv> only has phonetic and phonetic filtered (and it's unclear what phonetic filtered means). PS: Yeah that surprised me too :D — You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/kylebgorman/wikipron/issues/299#issuecomment-752200135>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABG4OK2RJGJEYS3SLJ7YA3SXIPMPANCNFSM4VM5PX4A> .

jhdeov · 2020-12-29T18:50:22Z

Side-note: Is there any documentation on what you mean by "phonetic-filtered" on your scraped datasets?

kylebgorman · 2020-12-29T18:52:28Z

No, but I think it's obvious enough: it uses the appropriate phonelists in the `phones` directory to filter entries.

…

On Tue, Dec 29, 2020 at 1:50 PM Hossep Dolatian ***@***.***> wrote: Side-note: Is there any documentation on what you mean by "phonetic-filtered" on your scraped datasets? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/kylebgorman/wikipron/issues/299#issuecomment-752202529>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABG4OO5Q26HM2QWOPJQ6WLSXIQHXANCNFSM4VM5PX4A> .

jhdeov · 2020-12-29T18:58:41Z

Oh so at some point you manually convert the original stuff from Wiktionary into the segments in the phone list? But is there documentation on what are the specific graphemes (or transcribed phonemes) that you are converting from and to?

kylebgorman · 2020-12-29T19:04:32Z

On Tue, Dec 29, 2020 at 1:58 PM Hossep Dolatian ***@***.***> wrote: Oh so at some point you manually convert the original stuff from Wiktionary into the segments in the phone list <https://github.com/kylebgorman/wikipron/blob/master/data/phones/arm_e_phonetic.phones>? But is there documentation on what are the specific graphemes (or transcribed phonemes) that you are converting?

No. It's a filter. If any phone in the entry isn't present in the phonelist, we eliminate the pronunciation entry out in the corresponding "filtered" file. This happens during the "big scrape" procedure that generates the static database. It's not part of the command-line tool itself, but one could filter it that way if you want to. For instance, Bach is given two PR pronunciations: /bɑːx, bɑːk/. We throw the former out in the filtered file because /x/ isn't a phoneme of modern English (sorry SPE). There's no transformations applied: we don't have a mechanism for that yet. The only graphemic transformations that are applied are optional case-folding. For more information see data/src, the README and the code therein. K

…

kylebgorman · 2020-12-29T19:06:06Z

Linking this to #253 and closing for now.

jhdeov · 2020-12-29T19:06:22Z

Ah thanks for the clarification. The README doesn't mention the words 'case' or 'filter' though.

kylebgorman · 2020-12-29T19:10:18Z

You're welcome to update. It's really only for maintainers though: 3 people have used that code so far. Also the filtering mechanism isn't "released" yet, we'll do that before long.

…

On Tue, Dec 29, 2020 at 2:06 PM Hossep Dolatian ***@***.***> wrote: Ah thanks for the clarification. The README doesn't mention the words 'case' or 'filter' though. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <https://github.com/kylebgorman/wikipron/issues/299#issuecomment-752207954>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABG4OPTW6YRW6X463B74KTSXISDXANCNFSM4VM5PX4A> .

jhdeov · 2020-12-29T19:14:44Z

Well I found the ACL paper pretty helpful in understanding almost everything about how to interpret (and realize) how correct your arm scrape was. The README covered some holes that weren't present in the ACL doc.

I would update the README myself but then I would fear saying something wrong :/

PS: "before long." is a new construction for me.

kylebgorman added the enhancement New feature or request label Dec 29, 2020

kylebgorman closed this as completed Dec 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[arm] can't use wikipron because of potential readtimeout? Can we use a wiktionary dump? #299

[arm] can't use wikipron because of potential readtimeout? Can we use a wiktionary dump? #299

jhdeov commented Dec 29, 2020

kylebgorman commented Dec 29, 2020

jhdeov commented Dec 29, 2020

lfashby commented Dec 29, 2020

kylebgorman commented Dec 29, 2020

jhdeov commented Dec 29, 2020

kylebgorman commented Dec 29, 2020

jhdeov commented Dec 29, 2020

kylebgorman commented Dec 29, 2020 via email

jhdeov commented Dec 29, 2020

kylebgorman commented Dec 29, 2020 via email

jhdeov commented Dec 29, 2020 •

edited

kylebgorman commented Dec 29, 2020 via email

kylebgorman commented Dec 29, 2020

jhdeov commented Dec 29, 2020

kylebgorman commented Dec 29, 2020 via email

jhdeov commented Dec 29, 2020

[arm] can't use wikipron because of potential readtimeout? Can we use a wiktionary dump? #299

[arm] can't use wikipron because of potential readtimeout? Can we use a wiktionary dump? #299

Comments

jhdeov commented Dec 29, 2020

kylebgorman commented Dec 29, 2020

jhdeov commented Dec 29, 2020

lfashby commented Dec 29, 2020

kylebgorman commented Dec 29, 2020

jhdeov commented Dec 29, 2020

kylebgorman commented Dec 29, 2020

jhdeov commented Dec 29, 2020

kylebgorman commented Dec 29, 2020 via email

jhdeov commented Dec 29, 2020

kylebgorman commented Dec 29, 2020 via email

jhdeov commented Dec 29, 2020 • edited

kylebgorman commented Dec 29, 2020 via email

kylebgorman commented Dec 29, 2020

jhdeov commented Dec 29, 2020

kylebgorman commented Dec 29, 2020 via email

jhdeov commented Dec 29, 2020

jhdeov commented Dec 29, 2020 •

edited