New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[arm] can't use wikipron because of potential readtimeout? Can we use a wiktionary dump? #299
Comments
Hi Hossep, So first off, we release a regular dump of Wiktionary run over every language and dialect we know of. We are updating this at least annually. The Armenian ones were made this summer when we added separate "dialects" (Western and Eastern are sometimes distinguished). I plan to update the entire collection shortly. This "dump" is here. This seems to us a good compromise over using a frozen dump for extraction: for instance UniMorph does that and they've set it up so that it's impossible to advance beyond 2017! FWIW, I will kick off an overall "rescrape" next week (it runs for about a week, using a fast and reliable connection at my lab) and update this dumped data set. FWIW the latency you see has little to do with "searching" in any relevant sense. The code hits the backend for a pre-computed list of all Armenian headwords (4k or so at a time), which takes no more than a few seconds. The vast majority of time is spent waiting for the server to serve up actual HTML of the pages for those headwords. Finally, our requests identify Wikipron as a bot (the polite thing to do) so the administrators can traffic-shape us if necessary. The extraction does sometimes hang, though if it is actually a server hang it usually hangs up, which generates a fatal exception. I've not seen your particular pattern. Note that Python buffers IO (you can control this with command-line arguments to the interpreter and/or environmental variables, which may help getting more immediate results.) The script that generates the "dump" catches hangs and it generally runs in one go over about a week, from my office, with a very fast and reliable internet connection. We have an open "enhancement" issue (#253) to make scrapes restartable. It's not clear how to do it but I think it's a good idea. |
I see, thanks for the explanation. To give more context, this what my terminal shows for the different languages For example, when I try Amharic, it just works:
The outputted file has stuff in it French also works and it uses the cut-off points
But the Armenian just finishes in an hour and gives me an empty file
Regarding this statement:
Can you recommend any command to run? If the entire process would take a week for my residential internet connection, then I can just sit tight for your eventual rescrape :D |
Hello, |
Oh yes, that is a factor. There are a few hundred phonemic transcriptions (that's the default) but they're few and far between. To disable buffering from the command line I believe you can do PYTHONUNBUFFERED=1 your_python_command (Not tested.) |
Huh, cool. I'm not surprised. The Armenian IPA entries are all done via a script that converts a cleaned-up version of the orthographic form, explained here. For example, for the orthographic word անտերունչ, the IPA is [ɑntɛˈɾunt͡ʃʰ], but the edit-page just:
I'm trying it now and it's working :D
Though I thought that your code would extract both phonemic and phonetic transcriptions? Maybe I misunderstood the "Phonetic versus phonemic transcription" section of your paper. Does your scraped-data set mark any of the phonemic transcriptions? Because if so, then their entries should be fixed by a Wiktionary user (like me) |
The way things are set up, we just scrape twice, once for phonemic (literally just whatever is in //) and once for phonetic ([]) and store them in separate files. Some languages have a lot of one and very few of the other; if a file ends up <100 entries, we don't include it in the "big scrape" database. But Armenian has enough phonemic transcriptions (for both dialects) to make the cut. What do you mean by "mark any of the phonemic transcriptions"? It does extract them: the big scrape will have a file with a name like PS: I didn't realize you could have > foo before a flag and Python would know that worked. I would have written |
But your dataset only has phonetic and phonetic filtered (and it's unclear what phonetic filtered means). PS: Yeah that surprised me too :D |
Oh no you're right, I misread this earlier.
…On Tue, Dec 29, 2020 at 1:43 PM Hossep Dolatian ***@***.***> wrote:
But Armenian has enough phonemic transcriptions (for both dialects) to
make the cut.
But your dataset
<https://github.com/kylebgorman/wikipron/tree/master/data/tsv> only has
phonetic and phonetic filtered (and it's unclear what phonetic filtered
means).
PS: Yeah that surprised me too :D
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/kylebgorman/wikipron/issues/299#issuecomment-752200135>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABG4OK2RJGJEYS3SLJ7YA3SXIPMPANCNFSM4VM5PX4A>
.
|
Side-note: Is there any documentation on what you mean by "phonetic-filtered" on your scraped datasets? |
No, but I think it's obvious enough: it uses the appropriate phonelists in
the `phones` directory to filter entries.
…On Tue, Dec 29, 2020 at 1:50 PM Hossep Dolatian ***@***.***> wrote:
Side-note: Is there any documentation on what you mean by
"phonetic-filtered" on your scraped datasets?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/kylebgorman/wikipron/issues/299#issuecomment-752202529>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABG4OO5Q26HM2QWOPJQ6WLSXIQHXANCNFSM4VM5PX4A>
.
|
Oh so at some point you manually convert the original stuff from Wiktionary into the segments in the phone list? But is there documentation on what are the specific graphemes (or transcribed phonemes) that you are converting from and to? |
On Tue, Dec 29, 2020 at 1:58 PM Hossep Dolatian ***@***.***> wrote:
Oh so at some point you manually convert the original stuff from
Wiktionary into the segments in the phone list
<https://github.com/kylebgorman/wikipron/blob/master/data/phones/arm_e_phonetic.phones>?
But is there documentation on what are the specific graphemes (or
transcribed phonemes) that you are converting?
No. It's a filter. If any phone in the entry isn't present in the
phonelist, we eliminate the pronunciation entry out in the corresponding
"filtered" file. This happens during the "big scrape" procedure that
generates the static database. It's not part of the command-line tool
itself, but one could filter it that way if you want to.
For instance, Bach is given two PR pronunciations: /bɑːx, bɑːk/. We throw
the former out in the filtered file because /x/ isn't a phoneme of modern
English (sorry SPE). There's no transformations applied: we don't have a
mechanism for that yet.
The only graphemic transformations that are applied are optional
case-folding. For more information see data/src, the README and the code
therein.
K
… |
Linking this to #253 and closing for now. |
Ah thanks for the clarification. The README doesn't mention the words 'case' or 'filter' though. |
You're welcome to update. It's really only for maintainers though: 3 people
have used that code so far. Also the filtering mechanism isn't "released"
yet, we'll do that before long.
…On Tue, Dec 29, 2020 at 2:06 PM Hossep Dolatian ***@***.***> wrote:
Ah thanks for the clarification. The README doesn't mention the words
'case' or 'filter' though.
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<https://github.com/kylebgorman/wikipron/issues/299#issuecomment-752207954>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABG4OPTW6YRW6X463B74KTSXISDXANCNFSM4VM5PX4A>
.
|
Well I found the ACL paper pretty helpful in understanding almost everything about how to interpret (and realize) how correct your arm scrape was. The README covered some holes that weren't present in the ACL doc. I would update the README myself but then I would fear saying something wrong :/ PS: "before long." is a new construction for me. |
Hello
I can use the terminal version of WikiPron to scrape a small language like Amharic [amh], and to scrape a big one like French. But when I try to run it on Armenian [hye or arm], the code just stops running after an hour and outputs nothing -- there's not even any errors thrown. I suspect the code is finding a readtimeout error and then skipping it.
I suspect there's a readtimeout error because in the past, I used other wiktionary extractors to Wiktextract and that took 9-12 hours to scrape the Armenian words (just 17k words). I suspect that the Armenian entries are just oddly dispersed across Wiktionary that it takes a while for some scrapers to find them. Granted Wiktextract was using a wiktionary dump and that's how it managed to eventually work. Can WikiPron work over a Wiktionary dump or does it need to actively use an internet connection?
The text was updated successfully, but these errors were encountered: