New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add config options to big scrape, separate scraping and writing within big scrape. #76
Conversation
These are what the English, Spanish and Portuguese entries look like in If you'd like, for example, the |
This is a link to the README table so you can see how Portuguese looks with dialect info in the Wiktionary name column. I don't think it would make sense to add another column for dialect info as the table is already quite wide and it would only apply to a few languages. This is a link to the tsv of the README table. It contains mostly the same information as is in the README, but just lists the file name in the link column instead of a relative path in a markdown link. I could potentially add headers if that is useful. A separate column for dialects may be relevant in this format. |
Just wanted to comment on some of the API choices before I dive into code. This is a lot of stuff at once, congrats on getting it all together so fast.
but would still produce |
languages/wikipron/src/codes.py
Outdated
"casefold": None, | ||
"no_stress": True, | ||
"no_syllable_boundaries": True, | ||
"cut_off_date": CUT_OFF_DATE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apply reflow-er? I see there's no trailing comma here ;)
languages/wikipron/src/codes.py
Outdated
with open("languages.json", "r") as source: | ||
prev_languages = json.load(source) | ||
with open("iso-639-3_20190408.tsv", "r") as source: | ||
iso_list = csv.reader(source, delimiter="\t") | ||
for (lang_page_title, total_pages) in _cat_members(): | ||
for (lang_page_title) in _cat_members(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove extraneous parens here (or I think reflow-er will take care of that)
languages/wikipron/src/codes.py
Outdated
# Set CUT_OFF_DATE below. | ||
# CUT_OFF_DATE can be no later than the date | ||
# at which you plan to begin scraping. | ||
CUT_OFF_DATE = "2019-11-01" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused why this is given as a default here. Why not use datetime.datetime.today().isoformat()
or something? (For our particular case where we really want to do a cut-off-date'd scrape we can just search-and-replace on languages.json
, right?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason I interpreted #68 as “Language entries in languages.json should contains all config options that do not use default values set in config.py.”
So I guess I wasted some time doing that. It would be simpler if the language entries looked like they were before:
"est": {
"iso639_name": "Estonian",
"wiktionary_name": "Estonian",
"wiktionary_code": "et",
"casefold": true
},
Or this if a dialect is used:
"eng": {
"iso639_name": "English",
"wiktionary_name": "English",
"wiktionary_code": "en",
"casefold": true,
"dialect": {
"us": "US | General American",
"uk": "UK | Received Pronunciation"
}
},
And then config_settings
in scrape.py
would handle setting no_stress
, no_syllable_boundaries
, and cut_off_date
, since they are always the same. Then we can just set cut_off_date
with datetime
in scrape.py
.
languages/wikipron/src/codes.py
Outdated
@@ -21,6 +21,7 @@ | |||
* Dialect information may also need to be added manually | |||
""" | |||
|
|||
# TODO Generate and use a lookup table for the iso639-3 tsv file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generic minor comment throughout this change: TSV
when you're talking about it in a comment (okay to have a variable name like tsv
cuz snake_casing conventions).
languages/wikipron/src/scrape.py
Outdated
# Removes TSV files with less than 100 entries. | ||
if phonemic_count < 100: | ||
logging.info( | ||
'"%s", has less than 100 entries in phonemic transcription.', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to say:
Amharic, has less than 100 entries...
Is the comma there intended?
languages/wikipron/src/scrape.py
Outdated
|
||
|
||
if __name__ == "__main__": | ||
logging.basicConfig( | ||
format="%(filename)s %(levelname)s: %(asctime)s - %(message)s", | ||
handlers=[ | ||
logging.FileHandler("scraping.log", mode="w"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possibly put filename path here into a module-level constant
languages/wikipron/src/write.py
Outdated
|
||
|
||
def _readme_insert(wiki_name, row_string): | ||
with open("../tsv/README.md", "r+") as source: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest this path as a module-level constant
languages/wikipron/src/write.py
Outdated
# Rewrite the entire README. | ||
source.seek(0) | ||
source.truncate() | ||
source.write("".join(readme_list)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If each element in readme_list
was a line w/o newlines you could use the more-efficient writelines
(which takes an iterator of strings and doesn't build up the whole big string at once). But this part is so complex I'm afraid to ask you to kick the hornet's nest...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look into this.
languages/wikipron/src/write.py
Outdated
file.rindex("_") + 1 : file.index(".") | ||
].capitalize() | ||
wiki_name = languages[iso639_code]["wiktionary_name"] | ||
# Assumes we will not remove eng and spa tsv files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment still current? Will it become irrelevant shortly? If so remove.
languages/wikipron/src/write.py
Outdated
def sorting(ele): | ||
return ele[3] | ||
readme_tsv_list.sort(key=sorting) | ||
for lang_row in readme_tsv_list: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strongly suggest you use csv.writer
for this (you just need to tell it you're using a "\t" separator). Then call it's writerows
method with readme_tsv_list
as an argument and you're free of this loop too!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviews scattered throughout...only serious concerns are about file naming etc. I'm traveling this weekend so ping on email if you need timely attention. Things looking quite good overall.
Okay, I'll start working on this tomorrow and just reply to a few comments now. |
In addition to changing the name of |
Not sure. Maybe the parent directory of src/ and tsv/ ? That way there’s no
risk of it being confused with an actual language file.
K
…On Thu, Oct 24, 2019 at 11:12 PM Lucas Ashby ***@***.***> wrote:
In addition to changing the name of readme_tsv.tsv I think I should also
change where it gets outputted. Right now it just gets placed in the src
directory, but I presume the tsv directory is the more appropriate place.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/kylebgorman/wikipron/pull/76?email_source=notifications&email_token=AABG4OODNAGTVWX5L3BSDFDQQJP2DA5CNFSM4JE44HH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECHBMEQ#issuecomment-546182674>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABG4OMJZSK6JAITHWZN2J3QQJP2DANCNFSM4JE44HHQ>
.
|
… write.py, added tsv of readme creation
….py to handle new languages.json format
… 100 entries, changed logging in scrape.py
…age names from SIL website, re-ran codes.py and write.py, added additional steps to README explaining src python files
…file created in scrape.py
…reformatted scrape.py, removed needless config options from codes.py and languages.json
faee117
to
ac23a48
Compare
I think these changes touch on all the comments Kyle made. Aside from formatting and file name stuff, the big differences are:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple minor things left, but looking good. @jacksonllee, do you have any final thoughts on this?
|
||
# Sort by wiktionary language name, | ||
# with phonemic entries before phonetic | ||
def sorting(ele): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still just use a lambda here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in:
languages_summary_list.sort(key=lambda ele: ele[3] + ele[5])
readme_list.sort(key=lambda ele: ele[3] + ele[5])
Or something along the lines of:
f = lambda ele: ele[3] + ele[5]
languages_summary_list.sort(key=f)
readme_list.sort(key=f)
The later is 'DRY-er' but isn't assigning a lambda to a variable an anti-pattern?
languages/wikipron/src/scrape.py
Outdated
os.remove(phonemic_path) | ||
if os.path.exists(phonetic_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticed this. This is a "dark pattern" because it is theoretically possible for the file to blink in or out of existence between the check for existence and the remove. The recommended pattern is instead to do a try: os.remove(...)
and then catch and ignore the exception (OSError
probably) if it fails. I realize this is an annoying one...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found this from googling:
from contextlib import suppress
with suppress(OSError):
os.remove(filename)
Seems nicer than ignoring the exception. Let me know if you have a preference. Is it the os.path.exists()
check that makes the file potentially blink in or out of existence? Do I need to add this to all os.remove()
statements or just the ones that involve os.path.exists()
languages/wikipron/src/scrape.py
Outdated
phonemic_readme_row_string = ( | ||
"| " + " | ".join(phonemic_row) + " |\n" | ||
return | ||
# Remove files for languages that returned nothing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
period for consistency on this comment I suppose
languages/wikipron/src/scrape.py
Outdated
"| " + " | ".join(phonemic_row) + " |\n" | ||
return | ||
# Remove files for languages that returned nothing | ||
elif phonemic_count == 0 and phonetic_count == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably if not phonemic_count and not phonetic_count
or something like that.
Former is better and hopefully the reflower can make it pretty too.
…On Fri, Oct 25, 2019 at 6:06 PM Lucas Ashby ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In languages/wikipron/src/generate_summary.py
<https://github.com/kylebgorman/wikipron/pull/76#discussion_r339255133>:
> +
+ row = [
+ iso639_code,
+ languages[iso639_code]["iso639_name"],
+ wiki_name,
+ str(languages[iso639_code]["casefold"]),
+ transcription_level,
+ str(num_of_entries),
+ ]
+ # TSV and README have different first column.
+ languages_summary_list.append([file] + row)
+ readme_list.append([f"[TSV]({file})"] + row)
+
+ # Sort by wiktionary language name,
+ # with phonemic entries before phonetic
+ def sorting(ele):
As in:
languages_summary_list.sort(key=lambda ele: ele[3] + ele[5])
readme_list.sort(key=lambda ele: ele[3] + ele[5])
Or something along the lines of:
f = lambda ele: ele[3] + ele[5]
languages_summary_list.sort(key=f)
readme_list.sort(key=f)
The later is 'DRY-er' but isn't assigning a lambda to a variable an
anti-pattern?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/kylebgorman/wikipron/pull/76?email_source=notifications&email_token=AABG4OO4CCTKSXRH4HM4I5LQQNUWHA5CNFSM4JE44HH2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCJJ2WVI#discussion_r339255133>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABG4OP2FUYEI4PUWANXOZDQQNUWHANCNFSM4JE44HHQ>
.
|
I'm not sure what reflower or reflow-er is and searching 'reflower python' doesn't seem to bring up anything useful. |
Sorry, I’m referring to ‘white’.
…On Fri, Oct 25, 2019 at 6:24 PM Lucas Ashby ***@***.***> wrote:
I'm not sure what reflower or reflow-er is and searching 'reflower python'
doesn't seem to bring up anything useful.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/kylebgorman/wikipron/pull/76?email_source=notifications&email_token=AABG4OMEHVGLQ73H36WCWBLQQNW3DA5CNFSM4JE44HH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECJWZMQ#issuecomment-546532530>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABG4OJLQ6YLZ4RRIR6JZZ3QQNW3DANCNFSM4JE44HHQ>
.
|
… in generate_summary.py, reworked removing files in scrape.py
I made the above changes and ran white. There were a few things I could do with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few minor things (plus a new, separate bug unrelated to this PR).
languages/wikipron/README.md
Outdated
Wiktionary languages with over 100 entries. It writes the results of those | ||
scrape calls to TSVs and generates a [README](./tsv/README.md) with selected | ||
information regarding the contents of those TSVs and the configuration settings | ||
Wiktionary languages with over 100 entries. [write.py](./src/write.py) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
write.py
-> generate_summary.py
?
languages/wikipron/README.md
Outdated
2. Run [scrape.py](./src/scrape.py). | ||
3. Run [write.py](./src/write.py). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generate_summary.py
readme_list = [] | ||
languages_summary_list = [] | ||
path = "../tsv" | ||
for file in os.listdir(path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
file
-> file_path
or something?
adaga a d a ɡ a | ||
adenina a d e n i n a | ||
adeus a d e w s | ||
adição a d͡ ʒ i s ɐ̃ w |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, we forgot about having to correctly handle the tie bar for affricates. I'm opening an issue for this now. (Edit: Just opened #78)
You mean like I could potentially set up |
I don't know if this is a good case for a returning heterogeneous types. If
you want to do that any of the following would be preferable IMO:
1. raise an exception
2. return 0 as a default
3. return -1 if it's important to distinguish between that and 0
I think #2 is probably the best option.
…On Sun, Oct 27, 2019 at 11:21 AM Lucas Ashby ***@***.***> wrote:
boolean coercion here?
You mean like if not phonemic_count and not phonetic count as in the if
block below it?
I could potentially set up _call_scrape() so it always returns an int and
then remove the files when that int is less than 100, but the idea behind
having these separate if blocks and the reason _call_scrape() can return
either an int or None is so we can log the different ways languages can
fail.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/kylebgorman/wikipron/pull/76?email_source=notifications&email_token=AABG4OPT2HCZSGQSSDUG3ALQQWWWFA5CNFSM4JE44HH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECLAXQA#issuecomment-546704320>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABG4OKCAGFYCPB5RVHTR6TQQWWWFANCNFSM4JE44HHQ>
.
|
languages/wikipron/README.md
Outdated
that were passed to scrape. [languages.json](./src/languages.json) provides | ||
[scrape.py](./src/scrape.py) with a dictionary containing the information it | ||
needs to call scrape on all Wiktionary languages with over 100 entries as well | ||
as to generate the previously mentioned [README](./tsv/README.md). | ||
[codes.py](./src/codes.py) is used to generate | ||
[languages.json](./src/languages.json). It queries Wiktionary for all languages | ||
with over 100 entries. It also outputs | ||
[failed\_langauges.json](./src/failed_languages.json), a list of languages on | ||
[unmatched\_langauges.json](./src/unmatched_languages.json), a list of languages on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
langauges
languages/wikipron/src/codes.py
Outdated
failed_languages = {} | ||
with open("languages.json", "r") as source: | ||
unmatched_languages = {} | ||
with open(LANGUAGES_PATH, "r") as source: | ||
prev_languages = json.load(source) | ||
with open("iso-639-3_20190408.tsv", "r") as source: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Model level constant?
languages/wikipron/src/codes.py
Outdated
failed_dict = json.dumps(failed_languages, indent=4) | ||
failed.write(failed_dict) | ||
# All languages that failed to be matched with data in ISO 639 TSV file. | ||
with open("unmatched_languages.json", "w") as unmatched: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
module-level constant
languages_summary_list.append([file_path] + row) | ||
readme_list.append([f"[TSV]({file_path})"] + row) | ||
|
||
# Sort by wiktionary language name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wiktionary
|
||
# Sort by wiktionary language name, | ||
# with phonemic entries before phonetic. | ||
languages_summary_list.sort(key=lambda ele: ele[3] + ele[5]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so now you're using this twice, so probably define it outside of main as _3_and_5
or something like that?
languages/wikipron/src/scrape.py
Outdated
return 0 | ||
|
||
|
||
def build_config_and_filter_scrape_results( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe too long, also maybe you want _
at the beginning
Other than that, looks good to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. please use squash-and-merge.
(Sorry for the wall of text...)
These changes are meant to address issues #66, #67, #68 as well as a few suggestions made in the comments of pull request #61.
Here are the larger changes introduced by this pull request:
Separated the scraping and writing part of
scrape.py
(formerlyscrape_and_write.py
).write.py
now generates the README table by inspecting the contents of thetsv/
directory. In the process it also creates a tsvreadme_tsv.tsv
with similar information as is in the README table.Added
no_stress
,no_syllable_boundaries
, andcut_off_date
options to languages inlanguages.json
.codes.py
to specify default values for these options when adding new languages tolanguages.json
and to copy over previously set values for these options.cut_off_date
should now be set in codes.py prior to running codes.py, I’ve updated the README in languages/wikipron with those instructions.Added
dialect
config option (andrequire_dialect_label
option) to English, Spanish and Portuguese.scrape.py
to handle when one or more dialects are specified for a language. (Ran this new code on Portuguese, because it is a smaller language, to generate some sample data.)The only small changes worth noting are:
scrape.py
will now also output toscraping.log
which I’ve added to.gitignore
. This way finding the languages that failed to be scraped is a bit easier (don’t need to scroll through the console). It also outputs the language dict fromlanguages.json
in the error message for languages that failed to be scraped within our set amount of retries, so it is a bit easier to build a temporarylanguages.json
with the failed languages.scrape.py
will now remove files with less than 100 entries. (TSVs with less than 100 entries have been removed.)I have a few questions regarding dialects that I'd like your thoughts on:
How dialects are handled in
languages.json
.languages.json
in the following way:_bz
,_po
) indialect
serve as a sort of extension when naming the dialect tsv files (por_bz_phonetic.tsv
, for example) and help with easy access to the dialect strings ("Brazil"
) inwrite.py
. If you'd like me to change any of the keys because you'd like different extensions for certain dialects let me know. They can be longer than two letters. I'll provide links to the English, Spanish, and Portuguese entries inlanguages.json
as a separate comment so you can review the keys and dialect strings I'm using.How dialects are handled in
scrape.py
.scrape.py
will first scrape a language entirely and then scrape for dialects if any are specified. This means it will scrapepor
(Portuguese) with no dialect, thenpor
with"Brazil"
as the dialect and thenpor
with"Portugal"
as the dialect. Is there any reason to scrape forpor
with no dialect when we are specifying a dialect? Do we want to keep the tsvs generated from previously scrapingpor
(oreng
/spa
) with no dialect?scrape.py
, I moved a lot of what was inmain()
to a separate function in order to handle dialects. There may well be a better way of handling dialects than the way I've tried to do it and I'm open to suggestions on how to improve it.