Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add config options to big scrape, separate scraping and writing within big scrape. #76

Merged
merged 13 commits into from Oct 29, 2019

Conversation

lfashby
Copy link
Collaborator

@lfashby lfashby commented Oct 25, 2019

(Sorry for the wall of text...)
These changes are meant to address issues #66, #67, #68 as well as a few suggestions made in the comments of pull request #61.

Here are the larger changes introduced by this pull request:

  • Separated the scraping and writing part of scrape.py (formerly scrape_and_write.py).

    • write.py now generates the README table by inspecting the contents of the tsv/ directory. In the process it also creates a tsv readme_tsv.tsv with similar information as is in the README table.
  • Added no_stress, no_syllable_boundaries, and cut_off_date options to languages in languages.json.

    • Modified codes.py to specify default values for these options when adding new languages to languages.json and to copy over previously set values for these options.
    • cut_off_date should now be set in codes.py prior to running codes.py, I’ve updated the README in languages/wikipron with those instructions.
  • Added dialect config option (and require_dialect_label option) to English, Spanish and Portuguese.

    • Restructured scrape.py to handle when one or more dialects are specified for a language. (Ran this new code on Portuguese, because it is a smaller language, to generate some sample data.)
    • README table now includes dialect information in Wiktionary language name column

The only small changes worth noting are:

  • Logging in scrape.py will now also output to scraping.log which I’ve added to .gitignore. This way finding the languages that failed to be scraped is a bit easier (don’t need to scroll through the console). It also outputs the language dict from languages.json in the error message for languages that failed to be scraped within our set amount of retries, so it is a bit easier to build a temporary languages.json with the failed languages.
  • scrape.py will now remove files with less than 100 entries. (TSVs with less than 100 entries have been removed.)

I have a few questions regarding dialects that I'd like your thoughts on:

  • How dialects are handled in languages.json.

    • I added dialects to languages.json in the following way:
      "por": {
          ...
          "require_dialect_label": true,
          "dialect": {
              "_bz": "Brazil",
              "_po": "Portugal"
          },
          ...
      },
      
    • The keys (_bz, _po) in dialect serve as a sort of extension when naming the dialect tsv files (por_bz_phonetic.tsv, for example) and help with easy access to the dialect strings ("Brazil") in write.py. If you'd like me to change any of the keys because you'd like different extensions for certain dialects let me know. They can be longer than two letters. I'll provide links to the English, Spanish, and Portuguese entries in languages.json as a separate comment so you can review the keys and dialect strings I'm using.
    • Is there a process for finding which dialects are frequently used within a given Wiktionary language category? (Aside from just checking entries and seeing whether dialect information is specified.)
  • How dialects are handled in scrape.py.

    • As written scrape.py will first scrape a language entirely and then scrape for dialects if any are specified. This means it will scrape por (Portuguese) with no dialect, then por with "Brazil" as the dialect and then por with "Portugal" as the dialect. Is there any reason to scrape for por with no dialect when we are specifying a dialect? Do we want to keep the tsvs generated from previously scraping por (or eng/spa) with no dialect?
    • Within scrape.py, I moved a lot of what was in main() to a separate function in order to handle dialects. There may well be a better way of handling dialects than the way I've tried to do it and I'm open to suggestions on how to improve it.

@lfashby
Copy link
Collaborator Author

lfashby commented Oct 25, 2019

This is a link to the README table so you can see how Portuguese looks with dialect info in the Wiktionary name column. I don't think it would make sense to add another column for dialect info as the table is already quite wide and it would only apply to a few languages.

This is a link to the tsv of the README table. It contains mostly the same information as is in the README, but just lists the file name in the link column instead of a relative path in a markdown link. I could potentially add headers if that is useful. A separate column for dialects may be relevant in this format.

@kylebgorman
Copy link
Collaborator

Just wanted to comment on some of the API choices before I dive into code. This is a lot of stuff at once, congrats on getting it all together so fast.

  • I don't care for the name readme_tsv.tsv; it duplicates "tsv" (duh) and the fact that it is the same data as the README table is irrelevant to me. What is relevant to me as a reader is that it's a TSV file that contains the counts of data for all the languages. (languages.tsv maybe?)

  • Can you rename write.py to something more informative? Maybe summarize.py or generate_summary.py since that's what it does?

  • I personally have no use for the "all-dialectal" versions of English, Portuguese, and Spanish, and if someone wants them, they should just concatenate (and sort) the dialectal ones, right? So I'd suggest that if dialect is specified it should pre-empt an all-dialects scrape. (If someone really wanted one they could add THAT to the dialect specification in languages.json.)

  • Your approach to adding dialect suffixes is fine but I would suggest that your code should add the _ between the language and the dialect code, rather than including it in the JSON file, which feels like an abstractional leak. So it would look like

    "dialect": {
    "us": "US | General American",
    "uk": "UK | Received Pronunciation"
    },

but would still produce en_us_phonemic.tsv etc.

"casefold": None,
"no_stress": True,
"no_syllable_boundaries": True,
"cut_off_date": CUT_OFF_DATE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apply reflow-er? I see there's no trailing comma here ;)

with open("languages.json", "r") as source:
prev_languages = json.load(source)
with open("iso-639-3_20190408.tsv", "r") as source:
iso_list = csv.reader(source, delimiter="\t")
for (lang_page_title, total_pages) in _cat_members():
for (lang_page_title) in _cat_members():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove extraneous parens here (or I think reflow-er will take care of that)

# Set CUT_OFF_DATE below.
# CUT_OFF_DATE can be no later than the date
# at which you plan to begin scraping.
CUT_OFF_DATE = "2019-11-01"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused why this is given as a default here. Why not use datetime.datetime.today().isoformat() or something? (For our particular case where we really want to do a cut-off-date'd scrape we can just search-and-replace on languages.json, right?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason I interpreted #68 as “Language entries in languages.json should contains all config options that do not use default values set in config.py.”

So I guess I wasted some time doing that. It would be simpler if the language entries looked like they were before:

    "est": {
        "iso639_name": "Estonian",
        "wiktionary_name": "Estonian",
        "wiktionary_code": "et",
        "casefold": true
    },

Or this if a dialect is used:

    "eng": {
        "iso639_name": "English",
        "wiktionary_name": "English",
        "wiktionary_code": "en",
        "casefold": true,
        "dialect": {
            "us": "US | General American",
            "uk": "UK | Received Pronunciation"
        }
    },

And then config_settings in scrape.py would handle setting no_stress, no_syllable_boundaries, and cut_off_date, since they are always the same. Then we can just set cut_off_date with datetime in scrape.py.

@@ -21,6 +21,7 @@
* Dialect information may also need to be added manually
"""

# TODO Generate and use a lookup table for the iso639-3 tsv file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generic minor comment throughout this change: TSV when you're talking about it in a comment (okay to have a variable name like tsv cuz snake_casing conventions).

# Removes TSV files with less than 100 entries.
if phonemic_count < 100:
logging.info(
'"%s", has less than 100 entries in phonemic transcription.',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to say:

Amharic, has less than 100 entries...

Is the comma there intended?



if __name__ == "__main__":
logging.basicConfig(
format="%(filename)s %(levelname)s: %(asctime)s - %(message)s",
handlers=[
logging.FileHandler("scraping.log", mode="w"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly put filename path here into a module-level constant



def _readme_insert(wiki_name, row_string):
with open("../tsv/README.md", "r+") as source:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest this path as a module-level constant

# Rewrite the entire README.
source.seek(0)
source.truncate()
source.write("".join(readme_list))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If each element in readme_list was a line w/o newlines you could use the more-efficient writelines (which takes an iterator of strings and doesn't build up the whole big string at once). But this part is so complex I'm afraid to ask you to kick the hornet's nest...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into this.

file.rindex("_") + 1 : file.index(".")
].capitalize()
wiki_name = languages[iso639_code]["wiktionary_name"]
# Assumes we will not remove eng and spa tsv files
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment still current? Will it become irrelevant shortly? If so remove.

def sorting(ele):
return ele[3]
readme_tsv_list.sort(key=sorting)
for lang_row in readme_tsv_list:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strongly suggest you use csv.writer for this (you just need to tell it you're using a "\t" separator). Then call it's writerows method with readme_tsv_list as an argument and you're free of this loop too!

Copy link
Collaborator

@kylebgorman kylebgorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviews scattered throughout...only serious concerns are about file naming etc. I'm traveling this weekend so ping on email if you need timely attention. Things looking quite good overall.

@lfashby
Copy link
Collaborator Author

lfashby commented Oct 25, 2019

Okay, I'll start working on this tomorrow and just reply to a few comments now.

@lfashby
Copy link
Collaborator Author

lfashby commented Oct 25, 2019

In addition to changing the name of readme_tsv.tsv I think I should also change where it gets outputted. Right now it just gets placed in the src directory, but I presume the tsv directory is the more appropriate place.

@kylebgorman
Copy link
Collaborator

kylebgorman commented Oct 25, 2019 via email

@lfashby
Copy link
Collaborator Author

lfashby commented Oct 25, 2019

I think these changes touch on all the comments Kyle made. Aside from formatting and file name stuff, the big differences are:

  • I renamed and reworked write.py (now generate_summary.py) to be more efficient. Formerly it would rewrite the README (by reading it in, splitting the rows into a list of lists, inserting a new row, converting the list of lists into a big string and writing that string) every time we iterated through a file in the tsv directory. Now it just constructs a list of lists once, converts those internal lists to strings once, and writes an all new README once. It's perhaps still not ideal, but I feel better about it.

  • I added what I thought were important paths as constants to codes.py. (Not sure if I'm using module level constants correctly.)

  • I removed English, Spanish, and Portuguese TSVs that don't use a dialect (in preparation for the next scrape).

Copy link
Collaborator

@kylebgorman kylebgorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple minor things left, but looking good. @jacksonllee, do you have any final thoughts on this?


# Sort by wiktionary language name,
# with phonemic entries before phonetic
def sorting(ele):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still just use a lambda here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in:

languages_summary_list.sort(key=lambda ele: ele[3] + ele[5])
readme_list.sort(key=lambda ele: ele[3] + ele[5])

Or something along the lines of:

f = lambda ele: ele[3] + ele[5]
languages_summary_list.sort(key=f)
readme_list.sort(key=f)

The later is 'DRY-er' but isn't assigning a lambda to a variable an anti-pattern?

os.remove(phonemic_path)
if os.path.exists(phonetic_path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed this. This is a "dark pattern" because it is theoretically possible for the file to blink in or out of existence between the check for existence and the remove. The recommended pattern is instead to do a try: os.remove(...) and then catch and ignore the exception (OSError probably) if it fails. I realize this is an annoying one...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found this from googling:

from contextlib import suppress

with suppress(OSError):
    os.remove(filename)

Seems nicer than ignoring the exception. Let me know if you have a preference. Is it the os.path.exists() check that makes the file potentially blink in or out of existence? Do I need to add this to all os.remove() statements or just the ones that involve os.path.exists()

phonemic_readme_row_string = (
"| " + " | ".join(phonemic_row) + " |\n"
return
# Remove files for languages that returned nothing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

period for consistency on this comment I suppose

"| " + " | ".join(phonemic_row) + " |\n"
return
# Remove files for languages that returned nothing
elif phonemic_count == 0 and phonetic_count == 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably if not phonemic_count and not phonetic_count or something like that.

@kylebgorman
Copy link
Collaborator

kylebgorman commented Oct 25, 2019 via email

@lfashby
Copy link
Collaborator Author

lfashby commented Oct 25, 2019

I'm not sure what reflower or reflow-er is and searching 'reflower python' doesn't seem to bring up anything useful.

@kylebgorman
Copy link
Collaborator

kylebgorman commented Oct 26, 2019 via email

… in generate_summary.py, reworked removing files in scrape.py
@lfashby
Copy link
Collaborator Author

lfashby commented Oct 26, 2019

I made the above changes and ran white. There were a few things I could do with the os.remove try-except block but I just went with the most straightforward method and added some comments above it.

Copy link
Collaborator

@jacksonllee jacksonllee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor things (plus a new, separate bug unrelated to this PR).

Wiktionary languages with over 100 entries. It writes the results of those
scrape calls to TSVs and generates a [README](./tsv/README.md) with selected
information regarding the contents of those TSVs and the configuration settings
Wiktionary languages with over 100 entries. [write.py](./src/write.py)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write.py -> generate_summary.py?

2. Run [scrape.py](./src/scrape.py).
3. Run [write.py](./src/write.py).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generate_summary.py

readme_list = []
languages_summary_list = []
path = "../tsv"
for file in os.listdir(path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file -> file_path or something?

adaga a d a ɡ a
adenina a d e n i n a
adeus a d e w s
adição a d͡ ʒ i s ɐ̃ w
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, we forgot about having to correctly handle the tie bar for affricates. I'm opening an issue for this now. (Edit: Just opened #78)

@lfashby
Copy link
Collaborator Author

lfashby commented Oct 27, 2019

boolean coercion here?

You mean like if not phonemic_count and not phonetic count as in the if block below it?

I could potentially set up _call_scrape() so it always returns an int and then remove the files when that int is less than 100, but the idea behind having these separate if blocks and the reason _call_scrape() can return either an int or None is so we can log the different ways languages can fail.

@kylebgorman
Copy link
Collaborator

kylebgorman commented Oct 27, 2019 via email

that were passed to scrape. [languages.json](./src/languages.json) provides
[scrape.py](./src/scrape.py) with a dictionary containing the information it
needs to call scrape on all Wiktionary languages with over 100 entries as well
as to generate the previously mentioned [README](./tsv/README.md).
[codes.py](./src/codes.py) is used to generate
[languages.json](./src/languages.json). It queries Wiktionary for all languages
with over 100 entries. It also outputs
[failed\_langauges.json](./src/failed_languages.json), a list of languages on
[unmatched\_langauges.json](./src/unmatched_languages.json), a list of languages on
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

langauges

failed_languages = {}
with open("languages.json", "r") as source:
unmatched_languages = {}
with open(LANGUAGES_PATH, "r") as source:
prev_languages = json.load(source)
with open("iso-639-3_20190408.tsv", "r") as source:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Model level constant?

failed_dict = json.dumps(failed_languages, indent=4)
failed.write(failed_dict)
# All languages that failed to be matched with data in ISO 639 TSV file.
with open("unmatched_languages.json", "w") as unmatched:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

module-level constant

languages_summary_list.append([file_path] + row)
readme_list.append([f"[TSV]({file_path})"] + row)

# Sort by wiktionary language name,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wiktionary


# Sort by wiktionary language name,
# with phonemic entries before phonetic.
languages_summary_list.sort(key=lambda ele: ele[3] + ele[5])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so now you're using this twice, so probably define it outside of main as _3_and_5 or something like that?

return 0


def build_config_and_filter_scrape_results(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe too long, also maybe you want _ at the beginning

@kylebgorman
Copy link
Collaborator

Other than that, looks good to me.

@kylebgorman kylebgorman self-requested a review October 28, 2019 20:10
Copy link
Collaborator

@kylebgorman kylebgorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. please use squash-and-merge.

@lfashby lfashby merged commit 8010798 into CUNY-CL:master Oct 29, 2019
@lfashby lfashby deleted the config_options branch October 29, 2019 00:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants