Reduced eng_us_phonemic.phones to eng branch (#336) · CUNY-CL/wikipron@f5c05d0

Commit

Reduced eng_us_phonemic.phones to eng branch (#336)

Browse files

* Update to Latin pron selector (#183)

* minor change to latin extraction function, rescraped Latin

* potential fix to lat scraping issue

* raw scrape of latin

* postprocessing of new latin data

* updated changelog, fixed line length error

* rescrape of latin

* postprocessing of updated latin data

* [pox] Scraped Polabian. (#186)

* [pox] Scraped Polabian.

Note: The ISO 639-3 code is `pox`, the older ISO 639-2 code is `sla`.

* Updated CHANGELOG.

* [mnc] Scraped Manchu. (#185)

* [mnc] Scraped Manchu.

* Updated CHANGELOG.

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* Merged Whitelist functionality with src/scrape.py. Now checks for pre… (#184)

* Merged Whitelist functionality with src/scrape.py. Now checks for presence of whitelist and writes separate tsv as {original file name}_filtered.tsv. Update generate_summary to reflect if file is filtered through a whitelist. CHANGELOG and README update accordingly.

* Style tweaks and cleanup.

* Updated generalized_split and postprocess to reflect automatic whitelist processing in scrape. Fixed dialect issue in generate_summary.

* Previous edits didn't cary.

* Cleanup typo mistakes. Added error handling to scrape.py.

* Style clean-up.

* Fixed style issues.

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* Imperial aramaic (#187)

* [arc] Listing the correct scripts for Imperial Aramaic:

1. The original Aramaic script (`armi`).
2. The square script as in Biblical Aramaic (`hebr`).
3. Classical Syriac/Assyrian Neo-Aramaic (`syrc`) descended from (1).

This correctly assigns the entries to their respective lexicons. Most
of pronunciations are available for (2), with very minor number of
entries for (1) and (3).

* [arc] Listing the correct scripts for Imperial Aramaic (continuing the
previous commit which was partial):

1. The original Aramaic script (`armi`).
2. The square script as in Biblical Aramaic (`hebr`).
3. Classical Syriac/Assyrian Neo-Aramaic (`syrc`) descended from (1).

This correctly assigns the entries to their respective lexicons. Most
of pronunciations are available for (2), with very minor number of
entries for (1) and (3).

* Updated CHANGELOG with #186.

* Add --no-tone flag (#188)

* tentative solution for tone removal

* updates changelog, ran white on test_config.py

* remove print statement from test_config.py

* partial replace of codepoints with chars, adds nfd/nfc conversion

* reworks import statements

* updates _TONES_REGEX

* ran white on config.py

* updates to conversions and adds comments

* fixes to scrape.py comment length

* converted test_config.py no_tone tests to nfd strings

* modifies no_tone process not to skip removing superscript parentheses around non-tone superscript chars

* Rename (#192)

* [geo] Rescrape post-bot.

Closes #138.

* Add changelog

* Rename.

* Update CHANGELOG

* Revert "[geo] Rescrape post-bot."

This reverts commit 4a151b13e0e03e7a4aecb7dad29c1de9c2230f10.

* Flattens directory structure for data. (#194)

* Flattens directory structure for data.

The non-wiki data is moved to the new `wikipron-extras` (https://github.com/kylebgorman/wikipron-extras) repository.

Closes #193.

* Add PR number to changelog.

* "Imperial"

* [geo] Rescrape post-bot. (#191)

* [geo] Rescrape post-bot.

Closes #138.

* Add changelog

* Update changelog

* [geo] Add whitelist and re-scrape.

* Renames for merge.

* Add link to guidelines

* [hun] Adds whitelist.

* Simplify postprocess

* Enforces consistent style in logging using %r. (#196)

* Enforces consistent style in logging using %r.

* Updates CHANELOG

* Fixes a double-quoted logging var.

* Filtering (#199)

* [rum] Add whitelist and rescrape.

* [eng] Adds English rescrape.

* [dut] Adds Dutch rescrape.

* [gre] Adds Greek rescrape.

* Updates scrape path for phonetic filtering.

Closes #195.

* [rum] Adds Romanian rescrape.

* [arm] Adds Armenian rescrape.

* [gre] Adds Greek rescrape (second try).

* [arm] Adds Armenian dialects + rescrapes.

Closes #197.

* Adds CHANGELOG changes.

* [spa] Adds Spanish rescrape.

* Postprocess and regenerate summaries.

* [aar, bdq, jje, lsi] discovers new languages and scrapes them. (#202)

* Added tyv to languagecodes.py (#203)

* adds tuvan to languagecodes.py

* updates changelog

* Fall scrape (#204)

* [aar, bdq, jje, lsi] discovers new languages and scrapes them.

* Fall scrape.

* Fuller bib information

Fills out the bibliography entry for the WikiPron paper.

* Updates to codes.py (#205)

* updated languages.json and json files for translating between wikitionary code and iso code

* updates codes.py and languagecodes.py

* modifies test_languagecodes.py to reduce redundancy with codes.py

* small formatting fixes

* updates changelog

* logging statement formatting

* Update README.md

Fixes formatting issue in table. Not sure why this had to be done manually...

* ENH rename '.whitelist' as '.phones' (#207)

* Uses %r everywhere in `data/src`. (#210)

* Nepali support (#211)

* Uses %r everywhere in `data/src`.

* [nep] Adds Nepali data.

Closes #209.

* Update changelog

* [fre] Adds phoneme list (#213)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* [izh] Scrape and add Ingrian. (#215)

* [izh] Scrape and add Ingrian.

* Updated CHANGELOG.

* [ban] Splitting Balinese into Latin and Balinese scripts. (#214)

* [ban] Splitting Balinese into Latin and Balinese scripts.

* Updated CHANGELOG.

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* [kir] Split Kyrgyz into Cyrillic and Arabic scripts. (#216)

* [kir] Split Kyrgyz into Cyrillic and Arabic scripts.

* Updated.

* Added fre_phonemic_filtered.tsv (#217)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Refresh the database size counter. (#220)

* [khb] Customized extractor and re-scraping of Lü. (#219)

* [khb] Adding customized extractor for Lü.

* [khb] Re-scraping and updating the data and summaries.

* Updated CHANGELOG.

* Reordered imports.

* [khb] Adding scrape smoke test.

* Resorted.

* FIX specify UTF-8 in handling text files (#221)

It looks like Windows users have encountered encodings --
they hit `UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in
position 2882: character maps to <undefined>` when pip installing
wikipron, the error triggered at setup.py.
While we're at it, we specify UTF-8 encoding for all open() calls
for text processing as well.

Co-authored-by: jacksonllee <jacksonlunlee@gmail.com>

* [mga] New scrape: Middle Irish. (#224)

* [mga] New scrape: Middle Irish.

* Updated CHANGELOG.

* [cos] New scrape: Corsican. (#222)

* [cos] Add Corsican to the language code registry.

* [cos] Scraped Corsican and updated the language descriptions.

* Updated CHANGELOG.

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* [okm] New scrape: Middle Korean (#223)

* [okm] Adding ISO 639-3-only Middle Korean: Korean, Middle (10th–16th centuries).

* [okm] New scrape of Middle Korean and update of indices and descriptions.

* Updated CHANGELOG.

* Fixed typo.

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* [opt] New scrape: Old Portuguese (aka Galician-Portuguese). (#225)

* Adding Old Portuegese (aka Galician-Portuguese) codes.

* [opt] New scrape.

* [opt] Updated summaries.

* Updated CHANGELOG.

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* Added Serbo-Croatian phonemes and filtered TSV files. (#227)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* [shn] Custom extractor and new scrape for Shan (#229)

* [shn] Adding customized extractor for Shan.

* [shn] Adding smoke test.

* [shn] New scrape for Shan.

* [shn] Updated descriptions.

* Updated CHANGELOG.

* [tyv] New scrape: Tuvan (#228)

* [tyv] Tuvan scrape.

* [tyv] Updated descriptions.

This also fixes a buggy previous merge of `okm`.

* [tyv] Filtering Tuvan to use Cyrillic script only.

* Updated CHANGELOG.

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* Reorganizes tests and adds a few initial tests for the data side (#226)

* moves wikipron module tests into subdirectory

* reformating of test_version.py

* adds outline of test for data naming conventions, removes nonsense from src/scrape.py

* basic framework for testing file creation involved in big scrape

* renamed file naming test and added comments

* reorganizes tests directory, adds test for generate_summary.py

* fix formating in test_version.py

* revises and renames file for testing scrape

* fixes pathing issue in init

* adds some typing to new tests

* changes open statements to use proper encoding

* potential solution to circleci module error

* approaching a circleci import solution?

* updates changelog

* [hbs] Fix file naming.

* Update README.md

* [gre] Takes advantage of upstream consistency fix.

Closes #198.

* [lat] Split Latin into its dialects (#233)

* ENH handle Latin dialects

* RM remove unwanted Latin TSVs

* ENH add Latin dialect TSVs

* ENH postprocess Latin dialect TSVs

* ENH update data summary readme

* MAINT update changelog

* Add HTTP User-Agent header to API calls (#234)

* add http headers for get requests

* add http headers for get requests in tests/

* change wikipron/scrape.py code to avoid circular imports

* updated requirements.txt to have the latest dependencies (#238) (#239)

* Update requirements.txt

* Update CHANGELOG.md

* Added support for Python-3.9 (#236) (#240)

* Update requirements.txt

* Update CHANGELOG.md

* Update config.yml

* Update setup.py

* Update CHANGELOG.md

* Add black formatting (#242)

* add black formatting (fix #237)

* update changelog

* [kmr] New scrape: Northern Kurdish (#243)

* [kmr] Adding an entry for Northern Kurdish.

* [kmr] Adding an ISO mapping for Northern Kurdish.

* [kmr] Fresh scrape.

* [kmr] Updated description and summaries.

* [kmr] Updated CHANGELOG.

* [kmr] Lower-cased version.

* [kmr] Silly. Source should be lower-case.

* Update CHANGELOG.md

Minor style fixes to CHANGELOG

* MAINT reorganize changelog (#244)

* Add logging for dialect support for languages requiring custom extraction logic (#245)

* ENH alert the use of custom logic when dialect is specific

* MAINT update changelog

* Add a script to facilitate the creation of .phones files (#246)

* ENH add script to tally phones/phonemes in a TSV

* DOC update readme for the .phones files

* MAINT update changelog

* DOC comments in list_phones.py

* MAINT update changelog

* DOC update docstrings and readme

* Use mypy for type checking (#247)

* ISSUE-241: Ignoring 'env' and '.idea' directories

* ISSUE-241: Added 'mypy' to 'requirements.txt'

* ISSUE-241: Added 'Type checking' step to CircleCI

* ISSUE-241: Fixed mypy issues

* ISSUE-241: Updated documentation

* ISSUE-241: Added mypy to the correct 'requirements.txt'

* ISSUE-241: Ran Black formatter

Also updated the contribution guidelines to include this as a step

* ISSUE-241: Markups

ISSUE-241: Markup - Alphabetised 'requirements.txt'
ISSUE-241: Markup - Log invalid page title
ISSUE-241: Markup - Alphabetised 'test_scrape.py' imports
ISSUE-241: Markup - Added explanatory comment
ISSUE-241: Markup - Improved 'config_dict' typing
ISSUE-241: Markup - Improved 'scrape.py' typing

* ISSUE-241: Markup - Using logger interpolation

* ISSUE-241: Markups

* ISSUE-241: Markup - Added working dir to Circle CI config

* split tildes; resort (#250)

* split tildes; resort

* update CHANGELOG.md

* Improve CircleCI workflow with orbs (#249)

* Convert to matrix CircleCI workflow

* Fix typo in parameter

* Add missing job name

* Add CircleCI test storage

* Add Python orb and caching

* Fix orb command

* Set Python deps install to global scope

* Bump up Python orb version

* Fix command nesting

* Add package manager to orb command

* Fix pyenv cache failure

* Fix pyenv cache

* Add workspace cache for pip packages

* Fix username typo

* Fix permission error

* Test pre-built CircleCI Docker image

* Test missing site-packages

* Test missing Python dir

* Add verbose pip list

* Add pre test jobs

* Fix parameter substitution in description

* Fix extraneus run

* Add parametrized flake8 and black jobs

* Fix parameter passing

* Fix unreferenced parameter

* Fix pre-test Docker image tag

* Show xml coverage

* Add pre-test Python cache

* Create tsv directory

* Chown /home to circleci

* Fix store_results path

* Rename pre-test jobs

* Improve CircleCI configuration

Add Python orb, matrix jobs and rework workflow structure

* Improve CircleCI configuration

Add Python orb, matrix jobs and rework workflow structure

* Bump up pre-build Python version to 3.9

* Add mypy to pre-build jobs

* Add mypy to build required jobs

* Change pip3 to pip

* Add PR to CHANGELOG.md

* Disable circleci user chowning /home

* Revert "Disable circleci user chowning /home"

This reverts commit eed32d6f3ab9c2094a642cc23967c536ad5bddb5.

* Disable pyenv creation

* Revert "Disable pyenv creation"

This reverts commit 68297c21c1c2f4dc67e2bc9bd7972adbeea3878b.

* Disable pyenv creation

* Test pip cache renewal

* Revert "Test pip cache renewal"

This reverts commit b4772307ded407da0fedfc4320b3594f66d366fa.

Cache works as intended, references
https://github.com/kylebgorman/wikipron/pull/249#discussion_r511582495.

Co-authored-by: Jackson L. Lee <jacksonlunlee@gmail.com>

* Small path changes on the data side, rework of test_scrape.py (#251)

* rework some paths on data side, simplify test_scrape.py

* revert changes to test_summary.py

* updates changelog

* Adding a sanity check for valid IPA (#248)

* Check that the phones/phonemes are valid IPA.

* Only print the bad characters.

* Updated CHANGELOG.

* Reformatted the file using black.

* Reran black with line length limit.

* Phonemes, rather than phones.

* Sorted the packages alphabetically.

* Re-arranged imports.

* Moved ipapy into data-specific requirements file.

* Adding dependency on absl-py (for logging) and factoring out the phoneme
checking functionality into its own function.

* Added a link to IPA chart.

* Removed absl-py.

* Use internal logger.

* Check the logging level.

* Moving to global logger.

Thanks Kyle!

* reformatted.

* Cosmetic: fixed warning message.

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* Style fixes for list_phones (#254)

* Style fixes for list_phones

* Ran black formatter.

* Remove `<5`.

* Negative flags are renamed to positive statements (#141) (#255)

* Negative flags in cli.py are renamed to positive statements. In order to accomodate this change, Wikipron/config.py and tests/test_wikipron/test_config.py are also edited accordingly.

* positive flags are added and negative flags are renamed to positive ones.

* changelog is updated.

* style edit

* fix fix redundancy

Co-authored-by: unknown <Yeonju@NYCMAXASIKKAW10.ad.insidemedia.net>

* Clean up flag help and eliminate remaining double negatives (#257)

* Work on flags:

1. Flag help should be short, because people don't read it very
carefully and it's not formatted for multi-sentence input. This shortens
all the flags to a single, consistent name. Because dialect and
segmentation require more information, these details have been moved
into a prominent position in the README instead.
2. The tone and space flags are given negative versions, cf. what Yeonju
did earlier.

* Eliminates double-negative in skip-spaces.

* Updates changelog.

* Updates tests, config, core.

* Fixed missing test_scrape change.

* Adds test for TSV splitting (#256)

* fixes to split.py and postprocess before adding tests

* cleanup of test_split

* updated a few comments in test_split

* revert needless changes to postprocess and split

* minor comment update in test_split

* updates changelog

* Updates data side to use new flags (#258)

* quick fix to small oversight in test_extract.py

* data side uses new flags

* updates changelog, removes config_factory from text_extract.py

* [ita] Adds phoneme list, filtered phonemic TSV file (#261)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* [ady] Adds phone list, filtered Adyghe data. (#263)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* Moves `list_phones.py`. (#266)

* Moves `list_phones.py`.

Closes #265.

* Add changelog

* [bul] Adds phone list, filtered Bulgarian data (#267)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* Adds Icelandic phone list, filtered Icelandic data. (#270)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* [slv] Adds Slovenian phoneme list, filtered TSV data. (#273)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* Adds normalization to `list_phones.py`, corrects bugs relating to `ipapy` (#275)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

* Add normalization to list_phones.py

* Updates changelog

* Reformats list_phones.py

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* Adds Welsh .phones lists, filtered TSV data (#276)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

* Add normalization to list_phones.py

* Updates changelog

* Reformats list_phones.py

* Adds Welsh phoneme lists, filtered Welsh TSV data

* Updates changelog

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* [yue] Handle Cantonese for scraping (#277)

* ENH handle Cantonese for scraping

* MAINT update changelog

* DOC explain Cantonese pron XPath template

* Updates `data/phones/README.md` with instructions to re-scrape (#281)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

* Add normalization to list_phones.py

* Updates changelog

* Reformats list_phones.py

* Adds Welsh phoneme lists, filtered Welsh TSV data

* Updates changelog

* Updates with instructions to re-scrape

* Updates changelog

* Updates

* Updates data/phones/README.md

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* [vie] Adds Vietnamese `.phones` files, `.tsv` files (#283)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

* Add normalization to list_phones.py

* Updates changelog

* Reformats list_phones.py

* Adds Welsh phoneme lists, filtered Welsh TSV data

* Updates changelog

* Updates with instructions to re-scrape

* Updates changelog

* Updates

* Updates data/phones/README.md

* Adds Vietnamese phones, Vietnamese TSV files

* Updates changelog

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* [hin] Adds `phones` file, updated/new TSV files. (#284)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

* Add normalization to list_phones.py

* Updates changelog

* Reformats list_phones.py

* Adds Welsh phoneme lists, filtered Welsh TSV data

* Updates changelog

* Updates with instructions to re-scrape

* Updates changelog

* Updates

* Updates data/phones/README.md

* Adds Vietnamese phones, Vietnamese TSV files

* Updates changelog

* Adds Hindi file, new/updated TSV files

* Updates changelog

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* [hbs] Fixes Serbo-Croatian phoneme lists. Re-scrapes data. (#288)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

* Add normalization to list_phones.py

* Updates changelog

* Reformats list_phones.py

* Adds Welsh phoneme lists, filtered Welsh TSV data

* Updates changelog

* Updates with instructions to re-scrape

* Updates changelog

* Updates

* Updates data/phones/README.md

* Adds Vietnamese phones, Vietnamese TSV files

* Updates changelog

* Adds Hindi file, new/updated TSV files

* Updates changelog

* Fixes Serbo-Croatian phones

* Updates CHANGELOG

* Revert "Adds Hindi file, new/updated TSV files"

This reverts commit 964c3becdbf8a4ec35285ffbfbe9a419fad5123e.

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* [ofs] Scraped Old Frisian. (#294)

* Add Old Frisian to the configuration.

* Mark "ofs" as ISO639-3 language code.

* Fixed language name.

* Added phonemic pronunciations.

* Updated.

* [aar] Rescraped Afar. (#291)

* [aar] Rescraped.

* Updated.

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* [dng] Scraped Dungan. (#293)

* [dng] Scraped Dungan.

* Updated.

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* [bre] Rescraped Breton. (#292)

* [bre] Rescraped.

* Updated.

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* Covering grammar script (#297)

* Adds covering grammar generator, for future QA work.

Also moves `list_phones.py` to the src directory, which makes sense to
me.

* Changelog update.

* Updates CHANGELOG with issue number.

* Update README.md

Fix syntax highlighting hints.

* [ltg] Scraped Latgalian. (#296)

* [ltg] Scraped Latgalian.

* Forgot to include the actual data.

* Updated.

* Removes reconstructions (#302)

* Adds covering grammar generator, for future QA work.

Also moves `list_phones.py` to the src directory, which makes sense to
me.

* Changelog update.

* Updates CHANGELOG with issue number.

* Skips reconstructions during scraping.

Then, rescrapes Latin to take advantage of this.

* Adds number to changelog.

* Updates CHANGELOG for >>> junk.

* Rescrapes Armenian. (#303)

Closes #301.

* [por] Adds phones files, rescraped TSV files. (#304)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

* Add normalization to list_phones.py

* Updates changelog

* Reformats list_phones.py

* Adds Welsh phoneme lists, filtered Welsh TSV data

* Updates changelog

* Updates with instructions to re-scrape

* Updates changelog

* Updates

* Updates data/phones/README.md

* Adds Vietnamese phones, Vietnamese TSV files

* Updates changelog

* Adds Hindi file, new/updated TSV files

* Updates changelog

* Fixes Serbo-Croatian phones

* Updates CHANGELOG

* Revert "Adds Hindi file, new/updated TSV files"

This reverts commit 964c3becdbf8a4ec35285ffbfbe9a419fad5123e.

* Adds Portuguese .phones files, re-scraped TSV data

* Rescrapes Portuguese data

* Updates changelog

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* [bur] Adds Burmese phone list, re-scraped Burmese data. (#305)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

* Add normalization to list_phones.py

* Updates changelog

* Reformats list_phones.py

* Adds Welsh phoneme lists, filtered Welsh TSV data

* Updates changelog

* Updates with instructions to re-scrape

* Updates changelog

* Updates

* Updates data/phones/README.md

* Adds Vietnamese phones, Vietnamese TSV files

* Updates changelog

* Adds Hindi file, new/updated TSV files

* Updates changelog

* Fixes Serbo-Croatian phones

* Updates CHANGELOG

* Revert "Adds Hindi file, new/updated TSV files"

This reverts commit 964c3becdbf8a4ec35285ffbfbe9a419fad5123e.

* Adds Portuguese .phones files, re-scraped TSV data

* Rescrapes Portuguese data

* Updates changelog

* Adds Burmese phones, updated Burmese data

* Updates changelog

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* [mdf] Scrape Moksha + slightly more flexible default pron selector. (#295)

* Support some Moksha pronunciations that reside under "p", rather than
"li".

* Scrape.

* Attempt to fix the test.

* Updated.

* Split the PR into two items.

* [jpn] Adds Japanese .phones file and updated TSV files (#307)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

* Add normalization to list_phones.py

* Updates changelog

* Reformats list_phones.py

* Adds Welsh phoneme lists, filtered Welsh TSV data

* Updates changelog

* Updates with instructions to re-scrape

* Updates changelog

* Updates

* Updates data/phones/README.md

* Adds Vietnamese phones, Vietnamese TSV files

* Updates changelog

* Adds Hindi file, new/updated TSV files

* Updates changelog

* Fixes Serbo-Croatian phones

* Updates CHANGELOG

* Revert "Adds Hindi file, new/updated TSV files"

This reverts commit 964c3becdbf8a4ec35285ffbfbe9a419fad5123e.

* Adds Portuguese .phones files, re-scraped TSV data

* Rescrapes Portuguese data

* Updates changelog

* Adds Burmese phones, updated Burmese data

* Updates changelog

* Adds Japanese phone list. Rescrapes Japanese data

* Updates changelog

* Removes data/tsv/jpn_hira_phonemic.tsv

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* Updates segments version (#308)

* updates segments version and adds test for vietnamese tones

* updates changelog

* [ger] Adds German Phone list, filtered TSV file (#309)

* Create German Phonelist

* Updated CHANGELOG.md

* incorporate updates in README.md, and added missing ger_phone* files

* Adds some whitespace to German phone list comments. (#310)

* [aze] Adds Azerbaijani phone lists and updated TSV data (#312)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

* Add normalization to list_phones.py

* Updates changelog

* Reformats list_phones.py

* Adds Welsh phoneme lists, filtered Welsh TSV data

* Updates changelog

* Updates with instructions to re-scrape

* Updates changelog

* Updates

* Updates data/phones/README.md

* Adds Vietnamese phones, Vietnamese TSV files

* Updates changelog

* Adds Hindi file, new/updated TSV files

* Updates changelog

* Fixes Serbo-Croatian phones

* Updates CHANGELOG

* Revert "Adds Hindi file, new/updated TSV files"

This reverts commit 964c3becdbf8a4ec35285ffbfbe9a419fad5123e.

* Adds Portuguese .phones files, re-scraped TSV data

* Rescrapes Portuguese data

* Updates changelog

* Adds Burmese phones, updated Burmese data

* Updates changelog

* Adds Japanese phone list. Rescrapes Japanese data

* Updates changelog

* Removes data/tsv/jpn_hira_phonemic.tsv

* Adds Azerbaijani phones, updated TSV data

* Updates changelog

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* [tur] Adds Turkish phone list and updated TSV data (#314)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

* Add normalization to list_phones.py

* Updates changelog

* Reformats list_phones.py

* Adds Welsh phoneme lists, filtered Welsh TSV data

* Updates changelog

* Updates with instructions to re-scrape

* Updates changelog

* Updates

* Updates data/phones/README.md

* Adds Vietnamese phones, Vietnamese TSV files

* Updates changelog

* Adds Hindi file, new/updated TSV files

* Updates changelog

* Fixes Serbo-Croatian phones

* Updates CHANGELOG

* Revert "Adds Hindi file, new/updated TSV files"

This reverts commit 964c3becdbf8a4ec35285ffbfbe9a419fad5123e.

* Adds Portuguese .phones files, re-scraped TSV data

* Rescrapes Portuguese data

* Updates changelog

* Adds Burmese phones, updated Burmese data

* Updates changelog

* Adds Japanese phone list. Rescrapes Japanese data

* Updates changelog

* Removes data/tsv/jpn_hira_phonemic.tsv

* Adds Azerbaijani phones, updated TSV data

* Updates changelog

* Adds Turkish phones, rescraped Turkish data

* Updates changelog

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* [afr] adds phone list for Afrikaans and updated TSV files (#316)

* adds afr phone list and rescrapes

* Updated CHANGELOG.md

* [mlt] Adds Maltese phones file and updates TSV data (#318)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

* Add normalization to list_phones.py

* Updates changelog

* Reformats list_phones.py

* Adds Welsh phoneme lists, filtered Welsh TSV data

* Updates changelog

* Updates with instructions to re-scrape

* Updates changelog

* Updates

* Updates data/phones/README.md

* Adds Vietnamese phones, Vietnamese TSV files

* Updates changelog

* Adds Hindi file, new/updated TSV files

* Updates changelog

* Fixes Serbo-Croatian phones

* Updates CHANGELOG

* Revert "Adds Hindi file, new/updated TSV files"

This reverts commit 964c3becdbf8a4ec35285ffbfbe9a419fad5123e.

* Adds Portuguese .phones files, re-scraped TSV data

* Rescrapes Portuguese data

* Updates changelog

* Adds Burmese phones, updated Burmese data

* Updates changelog

* Adds Japanese phone list. Rescrapes Japanese data

* Updates changelog

* Removes data/tsv/jpn_hira_phonemic.tsv

* Adds Azerbaijani phones, updated TSV data

* Updates changelog

* Adds Turkish phones, rescraped Turkish data

* Updates changelog

* Adds Maltese phones, updated data

* Updates changelog

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* Frequency code tire-kick (#320)

* Frequency code tire-kick:

1. Increases typing.
2. No longer overwrites the .tsv files: adds `_freq.tsv` suffix sintead.
3. Adds Khmer to JSON config. file.
4. Adds `shared_tasks` subdirectory for targeted config files.
5. Updates README.

* [lav] Adds Latvian phone list and updated TSV data (#322)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

* Add normalization to list_phones.py

* Updates changelog

* Reformats list_phones.py

* Adds Welsh phoneme lists, filtered Welsh TSV data

* Updates changelog

* Updates with instructions to re-scrape

* Updates changelog

* Updates

* Updates data/phones/README.md

* Adds Vietnamese phones, Vietnamese TSV files

* Updates changelog

* Adds Hindi file, new/updated TSV files

* Updates changelog

* Fixes Serbo-Croatian phones

* Updates CHANGELOG

* Revert "Adds Hindi file, new/updated TSV files"

This reverts commit 964c3becdbf8a4ec35285ffbfbe9a419fad5123e.

* Adds Portuguese .phones files, re-scraped TSV data

* Rescrapes Portuguese data

* Updates changelog

* Adds Burmese phones, updated Burmese data

* Updates changelog

* Adds Japanese phone list. Rescrapes Japanese data

* Updates changelog

* Removes data/tsv/jpn_hira_phonemic.tsv

* Adds Azerbaijani phones, updated TSV data

* Updates changelog

* Adds Turkish phones, rescraped Turkish data

* Updates changelog

* Adds Maltese phones, updated data

* Updates changelog

* Adds Latvian phones, updated Latvian data

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* [khm] Adds Khmer phones and updated TSV data (#327)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

* Add normalization to list_phones.py

* Updates changelog

* Reformats list_phones.py

* Adds Welsh phoneme lists, filtered Welsh TSV data

* Updates changelog

* Updates with instructions to re-scrape

* Updates changelog

* Updates

* Updates data/phones/README.md

* Adds Vietnamese phones, Vietnamese TSV files

* Updates changelog

* Adds Hindi file, new/updated TSV files

* Updates changelog

* Fixes Serbo-Croatian phones

* Updates CHANGELOG

* Revert "Adds Hindi file, new/updated TSV files"

This reverts commit 964c3becdbf8a4ec35285ffbfbe9a419fad5123e.

* Adds Portuguese .phones files, re-scraped TSV data

* Rescrapes Portuguese data

* Updates changelog

* Adds Burmese phones, updated Burmese data

* Updates changelog

* Adds Japanese phone list. Rescrapes Japanese data

* Updates changelog

* Removes data/tsv/jpn_hira_phonemic.tsv

* Adds Azerbaijani phones, updated TSV data

* Updates changelog

* Adds Turkish phones, rescraped Turkish data

* Updates changelog

* Adds Maltese phones, updated data

* Updates changelog

* Adds Latvian phones, updated Latvian data

* Updates changelog

* Adds Khmer phones and updated TSV data

* Updates changelog

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* Moves Latin phonelist to Classical Latin. (#326)

Also undertook a light reorg.

* [nob] Adds Østnorsk (Bokmål) phones and updated TSV data (#330)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

* Add normalization to list_phones.py

* Updates changelog

* Reformats list_phones.py

* Adds Welsh phoneme lists, filtered Welsh TSV data

* Updates changelog

* Updates with instructions to re-scrape

* Updates changelog

* Updates

* Updates data/phones/README.md

* Adds Vietnamese phones, Vietnamese TSV files

* Updates changelog

* Adds Hindi file, new/updated TSV files

* Updates changelog

* Fixes Serbo-Croatian phones

* Updates CHANGELOG

* Revert "Adds Hindi file, new/updated TSV files"

This reverts commit 964c3becdbf8a4ec35285ffbfbe9a419fad5123e.

* Adds Portuguese .phones files, re-scraped TSV data

* Rescrapes Portuguese data

* Updates changelog

* Adds Burmese phones, updated Burmese data

* Updates changelog

* Adds Japanese phone list. Rescrapes Japanese data

* Updates changelog

* Removes data/tsv/jpn_hira_phonemic.tsv

* Adds Azerbaijani phones, updated TSV data

* Updates changelog

* Adds Turkish phones, rescraped Turkish data

* Updates changelog

* Adds Maltese phones, updated data

* Updates changelog

* Adds Latvian phones, updated Latvian data

* Updates changelog

* Adds Khmer phones and updated TSV data

* Updates changelog

* Adds Østnorsk (Bokmål) phones and updated TSV data

* Updates changelog

* Fixes typo

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* Add English link to language list for frequencies. (#332)

* Partial scrape (#334)

* scrape up to cantonese

* raw partial scrape - excludes yue, rus, cmn

* post-processing on partial scrape, src README fix

* re-ran generate_summary.py after resolving conflicts

* revert comment in scrape.py

* updates changelog, resolves formatting error

* Updates `data/phones/README.md` (#333)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

* Add normalization to list_phones.py

* Updates changelog

* Reformats list_phones.py

* Adds Welsh phoneme lists, filtered Welsh TSV data

* Updates changelog

* Updates with instructions to re-scrape

* Updates changelog

* Updates

* Updates data/phones/README.md

* Adds Vietnamese phones, Vietnamese TSV files

* Updates changelog

* Adds Hindi file, new/updated TSV files

* Updates changelog

* Fixes Serbo-Croatian phones

* Updates CHANGELOG

* Revert "Adds Hindi file, new/updated TSV files"

This reverts commit 964c3becdbf8a4ec35285ffbfbe9a419fad5123e.

* Adds Portuguese .phones files, re-scraped TSV data

* Rescrapes Portuguese data

* Updates changelog

* Adds Burmese phones, updated Burmese data

* Updates changelog

* Adds Japanese phone list. Rescrapes Japanese data

* Updates changelog

* Removes data/tsv/jpn_hira_phonemic.tsv

* Adds Azerbaijani phones, updated TSV data

* Updates changelog

* Adds Turkish phones, rescraped Turkish data

* Updates changelog

* Adds Maltese phones, updated data

* Updates changelog

* Adds Latvian phones, updated Latvian data

* Updates changelog

* Adds Khmer phones and updated TSV data

* Updates changelog

* Adds Østnorsk (Bokmål) phones and updated TSV data

* Updates changelog

* Fixes typo

* Update data/phones/README.md

* Update changelog

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* [arm] cleaned up armenian phones (#331)

* cleaned up armenian phones

* cleaned up armenian phones (with more tidying up)

* cleaning up armenian (fixed changelog)

I had written the update on the wrong spot on the changelog + I added the issue number

* uncommented accidental gaps

* added voiceless allophones

* added missing geminate affricates

* reduced branch

* final changes for commit to original branch

Co-authored-by: Lucas Ashby <lfeashby@gmail.com>
Co-authored-by: Alexander Gutkin <35786058+agutkin@users.noreply.github.com>
Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>
Co-authored-by: Travis Bartley <Travismbartley@gmail.com>
Co-authored-by: Jackson L. Lee <jacksonlunlee@gmail.com>
Co-authored-by: ajmalanoski <71616036+ajmalanoski@users.noreply.github.com>
Co-authored-by: Alireza <Alirezasampoor@gmail.com>
Co-authored-by: Biswaroop Bhattacharjee <biswaroop08@gmail.com>
Co-authored-by: Muhammad Fakhri Putra Supriyadi <fakhriputra123s@gmail.com>
Co-authored-by: Ben Fernandes <dev.benfernandes@gmail.com>
Co-authored-by: Jim Regan <jaoregan@tcd.ie>
Co-authored-by: platipo <enrico.paganin@mail.com>
Co-authored-by: yeonju123 <yeonju123@gmail.com>
Co-authored-by: unknown <Yeonju@NYCMAXASIKKAW10.ad.insidemedia.net>
Co-authored-by: Hossep Dolatian <hovdeov@gmail.com>

Loading branch information

16 people committed Jan 26, 2021

1 parent 88829cf commit f5c05d0

.circleci/config.yml

-Original file line number
+Diff line change
@@ -1,52 +1,91 @@
-    version: 2
+    version: 2.1
+    orbs:
+      python: circleci/python@1.2.0
-    workflows:
-      version: 2
-      test:
-        jobs:
-          - build-python-3.6
-          - build-python-3.7
-          - build-python-3.8
     jobs:
-      build-python-3.6: &template
+      pre-build:
+        description: Install and run a Python standalone package
+        parameters:
+          command-name:
+            type: string
+          command-run:
+            type: string
         docker:
-          - image: python:3.6
+          - image: cimg/python:3.9
+            #auth:
+            #  username: $DOCKERHUB_USERNAME
+            #  password: $DOCKERHUB_PASSWORD
         steps:
           - checkout
           - run:
-              name: Build source distribution and install package from it
-              working_directory: ~/project/
-              # Ensure we can build a source distribution that can correctly install.
-              # "python setup.py sdist" creates dist/wikipron-x.y.z.tar.gz
-              command: |
-                  pip install --progress-bar off --upgrade pip setuptools
-                  python setup.py sdist
-                  pip install dist/`ls dist/ | grep .tar.gz`
+              name: Create custom requirements
+              command: grep << parameters.command-name >> requirements.txt > << parameters.command-name >>_requirements.txt
+          - python/install-packages:
+              pkg-manager: pip
+              pip-dependency-file: << parameters.command-name >>_requirements.txt
+              cache-version: << parameters.command-name >>-v1
           - run:
-              name: Install the full development requirements
-              working_directory: ~/project/
-              command: pip install --progress-bar off -r requirements.txt
+              working_directory: ~/
+              command: << parameters.command-run >>
+      build-python:
+        parameters:
+          python-version:
+            type: string
+        docker:
+          - image: cimg/python:<< parameters.python-version >>
+            #auth:
+            #  username: $DOCKERHUB_USERNAME
+            #  password: $DOCKERHUB_PASSWORD
+        steps:
+          - checkout
+          - python/install-packages:
+              pkg-manager: pip
+              pre-install-steps:
+                  - run:
+                      name: Build source distribution and install package from it
+                      command: |
+                          pip install --progress-bar off --upgrade pip setuptools
+                          python setup.py sdist
+                          pip install dist/*.tar.gz
           - run:
               name: Show installed Python packages
-              command: pip list
-          - run:
-              name: Lint
-              working_directory: ~/
-              # Avoid being able to import wikipron by relative import.
-              # Test code by importing the *installed* wikipron in site-packages.
-              command: flake8 project/setup.py project/wikipron project/tests
+              command: pip list -v
           - run:
               name: Run python tests
               working_directory: ~/
               # Avoid being able to import wikipron by relative import.
               # Test code by importing the *installed* wikipron in site-packages.
-              command: pytest -vv project/tests
-      build-python-3.7:
-        <<: *template
-        docker:
-          - image: python:3.7
-      build-python-3.8:
-        <<: *template
-        docker:
-          - image: python:3.8
+              command: |
+                  sudo chown circleci:circleci /home
+                  pytest -vv project/tests --junitxml /tmp/testxml/report.xml
+          - store_test_results:
+              path: /tmp/testxml/
+    workflows:
+      version: 2
+      build-and-test:
+        jobs:
+          - pre-build:
+              name: flake8
+              command-name: flake8
+              command-run: flake8 project/setup.py project/wikipron project/tests
+          - pre-build:
+              name: black
+              command-name: black
+              command-run: black --line-length=79 --check project/setup.py project/wikipron project/tests project/data
+          - pre-build:
+              name: mypy
+              command-name: mypy
+              command-run: mypy --ignore-missing-imports project/wikipron project/tests project/data
+          - build-python:
+              requires:
+                - flake8
+                - black
+                - mypy
+              matrix:
+                parameters:
+                  python-version: ["3.6", "3.7", "3.8", "3.9"]

.gitignore

-Original file line number
+Diff line change
@@ Expand Up / @@ -5,4 +5,7 @@ __pycache__/ @@
     *.egg-info/
     *.log
     **/tars
-    **/freq_tsvs
+    **/freq_tsvs
+    env/
+    .idea/

CHANGELOG.md

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -10,56 +10,178 @@ Versioning](http://semver.org/spec/v2.0.0.html).
  
    Unreleased

    ----------

    ### Added

    -   Adds two Vietnamese dialects to `languages.json`. (\#139)

    -   Adds whitelisting capabilities to `postprocess`. (\#152)

    -   Adds whitelists for Dutch, English, Greek, Latin, Korean, and Spanish.

    ### Under `data/`

    #### Added

    -   Added Afrikaans whitelists, filtered TSV file, rescraped phonemic and phonetic TSV files. (\#311)

    -   Added German whitelists and filtered TSV file. (\#285)

    -   Added whitelisting capabilities to `postprocess`. (\#152)

    -   Added whitelists for Dutch, English, Greek, Latin, Korean, and Spanish.

        (\#158, etc.)

    -   Improves printing in the README table. (\#145)

    -   Renames data directory `data`. (\#147)

    -   Logged dialect configuration if specified. (\#133)

    -   Handled additional language codes. (\#132, \#148)

    -   Added `--no-skip-spaces-word` and `--no-skip-spaces-pron` flag. (\#135)

    -   Added typing to big scrape code. (\#140)

    -   Added argparse to allow limiting 'big scrape' to individual languages

        with `--restriction` flag. (\#154)

    -   Added Manchu (`mnc`). (\#185)

    -   Added Polabian (`pox`). (\#186)

    -   Added `aar`, `bdq`, `jje`, and `lsi`. (\#202)

    -   Added `tyv` to `languagecodes.py` (\#203, \#205)

    -   Added `bcl`, `egl`, `izh`, `ltg`, `azg`, `kir` and `mga` to `languagecodes.py`. (\#205)

    -   Added `nep` to `languagecodes.py`. (\#206)

    -   Added Ingrian (`izh`). (\#215)

    -   Added French phoneme list and filtered TSV file. (\#213, \#217)

    -   Added Corsican (`cos`). (\#222)

    -   Added Middle Korean (`okm`). (\#223)

    -   Added Middle Irish (`mga`). (\#224)

    -   Added Old Portuguese (`opt`). (\#225)

    -   Added Serbo-Croatian phoneme list and filtered TSV files. (\#227)

    -   Added Tuvan (`tyv`). (\#228)

    -   Added Shan (`shn`) with custom extraction. (\#229)

    -   Added Northern Kurdish (`kmr`). (\#243)

    -   Added a script to facilitate the creation of a `.phones` file. (\#246)

    -   Added IPA validity checks for phonemes. (\#248)

    -   Split multiple pronunciations joined by tilde in `eng_us_phonetic`.

    -   Added Italian phoneme list and filtered TSV file. (\#260, \#261)

    -   Added Adyghe phone list and filtered TSV file. (\#262, \#263)

    -   Added Bulgarian phoneme list and filtered TSV file. (\#264, \#267)

    -   Added Icelandic phoneme list and filtered TSV file. (\#269, \#270)

    -   Added Slovenian phoneme list and filtered TSV file. (\#271, \#273)

    -   Added normalization to `list_phones.py`. Corrected errors relating to

        `ipapy` (\#275)

    -   Added Welsh .phones lists and filtered TSV files. (\#274, \#276)

    -   Added draft of covering grammar script. (\#297)

    -   Updated `data/phones/README.md` with instructions to re-scrape. (\#279, \#281)

    -   Added Vietnamese `.phones` files and re-scraped and filtered `.tsv` files.

        (\#278, \#283)

    -   Added Hindi `.phones` files and the re-scraped and filtered `.tsv` files.

        (\#282, \#284)

    -   Added Old Frisian (`ofs`). (\#294)

    -   Added Dungan (`dng`). (\#293)

    -   Added Latgalian (`ltg`). (\#296)

    -   Added draft of covering grammar script. (\#297)

    -   Added Portuguese `.phones` files and re-scraped data. (\#290, \#304)

    -   Added Japanese `.phones` files and re-scraped data. (\#230, \#307)

    -   Added Moksha (`mdf`). (\#295)

    -   Added Azerbaijani `.phones` files and re-scraped data. (\#306, \#312)

    -   Added Turkish `.phones` file and re-scraped data. (\#313, \#314)

    -   Added Maltese `.phones` file and re-scraped data. (\#317, \#318)

    -   Added Latvian `.phones` file and re-scraped data. (\#321, \#322)

    -   Added Khmer `.phones` file and re-scraped data. (\#324, \#327)

    -   Added Østnorsk (Bokmål) `.phones` file and re-scraped data. (\#324, \#327)

    -   Several languages added to `languagecodes.py`. (\#334)

    #### Changed

    -   Edited the arm_e_phonetic.phones and arm_w_phonetic.phones files. (\#298)

    -   Improved printing in the README table. (\#145)

    -   Renamed data directory `data`. (\#147)

    -   Split `may` into Latin and Arabic files. (\#164)

    -   Split `pan` into Gurmukhi and Shahmukhī. (\#169)

    -   Split `uig` into Perso-Arabic and Cyrillic. (\#173)

    -   Alowing ASCII apostrophes (0x27) in spellings. (\#172).

    -   Only allowing Latin spellings in Maltese lexicon. (\#166).

    -   Only allowed Latin spellings in Maltese lexicon. (\#166).

    -   Split `mon` into Cyrillic and Mongol Bichig (\#179).

    -   Added Vietnamese extraction function. (\#181)

    ### Deprecated

    ### Removed

    ### Fixed

    ### Security

    -   Merged whitelist.py into 'big scrape' script. src scrape.py now checks for

        existence of whitelist file during scrape to create second filtered TSV.

        New TSV placed under `tsv/\*\_filtered.tsv`. (\#154).

    -   Updated `generate_summary.py` to reflect presence of 'filtered' tsv. (\#154)

    -   Imperial Aramaic (`arc`) split into three scripts properly. (\#187)

    -   Flattened data directory structure. (\#194)

    -   Updated Georgian (`geo`) to take advantage of upstream bot-based

        consistency fixes. (\#138)

    -   Split `arm` into Eastern and Western dialects. (\#197)

    -   Rescraped files with new whitelists. (\#199)

    -   Updated logging statements for consistency. (\#196)

    -   Renamed `.whitelist` file extension name as `.phones`. (\#207)

    -   Split `ban` into Latin and Balinese scripts. (\#214)

    -   Split `kir` into Cyrillic and Arabic. (\#216)

    -   Split Latin (`lat`) into its dialects. (\#233)

    -   Added MyPy coverage for `wikipron`, `tests` and `data` directories. (\#247)

    -   Modified paths in `codes.py`, `scrape.py`, and `split.py`. (\#251, \#256)

    -   Modified config flags in `languages.json` and `scrape.py`. (\#258)

    -   Edited Serbo-Croatian `.phones` file to list all vowel/pitch accent

        combinations. Re-scraped Serbo-Croatian data. (\#288)

    -   Moved `list_phones.py` to parent directory. (\#265, \#266)

    -   Moved `list_phones.py` to `src` directory. (\#297)

    -   Frequencies code no longer overwrites TSV files. (\#320)

    -   Updated `data/phones/README.md` to specify that `.phones` files should be

        in NFC normalization form. (\#333)

    -   Kurdish (`kur`) and Opata (`opt`) removed from `languages.json`. (\#334)

    #### Fixed

    -   Fixed path issue with phonetic whitelisted files. (\#195)

    ### Under `wikipron/` and Elsewhere

    #### Added

    -   Added positive flags for stress, syllable boundaries, tones, segment to `cli.py`. (\#141)

    -   Added positive flags for space skipping to `cli.py`. (\#257)

    -   Added two Vietnamese dialects to `languages.json`. (\#139)

    -   Handled additional language codes. (\#132, \#148)

    -   Added `--no-skip-spaces-word` and `--no-skip-spaces-pron` flag. (\#135)

    -   Allowed ASCII apostrophes (0x27) in spellings. (\#172).

    -   Added Vietnamese extraction function. (\#181).

    -   Modified pron selector in Latin extraction function. (\#183).

    -   Added `--no-tone` flag. (\#188)

    -   Customized extractor and new scraped prons for `khb`. (\#219)

    -   Added `tests/test_data` directory containing two tests. (\#226, \#251)

    -   Added HTTP User-Agent header to API calls to Wiktionary. (\#234)

    -   Added support for python 3.9 (\#240)

    -   Added black style formatting to `.circleci/config.yml`. (\#242)

    -   Added logging for scraping a language with `--dialect` specified

        that requires its custom extraction logic. (\#245)

    -   Improved CircleCI workflow with orbs. (\#249)

    -   Added `test_split.py` to `tests/test_data`. (\#256)

    -   Handled Cantonese for scraping. (\#277)

    -   Added exclusion for reconstructions. (\#302)

    -   Added Vietnamese contour tone grouping test in `tests/test_config.py` (\#308)

    #### Changed

    -   Renamed arguments to positive statements in `wikipron/config.py` and edited `_get_process_pron` function accordingly. (\#141, \#257)

    -   Changed testing values used in `tests/test_config.py` in order to accomodate the addition of positive flags. (\#141)

    -   Specified UTF-8 encoding in handling text files. (\#221)

    -   Moved previous contents of `tests` into `tests/test_wikipron` (\#226)

    -   Updated the packages version numbers in requirements.txt to their latest according to PyPI (\#239)

    -   Updated the default pron selector to also look for IPA strings under paragraphs in addition to list items. (\#295)

    -   Updated segments package version to 2.2.0 (\#308)

    #### Deprecated

    #### Removed

    -   Moved Wiktionary querying functions from `test_languagecodes.py` to `codes.py` (\#205)

    #### Fixed

    #### Security

    [1.1.0] - 2020-03-03

    --------------------

    ### Added

    #### Added

    -   Added the extraction function for Mandarin Chinese and its scraped data. (\#124)

    -   Integrated Wortschatz frequencies. (\#122)

    ### Changed

    #### Changed

    -   Updated the Japanese extraction function and Japanese data. (\#129)

    -   Updated all scraped Wiktionary data and frequency data. (\#127, \#128)

    -   Generalized the splitting script in the big scrape. (\#123)

    -   Moved small file removal to `generate_summary.py`. (\#119)

    -   Updated Russian data. (\#115)

    ### Fixed

    #### Fixed

    -   Avoided and logged error in case of pron processing failure. (\#130)

    [1.0.0] - 2019-11-29

    ----------------------

    ### Added

    #### Added

    -   Handled Japanese. (\#109, \#114)

    -   Handled Latin, for which the actual graphemes cannot be the Wiktionary

    @@ -73,21 +195,21 @@ Unreleased
  
    -   Resolved Wiktionary language names for languages with at least 100

        pronunciation entries. (\#52, \#55)

    ### Changed

    #### Changed

    -   Removed duplicate <word, pronunciation> pairs in the persisted data. (\#85, \#111, \#116)

    -   Split Welsh into Northern Wales and Southern dialects in the persisted data. (\#110)

    -   Factored out casefolding. (\#102)

    -   Split Serbo-Croatian into Cyrillic and Latin TSVs. (\#96)

    -   Generalized word and pronunciation extraction. (\#88)

    ### Removed

    #### Removed

    -   Removed the timeout in smoke tests. (\#107)

    -   Removed the `output` option. (\#82)

    -   Removed the `require_dialect_label` option. (\#77)

    ### Fixed

    #### Fixed

    -   Skipped pronunciations with a dash. (\#106)

    -   Skipped empty pronunciations in scraping. (\#59)

    @@ -97,15 +219,15 @@ Unreleased
  
        `title="wikipedia:<language> phonology"` to cover previously unhandled

        languages (e.g., Estonian and Slovak). (\#49)

    ### Security

    #### Security

    -   Avoided using `exec` to retrieve the version string. Used `pkg_resources`

        instead. (\#63)

    [0.1.1] - 2019-08-14

    ----------------------

    ### Fixed

    #### Fixed

    -   Fixed import bug. (\#45)

0 comments on commit `f5c05d0`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `f5c05d0`

Commit

There are no files selected for viewing

0 comments on commit f5c05d0

0 comments on commit `f5c05d0`