Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve CircleCI workflow #249

Merged
merged 47 commits into from
Oct 25, 2020
Merged

Improve CircleCI workflow #249

merged 47 commits into from
Oct 25, 2020

Conversation

platipo
Copy link
Contributor

@platipo platipo commented Oct 24, 2020

  • Updated Unreleased in CHANGELOG.md to reflect the changes in code or data.

Hi @jacksonllee, I saw #235 and decided to hop on the renovation bandwagon for he CircleCI job. I'm working on the Orbtoberfest challenges and saw an opportunity to improve the actual configurations:

  • add the matrix building syntax for improved readability
  • add parallel testing for faster pipelines
  • add Python orb to remove maintenance burden
  • cache pipeline dependencies

@platipo platipo marked this pull request as ready for review October 24, 2020 17:00
@platipo
Copy link
Contributor Author

platipo commented Oct 24, 2020

One last point is left:

  • add parallel testing for faster pipelines

In my opinion is not worth it to apply the CircleCI parallelism because there are only few tests and it would require splitting the build-python job into two halves: the former would build package ad its requirements while the latter would run the tests. This splitting is required in order to avoid unknown behavior in the cache save.

Let me know if you want to test it anyway.

@platipo
Copy link
Contributor Author

platipo commented Oct 24, 2020

I reworked the entire workflow:

new workflow

If either black of flake8 fails the other jobs won't start. Now building the package and installing dependencies are cached so it needs less time to complete the run.

To integrate this PR with the work of #247 you can just modify the workflow like:

workflows:
  version: 2
  build-and-test:
    jobs:
      - pre-build:
          name: flake8
          command-name: flake8
          command-run: flake8 project/setup.py project/wikipron project/tests
      - pre-build:
          name: black
          command-name: black
          command-run: black --line-length=79 --check project/setup.py project/wikipron project/tests project/data      - pre-build:
      - pre-build:
          name: mypy
          command-name: mypy
          command-run: mypy
      - build-python:
          requires:
            - flake8
            - black
            - mypy
          matrix:
            parameters:
              python-version: ["3.6", "3.7", "3.8", "3.9"]

@platipo
Copy link
Contributor Author

platipo commented Oct 24, 2020

Do I have to anything with Updated Unreleased in CHANGELOG.md to reflect the changes in code or data.?

@platipo platipo changed the title Convert to matrix CircleCI workflow Improve CircleCI workflow Oct 24, 2020
@jacksonllee jacksonllee self-assigned this Oct 25, 2020
Copy link
Collaborator

@jacksonllee jacksonllee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @platipo for this contribution! The CircleCI orbs are new to me. I think I see how they can be useful, as this pull request has shown. I like how the checks for flake8/black/etc. common to all Python versions are factored out and required to pass prior to running the test suite. Cached environments for less build time are also nice.

As for the changelog, would you mind adding a new entry like Improved CircleCI workflow with orbs. (\#249) to CHANGELOG.md?

Is there a label I should add to this pull request for the purpose of Orbtoberfest? Let me know and I'll do it.

.circleci/config.yml Outdated Show resolved Hide resolved
Comment on lines +17 to +19
#auth:
# username: $DOCKERHUB_USERNAME
# password: $DOCKERHUB_PASSWORD
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can safely remove these lines. I don't envision the need for passing in Docker Hub credentials for this project.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting from 1st November, CircleCI will limit the unauthenticated Docker pulls, you can read more here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see -- thanks for flagging this! In that case, I agree it's good to leave these commented-out lines here as a reminder to ourselves.

.circleci/config.yml Show resolved Hide resolved
.circleci/config.yml Outdated Show resolved Hide resolved
pip install dist/*.tar.gz
- run:
name: Show installed Python packages
command: pip3 list -v
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pip (used in the previous step) instead of pip3 for consistency?

Copy link
Contributor Author

@platipo platipo Oct 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified the command to pip3 because in a former image pip showed both Python 2 and 3 packages. Do you want me to remove also the -v flag that I used for debugging the package location? This is an output sample

docker:
- image: python:3.9
command: |
sudo chown circleci:circleci /home
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to run chown on /home here? A comment just above would help.

Copy link
Contributor Author

@platipo platipo Oct 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that the first test are failing because they are trying to create a directory in /home which is owned by root.
I chose a quick and dirty fix to make tests pass; of course the correct fix is to move the test directory to either a unique temporary folder via tempfile stdlib or a fixed directory in a user owned path like ~/tsv

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, we do have tests with file I/O like that. Your quick fix here seems reasonable for now.

Note to self: I did mention using tempfile when the relevant tests were being checked in. This issue will be fixed when we get to refactoring the relevant code later.

sudo chown circleci:circleci /home
pytest -vv project/tests --junitxml /tmp/testxml/report.xml
- store_test_results:
path: /tmp/testxml/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that you've added --junitxml /tmp/testxml/report.xml to the pytest command above to generate a test report. Does this store_test_results step put the report somewhere? How do we access it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the --junitxml /tmp/testxml/report.xml and store_test_results because it is needed to split the test for the parallelization; you can access the results clicking on a build job and then selecting the test tab: for example

Copy link
Contributor Author

@platipo platipo Oct 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is also useful to read failure messages cleanly formatted, here is an example of a failing test.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting -- I must admit that I've been looking at test failures just by scrolling through the log on the "Steps" tab.

@jacksonllee jacksonllee removed their assignment Oct 25, 2020
@jacksonllee
Copy link
Collaborator

Heads-up that #247 adding a mypy step to the CircleCi config has just been merged.

@platipo
Copy link
Contributor Author

platipo commented Oct 25, 2020

Thank you @jacksonllee, I rebased the branch on master and added the mypy step; this is still a WIP but I'm adding the finishing touches 🚀

@platipo
Copy link
Contributor Author

platipo commented Oct 25, 2020

As long as you don't want the workflow with separate parallel tests, I think I have finished if there are no other fix to do.

If tests became longer and longer the new workflow could become:

proposed workflow

In my opinion it is not worth it because there is a context switch to take into account from the build job to the test job for workspace save, but we can try...

@platipo
Copy link
Contributor Author

platipo commented Oct 25, 2020

Is there a label I should add to this pull request for the purpose of Orbtoberfest? Let me know and I'll do it.

Yeah, I need the Orbtoberfest label. Thank you!

Comment on lines +17 to +19
#auth:
# username: $DOCKERHUB_USERNAME
# password: $DOCKERHUB_PASSWORD
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see -- thanks for flagging this! In that case, I agree it's good to leave these commented-out lines here as a reminder to ourselves.

docker:
- image: python:3.9
command: |
sudo chown circleci:circleci /home
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, we do have tests with file I/O like that. Your quick fix here seems reasonable for now.

Note to self: I did mention using tempfile when the relevant tests were being checked in. This issue will be fixed when we get to refactoring the relevant code later.

sudo chown circleci:circleci /home
pytest -vv project/tests --junitxml /tmp/testxml/report.xml
- store_test_results:
path: /tmp/testxml/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting -- I must admit that I've been looking at test failures just by scrolling through the log on the "Steps" tab.

- pre-build:
name: mypy
command-name: mypy
command-run: mypy --ignore-missing-imports project/wikipron project/tests project/data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding the mypy pre-build as well!

# Test code by importing the *installed* wikipron in site-packages.
command: black --line-length=79 --check project/setup.py project/wikipron project/tests project/data
name: Show installed Python packages
command: pip list -v
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The -v flag to be more verbose for where things are is nice. Thanks for adding it!

Copy link
Collaborator

@jacksonllee jacksonllee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you very much @platipo for showing us the CircleCI orb magic and patiently walking me through the details!

I went ahead to fix the merge conflict for CHANGELOG.md to save another round of back-and-forth and not take up more of your time. Thank you again!

@jacksonllee jacksonllee merged commit 13be941 into CUNY-CL:master Oct 25, 2020
kylebgorman added a commit that referenced this pull request Jan 26, 2021
* Update to Latin pron selector (#183)

* minor change to latin extraction function, rescraped Latin

* potential fix to lat scraping issue

* raw scrape of latin

* postprocessing of new latin data

* updated changelog, fixed line length error

* rescrape of latin

* postprocessing of updated latin data

* [pox] Scraped Polabian. (#186)

* [pox] Scraped Polabian.

Note: The ISO 639-3 code is `pox`, the older ISO 639-2 code is `sla`.

* Updated CHANGELOG.

* [mnc] Scraped Manchu. (#185)

* [mnc] Scraped Manchu.

* Updated CHANGELOG.

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* Merged Whitelist functionality with src/scrape.py. Now checks for pre… (#184)

* Merged Whitelist functionality with src/scrape.py. Now checks for presence of whitelist and writes separate tsv as {original file name}_filtered.tsv. Update generate_summary to reflect if file is filtered through a whitelist. CHANGELOG and README update accordingly.

* Style tweaks and cleanup.

* Updated generalized_split and postprocess to reflect automatic whitelist processing in scrape. Fixed dialect issue in generate_summary.

* Previous edits didn't cary.

* Cleanup typo mistakes. Added error handling to scrape.py.

* Style clean-up.

* Fixed style issues.

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* Imperial aramaic (#187)

* [arc] Listing the correct scripts for Imperial Aramaic:

1.  The original Aramaic script (`armi`).
2.  The square script as in Biblical Aramaic (`hebr`).
3.  Classical Syriac/Assyrian Neo-Aramaic (`syrc`) descended from (1).

This correctly assigns the entries to their respective lexicons. Most
of pronunciations are available for (2), with very minor number of
entries for (1) and (3).

* [arc] Listing the correct scripts for Imperial Aramaic (continuing the
previous commit which was partial):

1.  The original Aramaic script (`armi`).
2.  The square script as in Biblical Aramaic (`hebr`).
3.  Classical Syriac/Assyrian Neo-Aramaic (`syrc`) descended from (1).

This correctly assigns the entries to their respective lexicons. Most
of pronunciations are available for (2), with very minor number of
entries for (1) and (3).

* Updated CHANGELOG with #186.

* Add --no-tone flag (#188)

* tentative solution for tone removal

* updates changelog, ran white on test_config.py

* remove print statement from test_config.py

* partial replace of codepoints with chars, adds nfd/nfc conversion

* reworks import statements

* updates _TONES_REGEX

* ran white on config.py

* updates to conversions and adds comments

* fixes to scrape.py comment length

* converted test_config.py no_tone tests to nfd strings

* modifies no_tone process not to skip removing superscript parentheses around non-tone superscript chars

* Rename (#192)

* [geo] Rescrape post-bot.

Closes #138.

* Add changelog

* Rename.

* Update CHANGELOG

* Revert "[geo] Rescrape post-bot."

This reverts commit 4a151b13e0e03e7a4aecb7dad29c1de9c2230f10.

* Flattens directory structure for data. (#194)

* Flattens directory structure for data.

The non-wiki data is moved to the new `wikipron-extras` (https://github.com/kylebgorman/wikipron-extras) repository.

Closes #193.

* Add PR number to changelog.

* "Imperial"

* [geo] Rescrape post-bot. (#191)

* [geo] Rescrape post-bot.

Closes #138.

* Add changelog

* Update changelog

* [geo] Add whitelist and re-scrape.

* Renames for merge.

* Add link to guidelines

* [hun] Adds whitelist.

* Simplify postprocess

* Enforces consistent style in logging using %r. (#196)

* Enforces consistent style in logging using %r.

* Updates CHANELOG

* Fixes a double-quoted logging var.

* Filtering (#199)

* [rum] Add whitelist and rescrape.

* [eng] Adds English rescrape.

* [dut] Adds Dutch rescrape.

* [gre] Adds Greek rescrape.

* [gre] Adds Greek rescrape.

* Updates scrape path for phonetic filtering.

Closes #195.

* [rum] Adds Romanian rescrape.

* [arm] Adds Armenian rescrape.

* [gre] Adds Greek rescrape (second try).

* [arm] Adds Armenian dialects + rescrapes.

Closes #197.

* Adds CHANGELOG changes.

* [spa] Adds Spanish rescrape.

* Postprocess and regenerate summaries.

* [aar, bdq, jje, lsi] discovers new languages and scrapes them. (#202)

* Added tyv to languagecodes.py (#203)

* adds tuvan to languagecodes.py

* updates changelog

* Fall scrape (#204)

* [aar, bdq, jje, lsi] discovers new languages and scrapes them.

* Fall scrape.

* Fuller bib information

Fills out the bibliography entry for the WikiPron paper.

* Updates to codes.py (#205)

* updated languages.json and json files for translating between wikitionary code and iso code

* updates codes.py and languagecodes.py

* modifies test_languagecodes.py to reduce redundancy with codes.py

* small formatting fixes

* updates changelog

* logging statement formatting

* Update README.md

Fixes formatting issue in table. Not sure why this had to be done manually...

* ENH rename '.whitelist' as '.phones' (#207)

* Uses %r everywhere in `data/src`. (#210)

* Nepali support (#211)

* Uses %r everywhere in `data/src`.

* [nep] Adds Nepali data.

Closes #209.

* Update changelog

* [fre] Adds phoneme list (#213)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* [izh] Scrape and add Ingrian. (#215)

* [izh] Scrape and add Ingrian.

* Updated CHANGELOG.

* [ban] Splitting Balinese into Latin and Balinese scripts. (#214)

* [ban] Splitting Balinese into Latin and Balinese scripts.

* Updated CHANGELOG.

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* [kir] Split Kyrgyz into Cyrillic and Arabic scripts. (#216)

* [kir] Split Kyrgyz into Cyrillic and Arabic scripts.

* Updated.

* Added fre_phonemic_filtered.tsv (#217)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Refresh the database size counter. (#220)

* [khb] Customized extractor and re-scraping of Lü. (#219)

* [khb] Adding customized extractor for Lü.

* [khb] Re-scraping and updating the data and summaries.

* Updated CHANGELOG.

* Reordered imports.

* [khb] Adding scrape smoke test.

* Resorted.

* FIX specify UTF-8 in handling text files (#221)

It looks like Windows users have encountered encodings --
they hit `UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in
position 2882: character maps to <undefined>` when pip installing
wikipron, the error triggered at setup.py.
While we're at it, we specify UTF-8 encoding for all open() calls
for text processing as well.

Co-authored-by: jacksonllee <jacksonlunlee@gmail.com>

* [mga] New scrape: Middle Irish. (#224)

* [mga] New scrape: Middle Irish.

* Updated CHANGELOG.

* [cos] New scrape: Corsican. (#222)

* [cos] Add Corsican to the language code registry.

* [cos] Scraped Corsican and updated the language descriptions.

* Updated CHANGELOG.

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* [okm] New scrape: Middle Korean (#223)

* [okm] Adding ISO 639-3-only Middle Korean: Korean, Middle (10th–16th centuries).

* [okm] New scrape of Middle Korean and update of indices and descriptions.

* Updated CHANGELOG.

* Fixed typo.

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* [opt] New scrape: Old Portuguese (aka Galician-Portuguese). (#225)

* Adding Old Portuegese (aka Galician-Portuguese) codes.

* [opt] New scrape.

* [opt] Updated summaries.

* Updated CHANGELOG.

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* Added Serbo-Croatian phonemes and filtered TSV files. (#227)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* [shn] Custom extractor and new scrape for Shan (#229)

* [shn] Adding customized extractor for Shan.

* [shn] Adding smoke test.

* [shn] New scrape for Shan.

* [shn] Updated descriptions.

* Updated CHANGELOG.

* [tyv] New scrape: Tuvan (#228)

* [tyv] Tuvan scrape.

* [tyv] Updated descriptions.

This also fixes a buggy previous merge of `okm`.

* [tyv] Filtering Tuvan to use Cyrillic script only.

* Updated CHANGELOG.

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* Reorganizes tests and adds a few initial tests for the data side (#226)

* moves wikipron module tests into subdirectory

* reformating of test_version.py

* adds outline of test for data naming conventions, removes nonsense from src/scrape.py

* basic framework for testing file creation involved in big scrape

* renamed file naming test and added comments

* reorganizes tests directory, adds test for generate_summary.py

* fix formating in test_version.py

* revises and renames file for testing scrape

* fixes pathing issue in init

* adds some typing to new tests

* changes open statements to use proper encoding

* potential solution to circleci module error

* approaching a circleci import solution?

* updates changelog

* [hbs] Fix file naming.

* Update README.md

* Update README.md

* [gre] Takes advantage of upstream consistency fix.

Closes #198.

* [lat] Split Latin into its dialects (#233)

* ENH handle Latin dialects

* RM remove unwanted Latin TSVs

* ENH add Latin dialect TSVs

* ENH postprocess Latin dialect TSVs

* ENH update data summary readme

* MAINT update changelog

* MAINT update changelog

* Add HTTP User-Agent header to API calls (#234)

* add http headers for get requests

* add http headers for get requests in tests/

* change wikipron/scrape.py code to avoid circular imports

* updated requirements.txt to have the latest dependencies (#238) (#239)

* Update requirements.txt

* Update CHANGELOG.md

* Update CHANGELOG.md

* Added support for Python-3.9 (#236) (#240)

* Update requirements.txt

* Update CHANGELOG.md

* Update CHANGELOG.md

* Update config.yml

* Update setup.py

* Update CHANGELOG.md

* Add black formatting (#242)

* add black formatting (fix #237)

* update changelog

* [kmr] New scrape: Northern Kurdish (#243)

* [kmr] Adding an entry for Northern Kurdish.

* [kmr] Adding an ISO mapping for Northern Kurdish.

* [kmr] Fresh scrape.

* [kmr] Updated description and summaries.

* [kmr] Updated CHANGELOG.

* [kmr] Lower-cased version.

* [kmr] Silly. Source should be lower-case.

* Update CHANGELOG.md

Minor style fixes to CHANGELOG

* MAINT reorganize changelog (#244)

* Add logging for dialect support for languages requiring custom extraction logic (#245)

* ENH alert the use of custom logic when dialect is specific

* MAINT update changelog

* Add a script to facilitate the creation of .phones files (#246)

* ENH add script to tally phones/phonemes in a TSV

* DOC update readme for the .phones files

* MAINT update changelog

* DOC comments in list_phones.py

* MAINT update changelog

* DOC update docstrings and readme

* Use mypy for type checking (#247)

* ISSUE-241: Ignoring 'env' and '.idea' directories

* ISSUE-241: Added 'mypy' to 'requirements.txt'

* ISSUE-241: Added 'Type checking' step to CircleCI

* ISSUE-241: Fixed mypy issues

* ISSUE-241: Updated documentation

* ISSUE-241: Added mypy to the correct 'requirements.txt'

* ISSUE-241: Ran Black formatter

Also updated the contribution guidelines to include this as a step

* ISSUE-241: Markups

ISSUE-241: Markup - Alphabetised 'requirements.txt'
ISSUE-241: Markup - Log invalid page title
ISSUE-241: Markup - Alphabetised 'test_scrape.py' imports
ISSUE-241: Markup - Added explanatory comment
ISSUE-241: Markup - Improved 'config_dict' typing
ISSUE-241: Markup - Improved 'scrape.py' typing

* ISSUE-241: Markup - Using logger interpolation

* ISSUE-241: Markups

* ISSUE-241: Markup - Added working dir to Circle CI config

* split tildes; resort (#250)

* split tildes; resort

* update CHANGELOG.md

* Improve CircleCI workflow with orbs (#249)

* Convert to matrix CircleCI workflow

* Fix typo in parameter

* Add missing job name

* Add CircleCI test storage

* Add Python orb and caching

* Fix orb command

* Set Python deps install to global scope

* Bump up Python orb version

* Fix command nesting

* Add package manager to orb command

* Fix pyenv cache failure

* Fix pyenv cache

* Add workspace cache for pip packages

* Fix username typo

* Fix permission error

* Test pre-built CircleCI Docker image

* Test missing site-packages

* Test missing Python dir

* Add verbose pip list

* Add pre test jobs

* Fix parameter substitution in description

* Fix extraneus run

* Add parametrized flake8 and black jobs

* Fix parameter passing

* Fix unreferenced parameter

* Fix pre-test Docker image tag

* Show xml coverage

* Add pre-test Python cache

* Create tsv directory

* Chown /home to circleci

* Fix store_results path

* Rename pre-test jobs

* Improve CircleCI configuration

Add Python orb, matrix jobs and rework workflow structure

* Improve CircleCI configuration

Add Python orb, matrix jobs and rework workflow structure

* Bump up pre-build Python version to 3.9

* Add mypy to pre-build jobs

* Add mypy to build required jobs

* Change pip3 to pip

* Add PR to CHANGELOG.md

* Disable circleci user chowning /home

* Revert "Disable circleci user chowning /home"

This reverts commit eed32d6f3ab9c2094a642cc23967c536ad5bddb5.

* Disable pyenv creation

* Revert "Disable pyenv creation"

This reverts commit 68297c21c1c2f4dc67e2bc9bd7972adbeea3878b.

* Disable pyenv creation

* Test pip cache renewal

* Revert "Test pip cache renewal"

This reverts commit b4772307ded407da0fedfc4320b3594f66d366fa.

Cache works as intended, references
https://github.com/kylebgorman/wikipron/pull/249#discussion_r511582495.

Co-authored-by: Jackson L. Lee <jacksonlunlee@gmail.com>

* Small path changes on the data side, rework of test_scrape.py (#251)

* rework some paths on data side, simplify test_scrape.py

* revert changes to test_summary.py

* updates changelog

* Adding a sanity check for valid IPA (#248)

* Check that the phones/phonemes are valid IPA.

* Only print the bad characters.

* Updated CHANGELOG.

* Reformatted the file using black.

* Reran black with line length limit.

* Phonemes, rather than phones.

* Sorted the packages alphabetically.

* Re-arranged imports.

* Moved ipapy into data-specific requirements file.

* Adding dependency on absl-py (for logging) and factoring out the phoneme
checking functionality into its own function.

* Added a link to IPA chart.

* Removed absl-py.

* Use internal logger.

* Check the logging level.

* Moving to global logger.

Thanks Kyle!

* reformatted.

* Cosmetic: fixed warning message.

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* Style fixes for list_phones (#254)

* Style fixes for list_phones

* Ran black formatter.

* Remove `<5`.

* Negative flags are renamed to positive statements (#141) (#255)

* Negative flags in cli.py are renamed to positive statements. In order to accomodate this change, Wikipron/config.py and tests/test_wikipron/test_config.py are also edited accordingly.

* positive flags are added and negative flags are renamed to positive ones.

* changelog is updated.

* style edit

* fix fix redundancy

Co-authored-by: unknown <Yeonju@NYCMAXASIKKAW10.ad.insidemedia.net>

* Clean up flag help and eliminate remaining double negatives (#257)

* Work on flags:

1. Flag help should be short, because people don't read it very
carefully and it's not formatted for multi-sentence input. This shortens
all the flags to a single, consistent name. Because dialect and
segmentation require more information, these details have been moved
into a prominent position in the README instead.
2. The tone and space flags are given negative versions, cf. what Yeonju
did earlier.

* Eliminates double-negative in skip-spaces.

* Updates changelog.

* Updates tests, config, core.

* Fixed missing test_scrape change.

* Adds test for TSV splitting (#256)

* fixes to split.py and postprocess before adding tests

* cleanup of test_split

* updated a few comments in test_split

* revert needless changes to postprocess and split

* minor comment update in test_split

* updates changelog

* Updates data side to use new flags (#258)

* quick fix to small oversight in test_extract.py

* data side uses new flags

* updates changelog, removes config_factory from text_extract.py

* [ita] Adds phoneme list, filtered phonemic TSV file (#261)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* [ady] Adds phone list, filtered Adyghe data. (#263)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* Moves `list_phones.py`. (#266)

* Moves `list_phones.py`.

Closes #265.

* Add changelog

* [bul] Adds phone list, filtered Bulgarian data (#267)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* Adds Icelandic phone list, filtered Icelandic data. (#270)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* [slv] Adds Slovenian phoneme list, filtered TSV data. (#273)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* Adds normalization to `list_phones.py`, corrects bugs relating to `ipapy` (#275)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

* Add normalization to list_phones.py

* Updates changelog

* Reformats list_phones.py

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* Adds Welsh .phones lists, filtered TSV data (#276)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

* Add normalization to list_phones.py

* Updates changelog

* Reformats list_phones.py

* Adds Welsh phoneme lists, filtered Welsh TSV data

* Updates changelog

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* [yue] Handle Cantonese for scraping (#277)

* ENH handle Cantonese for scraping

* MAINT update changelog

* DOC explain Cantonese pron XPath template

* Updates `data/phones/README.md` with instructions to re-scrape (#281)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

* Add normalization to list_phones.py

* Updates changelog

* Reformats list_phones.py

* Adds Welsh phoneme lists, filtered Welsh TSV data

* Updates changelog

* Updates  with instructions to re-scrape

* Updates changelog

* Updates

* Updates data/phones/README.md

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* [vie] Adds Vietnamese `.phones` files, `.tsv` files  (#283)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

* Add normalization to list_phones.py

* Updates changelog

* Reformats list_phones.py

* Adds Welsh phoneme lists, filtered Welsh TSV data

* Updates changelog

* Updates  with instructions to re-scrape

* Updates changelog

* Updates

* Updates data/phones/README.md

* Adds Vietnamese phones, Vietnamese TSV files

* Updates changelog

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* [hin] Adds `phones` file, updated/new TSV files. (#284)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

* Add normalization to list_phones.py

* Updates changelog

* Reformats list_phones.py

* Adds Welsh phoneme lists, filtered Welsh TSV data

* Updates changelog

* Updates  with instructions to re-scrape

* Updates changelog

* Updates

* Updates data/phones/README.md

* Adds Vietnamese phones, Vietnamese TSV files

* Updates changelog

* Adds Hindi  file, new/updated TSV files

* Updates changelog

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* [hbs] Fixes Serbo-Croatian phoneme lists. Re-scrapes data. (#288)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

* Add normalization to list_phones.py

* Updates changelog

* Reformats list_phones.py

* Adds Welsh phoneme lists, filtered Welsh TSV data

* Updates changelog

* Updates  with instructions to re-scrape

* Updates changelog

* Updates

* Updates data/phones/README.md

* Adds Vietnamese phones, Vietnamese TSV files

* Updates changelog

* Adds Hindi  file, new/updated TSV files

* Updates changelog

* Fixes Serbo-Croatian phones

* Updates CHANGELOG

* Revert "Adds Hindi  file, new/updated TSV files"

This reverts commit 964c3becdbf8a4ec35285ffbfbe9a419fad5123e.

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* [ofs] Scraped Old Frisian. (#294)

* Add Old Frisian to the configuration.

* Mark "ofs" as ISO639-3 language code.

* Fixed language name.

* Added phonemic pronunciations.

* Updated.

* [aar] Rescraped Afar. (#291)

* [aar] Rescraped.

* Updated.

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* [dng] Scraped Dungan. (#293)

* [dng] Scraped Dungan.

* Updated.

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* [bre] Rescraped Breton. (#292)

* [bre] Rescraped.

* Updated.

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* Covering grammar script (#297)

* Adds covering grammar generator, for future QA work.

Also moves `list_phones.py` to the src directory, which makes sense to
me.

* Changelog update.

* Updates CHANGELOG with issue number.

* Update README.md

Fix syntax highlighting hints.

* [ltg] Scraped Latgalian. (#296)

* [ltg] Scraped Latgalian.

* Forgot to include the actual data.

* Updated.

* Removes reconstructions (#302)

* Adds covering grammar generator, for future QA work.

Also moves `list_phones.py` to the src directory, which makes sense to
me.

* Changelog update.

* Updates CHANGELOG with issue number.

* Skips reconstructions during scraping.

Then, rescrapes Latin to take advantage of this.

* Adds number to changelog.

* Updates CHANGELOG for >>> junk.

* Rescrapes Armenian. (#303)

Closes #301.

* [por] Adds phones files, rescraped TSV files. (#304)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

* Add normalization to list_phones.py

* Updates changelog

* Reformats list_phones.py

* Adds Welsh phoneme lists, filtered Welsh TSV data

* Updates changelog

* Updates  with instructions to re-scrape

* Updates changelog

* Updates

* Updates data/phones/README.md

* Adds Vietnamese phones, Vietnamese TSV files

* Updates changelog

* Adds Hindi  file, new/updated TSV files

* Updates changelog

* Fixes Serbo-Croatian phones

* Updates CHANGELOG

* Revert "Adds Hindi  file, new/updated TSV files"

This reverts commit 964c3becdbf8a4ec35285ffbfbe9a419fad5123e.

* Adds Portuguese .phones files, re-scraped TSV data

* Rescrapes Portuguese data

* Updates changelog

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* [bur] Adds Burmese phone list, re-scraped Burmese data. (#305)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

* Add normalization to list_phones.py

* Updates changelog

* Reformats list_phones.py

* Adds Welsh phoneme lists, filtered Welsh TSV data

* Updates changelog

* Updates  with instructions to re-scrape

* Updates changelog

* Updates

* Updates data/phones/README.md

* Adds Vietnamese phones, Vietnamese TSV files

* Updates changelog

* Adds Hindi  file, new/updated TSV files

* Updates changelog

* Fixes Serbo-Croatian phones

* Updates CHANGELOG

* Revert "Adds Hindi  file, new/updated TSV files"

This reverts commit 964c3becdbf8a4ec35285ffbfbe9a419fad5123e.

* Adds Portuguese .phones files, re-scraped TSV data

* Rescrapes Portuguese data

* Updates changelog

* Adds Burmese phones, updated Burmese data

* Updates changelog

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* [mdf] Scrape Moksha + slightly more flexible default pron selector. (#295)

* Support some Moksha pronunciations that reside under "p", rather than
"li".

* Scrape.

* Attempt to fix the test.

* Updated.

* Split the PR into two items.

* [jpn] Adds Japanese .phones file and updated TSV files (#307)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

* Add normalization to list_phones.py

* Updates changelog

* Reformats list_phones.py

* Adds Welsh phoneme lists, filtered Welsh TSV data

* Updates changelog

* Updates  with instructions to re-scrape

* Updates changelog

* Updates

* Updates data/phones/README.md

* Adds Vietnamese phones, Vietnamese TSV files

* Updates changelog

* Adds Hindi  file, new/updated TSV files

* Updates changelog

* Fixes Serbo-Croatian phones

* Updates CHANGELOG

* Revert "Adds Hindi  file, new/updated TSV files"

This reverts commit 964c3becdbf8a4ec35285ffbfbe9a419fad5123e.

* Adds Portuguese .phones files, re-scraped TSV data

* Rescrapes Portuguese data

* Updates changelog

* Adds Burmese phones, updated Burmese data

* Updates changelog

* Adds Japanese phone list. Rescrapes Japanese data

* Updates changelog

* Removes data/tsv/jpn_hira_phonemic.tsv

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* Updates segments version (#308)

* updates segments version and adds test for vietnamese tones

* updates changelog

* [ger] Adds German Phone list, filtered TSV file (#309)

* Create German Phonelist

* Updated CHANGELOG.md

* incorporate updates in README.md, and added missing ger_phone* files

* Adds some whitespace to German phone list comments. (#310)

* [aze] Adds Azerbaijani phone lists and updated TSV data (#312)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

* Add normalization to list_phones.py

* Updates changelog

* Reformats list_phones.py

* Adds Welsh phoneme lists, filtered Welsh TSV data

* Updates changelog

* Updates  with instructions to re-scrape

* Updates changelog

* Updates

* Updates data/phones/README.md

* Adds Vietnamese phones, Vietnamese TSV files

* Updates changelog

* Adds Hindi  file, new/updated TSV files

* Updates changelog

* Fixes Serbo-Croatian phones

* Updates CHANGELOG

* Revert "Adds Hindi  file, new/updated TSV files"

This reverts commit 964c3becdbf8a4ec35285ffbfbe9a419fad5123e.

* Adds Portuguese .phones files, re-scraped TSV data

* Rescrapes Portuguese data

* Updates changelog

* Adds Burmese phones, updated Burmese data

* Updates changelog

* Adds Japanese phone list. Rescrapes Japanese data

* Updates changelog

* Removes data/tsv/jpn_hira_phonemic.tsv

* Adds Azerbaijani phones, updated TSV data

* Updates changelog

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* [tur] Adds Turkish phone list and updated TSV data (#314)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

* Add normalization to list_phones.py

* Updates changelog

* Reformats list_phones.py

* Adds Welsh phoneme lists, filtered Welsh TSV data

* Updates changelog

* Updates  with instructions to re-scrape

* Updates changelog

* Updates

* Updates data/phones/README.md

* Adds Vietnamese phones, Vietnamese TSV files

* Updates changelog

* Adds Hindi  file, new/updated TSV files

* Updates changelog

* Fixes Serbo-Croatian phones

* Updates CHANGELOG

* Revert "Adds Hindi  file, new/updated TSV files"

This reverts commit 964c3becdbf8a4ec35285ffbfbe9a419fad5123e.

* Adds Portuguese .phones files, re-scraped TSV data

* Rescrapes Portuguese data

* Updates changelog

* Adds Burmese phones, updated Burmese data

* Updates changelog

* Adds Japanese phone list. Rescrapes Japanese data

* Updates changelog

* Removes data/tsv/jpn_hira_phonemic.tsv

* Adds Azerbaijani phones, updated TSV data

* Updates changelog

* Adds Turkish phones, rescraped Turkish data

* Updates changelog

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* [afr] adds phone list for Afrikaans and updated TSV files (#316)

* adds afr phone list and rescrapes

* Updated CHANGELOG.md

* [mlt] Adds Maltese phones file and updates TSV data (#318)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

* Add normalization to list_phones.py

* Updates changelog

* Reformats list_phones.py

* Adds Welsh phoneme lists, filtered Welsh TSV data

* Updates changelog

* Updates  with instructions to re-scrape

* Updates changelog

* Updates

* Updates data/phones/README.md

* Adds Vietnamese phones, Vietnamese TSV files

* Updates changelog

* Adds Hindi  file, new/updated TSV files

* Updates changelog

* Fixes Serbo-Croatian phones

* Updates CHANGELOG

* Revert "Adds Hindi  file, new/updated TSV files"

This reverts commit 964c3becdbf8a4ec35285ffbfbe9a419fad5123e.

* Adds Portuguese .phones files, re-scraped TSV data

* Rescrapes Portuguese data

* Updates changelog

* Adds Burmese phones, updated Burmese data

* Updates changelog

* Adds Japanese phone list. Rescrapes Japanese data

* Updates changelog

* Removes data/tsv/jpn_hira_phonemic.tsv

* Adds Azerbaijani phones, updated TSV data

* Updates changelog

* Adds Turkish phones, rescraped Turkish data

* Updates changelog

* Adds Maltese phones, updated data

* Updates changelog

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* Frequency code tire-kick (#320)

* Frequency code tire-kick:

1. Increases typing.
2. No longer overwrites the .tsv files: adds `_freq.tsv` suffix sintead.
3. Adds Khmer to JSON config. file.
4. Adds `shared_tasks` subdirectory for targeted config files.
5. Updates README.

* [lav] Adds Latvian phone list and updated TSV data (#322)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

* Add normalization to list_phones.py

* Updates changelog

* Reformats list_phones.py

* Adds Welsh phoneme lists, filtered Welsh TSV data

* Updates changelog

* Updates  with instructions to re-scrape

* Updates changelog

* Updates

* Updates data/phones/README.md

* Adds Vietnamese phones, Vietnamese TSV files

* Updates changelog

* Adds Hindi  file, new/updated TSV files

* Updates changelog

* Fixes Serbo-Croatian phones

* Updates CHANGELOG

* Revert "Adds Hindi  file, new/updated TSV files"

This reverts commit 964c3becdbf8a4ec35285ffbfbe9a419fad5123e.

* Adds Portuguese .phones files, re-scraped TSV data

* Rescrapes Portuguese data

* Updates changelog

* Adds Burmese phones, updated Burmese data

* Updates changelog

* Adds Japanese phone list. Rescrapes Japanese data

* Updates changelog

* Removes data/tsv/jpn_hira_phonemic.tsv

* Adds Azerbaijani phones, updated TSV data

* Updates changelog

* Adds Turkish phones, rescraped Turkish data

* Updates changelog

* Adds Maltese phones, updated data

* Updates changelog

* Adds Latvian phones, updated Latvian data

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* [khm] Adds Khmer phones and updated TSV data (#327)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

* Add normalization to list_phones.py

* Updates changelog

* Reformats list_phones.py

* Adds Welsh phoneme lists, filtered Welsh TSV data

* Updates changelog

* Updates  with instructions to re-scrape

* Updates changelog

* Updates

* Updates data/phones/README.md

* Adds Vietnamese phones, Vietnamese TSV files

* Updates changelog

* Adds Hindi  file, new/updated TSV files

* Updates changelog

* Fixes Serbo-Croatian phones

* Updates CHANGELOG

* Revert "Adds Hindi  file, new/updated TSV files"

This reverts commit 964c3becdbf8a4ec35285ffbfbe9a419fad5123e.

* Adds Portuguese .phones files, re-scraped TSV data

* Rescrapes Portuguese data

* Updates changelog

* Adds Burmese phones, updated Burmese data

* Updates changelog

* Adds Japanese phone list. Rescrapes Japanese data

* Updates changelog

* Removes data/tsv/jpn_hira_phonemic.tsv

* Adds Azerbaijani phones, updated TSV data

* Updates changelog

* Adds Turkish phones, rescraped Turkish data

* Updates changelog

* Adds Maltese phones, updated data

* Updates changelog

* Adds Latvian phones, updated Latvian data

* Updates changelog

* Adds Khmer phones and updated TSV data

* Updates changelog

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* Moves Latin phonelist to Classical Latin. (#326)

Also undertook a light reorg.

* [nob] Adds Østnorsk (Bokmål) phones and updated TSV data (#330)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

* Add normalization to list_phones.py

* Updates changelog

* Reformats list_phones.py

* Adds Welsh phoneme lists, filtered Welsh TSV data

* Updates changelog

* Updates  with instructions to re-scrape

* Updates changelog

* Updates

* Updates data/phones/README.md

* Adds Vietnamese phones, Vietnamese TSV files

* Updates changelog

* Adds Hindi  file, new/updated TSV files

* Updates changelog

* Fixes Serbo-Croatian phones

* Updates CHANGELOG

* Revert "Adds Hindi  file, new/updated TSV files"

This reverts commit 964c3becdbf8a4ec35285ffbfbe9a419fad5123e.

* Adds Portuguese .phones files, re-scraped TSV data

* Rescrapes Portuguese data

* Updates changelog

* Adds Burmese phones, updated Burmese data

* Updates changelog

* Adds Japanese phone list. Rescrapes Japanese data

* Updates changelog

* Removes data/tsv/jpn_hira_phonemic.tsv

* Adds Azerbaijani phones, updated TSV data

* Updates changelog

* Adds Turkish phones, rescraped Turkish data

* Updates changelog

* Adds Maltese phones, updated data

* Updates changelog

* Adds Latvian phones, updated Latvian data

* Updates changelog

* Adds Khmer phones and updated TSV data

* Updates changelog

* Adds Østnorsk (Bokmål) phones and updated TSV data

* Updates changelog

* Fixes typo

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* Add English link to language list for frequencies. (#332)

* Partial scrape (#334)

* scrape up to cantonese

* raw partial scrape - excludes yue, rus, cmn

* post-processing on partial scrape, src README fix

* re-ran generate_summary.py after resolving conflicts

* revert comment in scrape.py

* updates changelog, resolves formatting error

* Updates `data/phones/README.md` (#333)

* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

* Add normalization to list_phones.py

* Updates changelog

* Reformats list_phones.py

* Adds Welsh phoneme lists, filtered Welsh TSV data

* Updates changelog

* Updates  with instructions to re-scrape

* Updates changelog

* Updates

* Updates data/phones/README.md

* Adds Vietnamese phones, Vietnamese TSV files

* Updates changelog

* Adds Hindi  file, new/updated TSV files

* Updates changelog

* Fixes Serbo-Croatian phones

* Updates CHANGELOG

* Revert "Adds Hindi  file, new/updated TSV files"

This reverts commit 964c3becdbf8a4ec35285ffbfbe9a419fad5123e.

* Adds Portuguese .phones files, re-scraped TSV data

* Rescrapes Portuguese data

* Updates changelog

* Adds Burmese phones, updated Burmese data

* Updates changelog

* Adds Japanese phone list. Rescrapes Japanese data

* Updates changelog

* Removes data/tsv/jpn_hira_phonemic.tsv

* Adds Azerbaijani phones, updated TSV data

* Updates changelog

* Adds Turkish phones, rescraped Turkish data

* Updates changelog

* Adds Maltese phones, updated data

* Updates changelog

* Adds Latvian phones, updated Latvian data

* Updates changelog

* Adds Khmer phones and updated TSV data

* Updates changelog

* Adds Østnorsk (Bokmål) phones and updated TSV data

* Updates changelog

* Fixes typo

* Update data/phones/README.md

* Update changelog

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>

* [arm] cleaned up armenian phones (#331)

* cleaned up armenian phones

* cleaned up armenian phones (with more tidying up)

* cleaning up armenian (fixed changelog)

I had written the update on the wrong spot on the changelog + I added the issue number

* uncommented accidental gaps

* uncommented accidental gaps

* added voiceless allophones

* added missing geminate affricates

* reduced branch

* reduced branch

* final changes for commit to original branch

Co-authored-by: Lucas Ashby <lfeashby@gmail.com>
Co-authored-by: Alexander Gutkin <35786058+agutkin@users.noreply.github.com>
Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>
Co-authored-by: Travis Bartley <Travismbartley@gmail.com>
Co-authored-by: Jackson L. Lee <jacksonlunlee@gmail.com>
Co-authored-by: ajmalanoski <71616036+ajmalanoski@users.noreply.github.com>
Co-authored-by: Alireza <Alirezasampoor@gmail.com>
Co-authored-by: Biswaroop Bhattacharjee <biswaroop08@gmail.com>
Co-authored-by: Muhammad Fakhri Putra Supriyadi <fakhriputra123s@gmail.com>
Co-authored-by: Ben Fernandes <dev.benfernandes@gmail.com>
Co-authored-by: Jim Regan <jaoregan@tcd.ie>
Co-authored-by: platipo <enrico.paganin@mail.com>
Co-authored-by: yeonju123 <yeonju123@gmail.com>
Co-authored-by: unknown <Yeonju@NYCMAXASIKKAW10.ad.insidemedia.net>
Co-authored-by: Hossep Dolatian <hovdeov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants