Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[slv] Fixes Slovenian normalization. Re-scrapes Slovenian. #356

Merged
merged 117 commits into from
Feb 10, 2021

Conversation

ajmalanoski
Copy link
Collaborator

@ajmalanoski ajmalanoski commented Feb 10, 2021

  • Updated Unreleased in CHANGELOG.md to reflect the changes in code or data.

Also adds a script to change a file's unicode normalization

ajmalanoski and others added 30 commits October 2, 2020 10:48
I don't know where this file came from...
@ajmalanoski
Copy link
Collaborator Author

Oh, I just remembered I forgot to do the postprocessing and summaries. I'll fix that right now

Copy link
Collaborator

@kylebgorman kylebgorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but lots of finicky complaints about src/normalize.py; I'm trying to preserve a standard style (which is roughly PEP-8) throughout.

data/src/normalize.py Outdated Show resolved Hide resolved
data/src/normalize.py Outdated Show resolved Hide resolved
data/src/normalize.py Outdated Show resolved Hide resolved
data/src/normalize.py Outdated Show resolved Hide resolved
data/src/normalize.py Outdated Show resolved Hide resolved
data/src/normalize.py Outdated Show resolved Hide resolved
data/src/normalize.py Outdated Show resolved Hide resolved
data/src/normalize.py Outdated Show resolved Hide resolved
data/src/normalize.py Outdated Show resolved Hide resolved
data/src/normalize.py Show resolved Hide resolved
@kylebgorman
Copy link
Collaborator

Can you run black -l79 over the normalize.py and try again? That seems to be the broken test.

@ajmalanoski
Copy link
Collaborator Author

I don't know why the black test is still failing. Every line is under 79 characters now.

@kylebgorman
Copy link
Collaborator

kylebgorman commented Feb 10, 2021

Strange. It checks (and can change) things other than line length, so maybe something else is at play. Or maybe you need to add + commit again?

@ajmalanoski
Copy link
Collaborator Author

Ah, I hadn't seen your previous comment when I wrote my last comment. I ran the black command and it's working now.

Copy link
Collaborator

@kylebgorman kylebgorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'll let you push the big green button...it's very satisfying.

@ajmalanoski ajmalanoski merged commit b75d605 into CUNY-CL:master Feb 10, 2021
@ajmalanoski ajmalanoski deleted the slv branch February 12, 2021 04:31
ajmalanoski added a commit to ajmalanoski/wikipron that referenced this pull request Feb 12, 2021
ajmalanoski added a commit that referenced this pull request Feb 12, 2021
* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

* Add normalization to list_phones.py

* Updates changelog

* Reformats list_phones.py

* Adds Welsh phoneme lists, filtered Welsh TSV data

* Updates changelog

* Updates  with instructions to re-scrape

* Updates changelog

* Updates

* Updates data/phones/README.md

* Adds Vietnamese phones, Vietnamese TSV files

* Updates changelog

* Adds Hindi  file, new/updated TSV files

* Updates changelog

* Fixes Serbo-Croatian phones

* Updates CHANGELOG

* Revert "Adds Hindi  file, new/updated TSV files"

This reverts commit 964c3be.

* Adds Portuguese .phones files, re-scraped TSV data

* Rescrapes Portuguese data

* Updates changelog

* Adds Burmese phones, updated Burmese data

* Updates changelog

* Adds Japanese phone list. Rescrapes Japanese data

* Updates changelog

* Removes data/tsv/jpn_hira_phonemic.tsv

* Adds Azerbaijani phones, updated TSV data

* Updates changelog

* Adds Turkish phones, rescraped Turkish data

* Updates changelog

* Adds Maltese phones, updated data

* Updates changelog

* Adds Latvian phones, updated Latvian data

* Updates changelog

* Adds Khmer phones and updated TSV data

* Updates changelog

* Adds Østnorsk (Bokmål) phones and updated TSV data

* Updates changelog

* Fixes typo

* Update data/phones/README.md

* Update changelog

* Re-scrapes Armenian data. Fixes error in West Armenian phone list

* Updates changelog

* Attempts to fix data/phones/README.md

* Fixes paths in data/phones/README.md

* Fixes links in data/phones/HOWTO.md

* Fixes paths in data/src/generate_phones_sumary.py

* Updates changelog

* Adds normalization instructions in data/phones/HOWTO.md

* Fixes equal signs in changelog

* Updates changelog

* Updates data/src/normalize.py to make it more efficient. Additionally, adds a shebang to make it executable

* Fixes spacing in data/src/normalize.py

* Updates changelog. Fixes path typo for #356

* Updates data/src/normalize.py doc

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>
ajmalanoski added a commit that referenced this pull request Mar 9, 2021
* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

* Add normalization to list_phones.py

* Updates changelog

* Reformats list_phones.py

* Adds Welsh phoneme lists, filtered Welsh TSV data

* Updates changelog

* Updates  with instructions to re-scrape

* Updates changelog

* Updates

* Updates data/phones/README.md

* Adds Vietnamese phones, Vietnamese TSV files

* Updates changelog

* Adds Hindi  file, new/updated TSV files

* Updates changelog

* Fixes Serbo-Croatian phones

* Updates CHANGELOG

* Revert "Adds Hindi  file, new/updated TSV files"

This reverts commit 964c3be.

* Adds Portuguese .phones files, re-scraped TSV data

* Rescrapes Portuguese data

* Updates changelog

* Adds Burmese phones, updated Burmese data

* Updates changelog

* Adds Japanese phone list. Rescrapes Japanese data

* Updates changelog

* Removes data/tsv/jpn_hira_phonemic.tsv

* Adds Azerbaijani phones, updated TSV data

* Updates changelog

* Adds Turkish phones, rescraped Turkish data

* Updates changelog

* Adds Maltese phones, updated data

* Updates changelog

* Adds Latvian phones, updated Latvian data

* Updates changelog

* Adds Khmer phones and updated TSV data

* Updates changelog

* Adds Østnorsk (Bokmål) phones and updated TSV data

* Updates changelog

* Fixes typo

* Update data/phones/README.md

* Update changelog

* Re-scrapes Armenian data. Fixes error in West Armenian phone list

* Updates changelog

* Attempts to fix data/phones/README.md

* Fixes paths in data/phones/README.md

* Fixes links in data/phones/HOWTO.md

* Fixes paths in data/src/generate_phones_sumary.py

* Updates changelog

* Adds script to change file unicode normalization. Fixes normalization in Slovene phone lists. Re-scrapes Slovene.

* Updates changelog

* Postprocessing after Slovene scrape

* Fixes style in data/src/normalize.py

* Fixes style in data/src/normalize.py

* Fixes style in data/src/normalize.py (again)

* Fixes line length in data/src/normalize.py…I hope

* Ran black on data/src/normalize

* Adds normalization instructions in data/phones/HOWTO.md

* Fixes equal signs in changelog

* Updates changelog

* Updates data/src/normalize.py to make it more efficient. Additionally, adds a shebang to make it executable

* Fixes normalization command in step 5

* Fixes spacing in data/src/normalize.py

* Updates changelog. Fixes path typo for #356

* Adds CG for Georgian. Fixes errors/misleading aspects of Georgian phonelist

* Updates changelog

* Fixes typo in changelog

* Fixes taps in Georgian CG

* Postprocessing after Georgian phonelist edits

* Fixes typo in geo_phonemic.phones

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>
ajmalanoski added a commit that referenced this pull request Mar 17, 2021
* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

* Add normalization to list_phones.py

* Updates changelog

* Reformats list_phones.py

* Adds Welsh phoneme lists, filtered Welsh TSV data

* Updates changelog

* Updates  with instructions to re-scrape

* Updates changelog

* Updates

* Updates data/phones/README.md

* Adds Vietnamese phones, Vietnamese TSV files

* Updates changelog

* Adds Hindi  file, new/updated TSV files

* Updates changelog

* Fixes Serbo-Croatian phones

* Updates CHANGELOG

* Revert "Adds Hindi  file, new/updated TSV files"

This reverts commit 964c3be.

* Adds Portuguese .phones files, re-scraped TSV data

* Rescrapes Portuguese data

* Updates changelog

* Adds Burmese phones, updated Burmese data

* Updates changelog

* Adds Japanese phone list. Rescrapes Japanese data

* Updates changelog

* Removes data/tsv/jpn_hira_phonemic.tsv

* Adds Azerbaijani phones, updated TSV data

* Updates changelog

* Adds Turkish phones, rescraped Turkish data

* Updates changelog

* Adds Maltese phones, updated data

* Updates changelog

* Adds Latvian phones, updated Latvian data

* Updates changelog

* Adds Khmer phones and updated TSV data

* Updates changelog

* Adds Østnorsk (Bokmål) phones and updated TSV data

* Updates changelog

* Fixes typo

* Update data/phones/README.md

* Update changelog

* Re-scrapes Armenian data. Fixes error in West Armenian phone list

* Updates changelog

* Attempts to fix data/phones/README.md

* Fixes paths in data/phones/README.md

* Fixes links in data/phones/HOWTO.md

* Fixes paths in data/src/generate_phones_sumary.py

* Updates changelog

* Adds script to change file unicode normalization. Fixes normalization in Slovene phone lists. Re-scrapes Slovene.

* Updates changelog

* Postprocessing after Slovene scrape

* Fixes style in data/src/normalize.py

* Fixes style in data/src/normalize.py

* Fixes style in data/src/normalize.py (again)

* Fixes line length in data/src/normalize.py…I hope

* Ran black on data/src/normalize

* Adds normalization instructions in data/phones/HOWTO.md

* Fixes equal signs in changelog

* Updates changelog

* Updates data/src/normalize.py to make it more efficient. Additionally, adds a shebang to make it executable

* Fixes normalization command in step 5

* Fixes spacing in data/src/normalize.py

* Updates changelog. Fixes path typo for #356

* Adds CG for Georgian. Fixes errors/misleading aspects of Georgian phonelist

* Updates changelog

* Fixes typo in changelog

* Fixes taps in Georgian CG

* Postprocessing after Georgian phonelist edits

* Fixes typo in geo_phonemic.phones

* Fixes typo in Georgian covering grammar

* Updates changelog

* Adds missing character in Georgian covering grammar

* Updates changelog

* Changes spaces to tabs

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>
kylebgorman added a commit that referenced this pull request Mar 23, 2021
* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

* Add normalization to list_phones.py

* Updates changelog

* Reformats list_phones.py

* Adds Welsh phoneme lists, filtered Welsh TSV data

* Updates changelog

* Updates  with instructions to re-scrape

* Updates changelog

* Updates

* Updates data/phones/README.md

* Adds Vietnamese phones, Vietnamese TSV files

* Updates changelog

* Adds Hindi  file, new/updated TSV files

* Updates changelog

* Fixes Serbo-Croatian phones

* Updates CHANGELOG

* Revert "Adds Hindi  file, new/updated TSV files"

This reverts commit 964c3be.

* Adds Portuguese .phones files, re-scraped TSV data

* Rescrapes Portuguese data

* Updates changelog

* Adds Burmese phones, updated Burmese data

* Updates changelog

* Adds Japanese phone list. Rescrapes Japanese data

* Updates changelog

* Removes data/tsv/jpn_hira_phonemic.tsv

* Adds Azerbaijani phones, updated TSV data

* Updates changelog

* Adds Turkish phones, rescraped Turkish data

* Updates changelog

* Adds Maltese phones, updated data

* Updates changelog

* Adds Latvian phones, updated Latvian data

* Updates changelog

* Adds Khmer phones and updated TSV data

* Updates changelog

* Adds Østnorsk (Bokmål) phones and updated TSV data

* Updates changelog

* Fixes typo

* Update data/phones/README.md

* Update changelog

* Re-scrapes Armenian data. Fixes error in West Armenian phone list

* Updates changelog

* Attempts to fix data/phones/README.md

* Fixes paths in data/phones/README.md

* Fixes links in data/phones/HOWTO.md

* Fixes paths in data/src/generate_phones_sumary.py

* Updates changelog

* Adds script to change file unicode normalization. Fixes normalization in Slovene phone lists. Re-scrapes Slovene.

* Updates changelog

* Postprocessing after Slovene scrape

* Fixes style in data/src/normalize.py

* Fixes style in data/src/normalize.py

* Fixes style in data/src/normalize.py (again)

* Fixes line length in data/src/normalize.py…I hope

* Ran black on data/src/normalize

* Adds normalization instructions in data/phones/HOWTO.md

* Fixes equal signs in changelog

* Updates changelog

* Updates data/src/normalize.py to make it more efficient. Additionally, adds a shebang to make it executable

* Fixes normalization command in step 5

* Fixes spacing in data/src/normalize.py

* Updates changelog. Fixes path typo for #356

* Adds CG for Georgian. Fixes errors/misleading aspects of Georgian phonelist

* Updates changelog

* Fixes typo in changelog

* Fixes taps in Georgian CG

* Postprocessing after Georgian phonelist edits

* Fixes typo in geo_phonemic.phones

* Fixes typo in Georgian covering grammar

* Updates changelog

* Adds missing character in Georgian covering grammar

* Updates changelog

* Changes spaces to tabs

* Adds Japanese covering grammar(Hiragana, phonetic)

* Updates changelog

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>
ajmalanoski added a commit that referenced this pull request Mar 25, 2021
* Added French phonemic phones list. Added filter French phonemic tsv.

* Added French phonemic phones.

* Updated Changelog.

* Added phones

* Added filtered phonemic wordlist

* Added Serbo-Croatian phonemes and filtered TSV files.

* Updated summaries for Serbo-Croatian phones.

* Updated CHANGELOG.

* Fixed formatting of Serbo-Croat phones file and CHANGELOG.

* Updated fork to match upstream.

* Updated fork to match upstream

* Delete .DS_Store

I don't know where this file came from...

* Delete .DS_Store

* Delete hbs_phonemic_phones.txt

* Delete .DS_Store

* [ita] Adds phoneme list, filtered phonemic TSV file

* Updates CHANGELOG

* Adds updated README and language summary

* Updates CHANGELOG with issue number for Italian phone list

* Adds Adyghe phones, filtered Adyghe data

* Updated CHANGELOG

* Adds Bulgarian phone list, filtered Bulgarian data

* Postprocesses with filtered Bulgarian data

* Updates changelog

* Adds Icelandic phones, filtered TSV file

* Updates changelog

* Adds Slovenian phones, filtered Slovenian data

* Updates changelog

* Add normalization to list_phones.py

* Updates changelog

* Reformats list_phones.py

* Adds Welsh phoneme lists, filtered Welsh TSV data

* Updates changelog

* Updates  with instructions to re-scrape

* Updates changelog

* Updates

* Updates data/phones/README.md

* Adds Vietnamese phones, Vietnamese TSV files

* Updates changelog

* Adds Hindi  file, new/updated TSV files

* Updates changelog

* Fixes Serbo-Croatian phones

* Updates CHANGELOG

* Revert "Adds Hindi  file, new/updated TSV files"

This reverts commit 964c3be.

* Adds Portuguese .phones files, re-scraped TSV data

* Rescrapes Portuguese data

* Updates changelog

* Adds Burmese phones, updated Burmese data

* Updates changelog

* Adds Japanese phone list. Rescrapes Japanese data

* Updates changelog

* Removes data/tsv/jpn_hira_phonemic.tsv

* Adds Azerbaijani phones, updated TSV data

* Updates changelog

* Adds Turkish phones, rescraped Turkish data

* Updates changelog

* Adds Maltese phones, updated data

* Updates changelog

* Adds Latvian phones, updated Latvian data

* Updates changelog

* Adds Khmer phones and updated TSV data

* Updates changelog

* Adds Østnorsk (Bokmål) phones and updated TSV data

* Updates changelog

* Fixes typo

* Update data/phones/README.md

* Update changelog

* Re-scrapes Armenian data. Fixes error in West Armenian phone list

* Updates changelog

* Attempts to fix data/phones/README.md

* Fixes paths in data/phones/README.md

* Fixes links in data/phones/HOWTO.md

* Fixes paths in data/src/generate_phones_sumary.py

* Updates changelog

* Adds script to change file unicode normalization. Fixes normalization in Slovene phone lists. Re-scrapes Slovene.

* Updates changelog

* Postprocessing after Slovene scrape

* Fixes style in data/src/normalize.py

* Fixes style in data/src/normalize.py

* Fixes style in data/src/normalize.py (again)

* Fixes line length in data/src/normalize.py…I hope

* Ran black on data/src/normalize

* Adds normalization instructions in data/phones/HOWTO.md

* Fixes equal signs in changelog

* Updates changelog

* Updates data/src/normalize.py to make it more efficient. Additionally, adds a shebang to make it executable

* Fixes normalization command in step 5

* Fixes spacing in data/src/normalize.py

* Updates changelog. Fixes path typo for #356

* Adds CG for Georgian. Fixes errors/misleading aspects of Georgian phonelist

* Updates changelog

* Fixes typo in changelog

* Fixes taps in Georgian CG

* Postprocessing after Georgian phonelist edits

* Fixes typo in geo_phonemic.phones

* Fixes typo in Georgian covering grammar

* Updates changelog

* Adds missing character in Georgian covering grammar

* Updates changelog

* Changes spaces to tabs

* Fix data/src/generate_phones_summary.py

* Fixes data/src/generate_phones_summary.py (2nd attempt)

* Updates changelog

* Updates tests/test_data/test_summary.py

* Stylistic fix to data/src/generate_phones_summary.py

Co-authored-by: Kyle Gorman <kylebgorman@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants