Skip to content

Commit

Permalink
Filtering (#199)
Browse files Browse the repository at this point in the history
* [rum] Add whitelist and rescrape.

* [eng] Adds English rescrape.

* [dut] Adds Dutch rescrape.

* [gre] Adds Greek rescrape.

* [gre] Adds Greek rescrape.

* Updates scrape path for phonetic filtering.

Closes #195.

* [rum] Adds Romanian rescrape.

* [arm] Adds Armenian rescrape.

* [gre] Adds Greek rescrape (second try).

* [arm] Adds Armenian dialects + rescrapes.

Closes #197.

* Adds CHANGELOG changes.

* [spa] Adds Spanish rescrape.

* Postprocess and regenerate summaries.
  • Loading branch information
kylebgorman committed Aug 7, 2020
1 parent 91b554b commit 4af0af1
Show file tree
Hide file tree
Showing 36 changed files with 383,811 additions and 50,529 deletions.
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,9 @@ Unreleased
- Flattened data directory structure. (\#194)
- Updated Georgian (`geo`) to take advantage of upstream bot-based
consistency fixes. (\#138)
- Fixes path issue with phonetic whitelisted files. (\#195)
- Split `arm` into Eastern and Western dialects. (\#197)
- Rescraped files with new whitelists. (\#199)
- Updates logging statements for consistency. (\#196)

### Deprecated
Expand Down
41 changes: 26 additions & 15 deletions data/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,10 @@
| [TSV](tsv/grc_phonemic.tsv) | grc | Ancient Greek (to 1453) | Ancient Greek | True | Phonemic | 73,879 |
| [TSV](tsv/ara_phonemic.tsv) | ara | Arabic | Arabic | False | Phonemic | 5,102 |
| [TSV](tsv/arc_hebr_phonemic.tsv) | arc | Imperial Aramaic (700-300 BCE); Official Aramaic (700-300 BCE) | Aramaic (Hebrew) | False | Phonemic | 1,156 |
| [TSV](tsv/arm_phonetic.tsv) | arm | Armenian | Armenian | True | Phonetic | 13,656 |
| [TSV](tsv/arm_e_phonetic.tsv) | arm | Armenian | Armenian (Eastern Armenian) | True | Phonetic | 13,702 |
| [TSV](tsv/arm_e_phonetic_filtered.tsv) | arm | Armenian | Armenian (Eastern Armenian) | True | Phonetic_filtered | 13,258 |
| [TSV](tsv/arm_w_phonetic.tsv) | arm | Armenian | Armenian (Western Armenian) | True | Phonetic | 742 |
| [TSV](tsv/arm_w_phonetic_filtered.tsv) | arm | Armenian | Armenian (Western Armenian) | True | Phonetic_filtered | 599 |
| [TSV](tsv/asm_phonemic.tsv) | asm | Assamese | Assamese | False | Phonemic | 2,209 |
| [TSV](tsv/ast_phonetic.tsv) | ast | Asturian | Asturian | True | Phonetic | 129 |
| [TSV](tsv/aze_latn_phonemic.tsv) | aze | Azerbaijani | Azerbaijani (Latin) | True | Phonemic | 210 |
Expand Down Expand Up @@ -44,14 +47,17 @@
| [TSV](tsv/dan_phonemic.tsv) | dan | Danish | Danish | True | Phonemic | 3,490 |
| [TSV](tsv/dan_phonetic.tsv) | dan | Danish | Danish | True | Phonetic | 4,116 |
| [TSV](tsv/sce_phonemic.tsv) | sce | Dongxiang | Dongxiang | True | Phonemic | 117 |
| [TSV](tsv/dut_phonemic.tsv) | dut | Dutch; Flemish | Dutch | True | Phonemic | 25,922 |
| [TSV](tsv/dut_phonetic.tsv) | dut | Dutch; Flemish | Dutch | True | Phonetic | 632 |
| [TSV](tsv/dut_phonemic.tsv) | dut | Dutch; Flemish | Dutch | True | Phonemic | 27,712 |
| [TSV](tsv/dut_phonemic_filtered.tsv) | dut | Dutch; Flemish | Dutch | True | Phonemic_filtered | 27,661 |
| [TSV](tsv/dut_phonetic.tsv) | dut | Dutch; Flemish | Dutch | True | Phonetic | 641 |
| [TSV](tsv/dzo_phonemic.tsv) | dzo | Dzongkha | Dzongkha | False | Phonemic | 188 |
| [TSV](tsv/egy_phonemic.tsv) | egy | Egyptian (Ancient) | Egyptian | False | Phonemic | 2,785 |
| [TSV](tsv/eng_uk_phonemic.tsv) | eng | English | English (UK, Received Pronunciation) | True | Phonemic | 55,050 |
| [TSV](tsv/eng_uk_phonetic.tsv) | eng | English | English (UK, Received Pronunciation) | True | Phonetic | 1,243 |
| [TSV](tsv/eng_us_phonemic.tsv) | eng | English | English (US, General American) | True | Phonemic | 51,199 |
| [TSV](tsv/eng_us_phonetic.tsv) | eng | English | English (US, General American) | True | Phonetic | 1,588 |
| [TSV](tsv/eng_uk_phonemic.tsv) | eng | English | English (UK, Received Pronunciation) | True | Phonemic | 56,674 |
| [TSV](tsv/eng_uk_phonemic_filtered.tsv) | eng | English | English (UK, Received Pronunciation) | True | Phonemic_filtered | 55,962 |
| [TSV](tsv/eng_uk_phonetic.tsv) | eng | English | English (UK, Received Pronunciation) | True | Phonetic | 1,264 |
| [TSV](tsv/eng_us_phonemic.tsv) | eng | English | English (US, General American) | True | Phonemic | 53,011 |
| [TSV](tsv/eng_us_phonemic_filtered.tsv) | eng | English | English (US, General American) | True | Phonemic_filtered | 52,208 |
| [TSV](tsv/eng_us_phonetic.tsv) | eng | English | English (US, General American) | True | Phonetic | 1,604 |
| [TSV](tsv/epo_phonemic.tsv) | epo | Esperanto | Esperanto | True | Phonemic | 14,289 |
| [TSV](tsv/est_phonemic.tsv) | est | Estonian | Estonian | True | Phonemic | 348 |
| [TSV](tsv/fao_phonemic.tsv) | fao | Faroese | Faroese | True | Phonemic | 1,647 |
Expand All @@ -69,8 +75,9 @@
| [TSV](tsv/ger_phonetic.tsv) | ger | German | German | True | Phonetic | 9,885 |
| [TSV](tsv/got_phonemic.tsv) | got | Gothic | Gothic | True | Phonemic | 665 |
| [TSV](tsv/got_phonetic.tsv) | got | Gothic | Gothic | True | Phonetic | 122 |
| [TSV](tsv/gre_phonemic.tsv) | gre | Modern Greek (1453-) | Greek | True | Phonemic | 9,578 |
| [TSV](tsv/gre_phonetic.tsv) | gre | Modern Greek (1453-) | Greek | True | Phonetic | 405 |
| [TSV](tsv/gre_phonemic.tsv) | gre | Modern Greek (1453-) | Greek | True | Phonemic | 10,037 |
| [TSV](tsv/gre_phonemic_filtered.tsv) | gre | Modern Greek (1453-) | Greek | True | Phonemic_filtered | 9,829 |
| [TSV](tsv/gre_phonetic.tsv) | gre | Modern Greek (1453-) | Greek | True | Phonetic | 404 |
| [TSV](tsv/afb_phonemic.tsv) | afb | Gulf Arabic | Gulf Arabic | False | Phonemic | 432 |
| [TSV](tsv/hts_phonemic.tsv) | hts | Hadza | Hadza | True | Phonemic | 273 |
| [TSV](tsv/haw_phonemic.tsv) | haw | Hawaiian | Hawaiian | True | Phonemic | 441 |
Expand All @@ -82,6 +89,7 @@
| [TSV](tsv/hin_phonemic.tsv) | hin | Hindi | Hindi | False | Phonemic | 8,255 |
| [TSV](tsv/hin_phonetic.tsv) | hin | Hindi | Hindi | False | Phonetic | 214 |
| [TSV](tsv/hun_phonetic.tsv) | hun | Hungarian | Hungarian | True | Phonetic | 49,665 |
| [TSV](tsv/hun_phonetic_filtered.tsv) | hun | Hungarian | Hungarian | True | Phonetic_filtered | 49,665 |
| [TSV](tsv/hrx_phonemic.tsv) | hrx | Hunsrik | Hunsrik | True | Phonemic | 1,248 |
| [TSV](tsv/ice_phonemic.tsv) | ice | Icelandic | Icelandic | True | Phonemic | 9,437 |
| [TSV](tsv/ice_phonetic.tsv) | ice | Icelandic | Icelandic | True | Phonetic | 339 |
Expand Down Expand Up @@ -176,8 +184,9 @@
| [TSV](tsv/por_po_phonemic.tsv) | por | Portuguese | Portuguese (Portugal) | True | Phonemic | 9,633 |
| [TSV](tsv/por_po_phonetic.tsv) | por | Portuguese | Portuguese (Portugal) | True | Phonetic | 356 |
| [TSV](tsv/pan_guru_phonemic.tsv) | pan | Panjabi | Punjabi (Gurmukhi) | False | Phonemic | 113 |
| [TSV](tsv/rum_phonemic.tsv) | rum | Romanian; Moldavian; Moldovan | Romanian | True | Phonemic | 3,175 |
| [TSV](tsv/rum_phonetic.tsv) | rum | Romanian; Moldavian; Moldovan | Romanian | True | Phonetic | 4,805 |
| [TSV](tsv/rum_phonemic.tsv) | rum | Romanian; Moldavian; Moldovan | Romanian | True | Phonemic | 3,654 |
| [TSV](tsv/rum_phonetic.tsv) | rum | Romanian; Moldavian; Moldovan | Romanian | True | Phonetic | 6,308 |
| [TSV](tsv/rum_phonetic_filtered.tsv) | rum | Romanian; Moldavian; Moldovan | Romanian | True | Phonetic_filtered | 6,201 |
| [TSV](tsv/rus_phonetic.tsv) | rus | Russian | Russian | True | Phonetic | 389,128 |
| [TSV](tsv/san_phonemic.tsv) | san | Sanskrit | Sanskrit | False | Phonemic | 4,614 |
| [TSV](tsv/san_phonetic.tsv) | san | Sanskrit | Sanskrit | False | Phonetic | 381 |
Expand All @@ -193,10 +202,12 @@
| [TSV](tsv/slo_phonemic.tsv) | slo | Slovak | Slovak | True | Phonemic | 3,724 |
| [TSV](tsv/slo_phonetic.tsv) | slo | Slovak | Slovak | True | Phonetic | 456 |
| [TSV](tsv/slv_phonemic.tsv) | slv | Slovenian | Slovene | True | Phonemic | 4,360 |
| [TSV](tsv/spa_ca_phonemic.tsv) | spa | Spanish; Castilian | Spanish (Castilian) | True | Phonemic | 55,623 |
| [TSV](tsv/spa_ca_phonetic.tsv) | spa | Spanish; Castilian | Spanish (Castilian) | True | Phonetic | 47,950 |
| [TSV](tsv/spa_la_phonemic.tsv) | spa | Spanish; Castilian | Spanish (Latin America) | True | Phonemic | 44,487 |
| [TSV](tsv/spa_la_phonetic.tsv) | spa | Spanish; Castilian | Spanish (Latin America) | True | Phonetic | 38,392 |
| [TSV](tsv/spa_ca_phonemic.tsv) | spa | Spanish; Castilian | Spanish (Castilian) | True | Phonemic | 57,472 |
| [TSV](tsv/spa_ca_phonemic_filtered.tsv) | spa | Spanish; Castilian | Spanish (Castilian) | True | Phonemic_filtered | 57,342 |
| [TSV](tsv/spa_ca_phonetic.tsv) | spa | Spanish; Castilian | Spanish (Castilian) | True | Phonetic | 49,410 |
| [TSV](tsv/spa_la_phonemic.tsv) | spa | Spanish; Castilian | Spanish (Latin America) | True | Phonemic | 45,948 |
| [TSV](tsv/spa_la_phonemic_filtered.tsv) | spa | Spanish; Castilian | Spanish (Latin America) | True | Phonemic_filtered | 45,880 |
| [TSV](tsv/spa_la_phonetic.tsv) | spa | Spanish; Castilian | Spanish (Latin America) | True | Phonetic | 39,581 |
| [TSV](tsv/srn_phonemic.tsv) | srn | Sranan Tongo | Sranan Tongo | True | Phonemic | 157 |
| [TSV](tsv/swe_phonemic.tsv) | swe | Swedish | Swedish | True | Phonemic | 3,113 |
| [TSV](tsv/swe_phonetic.tsv) | swe | Swedish | Swedish | True | Phonetic | 154 |
Expand Down
41 changes: 26 additions & 15 deletions data/languages_summary.tsv
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,10 @@ ale_phonemic.tsv ale Aleut Aleut True Phonemic 104
grc_phonemic.tsv grc Ancient Greek (to 1453) Ancient Greek True Phonemic 73879
ara_phonemic.tsv ara Arabic Arabic False Phonemic 5102
arc_hebr_phonemic.tsv arc Imperial Aramaic (700-300 BCE); Official Aramaic (700-300 BCE) Aramaic (Hebrew) False Phonemic 1156
arm_phonetic.tsv arm Armenian Armenian True Phonetic 13656
arm_e_phonetic.tsv arm Armenian Armenian (Eastern Armenian) True Phonetic 13702
arm_e_phonetic_filtered.tsv arm Armenian Armenian (Eastern Armenian) True Phonetic_filtered 13258
arm_w_phonetic.tsv arm Armenian Armenian (Western Armenian) True Phonetic 742
arm_w_phonetic_filtered.tsv arm Armenian Armenian (Western Armenian) True Phonetic_filtered 599
asm_phonemic.tsv asm Assamese Assamese False Phonemic 2209
ast_phonetic.tsv ast Asturian Asturian True Phonetic 129
aze_latn_phonemic.tsv aze Azerbaijani Azerbaijani (Latin) True Phonemic 210
Expand Down Expand Up @@ -42,14 +45,17 @@ dlm_phonemic.tsv dlm Dalmatian Dalmatian True Phonemic 176
dan_phonemic.tsv dan Danish Danish True Phonemic 3490
dan_phonetic.tsv dan Danish Danish True Phonetic 4116
sce_phonemic.tsv sce Dongxiang Dongxiang True Phonemic 117
dut_phonemic.tsv dut Dutch; Flemish Dutch True Phonemic 25922
dut_phonetic.tsv dut Dutch; Flemish Dutch True Phonetic 632
dut_phonemic.tsv dut Dutch; Flemish Dutch True Phonemic 27712
dut_phonemic_filtered.tsv dut Dutch; Flemish Dutch True Phonemic_filtered 27661
dut_phonetic.tsv dut Dutch; Flemish Dutch True Phonetic 641
dzo_phonemic.tsv dzo Dzongkha Dzongkha False Phonemic 188
egy_phonemic.tsv egy Egyptian (Ancient) Egyptian False Phonemic 2785
eng_uk_phonemic.tsv eng English English (UK, Received Pronunciation) True Phonemic 55050
eng_uk_phonetic.tsv eng English English (UK, Received Pronunciation) True Phonetic 1243
eng_us_phonemic.tsv eng English English (US, General American) True Phonemic 51199
eng_us_phonetic.tsv eng English English (US, General American) True Phonetic 1588
eng_uk_phonemic.tsv eng English English (UK, Received Pronunciation) True Phonemic 56674
eng_uk_phonemic_filtered.tsv eng English English (UK, Received Pronunciation) True Phonemic_filtered 55962
eng_uk_phonetic.tsv eng English English (UK, Received Pronunciation) True Phonetic 1264
eng_us_phonemic.tsv eng English English (US, General American) True Phonemic 53011
eng_us_phonemic_filtered.tsv eng English English (US, General American) True Phonemic_filtered 52208
eng_us_phonetic.tsv eng English English (US, General American) True Phonetic 1604
epo_phonemic.tsv epo Esperanto Esperanto True Phonemic 14289
est_phonemic.tsv est Estonian Estonian True Phonemic 348
fao_phonemic.tsv fao Faroese Faroese True Phonemic 1647
Expand All @@ -67,8 +73,9 @@ ger_phonemic.tsv ger German German True Phonemic 30872
ger_phonetic.tsv ger German German True Phonetic 9885
got_phonemic.tsv got Gothic Gothic True Phonemic 665
got_phonetic.tsv got Gothic Gothic True Phonetic 122
gre_phonemic.tsv gre Modern Greek (1453-) Greek True Phonemic 9578
gre_phonetic.tsv gre Modern Greek (1453-) Greek True Phonetic 405
gre_phonemic.tsv gre Modern Greek (1453-) Greek True Phonemic 10037
gre_phonemic_filtered.tsv gre Modern Greek (1453-) Greek True Phonemic_filtered 9829
gre_phonetic.tsv gre Modern Greek (1453-) Greek True Phonetic 404
afb_phonemic.tsv afb Gulf Arabic Gulf Arabic False Phonemic 432
hts_phonemic.tsv hts Hadza Hadza True Phonemic 273
haw_phonemic.tsv haw Hawaiian Hawaiian True Phonemic 441
Expand All @@ -80,6 +87,7 @@ acw_phonetic.tsv acw Hijazi Arabic Hijazi Arabic False Phonetic 127
hin_phonemic.tsv hin Hindi Hindi False Phonemic 8255
hin_phonetic.tsv hin Hindi Hindi False Phonetic 214
hun_phonetic.tsv hun Hungarian Hungarian True Phonetic 49665
hun_phonetic_filtered.tsv hun Hungarian Hungarian True Phonetic_filtered 49665
hrx_phonemic.tsv hrx Hunsrik Hunsrik True Phonemic 1248
ice_phonemic.tsv ice Icelandic Icelandic True Phonemic 9437
ice_phonetic.tsv ice Icelandic Icelandic True Phonetic 339
Expand Down Expand Up @@ -174,8 +182,9 @@ por_bz_phonetic.tsv por Portuguese Portuguese (Brazil) True Phonetic 396
por_po_phonemic.tsv por Portuguese Portuguese (Portugal) True Phonemic 9633
por_po_phonetic.tsv por Portuguese Portuguese (Portugal) True Phonetic 356
pan_guru_phonemic.tsv pan Panjabi Punjabi (Gurmukhi) False Phonemic 113
rum_phonemic.tsv rum Romanian; Moldavian; Moldovan Romanian True Phonemic 3175
rum_phonetic.tsv rum Romanian; Moldavian; Moldovan Romanian True Phonetic 4805
rum_phonemic.tsv rum Romanian; Moldavian; Moldovan Romanian True Phonemic 3654
rum_phonetic.tsv rum Romanian; Moldavian; Moldovan Romanian True Phonetic 6308
rum_phonetic_filtered.tsv rum Romanian; Moldavian; Moldovan Romanian True Phonetic_filtered 6201
rus_phonetic.tsv rus Russian Russian True Phonetic 389128
san_phonemic.tsv san Sanskrit Sanskrit False Phonemic 4614
san_phonetic.tsv san Sanskrit Sanskrit False Phonetic 381
Expand All @@ -191,10 +200,12 @@ scn_phonetic.tsv scn Sicilian Sicilian True Phonetic 161
slo_phonemic.tsv slo Slovak Slovak True Phonemic 3724
slo_phonetic.tsv slo Slovak Slovak True Phonetic 456
slv_phonemic.tsv slv Slovenian Slovene True Phonemic 4360
spa_ca_phonemic.tsv spa Spanish; Castilian Spanish (Castilian) True Phonemic 55623
spa_ca_phonetic.tsv spa Spanish; Castilian Spanish (Castilian) True Phonetic 47950
spa_la_phonemic.tsv spa Spanish; Castilian Spanish (Latin America) True Phonemic 44487
spa_la_phonetic.tsv spa Spanish; Castilian Spanish (Latin America) True Phonetic 38392
spa_ca_phonemic.tsv spa Spanish; Castilian Spanish (Castilian) True Phonemic 57472
spa_ca_phonemic_filtered.tsv spa Spanish; Castilian Spanish (Castilian) True Phonemic_filtered 57342
spa_ca_phonetic.tsv spa Spanish; Castilian Spanish (Castilian) True Phonetic 49410
spa_la_phonemic.tsv spa Spanish; Castilian Spanish (Latin America) True Phonemic 45948
spa_la_phonemic_filtered.tsv spa Spanish; Castilian Spanish (Latin America) True Phonemic_filtered 45880
spa_la_phonetic.tsv spa Spanish; Castilian Spanish (Latin America) True Phonetic 39581
srn_phonemic.tsv srn Sranan Tongo Sranan Tongo True Phonemic 157
swe_phonemic.tsv swe Swedish Swedish True Phonemic 3113
swe_phonetic.tsv swe Swedish Swedish True Phonetic 154
Expand Down
6 changes: 5 additions & 1 deletion data/src/languages.json
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,11 @@
"iso639_name": "Armenian",
"wiktionary_name": "Armenian",
"wiktionary_code": "hy",
"casefold": true
"casefold": true,
"dialect": {
"w": "Western Armenian",
"e": "Eastern Armenian"
}
},
"rup": {
"iso639_name": "Macedo-Romanian",
Expand Down
10 changes: 3 additions & 7 deletions data/src/scrape.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@


def _whitelist_reader(path: str) -> Iterator[str]:
# Read-in whitelist file.
# Reads whitelist file.
with open(path, "r") as source:
for line in source:
line = re.sub(r"\s*#.*$", "", line) # Removes comments from line.
Expand Down Expand Up @@ -74,7 +74,6 @@ def _call_scrape(
# Pauses execution for 10 min.
time.sleep(600)
# Log and remove TSVs for languages that failed.
# to be scraped within 10 retries.
logging.info(
'Failed to scrape "%s" within 10 retries. %s',
lang_settings["key"],
Expand Down Expand Up @@ -130,7 +129,7 @@ def _build_scraping_config(
config_settings["key"],
whitelist_phonetic,
)
phonetic_path_filtered = f"{whitelist_path_affix}phonetic.whitelist"
phonetic_path_filtered = f"{path_affix}phonetic_filtered.tsv"
phone_set = frozenset(_whitelist_reader(whitelist_phonetic))
_call_scrape(
config_settings,
Expand Down Expand Up @@ -212,14 +211,11 @@ def main(args: argparse.Namespace) -> None:
datefmt="%d-%b-%y %H:%M:%S",
level="INFO",
)

parser = argparse.ArgumentParser()
parser.add_argument(
"--restriction",
type=str,
nargs="+",
help="Specify language restrictions for scrape",
)
args = parser.parse_args()

main(args)
main(parser.parse_args())
Loading

0 comments on commit 4af0af1

Please sign in to comment.