Integrates Wortschatz frequencies #122

lfashby · 2019-12-15T22:19:04Z

This PR adds two scripts and a json file to a new directory src/frequencies. grab_wortschatz_data.py automatically downloads and unpacks Wortschatz tars and merge.py merges in the Wortschatz word frequency counts with our TSVs such that we end up with three column TSVs (our old TSVs are replaced):

алтын	ɑ ɫ t ɯ̞ n	745

This is currently done for 61 languages, I'll see if I can add more before our next big scrape.

For words that we scrape and that Wortschatz doesn't have a frequency count for, the third column is set to 0.

I added the tars and freq_tsvs directories (both created when running grab_wortschatz_data.py) to .gitignore as opposed to automatically deleting them once merge.py completes because we download about 10 gigabytes worth of Wortschatz stuff and it's a pain to have to re-download it all every time we want to test these two scripts.

kylebgorman

Nit and a question, but this basically looks good to me.

kylebgorman · 2019-12-15T22:28:48Z

languages/wikipron/src/frequencies/grab_wortschatz_data.py

+import tarfile
+import time
+
+with open("wortschatz_languages.json", "r") as langs:


why is this at global scope?

Good question!

kylebgorman · 2019-12-15T22:32:33Z

Basically looks good to me. I don't understand the *.tar.gz in the .gitignore though. I get why you want to keep the tarballs around, but maybe just add the directory as a .gitignore, not all filetypes like that?

lfashby · 2019-12-15T22:58:53Z

I experimented with different ways of grabbing the Wortschatz data (including with cURL) to see if I could resolve the 404 response problem, but generally got the same behavior. The solution in grab_wortschatz_data.py is not ideal but it does at least eventually collect all the tarballs we need and only needs to be run once.

kylebgorman

Approving/LGTM, one more minor nit.

kylebgorman · 2019-12-15T23:25:00Z

languages/wikipron/src/frequencies/grab_wortschatz_data.py

@@ -51,6 +48,9 @@ def unpack():


 def main():
+    with open("wortschatz_languages.json", "r") as langs:


make this string a global variable at the top?

Or perhaps I should add it to the other path constants in src/codes.py?

kylebgorman · 2019-12-15T23:33:42Z

That'd be fine too, whatever you think is more consistent.

…

On Sun, Dec 15, 2019 at 6:28 PM Lucas Ashby ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In languages/wikipron/src/frequencies/grab_wortschatz_data.py <https://github.com/kylebgorman/wikipron/pull/122#discussion_r358014869>: > @@ -51,6 +48,9 @@ def unpack(): def main(): + with open("wortschatz_languages.json", "r") as langs: Or perhaps I should add it to the other path constants in src/codes.py? — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <https://github.com/kylebgorman/wikipron/pull/122?email_source=notifications&email_token=AABG4OOKL37LWXNIPAXX6P3QY24P7A5CNFSM4J3CKGC2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCPHG4XA#discussion_r358014869>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABG4ON7ZX4I6ZO2U6THPCDQY24P7ANCNFSM4J3CKGCQ> .

lfashby added 5 commits December 15, 2019 16:37

First pass successfully integrating Wortschatz data

c48a389

Added comments to new scripts

3afee5d

Cleaned up comments

e33e250

Final cleanup

bab7474

added shebang

63490a7

lfashby changed the title ~~Integrates Wortshatz frequencies~~ Integrates Wortschatz frequencies Dec 15, 2019

lfashby requested a review from kylebgorman December 15, 2019 22:23

kylebgorman reviewed Dec 15, 2019

View reviewed changes

changes to .gitignore and grab_wortschatz_data.py

1cebb53

kylebgorman approved these changes Dec 15, 2019

View reviewed changes

added path to wortschatz dictionary as global constant

d9acc80

lfashby merged commit 55ddd85 into CUNY-CL:master Dec 15, 2019

lfashby deleted the frequencies branch December 25, 2019 16:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrates Wortschatz frequencies #122

Integrates Wortschatz frequencies #122

lfashby commented Dec 15, 2019

kylebgorman left a comment

kylebgorman Dec 15, 2019

lfashby Dec 15, 2019

kylebgorman commented Dec 15, 2019

lfashby commented Dec 15, 2019

kylebgorman left a comment

kylebgorman Dec 15, 2019

lfashby Dec 15, 2019

kylebgorman commented Dec 15, 2019 via email

		@@ -51,6 +48,9 @@ def unpack():


		def main():
		with open("wortschatz_languages.json", "r") as langs:

Integrates Wortschatz frequencies #122

Integrates Wortschatz frequencies #122

Conversation

lfashby commented Dec 15, 2019

kylebgorman left a comment

Choose a reason for hiding this comment

kylebgorman Dec 15, 2019

Choose a reason for hiding this comment

lfashby Dec 15, 2019

Choose a reason for hiding this comment

kylebgorman commented Dec 15, 2019

lfashby commented Dec 15, 2019

kylebgorman left a comment

Choose a reason for hiding this comment

kylebgorman Dec 15, 2019

Choose a reason for hiding this comment

lfashby Dec 15, 2019

Choose a reason for hiding this comment

kylebgorman commented Dec 15, 2019 via email