-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrates Wortschatz frequencies #122
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit and a question, but this basically looks good to me.
import tarfile | ||
import time | ||
|
||
with open("wortschatz_languages.json", "r") as langs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this at global scope?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question!
Basically looks good to me. I don't understand the *.tar.gz in the .gitignore though. I get why you want to keep the tarballs around, but maybe just add the directory as a .gitignore, not all filetypes like that? |
I experimented with different ways of grabbing the Wortschatz data (including with cURL) to see if I could resolve the 404 response problem, but generally got the same behavior. The solution in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving/LGTM, one more minor nit.
@@ -51,6 +48,9 @@ def unpack(): | |||
|
|||
|
|||
def main(): | |||
with open("wortschatz_languages.json", "r") as langs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this string a global variable at the top?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or perhaps I should add it to the other path constants in src/codes.py
?
That'd be fine too, whatever you think is more consistent.
…On Sun, Dec 15, 2019 at 6:28 PM Lucas Ashby ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In languages/wikipron/src/frequencies/grab_wortschatz_data.py
<https://github.com/kylebgorman/wikipron/pull/122#discussion_r358014869>:
> @@ -51,6 +48,9 @@ def unpack():
def main():
+ with open("wortschatz_languages.json", "r") as langs:
Or perhaps I should add it to the other path constants in src/codes.py?
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<https://github.com/kylebgorman/wikipron/pull/122?email_source=notifications&email_token=AABG4OOKL37LWXNIPAXX6P3QY24P7A5CNFSM4J3CKGC2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCPHG4XA#discussion_r358014869>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABG4ON7ZX4I6ZO2U6THPCDQY24P7ANCNFSM4J3CKGCQ>
.
|
This PR adds two scripts and a json file to a new directory
src/frequencies
.grab_wortschatz_data.py
automatically downloads and unpacks Wortschatz tars andmerge.py
merges in the Wortschatz word frequency counts with our TSVs such that we end up with three column TSVs (our old TSVs are replaced):This is currently done for 61 languages, I'll see if I can add more before our next big scrape.
For words that we scrape and that Wortschatz doesn't have a frequency count for, the third column is set to 0.
I added the
tars
andfreq_tsvs
directories (both created when runninggrab_wortschatz_data.py
) to.gitignore
as opposed to automatically deleting them oncemerge.py
completes because we download about 10 gigabytes worth of Wortschatz stuff and it's a pain to have to re-download it all every time we want to test these two scripts.