Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrates Wortschatz frequencies #122

Merged
merged 7 commits into from
Dec 15, 2019
Merged

Conversation

lfashby
Copy link
Collaborator

@lfashby lfashby commented Dec 15, 2019

This PR adds two scripts and a json file to a new directory src/frequencies. grab_wortschatz_data.py automatically downloads and unpacks Wortschatz tars and merge.py merges in the Wortschatz word frequency counts with our TSVs such that we end up with three column TSVs (our old TSVs are replaced):

алтын	ɑ ɫ t ɯ̞ n	745

This is currently done for 61 languages, I'll see if I can add more before our next big scrape.

For words that we scrape and that Wortschatz doesn't have a frequency count for, the third column is set to 0.

I added the tars and freq_tsvs directories (both created when running grab_wortschatz_data.py) to .gitignore as opposed to automatically deleting them once merge.py completes because we download about 10 gigabytes worth of Wortschatz stuff and it's a pain to have to re-download it all every time we want to test these two scripts.

@lfashby lfashby changed the title Integrates Wortshatz frequencies Integrates Wortschatz frequencies Dec 15, 2019
Copy link
Collaborator

@kylebgorman kylebgorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit and a question, but this basically looks good to me.

import tarfile
import time

with open("wortschatz_languages.json", "r") as langs:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this at global scope?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question!

@kylebgorman
Copy link
Collaborator

Basically looks good to me. I don't understand the *.tar.gz in the .gitignore though. I get why you want to keep the tarballs around, but maybe just add the directory as a .gitignore, not all filetypes like that?

@lfashby
Copy link
Collaborator Author

lfashby commented Dec 15, 2019

I experimented with different ways of grabbing the Wortschatz data (including with cURL) to see if I could resolve the 404 response problem, but generally got the same behavior. The solution in grab_wortschatz_data.py is not ideal but it does at least eventually collect all the tarballs we need and only needs to be run once.

Copy link
Collaborator

@kylebgorman kylebgorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving/LGTM, one more minor nit.

@@ -51,6 +48,9 @@ def unpack():


def main():
with open("wortschatz_languages.json", "r") as langs:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this string a global variable at the top?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or perhaps I should add it to the other path constants in src/codes.py?

@kylebgorman
Copy link
Collaborator

kylebgorman commented Dec 15, 2019 via email

@lfashby lfashby merged commit 55ddd85 into CUNY-CL:master Dec 15, 2019
@lfashby lfashby deleted the frequencies branch December 25, 2019 16:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants