Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Stanford GloVe Embeddings Datasets #26

Closed
jonthegeek opened this issue Oct 15, 2019 · 2 comments
Closed

Add Stanford GloVe Embeddings Datasets #26

jonthegeek opened this issue Oct 15, 2019 · 2 comments
Labels
enhancement New feature or request

Comments

@jonthegeek
Copy link
Contributor

I'd like to add the GloVe pre-trained word vectors, for use in tidymodels/textrecipes#20

The datasets are available here: https://nlp.stanford.edu/projects/glove/

There are 4 downloads, that break down like this:

  • glove.6B.zip = 4 datasets
  • glove.42B.300d.zip = 1 dataset
  • glove.840B.300d.zip = 1 dataset
  • glove.twitter.27B.zip = 4 datasets

The first one is all I'm directly in need of right now, but it feels worthwhile to work out a standard for all of them while I'm at it.

I don't want to make the functions too complicated to understand, but it feels like maybe it should be one set of textdata functions (download_glove, process_glove, dataset_glove), with arguments about the specifics (something like dataset_glove({normal stuff plus}, token_set, dimensions)).

Let me know what you think and I can knock this out (I'm doing it anyway for personal/work use, so formalizing it won't be a lot of extra work).

@EmilHvitfeldt
Copy link
Owner

This sounds good.

It looks like each download comes with everything zipped. So I would create 4 user facing functions. Lets prefix them with embedding_ . so we get embedding_glove6b(), embedding_glove42b() etc etc.

I did a little writeup of what should be done to make a new step work:
https://emilhvitfeldt.github.io/textdata/articles/How-to-add-a-data-set.html

If you need examples of how this procedure works look at this commit 7ce4e42.

Please feel free to ping me if you have any questions or problems

@EmilHvitfeldt EmilHvitfeldt added the enhancement New feature or request label Oct 15, 2019
@jonthegeek
Copy link
Contributor Author

Ok, that sounds good. The downloads will be separate, but then I'll put a parameter in the dataset_ function to just load the appropriate sub-dataset (for 6b and 27b). I should have a PR for this within the next couple hours, depending on what other distractions come up.

EmilHvitfeldt added a commit that referenced this issue Oct 16, 2019
Added Stanford GloVe embeddings. Closes #26.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants