Add Stanford GloVe Embeddings Datasets #26

jonthegeek · 2019-10-15T16:07:57Z

I'd like to add the GloVe pre-trained word vectors, for use in tidymodels/textrecipes#20

The datasets are available here: https://nlp.stanford.edu/projects/glove/

There are 4 downloads, that break down like this:

glove.6B.zip = 4 datasets
glove.42B.300d.zip = 1 dataset
glove.840B.300d.zip = 1 dataset
glove.twitter.27B.zip = 4 datasets

The first one is all I'm directly in need of right now, but it feels worthwhile to work out a standard for all of them while I'm at it.

I don't want to make the functions too complicated to understand, but it feels like maybe it should be one set of textdata functions (download_glove, process_glove, dataset_glove), with arguments about the specifics (something like dataset_glove({normal stuff plus}, token_set, dimensions)).

Let me know what you think and I can knock this out (I'm doing it anyway for personal/work use, so formalizing it won't be a lot of extra work).

The text was updated successfully, but these errors were encountered:

EmilHvitfeldt · 2019-10-15T16:39:01Z

This sounds good.

It looks like each download comes with everything zipped. So I would create 4 user facing functions. Lets prefix them with embedding_ . so we get embedding_glove6b(), embedding_glove42b() etc etc.

I did a little writeup of what should be done to make a new step work:
https://emilhvitfeldt.github.io/textdata/articles/How-to-add-a-data-set.html

If you need examples of how this procedure works look at this commit 7ce4e42.

Please feel free to ping me if you have any questions or problems

jonthegeek · 2019-10-15T17:18:47Z

Ok, that sounds good. The downloads will be separate, but then I'll put a parameter in the dataset_ function to just load the appropriate sub-dataset (for 6b and 27b). I should have a PR for this within the next couple hours, depending on what other distractions come up.

Added Stanford GloVe embeddings. Closes #26.

EmilHvitfeldt added the enhancement New feature or request label Oct 15, 2019

EmilHvitfeldt closed this as completed in 2b523a7 Oct 16, 2019

EmilHvitfeldt added a commit that referenced this issue Oct 16, 2019

Merge pull request #27 from jonthegeek/glove

2b5e9f7

Added Stanford GloVe embeddings. Closes #26.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Stanford GloVe Embeddings Datasets #26

Add Stanford GloVe Embeddings Datasets #26

jonthegeek commented Oct 15, 2019

EmilHvitfeldt commented Oct 15, 2019

jonthegeek commented Oct 15, 2019

Add Stanford GloVe Embeddings Datasets #26

Add Stanford GloVe Embeddings Datasets #26

Comments

jonthegeek commented Oct 15, 2019

EmilHvitfeldt commented Oct 15, 2019

jonthegeek commented Oct 15, 2019