2022-01-25-word2vec_wiki_1000_fr #6818

jsl-models · 2022-01-25T20:19:43Z

No description provided.

maziyarpanahi · 2022-01-25T20:34:47Z

@luca-martial
Is the size of the model correct? It says it's less than 1MB but the smallest model is 120MB+ https://fauconnier.github.io/#data

luca-martial · 2022-01-26T09:00:38Z

@maziyarpanahi yes seems strange, but the outputs looked good on sample text when I tried it. This is what I did to save it, do you see any red flags?

!wget https://s3.us-east-2.amazonaws.com/embeddings.net/embeddings/frWiki_no_lem_no_postag_no_phrase_1000_cbow_cut100.bin

embeddings = WordEmbeddings() \
    .setStoragePath("frWiki_no_lem_no_postag_no_phrase_1000_cbow_cut100.bin", "BINARY") \
    .setDimension(1000) \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings")

# Saving embeddings
embeddings.write().overwrite().save("word2vec_wiki_1000_fr")

# Zipping to upload to models hub
shutil.make_archive("word2vec_wiki_1000_fr", 'zip', "word2vec_wiki_1000_fr")

maziyarpanahi · 2022-01-26T09:02:38Z

@maziyarpanahi yes seems strange, but the outputs looked good on sample text when I tried it. This is what I did to save it, do you see any red flags?

!wget https://s3.us-east-2.amazonaws.com/embeddings.net/embeddings/frWiki_no_lem_no_postag_no_phrase_1000_cbow_cut100.bin

embeddings = WordEmbeddings() \
    .setStoragePath("frWiki_no_lem_no_postag_no_phrase_1000_cbow_cut100.bin", "BINARY") \
    .setDimension(1000) \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings")

# Saving embeddings
embeddings.write().overwrite().save("word2vec_wiki_1000_fr")

# Zipping to upload to models hub
shutil.make_archive("word2vec_wiki_1000_fr", 'zip', "word2vec_wiki_1000_fr")

Thanks @luca-martial

It seems fine if the file word2vec_wiki_1000_fr.zip is actually larger than 1MB then it's just a bad calculation on the Models Hub.

luca-martial · 2022-01-26T09:05:10Z

@maziyarpanahi no, the zip file on my machine is 4.0k, so I was just wondering if the WordEmbeddings annotator was responsible for compressing the original binary so much. the unzipped version is 24.0k

maziyarpanahi · 2022-01-26T09:08:03Z

https://s3.us-east-2.amazonaws.com/embeddings.net/embeddings/frWiki_no_lem_no_postag_no_phrase_1000_cbow_cut100.bin

OK so that python part make_archive might be the problem. If the directory you saved the model word2vec_wiki_1000_fr is not 4KB (it should be the size of the .bin file which is 200MB+) then the archiving failed. (it's kind of impossible to go from 200MB+ to 4kb)

luca-martial · 2022-01-26T09:15:38Z

So what I was trying to say is that before archiving, the results of: embeddings.write().overwrite().save("word2vec_wiki_1000_fr") is a folder of 24.0k, that's why I'm wondering if I'm using the WordEmbeddings annotator in the wrong way

maziyarpanahi · 2022-01-26T09:20:51Z

So what I was trying to say is that before archiving, the results of: embeddings.write().overwrite().save("word2vec_wiki_1000_fr") is a folder of 24.0k, that's why I'm wondering if I'm using the WordEmbeddings annotator in the wrong way

I can't say for sure, but it definitely failed. It should be the exact same size. This is how it is in Scala, you missed to set storageRef, but that shouldn't be a problem here:

val embeddings = new WordEmbeddings()
   .setStoragePath("src/test/resources/random_embeddings_dim4.txt", ReadAs.TEXT)
   .setStorageRef("glove_4d")
   .setDimension(4)
   .setInputCols("document", "token")
   .setOutputCol("embeddings")

My guess is that the .bin file might not be compatible.

josejuanmartinez · 2022-01-26T11:20:29Z

@maziyarpanahi The models were perfectly compatible, what Luca was saying is that they were working in Spark NLP but not being saved properly. The problem here was saving the model before fitting. After creating a pipeline, fitting and saving the model with model.stages[-2].write().overwrite().save() it worked. @luca-martial will reupload again.

maziyarpanahi · 2022-01-26T11:29:45Z

Oh sorry, I missed the .fit() part! Yeah, it is absolutely required to have the .fit() even right after without a pipeline in order to get the Model rather than saving the Approach.

luca-martial · 2022-01-26T13:20:57Z

thanks guys, I'll close this PR and reupload now

Add model 2022-01-25-word2vec_wiki_1000_fr

fb45a81

jsl-models assigned maziyarpanahi Jan 25, 2022

jsl-models added the new model label Jan 25, 2022

luca-martial requested a review from josejuanmartinez January 25, 2022 20:19

luca-martial closed this Jan 26, 2022

maziyarpanahi deleted the 2022-01-25-word2vec_wiki_1000_fr_IXU3ddx8yOHfvm599lB4hVi3 branch September 13, 2022 12:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

2022-01-25-word2vec_wiki_1000_fr #6818

2022-01-25-word2vec_wiki_1000_fr #6818

Uh oh!

jsl-models commented Jan 25, 2022

Uh oh!

maziyarpanahi commented Jan 25, 2022

Uh oh!

luca-martial commented Jan 26, 2022

Uh oh!

maziyarpanahi commented Jan 26, 2022

Uh oh!

luca-martial commented Jan 26, 2022 •

edited

Loading

Uh oh!

maziyarpanahi commented Jan 26, 2022

Uh oh!

luca-martial commented Jan 26, 2022

Uh oh!

maziyarpanahi commented Jan 26, 2022

Uh oh!

josejuanmartinez commented Jan 26, 2022 •

edited

Loading

Uh oh!

maziyarpanahi commented Jan 26, 2022

Uh oh!

luca-martial commented Jan 26, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

2022-01-25-word2vec_wiki_1000_fr #6818

2022-01-25-word2vec_wiki_1000_fr #6818

Uh oh!

Conversation

jsl-models commented Jan 25, 2022

Uh oh!

maziyarpanahi commented Jan 25, 2022

Uh oh!

luca-martial commented Jan 26, 2022

Uh oh!

maziyarpanahi commented Jan 26, 2022

Uh oh!

luca-martial commented Jan 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maziyarpanahi commented Jan 26, 2022

Uh oh!

luca-martial commented Jan 26, 2022

Uh oh!

maziyarpanahi commented Jan 26, 2022

Uh oh!

josejuanmartinez commented Jan 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maziyarpanahi commented Jan 26, 2022

Uh oh!

luca-martial commented Jan 26, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

luca-martial commented Jan 26, 2022 •

edited

Loading

josejuanmartinez commented Jan 26, 2022 •

edited

Loading