Skip to content

Conversation

@jsl-models
Copy link
Collaborator

No description provided.

@maziyarpanahi
Copy link
Contributor

@luca-martial
Is the size of the model correct? It says it's less than 1MB but the smallest model is 120MB+ https://fauconnier.github.io/#data

@luca-martial
Copy link
Contributor

@maziyarpanahi yes seems strange, but the outputs looked good on sample text when I tried it. This is what I did to save it, do you see any red flags?

!wget https://s3.us-east-2.amazonaws.com/embeddings.net/embeddings/frWiki_no_lem_no_postag_no_phrase_1000_cbow_cut100.bin

embeddings = WordEmbeddings() \
    .setStoragePath("frWiki_no_lem_no_postag_no_phrase_1000_cbow_cut100.bin", "BINARY") \
    .setDimension(1000) \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings")

# Saving embeddings
embeddings.write().overwrite().save("word2vec_wiki_1000_fr")

# Zipping to upload to models hub
shutil.make_archive("word2vec_wiki_1000_fr", 'zip', "word2vec_wiki_1000_fr")

@maziyarpanahi
Copy link
Contributor

@maziyarpanahi yes seems strange, but the outputs looked good on sample text when I tried it. This is what I did to save it, do you see any red flags?

!wget https://s3.us-east-2.amazonaws.com/embeddings.net/embeddings/frWiki_no_lem_no_postag_no_phrase_1000_cbow_cut100.bin

embeddings = WordEmbeddings() \
    .setStoragePath("frWiki_no_lem_no_postag_no_phrase_1000_cbow_cut100.bin", "BINARY") \
    .setDimension(1000) \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings")

# Saving embeddings
embeddings.write().overwrite().save("word2vec_wiki_1000_fr")

# Zipping to upload to models hub
shutil.make_archive("word2vec_wiki_1000_fr", 'zip', "word2vec_wiki_1000_fr")

Thanks @luca-martial

It seems fine if the file word2vec_wiki_1000_fr.zip is actually larger than 1MB then it's just a bad calculation on the Models Hub.

@luca-martial
Copy link
Contributor

luca-martial commented Jan 26, 2022

@maziyarpanahi no, the zip file on my machine is 4.0k, so I was just wondering if the WordEmbeddings annotator was responsible for compressing the original binary so much. the unzipped version is 24.0k

@maziyarpanahi
Copy link
Contributor

https://s3.us-east-2.amazonaws.com/embeddings.net/embeddings/frWiki_no_lem_no_postag_no_phrase_1000_cbow_cut100.bin

OK so that python part make_archive might be the problem. If the directory you saved the model word2vec_wiki_1000_fr is not 4KB (it should be the size of the .bin file which is 200MB+) then the archiving failed. (it's kind of impossible to go from 200MB+ to 4kb)

@luca-martial
Copy link
Contributor

So what I was trying to say is that before archiving, the results of: embeddings.write().overwrite().save("word2vec_wiki_1000_fr") is a folder of 24.0k, that's why I'm wondering if I'm using the WordEmbeddings annotator in the wrong way

@maziyarpanahi
Copy link
Contributor

So what I was trying to say is that before archiving, the results of: embeddings.write().overwrite().save("word2vec_wiki_1000_fr") is a folder of 24.0k, that's why I'm wondering if I'm using the WordEmbeddings annotator in the wrong way

I can't say for sure, but it definitely failed. It should be the exact same size. This is how it is in Scala, you missed to set storageRef, but that shouldn't be a problem here:

val embeddings = new WordEmbeddings()
   .setStoragePath("src/test/resources/random_embeddings_dim4.txt", ReadAs.TEXT)
   .setStorageRef("glove_4d")
   .setDimension(4)
   .setInputCols("document", "token")
   .setOutputCol("embeddings")

My guess is that the .bin file might not be compatible.

@josejuanmartinez
Copy link
Contributor

josejuanmartinez commented Jan 26, 2022

@maziyarpanahi The models were perfectly compatible, what Luca was saying is that they were working in Spark NLP but not being saved properly. The problem here was saving the model before fitting. After creating a pipeline, fitting and saving the model with model.stages[-2].write().overwrite().save() it worked. @luca-martial will reupload again.

@maziyarpanahi
Copy link
Contributor

Oh sorry, I missed the .fit() part! Yeah, it is absolutely required to have the .fit() even right after without a pipeline in order to get the Model rather than saving the Approach.

@luca-martial
Copy link
Contributor

thanks guys, I'll close this PR and reupload now

@maziyarpanahi maziyarpanahi deleted the 2022-01-25-word2vec_wiki_1000_fr_IXU3ddx8yOHfvm599lB4hVi3 branch September 13, 2022 12:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants