🌟 New Features
gensim.models.fasttext.load_facebook_modelfunction: load full model (slower, more CPU/memory intensive, supports training continuation)
>>> from gensim.test.utils import datapath >>> >>> cap_path = datapath("crime-and-punishment.bin") >>> fb_model = load_facebook_model(cap_path) >>> >>> 'landlord' in fb_model.wv.vocab # Word is out of vocabulary False >>> oov_term = fb_model.wv['landlord'] >>> >>> 'landlady' in fb_model.wv.vocab # Word is in the vocabulary True >>> iv_term = fb_model.wv['landlady'] >>> >>> new_sent = [['lord', 'of', 'the', 'rings'], ['lord', 'of', 'the', 'flies']] >>> fb_model.build_vocab(new_sent, update=True) >>> fb_model.train(sentences=new_sent, total_examples=len(new_sent), epochs=5)
gensim.models.fasttext.load_facebook_vectorsfunction: load embeddings only (faster, less CPU/memory usage, does not support training continuation)
>>> fbkv = load_facebook_vectors(cap_path) >>> >>> 'landlord' in fbkv.vocab # Word is out of vocabulary False >>> oov_vector = fbkv['landlord'] >>> >>> 'landlady' in fbkv.vocab # Word is in the vocabulary True >>> iv_vector = fbkv['landlady']
🔴 Bug fixes
- Fix unicode error when loading FastText vocabulary (@mpenkov, #2390)
- Avoid division by zero in fasttext_inner.pyx (@mpenkov, #2404)
- Avoid incorrect filename inference when loading model (@mpenkov, #2408)
- Handle invalid unicode when loading native FastText models (@mpenkov, #2411)
- Avoid divide by zero when calculating vectors for terms with no ngrams (@mpenkov, #2411)
📚 Tutorial and doc improvements
⚠️ Changes in FastText behavior
Out-of-vocab word handling
To achieve consistency with the reference implementation from Facebook,
FastText model will now always report any word, out-of-vocabulary or
not, as being in the model, and always return some vector for any word
'any_word' in ft_modelwill always return
True. Previously, it
Trueonly if the full word was in the vocabulary. (To test if a
full word is in the known vocabulary, you can consult the
'any_word' in ft_model.wv.vocabwill return
Falseif the full
word wasn't learned during model training.)
ft_model['any_word']will always return a vector. Previously, it
KeyErrorfor OOV words when the model had no vectors
for any ngrams of the word.
- If no ngrams from the term are present in the model,
or when no ngrams could be extracted from the term, a vector pointing
to the origin will be returned. Previously, a vector of NaN (not a number)
was returned as a consequence of a divide-by-zero problem.
- Models may use more more memory, or take longer for word-vector
lookup, especially after training on smaller corpuses where the previous
non-compliant behavior discarded some ngrams from consideration.
Loading models in Facebook .bin format
gensim.models.FastText.load_fasttext_format function (deprecated) now loads the entire model contained in the .bin file, including the shallow neural network that enables training continuation.
Loading this NN requires more CPU and RAM than previously required.
Since this function is deprecated, consider using one of its alternatives (see below).
Furthermore, you must now pass the full path to the file to load, including the file extension.
Previously, if you specified a model path that ends with anything other than .bin, the code automatically appended .bin to the path before loading the model.
This behavior was confusing, so we removed it.
⚠️ Deprecations (will be removed in the next major release)
gensim.models.FastText.load_fasttext_format: use load_facebook_vectors to load embeddings only (faster, less CPU/memory usage, does not support training continuation) and load_facebook_model to load full model (slower, more CPU/memory intensive, supports training continuation)
gensim.models.wrappers.fasttext(obsoleted by the new native
gensim.scripts.make_wiki(all of these obsoleted by the new native
- "deprecated" functions and attributes
gensim.utils.utils(old imports will continue to work)