@menshikh-iv menshikh-iv released this Sep 20, 2018 · 19 commits to develop since this release

Assets 26

3.6.0, 2018-09-20

🌟 New features

  • File-based training for *2Vec models (@persiyanov, #2127 & #2078 & #2048)

    Blog post / Jupyter tutorial.

    New training mode for *2Vec models (word2vec, doc2vec, fasttext) that allows model training to scale linearly with the number of cores (full GIL elimination). The result of our Google Summer of Code 2018 project by Dmitry Persiyanov.

    Benchmark on the full English Wikipedia, Intel(R) Xeon(R) CPU @ 2.30GHz 32 cores (GCE cloud), MKL BLAS:

    Model Queue-based version [sec] File-based version [sec] speed up Accuracy (queue-based) Accuracy (file-based)
    Word2Vec 9230 2437 3.79x 0.754 (± 0.003) 0.750 (± 0.001)
    Doc2Vec 18264 2889 6.32x 0.721 (± 0.002) 0.683 (± 0.003)
    FastText 16361 10625 1.54x 0.642 (± 0.002) 0.660 (± 0.001)

    Usage:

    import gensim.downloader as api
    from multiprocessing import cpu_count
    from gensim.utils import save_as_line_sentence
    from gensim.test.utils import get_tmpfile
    from gensim.models import Word2Vec, Doc2Vec, FastText
    
    
    # Convert any corpus to the needed format: 1 document per line, words delimited by " "
    corpus = api.load("text8")
    corpus_fname = get_tmpfile("text8-file-sentence.txt")
    save_as_line_sentence(corpus, corpus_fname)
    
    # Choose num of cores that you want to use (let's use all, models scale linearly now!)
    num_cores = cpu_count()
    
    # Train models using all cores
    w2v_model = Word2Vec(corpus_file=corpus_fname, workers=num_cores)
    d2v_model = Doc2Vec(corpus_file=corpus_fname, workers=num_cores)
    ft_model = FastText(corpus_file=corpus_fname, workers=num_cores)
    

    Read notebook tutorial with full description.

👍 Improvements

🔴 Bug fixes

📚 Tutorial and doc improvements

  • Update docstring with new analogy evaluation method (@akutuzov, #2130)
  • Improve prune_at parameter description for gensim.corpora.Dictionary (@yxonic, #2128)
  • Fix default -> auto prior parameter in documentation for lda-related models (@Laubeee, #2156)
  • Use heading instead of bold style in gensim.models.translation_matrix (@nzw0301, #2164)
  • Fix quote of vocabulary from gensim.models.Word2Vec (@nzw0301, #2161)
  • Replace deprecated parameters with new in docstring of gensim.models.Doc2Vec (@xuhdev, #2165)
  • Fix formula in Mallet documentation (@Laubeee, #2186)
  • Fix minor semantic issue in docs for Phrases (@RunHorst, #2148)
  • Fix typo in documentation (@KenjiOhtsuka, #2157)
  • Additional documentation fixes (@piskvorky, #2121)

⚠️ Deprecations (will be removed in the next major release)

  • Remove

    • gensim.models.wrappers.fasttext (obsoleted by the new native gensim.models.fasttext implementation)
    • gensim.examples
    • gensim.nosy
    • gensim.scripts.word2vec_standalone
    • gensim.scripts.make_wiki_lemma
    • gensim.scripts.make_wiki_online
    • gensim.scripts.make_wiki_online_lemma
    • gensim.scripts.make_wiki_online_nodebug
    • gensim.scripts.make_wiki (all of these obsoleted by the new native gensim.scripts.segment_wiki implementation)
    • "deprecated" functions and attributes
  • Move

    • gensim.scripts.make_wikicorpusgensim.scripts.make_wiki.py
    • gensim.summarizationgensim.models.summarization
    • gensim.topic_coherencegensim.models._coherence
    • gensim.utilsgensim.utils.utils (old imports will continue to work)
    • gensim.parsing.*gensim.utils.text_utils