@menshikh-iv menshikh-iv released this Jan 18, 2019 · 28 commits to develop since this release

3.7.0, 2019-01-18

🌟 New features

  • Fast Online NMF (@anotherbugmaster, #2007)

    • Benchmark wiki-english-20171001

      Model Perplexity Coherence L2 norm Train time (minutes)
      LDA 4727.07 -2.514 7.372 138
      NMF 975.74 -2.814 7.265 73
      NMF (with regularization) 985.57 -2.436 7.269 441
    • Simple to use (same interface as LdaModel)

      from gensim.models.nmf import Nmf
      from gensim.corpora import Dictionary
      import gensim.downloader as api
      
      text8 = api.load('text8')
      
      dictionary = Dictionary(text8)
      dictionary.filter_extremes()
      
      corpus = [
          dictionary.doc2bow(doc) for doc in text8
      ]
      
      nmf = Nmf(
          corpus=corpus,
          num_topics=5,
          id2word=dictionary,
          chunksize=2000,
          passes=5,
          random_state=42,
      )
      
      nmf.show_topics()
      """
      [(0, '0.007*"km" + 0.006*"est" + 0.006*"islands" + 0.004*"league" + 0.004*"rate" + 0.004*"female" + 0.004*"economy" + 0.003*"male" + 0.003*"team" + 0.003*"elections"'),
       (1, '0.006*"actor" + 0.006*"player" + 0.004*"bwv" + 0.004*"writer" + 0.004*"actress" + 0.004*"singer" + 0.003*"emperor" + 0.003*"jewish" + 0.003*"italian" + 0.003*"prize"'),
       (2, '0.036*"college" + 0.007*"institute" + 0.004*"jewish" + 0.004*"universidad" + 0.003*"engineering" + 0.003*"colleges" + 0.003*"connecticut" + 0.003*"technical" + 0.003*"jews" + 0.003*"universities"'),
       (3, '0.016*"import" + 0.008*"insubstantial" + 0.007*"y" + 0.006*"soviet" + 0.004*"energy" + 0.004*"info" + 0.003*"duplicate" + 0.003*"function" + 0.003*"z" + 0.003*"jargon"'),
       (4, '0.005*"software" + 0.004*"games" + 0.004*"windows" + 0.003*"microsoft" + 0.003*"films" + 0.003*"apple" + 0.003*"video" + 0.002*"album" + 0.002*"fiction" + 0.002*"characters"')]
      """
    • See also:

  • Massive improvement of FastText compatibilities (@mpenkov, #2313)

    from gensim.models import FastText
    
    # 'cc.ru.300.bin' - Russian Facebook FT model trained on Common Crawl
    # Can be downloaded from https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ru.300.bin.gz
    
    model = FastText.load_fasttext_format("cc.ru.300.bin")
    
    # Fixed hash-function allow to produce same output as FB FastText & works correctly for non-latin languages (for example, Russian)
    assert "мяу" in m.wv.vocab  # 'мяу' - vocab word
    model.wv.most_similar("мяу")
    """
    [('Мяу', 0.6820122003555298),
     ('МЯУ', 0.6373013257980347),
     ('мяу-мяу', 0.593108594417572),
     ('кис-кис', 0.5899622440338135),
     ('гав', 0.5866007804870605),
     ('Кис-кис', 0.5798211097717285),
     ('Кис-кис-кис', 0.5742273330688477),
     ('Мяу-мяу', 0.5699705481529236),
     ('хрю-хрю', 0.5508339405059814),
     ('ав-ав', 0.5479759573936462)]
    """
    
    assert "котогород" not in m.wv.vocab  # 'котогород' - out-of-vocab word
    model.wv.most_similar("котогород", topn=3)
    """
    [('автогород', 0.5463314652442932),
     ('ТагилНовокузнецкНовомосковскНовороссийскНовосибирскНовотроицкНовочеркасскНовошахтинскНовый',
      0.5423436164855957),
     ('областьНовосибирскБарабинскБердскБолотноеИскитимКарасукКаргатКуйбышевКупиноОбьТатарскТогучинЧерепаново',
      0.5377570390701294)]
    """
    
    # Now we load full model, for this reason, we can continue an training
    
    from gensim.test.utils import datapath
    from smart_open import smart_open
    
    with smart_open(datapath("crime-and-punishment.txt"), encoding="utf-8") as infile:  # russian text
        corpus = [line.strip().split() for line in infile]
    
    model.train(corpus, total_examples=len(corpus), epochs=5)
  • Similarity search improvements (@Witiko, #2016)

    • Add similarity search using the Levenshtein distance in gensim.similarities.LevenshteinSimilarityIndex

    • Performance optimizations to gensim.similarities.SoftCosineSimilarity (full benchmark)

      dictionary size corpus size speed
      1000 100 1.0×
      1000 1000 53.4×
      1000 100000 156784.8×
      100000 100 3.8×
      100000 1000 405.8×
      100000 100000 66262.0×
    • See updated soft-cosine tutorial for more information and usage examples

  • Add python3.7 support (@menshikh-iv, #2211)

👍 Improvements

Optimizations
  • Reduce Phraser memory usage (drop frequencies) (@jenishah, #2208)
  • Reduce memory consumption of summarizer (@horpto, #2298)
  • Replace inline slow equivalent of mean_absolute_difference with fast (@horpto, #2284)
  • Reuse precalculated updated prior in ldamodel.update_dir_prior (@horpto, #2274)
  • Improve KeyedVector.wmdistance (@horpto, #2326)
  • Optimize remove_unreachable_nodes in gensim.summarization (@horpto, #2263)
  • Optimize mz_entropy from gensim.summarization (@horpto, #2267)
  • Improve filter_extremes methods in Dictionary and HashDictionary (@horpto, #2303)
Additions
Cleanup

🔴 Bug fixes

📚 Tutorial and doc improvements

⚠️ Deprecations (will be removed in the next major release)

  • Remove

    • gensim.models.wrappers.fasttext (obsoleted by the new native gensim.models.fasttext implementation)
    • gensim.examples
    • gensim.nosy
    • gensim.scripts.word2vec_standalone
    • gensim.scripts.make_wiki_lemma
    • gensim.scripts.make_wiki_online
    • gensim.scripts.make_wiki_online_lemma
    • gensim.scripts.make_wiki_online_nodebug
    • gensim.scripts.make_wiki (all of these obsoleted by the new native gensim.scripts.segment_wiki implementation)
    • "deprecated" functions and attributes
  • Move

    • gensim.scripts.make_wikicorpusgensim.scripts.make_wiki.py
    • gensim.summarizationgensim.models.summarization
    • gensim.topic_coherencegensim.models._coherence
    • gensim.utilsgensim.utils.utils (old imports will continue to work)
    • gensim.parsing.*gensim.utils.text_utils
Assets 2