Skip to content
Permalink
Browse files

Merge branch 'release-3.7.0'

  • Loading branch information...
menshikh-iv committed Jan 18, 2019
2 parents 355ecc6 + 7d84b7e commit 42e47a3a476245b6d78ff3167bd9d5937d9d91e8
Showing with 57,324 additions and 16,406 deletions.
  1. +1 −1 .circleci/config.yml
  2. +2 −0 .gitignore
  3. +12 −1 .travis.yml
  4. +217 −5 CHANGELOG.md
  5. +19 −0 MANIFEST.in
  6. +17 −23 README.md
  7. +5 −0 appveyor.yml
  8. +152 −0 docs/fasttext-notes.md
  9. +1 −1 docs/notebooks/FastText_Tutorial.ipynb
  10. +1 −1 docs/notebooks/Poincare Evaluation.ipynb
  11. +4 −3 docs/notebooks/Tensorboard_visualizations.ipynb
  12. +2 −2 docs/notebooks/Topics_and_Transformations.ipynb
  13. +2 −2 docs/notebooks/WMD_tutorial.ipynb
  14. +2 −2 docs/notebooks/Wordrank_comparisons.ipynb
  15. +1,749 −0 docs/notebooks/nmf_tutorial.ipynb
  16. +1,312 −0 docs/notebooks/nmf_wikipedia.ipynb
  17. +4,605 −0 docs/notebooks/soft_cosine_benchmark.ipynb
  18. +72 −53 docs/notebooks/soft_cosine_tutorial.ipynb
  19. +3 −3 docs/notebooks/translation_matrix.ipynb
  20. +1 −0 docs/src/apiref.rst
  21. +11 −7 docs/src/changes_080.rst
  22. +3 −3 docs/src/conf.py
  23. +7 −3 docs/src/dist_lda.rst
  24. +29 −15 docs/src/dist_lsi.rst
  25. +9 −0 docs/src/models/nmf.rst
  26. +106 −76 docs/src/simserver.rst
  27. +120 −85 docs/src/tut1.rst
  28. +82 −52 docs/src/tut2.rst
  29. +48 −31 docs/src/tut3.rst
  30. +31 −19 docs/src/tutorial.rst
  31. +46 −14 docs/src/wiki.rst
  32. +0 −405 ez_setup.py
  33. +2 −9 gensim/__init__.py
  34. +2,811 −3,053 gensim/_matutils.c
  35. +1,363 −1,467 gensim/corpora/_mmreader.c
  36. +3 −3 gensim/corpora/_mmreader.pyx
  37. +5 −3 gensim/corpora/bleicorpus.py
  38. +175 −107 gensim/corpora/dictionary.py
  39. +37 −30 gensim/corpora/hashdictionary.py
  40. +24 −18 gensim/corpora/indexedcorpus.py
  41. +38 −34 gensim/corpora/lowcorpus.py
  42. +38 −31 gensim/corpora/malletcorpus.py
  43. +17 −15 gensim/corpora/mmcorpus.py
  44. +39 −30 gensim/corpora/sharded_corpus.py
  45. +3 −0 gensim/corpora/svmlightcorpus.py
  46. +23 −20 gensim/corpora/textcorpus.py
  47. +20 −13 gensim/corpora/ucicorpus.py
  48. +37 −28 gensim/corpora/wikicorpus.py
  49. +55 −37 gensim/downloader.py
  50. +47 −37 gensim/interfaces.py
  51. +18 −12 gensim/matutils.py
  52. +8 −6 gensim/models/__init__.py
  53. +218 −0 gensim/models/_fasttext_bin.py
  54. +7,075 −1,168 gensim/models/_utils_any2vec.c
  55. +38 −0 gensim/models/_utils_any2vec.pyx
  56. +64 −60 gensim/models/atmodel.py
  57. +5 −5 gensim/models/base_any2vec.py
  58. +1 −1 gensim/models/basemodel.py
  59. +68 −59 gensim/models/callbacks.py
  60. +38 −32 gensim/models/coherencemodel.py
  61. +26 −18 gensim/models/deprecated/doc2vec.py
  62. +45 −35 gensim/models/deprecated/fasttext.py
  63. +9 −6 gensim/models/deprecated/fasttext_wrapper.py
  64. +116 −77 gensim/models/deprecated/keyedvectors.py
  65. +2 −2 gensim/models/deprecated/old_saveload.py
  66. +66 −43 gensim/models/deprecated/word2vec.py
  67. +42 −29 gensim/models/doc2vec.py
  68. +1,661 −1,340 gensim/models/doc2vec_corpusfile.cpp
  69. +10 −7 gensim/models/doc2vec_corpusfile.pyx
  70. +933 −780 gensim/models/doc2vec_inner.c
  71. +285 −365 gensim/models/fasttext.py
  72. +1,313 −1,002 gensim/models/fasttext_corpusfile.cpp
  73. +7 −5 gensim/models/fasttext_corpusfile.pyx
  74. +920 −782 gensim/models/fasttext_inner.c
  75. +52 −46 gensim/models/hdpmodel.py
  76. +506 −207 gensim/models/keyedvectors.py
  77. +4 −2 gensim/models/lda_dispatcher.py
  78. +4 −2 gensim/models/lda_worker.py
  79. +101 −89 gensim/models/ldamodel.py
  80. +34 −29 gensim/models/ldamulticore.py
  81. +21 −15 gensim/models/ldaseqmodel.py
  82. +13 −11 gensim/models/logentropy_model.py
  83. +7 −5 gensim/models/lsi_dispatcher.py
  84. +7 −5 gensim/models/lsi_worker.py
  85. +23 −19 gensim/models/lsimodel.py
  86. +656 −0 gensim/models/nmf.py
  87. +21,794 −0 gensim/models/nmf_pgd.c
  88. +166 −0 gensim/models/nmf_pgd.pyx
  89. +3 −3 gensim/models/normmodel.py
  90. +191 −160 gensim/models/phrases.py
  91. +139 −114 gensim/models/poincare.py
  92. +32 −24 gensim/models/rpmodel.py
  93. +16 −14 gensim/models/tfidfmodel.py
  94. +80 −60 gensim/models/translation_matrix.py
  95. +86 −31 gensim/models/utils_any2vec.py
  96. +80 −59 gensim/models/word2vec.py
  97. +1,551 −1,326 gensim/models/word2vec_corpusfile.cpp
  98. +2 −2 gensim/models/word2vec_corpusfile.pxd
  99. +9 −7 gensim/models/word2vec_corpusfile.pyx
  100. +935 −781 gensim/models/word2vec_inner.c
  101. +13 −11 gensim/models/wrappers/dtmmodel.py
  102. +5 −3 gensim/models/wrappers/fasttext.py
  103. +39 −19 gensim/models/wrappers/ldamallet.py
  104. +32 −38 gensim/models/wrappers/ldavowpalwabbit.py
  105. +4 −4 gensim/models/wrappers/varembed.py
  106. +8 −7 gensim/models/wrappers/wordrank.py
  107. +101 −92 gensim/parsing/porter.py
  108. +73 −49 gensim/parsing/preprocessing.py
  109. +12 −12 gensim/scripts/glove2word2vec.py
  110. +10 −8 gensim/scripts/make_wikicorpus.py
  111. +5 −3 gensim/scripts/package_info.py
  112. +16 −12 gensim/scripts/segment_wiki.py
  113. +2 −0 gensim/similarities/__init__.py
  114. +181 −169 gensim/similarities/docsim.py
  115. +28 −24 gensim/similarities/index.py
  116. +153 −0 gensim/similarities/levenshtein.py
  117. +394 −0 gensim/similarities/termsim.py
  118. +17 −15 gensim/sklearn_api/atmodel.py
  119. +7 −5 gensim/sklearn_api/d2vmodel.py
  120. +25 −21 gensim/sklearn_api/ftmodel.py
  121. +8 −6 gensim/sklearn_api/hdp.py
  122. +8 −6 gensim/sklearn_api/ldamodel.py
  123. +11 −9 gensim/sklearn_api/ldaseqmodel.py
  124. +17 −15 gensim/sklearn_api/lsimodel.py
  125. +36 −17 gensim/sklearn_api/phrases.py
  126. +10 −8 gensim/sklearn_api/rpmodel.py
  127. +13 −11 gensim/sklearn_api/text2bow.py
  128. +8 −6 gensim/sklearn_api/tfidf.py
  129. +12 −10 gensim/sklearn_api/w2vmodel.py
  130. +153 −62 gensim/summarization/bm25.py
  131. +13 −9 gensim/summarization/commons.py
  132. +70 −179 gensim/summarization/graph.py
  133. +19 −20 gensim/summarization/keywords.py
  134. +53 −21 gensim/summarization/mz_entropy.py
  135. +42 −27 gensim/summarization/pagerank_weighted.py
  136. +52 −46 gensim/summarization/summarizer.py
  137. +2 −2 gensim/summarization/syntactic_unit.py
  138. +41 −32 gensim/summarization/textcleaner.py
  139. +2 −2 gensim/test/simspeed.py
  140. +1 −1 gensim/test/test_api.py
  141. +65 −5 gensim/test/test_corpora.py
  142. +40 −1 gensim/test/test_corpora_dictionary.py
  143. BIN gensim/test/test_data/compatible-hash-false.model
  144. BIN gensim/test/test_data/compatible-hash-true.model
  145. BIN gensim/test/test_data/crime-and-punishment.bin
  146. +5 −0 gensim/test/test_data/crime-and-punishment.txt
  147. +292 −0 gensim/test/test_data/crime-and-punishment.vec
  148. BIN gensim/test/test_data/phraser-3.6.0.model
  149. BIN gensim/test/test_data/phrases-3.6.0.model
  150. BIN gensim/test/test_data/phrases-transformer-new-v3-5-0.pkl
  151. BIN gensim/test/test_data/phrases-transformer-v3-5-0.pkl
  152. +1 −0 gensim/test/test_data/toy-data.txt
  153. BIN gensim/test/test_data/toy-model.bin
  154. +23 −0 gensim/test/test_data/toy-model.vec
  155. +74 −74 gensim/test/test_doc2vec.py
  156. +2 −2 gensim/test/test_dtm.py
  157. +442 −109 gensim/test/test_fasttext.py
  158. +9 −0 gensim/test/test_fasttext_wrapper.py
  159. +6 −6 gensim/test/test_keras_integration.py
  160. +88 −53 gensim/test/test_keyedvectors.py
  161. +60 −0 gensim/test/test_lda_callback.py
  162. +54 −1 gensim/test/test_ldamallet_wrapper.py
  163. +9 −9 gensim/test/test_ldamodel.py
  164. +8 −8 gensim/test/test_matutils.py
  165. +1 −1 gensim/test/test_miislita.py
  166. +159 −0 gensim/test/test_nmf.py
  167. +16 −1 gensim/test/test_phrases.py
  168. +8 −8 gensim/test/test_sharded_corpus.py
  169. +410 −33 gensim/test/test_similarities.py
  170. +86 −3 gensim/test/test_sklearn_api.py
  171. +117 −0 gensim/test/test_summarization.py
  172. +89 −0 gensim/test/test_utils.py
  173. +3 −3 gensim/test/test_varembed_wrapper.py
  174. +125 −123 gensim/test/test_word2vec.py
  175. +1 −1 gensim/test/test_wordrank_wrapper.py
  176. +67 −53 gensim/test/utils.py
  177. +5 −3 gensim/topic_coherence/aggregation.py
  178. +55 −46 gensim/topic_coherence/direct_confirmation_measure.py
  179. +59 −53 gensim/topic_coherence/indirect_confirmation_measure.py
  180. +116 −87 gensim/topic_coherence/probability_estimation.py
  181. +28 −22 gensim/topic_coherence/segmentation.py
  182. +42 −34 gensim/topic_coherence/text_analysis.py
  183. +121 −80 gensim/utils.py
  184. +3 −3 gensim/viz/poincare.py
  185. +16 −11 setup.py
  186. +34 −4 tox.ini
  187. +1 −1 tutorials.md
@@ -30,7 +30,7 @@ jobs:
name: Build documentation
command: |
source venv/bin/activate
tox -e docs -vv
tox -e compile,docs -vv
- store_artifacts:
path: docs/src/_build
@@ -7,6 +7,8 @@
*.o
*.so
*.pyc
*.pyo
*.pyd

# Packages #
############
@@ -13,7 +13,10 @@ language: python
matrix:
include:
- python: '2.7'
env: TOXENV="flake8"
env: TOXENV="flake8,flake8-docs"

- python: '3.6'
env: TOXENV="flake8,flake8-docs"

- python: '2.7'
env: TOXENV="py27-linux"
@@ -24,5 +27,13 @@ matrix:
- python: '3.6'
env: TOXENV="py36-linux"

- python: '3.7'
env:
- TOXENV="py37-linux"
- BOTO_CONFIG="/dev/null"
dist: xenial
sudo: true


install: pip install tox
script: tox -vv

Large diffs are not rendered by default.

@@ -4,17 +4,36 @@ include CHANGELOG.md
include COPYING
include COPYING.LESSER
include ez_setup.py

include gensim/models/voidptr.h
include gensim/models/fast_line_sentence.h

include gensim/models/word2vec_inner.c
include gensim/models/word2vec_inner.pyx
include gensim/models/word2vec_inner.pxd
include gensim/models/word2vec_corpusfile.cpp
include gensim/models/word2vec_corpusfile.pyx
include gensim/models/word2vec_corpusfile.pxd

include gensim/models/doc2vec_inner.c
include gensim/models/doc2vec_inner.pyx
include gensim/models/doc2vec_inner.pxd
include gensim/models/doc2vec_corpusfile.cpp
include gensim/models/doc2vec_corpusfile.pyx

include gensim/models/fasttext_inner.c
include gensim/models/fasttext_inner.pyx
include gensim/models/fasttext_inner.pxd
include gensim/models/fasttext_corpusfile.cpp
include gensim/models/fasttext_corpusfile.pyx

include gensim/models/_utils_any2vec.c
include gensim/models/_utils_any2vec.pyx
include gensim/corpora/_mmreader.c
include gensim/corpora/_mmreader.pyx
include gensim/_matutils.c
include gensim/_matutils.pyx

include gensim/models/nmf_pgd.c
include gensim/models/nmf_pgd.pyx

@@ -119,29 +119,23 @@ Documentation
Adopters
--------



| Name | Logo | URL | Description |
|----------------------------------------|--------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| RaRe Technologies | ![rare](docs/src/readme_images/rare.png) | [rare-technologies.com](http://rare-technologies.com) | Machine learning & NLP consulting and training. Creators and maintainers of Gensim. |
| Mindseye | ![mindseye](docs/src/readme_images/mindseye.png) | [mindseye.com](http://www.mindseyesolutions.com/) | Similarities in legal documents |
| Talentpair | ![talent-pair](docs/src/readme_images/talent-pair.png) | [talentpair.com](http://talentpair.com) | Data science driving high-touch recruiting |
| Tailwind | ![tailwind](docs/src/readme_images/tailwind.png)| [Tailwindapp.com](https://www.tailwindapp.com/)| Post interesting and relevant content to Pinterest |
| Issuu | ![issuu](docs/src/readme_images/issuu.png) | [Issuu.com](https://issuu.com/)| Gensim’s LDA module lies at the very core of the analysis we perform on each uploaded publication to figure out what it’s all about.
| Sports Authority | ![sports-authority](docs/src/readme_images/sports-authority.png) | [sportsauthority.com](https://en.wikipedia.org/wiki/Sports_Authority)| Text mining of customer surveys and social media sources |
| Search Metrics | ![search-metrics](docs/src/readme_images/search-metrics.png) | [searchmetrics.com](http://www.searchmetrics.com/)| Gensim word2vec used for entity disambiguation in Search Engine Optimisation
| Cisco Security | ![cisco](docs/src/readme_images/cisco.png) | [cisco.com](http://www.cisco.com/c/en/us/products/security/index.html)| Large-scale fraud detection
| 12K Research | ![12k](docs/src/readme_images/12k.png)| [12k.co](https://12k.co/)| Document similarity analysis on media articles
| National Institutes of Health | ![nih](docs/src/readme_images/nih.png) | [github/NIHOPA](https://github.com/NIHOPA/pipeline_word2vec)| Processing grants and publications with word2vec
| Codeq LLC | ![codeq](docs/src/readme_images/codeq.png) | [codeq.com](https://codeq.com)| Document classification with word2vec
| Mass Cognition | ![mass-cognition](docs/src/readme_images/mass-cognition.png) | [masscognition.com](http://www.masscognition.com/) | Topic analysis service for consumer text data and general text data |
| Stillwater Supercomputing | ![stillwater](docs/src/readme_images/stillwater.png) | [stillwater-sc.com](http://www.stillwater-sc.com/) | Document comprehension and association with word2vec |
| Channel 4 | ![channel4](docs/src/readme_images/channel4.png) | [channel4.com](http://www.channel4.com/) | Recommendation engine |
| Amazon | ![amazon](docs/src/readme_images/amazon.png) | [amazon.com](http://www.amazon.com/) | Document similarity|
| SiteGround Hosting | ![siteground](docs/src/readme_images/siteground.png) | [siteground.com](https://www.siteground.com/) | An ensemble search engine which uses different embeddings models and similarities, including word2vec, WMD, and LDA. |
| Juju | ![juju](docs/src/readme_images/juju.png) | [www.juju.com](http://www.juju.com/) | Provide non-obvious related job suggestions. |
| NLPub | ![nlpub](docs/src/readme_images/nlpub.png) | [nlpub.org](https://nlpub.org/) | Distributional semantic models including word2vec. |
|Capital One | ![capitalone](docs/src/readme_images/capitalone.png) | [www.capitalone.com](https://www.capitalone.com/) | Topic modeling for customer complaints exploration. |
| Company | Logo | Industry | Use of Gensim |
|---------|------|----------|---------------|
| [RARE Technologies](http://rare-technologies.com) | ![rare](docs/src/readme_images/rare.png) | ML & NLP consulting | Creators of Gensim – this is us! |
| [Amazon](http://www.amazon.com/) | ![amazon](docs/src/readme_images/amazon.png) | Retail | Document similarity. |
| [National Institutes of Health](https://github.com/NIHOPA/pipeline_word2vec) | ![nih](docs/src/readme_images/nih.png) | Health | Processing grants and publications with word2vec. |
| [Cisco Security](http://www.cisco.com/c/en/us/products/security/index.html) | ![cisco](docs/src/readme_images/cisco.png) | Security | Large-scale fraud detection. |
| [Mindseye](http://www.mindseyesolutions.com/) | ![mindseye](docs/src/readme_images/mindseye.png) | Legal | Similarities in legal documents. |
| [Channel 4](http://www.channel4.com/) | ![channel4](docs/src/readme_images/channel4.png) | Media | Recommendation engine. |
| [Talentpair](http://talentpair.com) | ![talent-pair](docs/src/readme_images/talent-pair.png) | HR | Candidate matching in high-touch recruiting. |
| [Juju](http://www.juju.com/) | ![juju](docs/src/readme_images/juju.png) | HR | Provide non-obvious related job suggestions. |
| [Tailwind](https://www.tailwindapp.com/) | ![tailwind](docs/src/readme_images/tailwind.png) | Media | Post interesting and relevant content to Pinterest. |
| [Issuu](https://issuu.com/) | ![issuu](docs/src/readme_images/issuu.png) | Media | Gensim's LDA module lies at the very core of the analysis we perform on each uploaded publication to figure out what it's all about. |
| [Search Metrics](http://www.searchmetrics.com/) | ![search-metrics](docs/src/readme_images/search-metrics.png) | Content Marketing | Gensim word2vec used for entity disambiguation in Search Engine Optimisation. |
| [12K Research](https://12k.co/) | ![12k](docs/src/readme_images/12k.png)| Media | Document similarity analysis on media articles. |
| [Stillwater Supercomputing](http://www.stillwater-sc.com/) | ![stillwater](docs/src/readme_images/stillwater.png) | Hardware | Document comprehension and association with word2vec. |
| [SiteGround](https://www.siteground.com/) | ![siteground](docs/src/readme_images/siteground.png) | Web hosting | An ensemble search engine which uses different embeddings models and similarities, including word2vec, WMD, and LDA. |
| [Capital One](https://www.capitalone.com/) | ![capitalone](docs/src/readme_images/capitalone.png) | Finance | Topic modeling for customer complaints exploration. |

-------

@@ -28,6 +28,11 @@ environment:
PYTHON_ARCH: "64"
TOXENV: "py36-win"

- PYTHON: "C:\\Python37-x64"
PYTHON_VERSION: "3.7.0"
PYTHON_ARCH: "64"
TOXENV: "py37-win"

init:
- "ECHO %PYTHON% %PYTHON_VERSION% %PYTHON_ARCH%"
- "ECHO \"%APPVEYOR_SCHEDULED_BUILD%\""
@@ -0,0 +1,152 @@
FastText Notes
==============

The implementation is split across several submodules:

- models.fasttext
- models.keyedvectors (includes FastText-specific code, not good)
- models.word2vec (superclasses)
- models.base_any2vec (superclasses)

The implementation consists of several key classes:

1. models.fasttext.FastTextVocab: the vocabulary
2. models.keyedvectors.FastTextKeyedVectors: the vectors
3. models.fasttext.FastTextTrainables: the underlying neural network
4. models.fasttext.FastText: ties everything together

FastTextVocab
-------------

Seems to be an entirely redundant class.
Inherits from models.word2vec.Word2VecVocab, adding no new functionality.

FastTextKeyedVectors
--------------------

Inheritance hierarchy:

1. FastTextKeyedVectors
2. WordEmbeddingsKeyedVectors. Implements word similarity e.g. cosine similarity, WMD, etc.
3. BaseKeyedVectors (abstract base class)
4. utils.SaveLoad

There are many attributes.

Inherited from BaseKeyedVectors:

- vectors: a 2D numpy array. Flexible number of rows (0 by default). Number of columns equals vector dimensionality.
- vocab: a dictionary. Keys are words. Items are Vocab instances: these are essentially namedtuples that contain an index and a count. The former is the index of a term in the entire vocab. The latter is the number of times the term occurs.
- vector_size (dimensionality)
- index2entity

Inherited from WordEmbeddingsKeyedVectors:

- vectors_norm
- index2word

Added by FastTextKeyedVectors:

- vectors_vocab: 2D array. Rows are vectors. Columns correspond to vector dimensions. Initialized in FastTextTrainables.init_ngrams_weights. Reset in reset_ngrams_weights. Referred to as syn0_vocab in fasttext_inner.pyx. These are vectors for every word in the vocabulary.
- vectors_vocab_norm: looks unused, see _clear_post_train method.
- vectors_ngrams: 2D array. Each row is a bucket. Columns correspond to vector dimensions. Initialized in init_ngrams_weights function. Initialized in _load_vectors method when reading from native FB binary. Modified in reset_ngrams_weights method. This is the first matrix loaded from the native binary files.
- vectors_ngrams_norm: looks unused, see _clear_post_train method.
- buckets_word: A hashmap. Keyed by the index of a term in the vocab. Each value is an array, where each element is an integer that corresponds to a bucket. Initialized in init_ngrams_weights function
- hash2index: A hashmap. Keys are hashes of ngrams. Values are the number of ngrams (?). Initialized in init_ngrams_weights function.
- min_n: minimum ngram length
- max_n: maximum ngram length
- num_ngram_vectors: initialized in the init_ngrams_weights function
The init_ngrams_method looks like an internal method of FastTextTrainables.
It gets called as part of the prepare_weights method, which is effectively part of the FastModel constructor.
The above attributes are initialized to None in the FastTextKeyedVectors class constructor.
Unfortunately, their real initialization happens in an entirely different module, models.fasttext - another indication of poor separation of concerns.
Some questions:
- What is the x_lockf stuff? Why is it used only by the fast C implementation?
- How are vectors_vocab and vectors_ngrams different?
vectors_vocab contains vectors for entire vocabulary.
vectors_ngrams contains vectors for each _bucket_.


FastTextTrainables
------------------

[Link](https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.FastTextTrainables)

This is a neural network that learns the vectors for the FastText embedding.
Mostly inherits from its [Word2Vec parent](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2VecTrainables).
Adds logic for calculating and maintaining ngram weights.

Key attributes:

- hashfxn: function for randomly initializing weights. Defaults to the built-in hash()
- layer1_size: The size of the inner layer of the NN. Equal to the vector dimensionality. Set in the Word2VecTrainables constructor.
- seed: The random generator seed used in reset_weights and update_weights
- syn1: The inner layer of the NN. Each row corresponds to a term in the vocabulary. Columns correspond to weights of the inner layer. There are layer1_size such weights. Set in the reset_weights and update_weights methods, only if hierarchical sampling is used.
- syn1neg: Similar to syn1, but only set if negative sampling is used.
- vectors_lockf: A one-dimensional array with one element for each term in the vocab. Set in reset_weights to an array of ones.
- vectors_vocab_lockf: Similar to vectors_vocab_lockf, ones(len(model.trainables.vectors), dtype=REAL)
- vectors_ngrams_lockf = ones((self.bucket, wv.vector_size), dtype=REAL)

The lockf stuff looks like it gets used by the fast C implementation.

The inheritance hierarchy here is:

1. FastTextTrainables
2. Word2VecTrainables
3. utils.SaveLoad

FastText
--------

Inheritance hierarchy:

1. FastText
2. BaseWordEmbeddingsModel: vocabulary management plus a ton of deprecated attrs
3. BaseAny2VecModel: logging and training functionality
4. utils.SaveLoad: for loading and saving

Lots of attributes (many inherited from superclasses).

From BaseAny2VecModel:

- workers
- vector_size
- epochs
- callbacks
- batch_words
- kv
- vocabulary
- trainables

From BaseWordEmbeddingModel:

- alpha
- min_alpha
- min_alpha_yet_reached
- window
- random
- hs
- negative
- ns_exponent
- cbow_mean
- compute_loss
- running_training_loss
- corpus_count
- corpus_total_words
- neg_labels

FastText attributes:

- wv: FastTextWordVectors. Used instead of .kv

Logging
-------

The logging seems to be inheritance-based.
It may be better to refactor this using aggregation istead of inheritance in the future.
The benefits would be leaner classes with less responsibilities and better separation of concerns.
@@ -134,7 +134,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Hyperparameters for training the model follow the same pattern as Word2Vec. FastText supports the folllowing parameters from the original word2vec - \n",
"Hyperparameters for training the model follow the same pattern as Word2Vec. FastText supports the following parameters from the original word2vec - \n",
" - model: Training architecture. Allowed values: `cbow`, `skipgram` (Default `cbow`)\n",
" - size: Size of embeddings to be learnt (Default 100)\n",
" - alpha: Initial learning rate (Default 0.025)\n",
@@ -1706,7 +1706,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"1. The model can be investigated further to understand why it doesn't produce results as good as the paper. It is possible that this might be due to training details not present in the paper, or due to us incorrectly interpreting some ambiguous parts of the paper. We have not been able to clarify all such ambiguitities in communication with the authors.\n",
"1. The model can be investigated further to understand why it doesn't produce results as good as the paper. It is possible that this might be due to training details not present in the paper, or due to us incorrectly interpreting some ambiguous parts of the paper. We have not been able to clarify all such ambiguities in communication with the authors.\n",
"2. Optimizing the training process further - with a model size of 50 dimensions and a dataset with ~700k relations and ~80k nodes, the Gensim implementation takes around 45 seconds to complete an epoch (~15k relations per second), whereas the open source C++ implementation takes around 1/6th the time (~95k relations per second).\n",
"3. Implementing the variant of the model mentioned in the paper for symmetric graphs and evaluating on the scientific collaboration datasets described earlier in the report."
]

0 comments on commit 42e47a3

Please sign in to comment.
You can’t perform that action at this time.