Merge branch 'release-3.0.0'

piskvorky · Sep 27, 2017 · 351bdef · 351bdef
2 parents 8b8669d + af646c4
commit 351bdef
Show file tree

Hide file tree

Showing 182 changed files with 23,312 additions and 4,747 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -1,23 +1,22 @@
 sudo: false
+
+cache:
+  apt: true
+  directories:
+  - $HOME/.cache/pip
+  - $HOME/.ccache
+
 dist: trusty
 language: python
-python:
-  - "2.7"
-  - "3.5"
-  - "3.6"
-before_install:
-  - wget 'http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh' -O miniconda.sh
-  - chmod +x miniconda.sh
-  - ./miniconda.sh -b
-  - export PATH=/home/travis/miniconda2/bin:$PATH
-  - conda update --yes conda
-install:
-  - conda create --yes -n gensim-test python=$TRAVIS_PYTHON_VERSION pip atlas numpy==1.11.3 scipy==0.18.1
-  - source activate gensim-test
-  - python setup.py install
-  - pip install .[test]
-script:
-  - pip freeze
-  - python setup.py test
-  - pip install flake8
-  - continuous_integration/travis/flake8_diff.sh
+
+
+matrix:
+  include:
+    - env: PYTHON_VERSION="2.7" NUMPY_VERSION="1.11.3" SCIPY_VERSION="0.18.1" ONLY_CODESTYLE="yes"
+    - env: PYTHON_VERSION="2.7" NUMPY_VERSION="1.11.3" SCIPY_VERSION="0.18.1" ONLY_CODESTYLE="no"
+    - env: PYTHON_VERSION="3.5" NUMPY_VERSION="1.11.3" SCIPY_VERSION="0.18.1" ONLY_CODESTYLE="no"
+    - env: PYTHON_VERSION="3.6" NUMPY_VERSION="1.11.3" SCIPY_VERSION="0.18.1" ONLY_CODESTYLE="no"
+
+
+install: source continuous_integration/travis/install.sh
+script: bash continuous_integration/travis/run.sh
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,54 @@
 Changes
 ===========
+## 3.0.0, 2017-09-27
+
+
+:star2: New features:
+* Add unsupervised FastText to Gensim (@chinmayapancholi13, [#1525](https://github.com/RaRe-Technologies/gensim/pull/1525))
+* Add sklearn API for gensim models (@chinmayapancholi13, [#1462](https://github.com/RaRe-Technologies/gensim/pull/1462))
+* Add callback metrics for LdaModel and integration with Visdom (@parulsethi, [#1399](https://github.com/RaRe-Technologies/gensim/pull/1399))
+* Add TranslationMatrix model (@robotcator, [#1434](https://github.com/RaRe-Technologies/gensim/pull/1434))
+* Add word2vec-based coherence. Fix #1380 (@macks22, [#1530](https://github.com/RaRe-Technologies/gensim/pull/1530))
+
+
+:+1: Improvements:
+* Add 'diagonal' parameter for LdaModel.diff (@parulsethi, [#1448](https://github.com/RaRe-Technologies/gensim/pull/1448))
+* Add 'score' function for SklLdaModel (@chinmayapancholi13, [#1445](https://github.com/RaRe-Technologies/gensim/pull/1445))
+* Update sklearn API for gensim models (@chinmayapancholi13, [#1473](https://github.com/RaRe-Technologies/gensim/pull/1473)) [:warning: breaks backward compatibility]
+* Add CoherenceModel to LdaModel.top_topics. Fix #1128 (@macks22, [#1427](https://github.com/RaRe-Technologies/gensim/pull/1427))
+* Add dendrogram viz for topics and JS metric (@parulsethi, [#1484](https://github.com/RaRe-Technologies/gensim/pull/1484))
+* Add topic network viz (@parulsethi, [#1536](https://github.com/RaRe-Technologies/gensim/pull/1536))
+* Replace viewitems to iteritems. Fix #1495 (@HodorTheCoder, [#1508](https://github.com/RaRe-Technologies/gensim/pull/1508))
+* Fix Travis config and add style-checking for Ipython Notebooks. Fix #1518, #1520 (@menshikh-iv, [#1522](https://github.com/RaRe-Technologies/gensim/pull/1522))
+* Remove mutable args from definitions. Fix #1561 (@zsef123, [#1562](https://github.com/RaRe-Technologies/gensim/pull/1562))
+* Add Appveyour for all PRs. Fix #1565 (@menshikh-iv, [#1565](https://github.com/RaRe-Technologies/gensim/pull/1565))
+* Refactor code by PEP8. Partially fix #1521 (@zsef123, [#1550](https://github.com/RaRe-Technologies/gensim/pull/1550))
+* Refactor code by PEP8 with additional limitations. Fix #1521 (@menshikh-iv, [#1569](https://github.com/RaRe-Technologies/gensim/pull/1569))
+* Update FastTextKeyedVectors.\_\_contains\_\_ (@ELind77, [#1499](https://github.com/RaRe-Technologies/gensim/pull/1499))
+* Update WikiCorpus tokenization. Fix #1534 (@roopalgarg, [#1537](https://github.com/RaRe-Technologies/gensim/pull/1537))
+
+
+:red_circle: Bug fixes:
+* Remove round in LdaSeqModel.print_topic. Fix #1480 (@menshikh-iv, [#1547](https://github.com/RaRe-Technologies/gensim/pull/1547))
+* Fix TextCorpus.samle_text (@menshikh-iv, [#1548](https://github.com/RaRe-Technologies/gensim/pull/1548))
+* Fix Mallet wrapper and tests for HDPTransform (@menshikh-iv, [#1555](https://github.com/RaRe-Technologies/gensim/pull/1555))
+* Fix incorrect initialization ShardedCorpus with a generator. Fix #1511 (@karkkainenk1, [#1512](https://github.com/RaRe-Technologies/gensim/pull/1512))
+* Add verification when summarize_corpus returns null. Fix #1531 (@fbarrios, [#1570](https://github.com/RaRe-Technologies/gensim/pull/1570))
+* Fix doctag unicode problem. Fix 1543 (@englhardt, [#1544](https://github.com/RaRe-Technologies/gensim/pull/1544))
+* Fix Translation Matrix (@robotcator, [#1594](https://github.com/RaRe-Technologies/gensim/pull/1594))
+* Add trainable flag to KeyedVectors.get_embedding_layer. Fix #1557 (@zsef123, [#1558](https://github.com/RaRe-Technologies/gensim/pull/1558))
+
+
+:books: Tutorial and doc improvements:
+* Update exception text in TextCorpus.samle_text. Partial fix #308 (@vlejd, [#1444](https://github.com/RaRe-Technologies/gensim/pull/1444))
+* Remove extra filter_token from tutorial (@VorontsovIE, [#1502](https://github.com/RaRe-Technologies/gensim/pull/1502))
+* Update Doc2Vec-IMDB notebook (@pahdo, [#1476](https://github.com/RaRe-Technologies/gensim/pull/1476))
+* Add Google Tag Manager for site (@yardos, [#1556](https://github.com/RaRe-Technologies/gensim/pull/1556))
+* Update docstring explaining lack of multistream support in WikiCopus. Fix #1496 (@polm and @menshikh-iv, [#1515](https://github.com/RaRe-Technologies/gensim/pull/1515))
+* Fix PathLineSentences docstring (@gojomo)
+* Fix typos from Translation Matrix notebook (@robotcator, [#1598](https://github.com/RaRe-Technologies/gensim/pull/1598))
+
+
 ## 2.3.0, 2017-07-25
 
 

diff --git a/README.md b/README.md
@@ -137,6 +137,8 @@ Adopters
 | Amazon     |  <img src="http://g-ec2.images-amazon.com/images/G/01/social/api-share/amazon_logo_500500._V323939215_.png" width="100"> | [amazon.com](http://www.amazon.com/)                                  |  Document similarity|
 | SiteGround Hosting     |  <img src="https://www.siteground.com/img/knox/logos/siteground.png" width="100"> | [siteground.com](https://www.siteground.com/)                                  | An ensemble search engine which uses different embeddings models and similarities, including word2vec, WMD, and LDA. |
 | Juju  | <img src="https://d5k1a84rm5hwo.cloudfront.net/img/juju_home_logo.png" width="100">   | [www.juju.com](http://www.juju.com/) | Provide non-obvious related job suggestions. |
+| NLPub | <img src="https://nlpub.org/images/thumb/a/aa/NLPub.svg/240px-NLPub.svg.png" width="100"> | [nlpub.org](https://nlpub.org/) | Distributional semantic models including word2vec. |
+|Capital One | <img src="https://s3.amazonaws.com/fjds/member/original/1245173/C1_Core_NG_RGB_R_%281%29.PNG?1456169388"  width="200"> | [www.capitalone.com](https://www.capitalone.com/) | Topic modeling for customer complaints exploration. |
 
 -------
 

diff --git a/appveyor.yml b/appveyor.yml
@@ -50,7 +50,6 @@ install:
   - "python -c \"import struct; print(struct.calcsize('P') * 8)\""
 
   # Install the build and runtime dependencies of the project.
-    # Install the build and runtime dependencies of the project.
   - "%CMD_IN_ENV% pip install --timeout=60 --trusted-host 28daf2247a33ed269873-7b1aad3fab3cc330e1fd9d109892382a.r6.cf2.rackcdn.com -r continuous_integration/appveyor/requirements.txt"
   - "%CMD_IN_ENV% python setup.py bdist_wheel bdist_wininst"
   - ps: "ls dist"

diff --git a/continuous_integration/travis/flake8_diff.sh b/continuous_integration/travis/flake8_diff.sh
@@ -19,18 +19,18 @@ set -e
 set -o pipefail
 
 PROJECT=RaRe-Technologies/gensim
-PROJECT_URL=https://github.com/$PROJECT.git
+PROJECT_URL=https://github.com/${PROJECT}.git
 
 # Find the remote with the project name (upstream in most cases)
-REMOTE=$(git remote -v | grep $PROJECT | cut -f1 | head -1 || echo '')
+REMOTE=$(git remote -v | grep ${PROJECT} | cut -f1 | head -1 || echo '')
 
 # Add a temporary remote if needed. For example this is necessary when
 # Travis is configured to run in a fork. In this case 'origin' is the
 # fork and not the reference repo we want to diff against.
 if [[ -z "$REMOTE" ]]; then
     TMP_REMOTE=tmp_reference_upstream
-    REMOTE=$TMP_REMOTE
-    git remote add $REMOTE $PROJECT_URL
+    REMOTE=${TMP_REMOTE}
+    git remote add ${REMOTE} ${PROJECT_URL}
 fi
 
 echo "Remotes:"
@@ -56,15 +56,15 @@ if [[ "$TRAVIS" == "true" ]]; then
                 echo "New branch, no commit range from Travis so passing this test by convention"
                 exit 0
             fi
-            COMMIT_RANGE=$TRAVIS_COMMIT_RANGE
+            COMMIT_RANGE=${TRAVIS_COMMIT_RANGE}
         fi
     else
         # We want to fetch the code as it is in the PR branch and not
         # the result of the merge into develop. This way line numbers
         # reported by Travis will match with the local code.
-        LOCAL_BRANCH_REF=travis_pr_$TRAVIS_PULL_REQUEST
+        LOCAL_BRANCH_REF=travis_pr_${TRAVIS_PULL_REQUEST}
         # In Travis the PR target is always origin
-        git fetch origin pull/$TRAVIS_PULL_REQUEST/head:refs/$LOCAL_BRANCH_REF
+        git fetch origin pull/${TRAVIS_PULL_REQUEST}/head:refs/${LOCAL_BRANCH_REF}
     fi
 fi
 
@@ -76,49 +76,55 @@ if [[ -z "$COMMIT_RANGE" ]]; then
     fi
     echo -e "\nLast 2 commits in $LOCAL_BRANCH_REF:"
     echo '--------------------------------------------------------------------------------'
-    git log -2 $LOCAL_BRANCH_REF
+    git log -2 ${LOCAL_BRANCH_REF}
 
     REMOTE_MASTER_REF="$REMOTE/develop"
     # Make sure that $REMOTE_MASTER_REF is a valid reference
     echo -e "\nFetching $REMOTE_MASTER_REF"
     echo '--------------------------------------------------------------------------------'
-    git fetch $REMOTE develop:refs/remotes/$REMOTE_MASTER_REF
-    LOCAL_BRANCH_SHORT_HASH=$(git rev-parse --short $LOCAL_BRANCH_REF)
-    REMOTE_MASTER_SHORT_HASH=$(git rev-parse --short $REMOTE_MASTER_REF)
+    git fetch ${REMOTE} develop:refs/remotes/${REMOTE_MASTER_REF}
+    LOCAL_BRANCH_SHORT_HASH=$(git rev-parse --short ${LOCAL_BRANCH_REF})
+    REMOTE_MASTER_SHORT_HASH=$(git rev-parse --short ${REMOTE_MASTER_REF})
 
-    COMMIT=$(git merge-base $LOCAL_BRANCH_REF $REMOTE_MASTER_REF) || \
-        echo "No common ancestor found for $(git show $LOCAL_BRANCH_REF -q) and $(git show $REMOTE_MASTER_REF -q)"
+    COMMIT=$(git merge-base ${LOCAL_BRANCH_REF} ${REMOTE_MASTER_REF}) || \
+        echo "No common ancestor found for $(git show ${LOCAL_BRANCH_REF} -q) and $(git show ${REMOTE_MASTER_REF} -q)"
 
     if [ -z "$COMMIT" ]; then
         exit 1
     fi
 
-    COMMIT_SHORT_HASH=$(git rev-parse --short $COMMIT)
+    COMMIT_SHORT_HASH=$(git rev-parse --short ${COMMIT})
 
     echo -e "\nCommon ancestor between $LOCAL_BRANCH_REF ($LOCAL_BRANCH_SHORT_HASH)"\
          "and $REMOTE_MASTER_REF ($REMOTE_MASTER_SHORT_HASH) is $COMMIT_SHORT_HASH:"
     echo '--------------------------------------------------------------------------------'
-    git show --no-patch $COMMIT_SHORT_HASH
+    git show --no-patch ${COMMIT_SHORT_HASH}
 
     COMMIT_RANGE="$COMMIT_SHORT_HASH..$LOCAL_BRANCH_SHORT_HASH"
 
     if [[ -n "$TMP_REMOTE" ]]; then
-        git remote remove $TMP_REMOTE
+        git remote remove ${TMP_REMOTE}
     fi
 
 else
     echo "Got the commit range from Travis: $COMMIT_RANGE"
 fi
 
 echo -e '\nRunning flake8 on the diff in the range' "$COMMIT_RANGE" \
-     "($(git rev-list $COMMIT_RANGE | wc -l) commit(s)):"
+     "($(git rev-list ${COMMIT_RANGE} | wc -l) commit(s)):"
 echo '--------------------------------------------------------------------------------'
 
 # We ignore files from sklearn/externals.
 # Excluding vec files since they contain non-utf8 content and flake8 raises exception for non-utf8 input
 # We need the following command to exit with 0 hence the echo in case
 # there is no match
-MODIFIED_FILES="$(git diff --name-only $COMMIT_RANGE -- . ':(exclude)*.vec' || echo "no_match")"
+MODIFIED_PY_FILES="$(git diff --name-only ${COMMIT_RANGE} | grep '[a-zA-Z0-9]*.py$' || echo "no_match")"
+MODIFIED_IPYNB_FILES="$(git diff --name-only ${COMMIT_RANGE} | grep '[a-zA-Z0-9]*.ipynb$' || echo "no_match")"
+
+
+echo "*.py files: " ${MODIFIED_PY_FILES}
+echo "*.ipynb files: " ${MODIFIED_IPYNB_FILES}
+
 
 check_files() {
     files="$1"
@@ -127,13 +133,23 @@ check_files() {
     if [ -n "$files" ]; then
         # Conservative approach: diff without context (--unified=0) so that code
         # that was not changed does not create failures
-        git diff --unified=0 $COMMIT_RANGE -- $files | flake8 --diff --show-source $options
+        git diff --unified=0 ${COMMIT_RANGE} -- ${files} | flake8 --diff --show-source ${options}
     fi
 }
 
-if [[ "$MODIFIED_FILES" == "no_match" ]]; then
-    echo "No file has been modified"
+if [[ "$MODIFIED_PY_FILES" == "no_match" ]]; then
+    echo "No .py files has been modified"
 else
-    check_files "$(echo "$MODIFIED_FILES" )" "--ignore=E501,E731,E12,W503 --exclude=*.sh,*.md,*.yml,*.rst,*.ipynb,*.txt,*.csv,*.vec,Dockerfile*,*.c,*.pyx,*.inc"
+    check_files "$(echo "$MODIFIED_PY_FILES" )" "--ignore=E501,E731,E12,W503"
 fi
 echo -e "No problem detected by flake8\n"
+
+if [[ "$MODIFIED_IPYNB_FILES" == "no_match" ]]; then
+    echo "No .ipynb file has been modified"
+else
+    for fname in ${MODIFIED_IPYNB_FILES}
+    do
+        echo "File: $fname"
+        jupyter nbconvert --to script --stdout ${fname} | flake8 - --show-source --ignore=E501,E731,E12,W503,E402 --builtins=get_ipython || true
+    done
+fi
diff --git a/continuous_integration/travis/install.sh b/continuous_integration/travis/install.sh
@@ -0,0 +1,13 @@
+#!/bin/bash
+
+set -e
+
+deactivate
+wget 'http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh' -O miniconda.sh
+chmod +x miniconda.sh && ./miniconda.sh -b
+export PATH=/home/travis/miniconda2/bin:$PATH
+conda update --yes conda
+
+
+conda create --yes -n gensim-test python=${PYTHON_VERSION} pip atlas flake8 jupyter numpy==${NUMPY_VERSION} scipy==${SCIPY_VERSION} && source activate gensim-test
+pip install . && pip install .[test]
diff --git a/continuous_integration/travis/run.sh b/continuous_integration/travis/run.sh
@@ -0,0 +1,11 @@
+#!/bin/bash
+
+set -e
+
+pip freeze
+
+if [[ "$ONLY_CODESTYLE" == "yes" ]]; then
+	continuous_integration/travis/flake8_diff.sh
+else
+	python setup.py test
+fi
diff --git a/docker/start_jupyter_notebook.sh b/docker/start_jupyter_notebook.sh
@@ -4,4 +4,4 @@ PORT=$1
 NOTEBOOK_DIR=/gensim/docs/notebooks
 DEFAULT_URL=/notebooks/gensim%20Quick%20Start.ipynb
 
-jupyter notebook --no-browser --ip=* --port=$PORT --allow-root --notebook-dir=$NOTEBOOK_DIR --NotebookApp.token=\"\" --NotebookApp.default_url=$DEFAULT_URL
+jupyter notebook --no-browser --ip=* --port=${PORT} --allow-root --notebook-dir=${NOTEBOOK_DIR} --NotebookApp.token=\"\" --NotebookApp.default_url=${DEFAULT_URL}
diff --git a/docs/notebooks/Coherence.gif b/docs/notebooks/Coherence.gif
diff --git a/docs/notebooks/Convergence.gif b/docs/notebooks/Convergence.gif
diff --git a/docs/notebooks/Corpora_and_Vector_Spaces.ipynb b/docs/notebooks/Corpora_and_Vector_Spaces.ipynb
@@ -354,7 +354,7 @@
    "source": [
     "Although the output is the same as for the plain Python list, the corpus is now much more memory friendly, because at most one vector resides in RAM at a time. Your corpus can now be as large as you want.\n",
     "\n",
-    "We are going to create the dictionary from the mycorpus.txt file without loading the entire file into memory. Then, we will generate the list of token ids to remove from this dictionary by querying the dictionary for the token ids of the stop words, and by querying the document frequencies dictionary (dictionary.dfs) for token ids that only appear once. Finally, we will filter these token ids out of our dictionary and call dictionary.compactify() to remove the gaps in the token id series."
+    "We are going to create the dictionary from the mycorpus.txt file without loading the entire file into memory. Then, we will generate the list of token ids to remove from this dictionary by querying the dictionary for the token ids of the stop words, and by querying the document frequencies dictionary (`dictionary.dfs`) for token ids that only appear once. Finally, we will filter these token ids out of our dictionary. Keep in mind that `dictionary.filter_tokens` (and some other functions such as `dictionary.add_document`) will call `dictionary.compactify()` to remove the gaps in the token id series thus enumeration of remaining tokens can be changed."
    ]
   },
   {
@@ -385,9 +385,6 @@
     "\n",
     "# remove stop words and words that appear only once\n",
     "dictionary.filter_tokens(stop_ids + once_ids)\n",
-    "\n",
-    "# remove gaps in id sequence after words that were removed\n",
-    "dictionary.compactify()\n",
     "print(dictionary)"
    ]
   },

diff --git a/docs/notebooks/Diff.gif b/docs/notebooks/Diff.gif