Skip to content

Commit

Permalink
Merge branch 'release-3.0.0'
Browse files Browse the repository at this point in the history
  • Loading branch information
menshikh-iv committed Sep 27, 2017
2 parents 8b8669d + af646c4 commit 351bdef
Show file tree
Hide file tree
Showing 182 changed files with 23,312 additions and 4,747 deletions.
39 changes: 19 additions & 20 deletions .travis.yml
@@ -1,23 +1,22 @@
sudo: false

cache:
apt: true
directories:
- $HOME/.cache/pip
- $HOME/.ccache

dist: trusty
language: python
python:
- "2.7"
- "3.5"
- "3.6"
before_install:
- wget 'http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh' -O miniconda.sh
- chmod +x miniconda.sh
- ./miniconda.sh -b
- export PATH=/home/travis/miniconda2/bin:$PATH
- conda update --yes conda
install:
- conda create --yes -n gensim-test python=$TRAVIS_PYTHON_VERSION pip atlas numpy==1.11.3 scipy==0.18.1
- source activate gensim-test
- python setup.py install
- pip install .[test]
script:
- pip freeze
- python setup.py test
- pip install flake8
- continuous_integration/travis/flake8_diff.sh


matrix:
include:
- env: PYTHON_VERSION="2.7" NUMPY_VERSION="1.11.3" SCIPY_VERSION="0.18.1" ONLY_CODESTYLE="yes"
- env: PYTHON_VERSION="2.7" NUMPY_VERSION="1.11.3" SCIPY_VERSION="0.18.1" ONLY_CODESTYLE="no"
- env: PYTHON_VERSION="3.5" NUMPY_VERSION="1.11.3" SCIPY_VERSION="0.18.1" ONLY_CODESTYLE="no"
- env: PYTHON_VERSION="3.6" NUMPY_VERSION="1.11.3" SCIPY_VERSION="0.18.1" ONLY_CODESTYLE="no"


install: source continuous_integration/travis/install.sh
script: bash continuous_integration/travis/run.sh
49 changes: 49 additions & 0 deletions CHANGELOG.md
@@ -1,5 +1,54 @@
Changes
===========
## 3.0.0, 2017-09-27


:star2: New features:
* Add unsupervised FastText to Gensim (@chinmayapancholi13, [#1525](https://github.com/RaRe-Technologies/gensim/pull/1525))
* Add sklearn API for gensim models (@chinmayapancholi13, [#1462](https://github.com/RaRe-Technologies/gensim/pull/1462))
* Add callback metrics for LdaModel and integration with Visdom (@parulsethi, [#1399](https://github.com/RaRe-Technologies/gensim/pull/1399))
* Add TranslationMatrix model (@robotcator, [#1434](https://github.com/RaRe-Technologies/gensim/pull/1434))
* Add word2vec-based coherence. Fix #1380 (@macks22, [#1530](https://github.com/RaRe-Technologies/gensim/pull/1530))


:+1: Improvements:
* Add 'diagonal' parameter for LdaModel.diff (@parulsethi, [#1448](https://github.com/RaRe-Technologies/gensim/pull/1448))
* Add 'score' function for SklLdaModel (@chinmayapancholi13, [#1445](https://github.com/RaRe-Technologies/gensim/pull/1445))
* Update sklearn API for gensim models (@chinmayapancholi13, [#1473](https://github.com/RaRe-Technologies/gensim/pull/1473)) [:warning: breaks backward compatibility]
* Add CoherenceModel to LdaModel.top_topics. Fix #1128 (@macks22, [#1427](https://github.com/RaRe-Technologies/gensim/pull/1427))
* Add dendrogram viz for topics and JS metric (@parulsethi, [#1484](https://github.com/RaRe-Technologies/gensim/pull/1484))
* Add topic network viz (@parulsethi, [#1536](https://github.com/RaRe-Technologies/gensim/pull/1536))
* Replace viewitems to iteritems. Fix #1495 (@HodorTheCoder, [#1508](https://github.com/RaRe-Technologies/gensim/pull/1508))
* Fix Travis config and add style-checking for Ipython Notebooks. Fix #1518, #1520 (@menshikh-iv, [#1522](https://github.com/RaRe-Technologies/gensim/pull/1522))
* Remove mutable args from definitions. Fix #1561 (@zsef123, [#1562](https://github.com/RaRe-Technologies/gensim/pull/1562))
* Add Appveyour for all PRs. Fix #1565 (@menshikh-iv, [#1565](https://github.com/RaRe-Technologies/gensim/pull/1565))
* Refactor code by PEP8. Partially fix #1521 (@zsef123, [#1550](https://github.com/RaRe-Technologies/gensim/pull/1550))
* Refactor code by PEP8 with additional limitations. Fix #1521 (@menshikh-iv, [#1569](https://github.com/RaRe-Technologies/gensim/pull/1569))
* Update FastTextKeyedVectors.\_\_contains\_\_ (@ELind77, [#1499](https://github.com/RaRe-Technologies/gensim/pull/1499))
* Update WikiCorpus tokenization. Fix #1534 (@roopalgarg, [#1537](https://github.com/RaRe-Technologies/gensim/pull/1537))


:red_circle: Bug fixes:
* Remove round in LdaSeqModel.print_topic. Fix #1480 (@menshikh-iv, [#1547](https://github.com/RaRe-Technologies/gensim/pull/1547))
* Fix TextCorpus.samle_text (@menshikh-iv, [#1548](https://github.com/RaRe-Technologies/gensim/pull/1548))
* Fix Mallet wrapper and tests for HDPTransform (@menshikh-iv, [#1555](https://github.com/RaRe-Technologies/gensim/pull/1555))
* Fix incorrect initialization ShardedCorpus with a generator. Fix #1511 (@karkkainenk1, [#1512](https://github.com/RaRe-Technologies/gensim/pull/1512))
* Add verification when summarize_corpus returns null. Fix #1531 (@fbarrios, [#1570](https://github.com/RaRe-Technologies/gensim/pull/1570))
* Fix doctag unicode problem. Fix 1543 (@englhardt, [#1544](https://github.com/RaRe-Technologies/gensim/pull/1544))
* Fix Translation Matrix (@robotcator, [#1594](https://github.com/RaRe-Technologies/gensim/pull/1594))
* Add trainable flag to KeyedVectors.get_embedding_layer. Fix #1557 (@zsef123, [#1558](https://github.com/RaRe-Technologies/gensim/pull/1558))


:books: Tutorial and doc improvements:
* Update exception text in TextCorpus.samle_text. Partial fix #308 (@vlejd, [#1444](https://github.com/RaRe-Technologies/gensim/pull/1444))
* Remove extra filter_token from tutorial (@VorontsovIE, [#1502](https://github.com/RaRe-Technologies/gensim/pull/1502))
* Update Doc2Vec-IMDB notebook (@pahdo, [#1476](https://github.com/RaRe-Technologies/gensim/pull/1476))
* Add Google Tag Manager for site (@yardos, [#1556](https://github.com/RaRe-Technologies/gensim/pull/1556))
* Update docstring explaining lack of multistream support in WikiCopus. Fix #1496 (@polm and @menshikh-iv, [#1515](https://github.com/RaRe-Technologies/gensim/pull/1515))
* Fix PathLineSentences docstring (@gojomo)
* Fix typos from Translation Matrix notebook (@robotcator, [#1598](https://github.com/RaRe-Technologies/gensim/pull/1598))


## 2.3.0, 2017-07-25


Expand Down
2 changes: 2 additions & 0 deletions README.md
Expand Up @@ -137,6 +137,8 @@ Adopters
| Amazon | <img src="http://g-ec2.images-amazon.com/images/G/01/social/api-share/amazon_logo_500500._V323939215_.png" width="100"> | [amazon.com](http://www.amazon.com/) | Document similarity|
| SiteGround Hosting | <img src="https://www.siteground.com/img/knox/logos/siteground.png" width="100"> | [siteground.com](https://www.siteground.com/) | An ensemble search engine which uses different embeddings models and similarities, including word2vec, WMD, and LDA. |
| Juju | <img src="https://d5k1a84rm5hwo.cloudfront.net/img/juju_home_logo.png" width="100"> | [www.juju.com](http://www.juju.com/) | Provide non-obvious related job suggestions. |
| NLPub | <img src="https://nlpub.org/images/thumb/a/aa/NLPub.svg/240px-NLPub.svg.png" width="100"> | [nlpub.org](https://nlpub.org/) | Distributional semantic models including word2vec. |
|Capital One | <img src="https://s3.amazonaws.com/fjds/member/original/1245173/C1_Core_NG_RGB_R_%281%29.PNG?1456169388" width="200"> | [www.capitalone.com](https://www.capitalone.com/) | Topic modeling for customer complaints exploration. |

-------

Expand Down
1 change: 0 additions & 1 deletion appveyor.yml
Expand Up @@ -50,7 +50,6 @@ install:
- "python -c \"import struct; print(struct.calcsize('P') * 8)\""

# Install the build and runtime dependencies of the project.
# Install the build and runtime dependencies of the project.
- "%CMD_IN_ENV% pip install --timeout=60 --trusted-host 28daf2247a33ed269873-7b1aad3fab3cc330e1fd9d109892382a.r6.cf2.rackcdn.com -r continuous_integration/appveyor/requirements.txt"
- "%CMD_IN_ENV% python setup.py bdist_wheel bdist_wininst"
- ps: "ls dist"
Expand Down
60 changes: 38 additions & 22 deletions continuous_integration/travis/flake8_diff.sh
Expand Up @@ -19,18 +19,18 @@ set -e
set -o pipefail

PROJECT=RaRe-Technologies/gensim
PROJECT_URL=https://github.com/$PROJECT.git
PROJECT_URL=https://github.com/${PROJECT}.git

# Find the remote with the project name (upstream in most cases)
REMOTE=$(git remote -v | grep $PROJECT | cut -f1 | head -1 || echo '')
REMOTE=$(git remote -v | grep ${PROJECT} | cut -f1 | head -1 || echo '')

# Add a temporary remote if needed. For example this is necessary when
# Travis is configured to run in a fork. In this case 'origin' is the
# fork and not the reference repo we want to diff against.
if [[ -z "$REMOTE" ]]; then
TMP_REMOTE=tmp_reference_upstream
REMOTE=$TMP_REMOTE
git remote add $REMOTE $PROJECT_URL
REMOTE=${TMP_REMOTE}
git remote add ${REMOTE} ${PROJECT_URL}
fi

echo "Remotes:"
Expand All @@ -56,15 +56,15 @@ if [[ "$TRAVIS" == "true" ]]; then
echo "New branch, no commit range from Travis so passing this test by convention"
exit 0
fi
COMMIT_RANGE=$TRAVIS_COMMIT_RANGE
COMMIT_RANGE=${TRAVIS_COMMIT_RANGE}
fi
else
# We want to fetch the code as it is in the PR branch and not
# the result of the merge into develop. This way line numbers
# reported by Travis will match with the local code.
LOCAL_BRANCH_REF=travis_pr_$TRAVIS_PULL_REQUEST
LOCAL_BRANCH_REF=travis_pr_${TRAVIS_PULL_REQUEST}
# In Travis the PR target is always origin
git fetch origin pull/$TRAVIS_PULL_REQUEST/head:refs/$LOCAL_BRANCH_REF
git fetch origin pull/${TRAVIS_PULL_REQUEST}/head:refs/${LOCAL_BRANCH_REF}
fi
fi

Expand All @@ -76,49 +76,55 @@ if [[ -z "$COMMIT_RANGE" ]]; then
fi
echo -e "\nLast 2 commits in $LOCAL_BRANCH_REF:"
echo '--------------------------------------------------------------------------------'
git log -2 $LOCAL_BRANCH_REF
git log -2 ${LOCAL_BRANCH_REF}

REMOTE_MASTER_REF="$REMOTE/develop"
# Make sure that $REMOTE_MASTER_REF is a valid reference
echo -e "\nFetching $REMOTE_MASTER_REF"
echo '--------------------------------------------------------------------------------'
git fetch $REMOTE develop:refs/remotes/$REMOTE_MASTER_REF
LOCAL_BRANCH_SHORT_HASH=$(git rev-parse --short $LOCAL_BRANCH_REF)
REMOTE_MASTER_SHORT_HASH=$(git rev-parse --short $REMOTE_MASTER_REF)
git fetch ${REMOTE} develop:refs/remotes/${REMOTE_MASTER_REF}
LOCAL_BRANCH_SHORT_HASH=$(git rev-parse --short ${LOCAL_BRANCH_REF})
REMOTE_MASTER_SHORT_HASH=$(git rev-parse --short ${REMOTE_MASTER_REF})

COMMIT=$(git merge-base $LOCAL_BRANCH_REF $REMOTE_MASTER_REF) || \
echo "No common ancestor found for $(git show $LOCAL_BRANCH_REF -q) and $(git show $REMOTE_MASTER_REF -q)"
COMMIT=$(git merge-base ${LOCAL_BRANCH_REF} ${REMOTE_MASTER_REF}) || \
echo "No common ancestor found for $(git show ${LOCAL_BRANCH_REF} -q) and $(git show ${REMOTE_MASTER_REF} -q)"

if [ -z "$COMMIT" ]; then
exit 1
fi

COMMIT_SHORT_HASH=$(git rev-parse --short $COMMIT)
COMMIT_SHORT_HASH=$(git rev-parse --short ${COMMIT})

echo -e "\nCommon ancestor between $LOCAL_BRANCH_REF ($LOCAL_BRANCH_SHORT_HASH)"\
"and $REMOTE_MASTER_REF ($REMOTE_MASTER_SHORT_HASH) is $COMMIT_SHORT_HASH:"
echo '--------------------------------------------------------------------------------'
git show --no-patch $COMMIT_SHORT_HASH
git show --no-patch ${COMMIT_SHORT_HASH}

COMMIT_RANGE="$COMMIT_SHORT_HASH..$LOCAL_BRANCH_SHORT_HASH"

if [[ -n "$TMP_REMOTE" ]]; then
git remote remove $TMP_REMOTE
git remote remove ${TMP_REMOTE}
fi

else
echo "Got the commit range from Travis: $COMMIT_RANGE"
fi

echo -e '\nRunning flake8 on the diff in the range' "$COMMIT_RANGE" \
"($(git rev-list $COMMIT_RANGE | wc -l) commit(s)):"
"($(git rev-list ${COMMIT_RANGE} | wc -l) commit(s)):"
echo '--------------------------------------------------------------------------------'

# We ignore files from sklearn/externals.
# Excluding vec files since they contain non-utf8 content and flake8 raises exception for non-utf8 input
# We need the following command to exit with 0 hence the echo in case
# there is no match
MODIFIED_FILES="$(git diff --name-only $COMMIT_RANGE -- . ':(exclude)*.vec' || echo "no_match")"
MODIFIED_PY_FILES="$(git diff --name-only ${COMMIT_RANGE} | grep '[a-zA-Z0-9]*.py$' || echo "no_match")"
MODIFIED_IPYNB_FILES="$(git diff --name-only ${COMMIT_RANGE} | grep '[a-zA-Z0-9]*.ipynb$' || echo "no_match")"


echo "*.py files: " ${MODIFIED_PY_FILES}
echo "*.ipynb files: " ${MODIFIED_IPYNB_FILES}


check_files() {
files="$1"
Expand All @@ -127,13 +133,23 @@ check_files() {
if [ -n "$files" ]; then
# Conservative approach: diff without context (--unified=0) so that code
# that was not changed does not create failures
git diff --unified=0 $COMMIT_RANGE -- $files | flake8 --diff --show-source $options
git diff --unified=0 ${COMMIT_RANGE} -- ${files} | flake8 --diff --show-source ${options}
fi
}

if [[ "$MODIFIED_FILES" == "no_match" ]]; then
echo "No file has been modified"
if [[ "$MODIFIED_PY_FILES" == "no_match" ]]; then
echo "No .py files has been modified"
else
check_files "$(echo "$MODIFIED_FILES" )" "--ignore=E501,E731,E12,W503 --exclude=*.sh,*.md,*.yml,*.rst,*.ipynb,*.txt,*.csv,*.vec,Dockerfile*,*.c,*.pyx,*.inc"
check_files "$(echo "$MODIFIED_PY_FILES" )" "--ignore=E501,E731,E12,W503"
fi
echo -e "No problem detected by flake8\n"

if [[ "$MODIFIED_IPYNB_FILES" == "no_match" ]]; then
echo "No .ipynb file has been modified"
else
for fname in ${MODIFIED_IPYNB_FILES}
do
echo "File: $fname"
jupyter nbconvert --to script --stdout ${fname} | flake8 - --show-source --ignore=E501,E731,E12,W503,E402 --builtins=get_ipython || true
done
fi
13 changes: 13 additions & 0 deletions continuous_integration/travis/install.sh
@@ -0,0 +1,13 @@
#!/bin/bash

set -e

deactivate
wget 'http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh' -O miniconda.sh
chmod +x miniconda.sh && ./miniconda.sh -b
export PATH=/home/travis/miniconda2/bin:$PATH
conda update --yes conda


conda create --yes -n gensim-test python=${PYTHON_VERSION} pip atlas flake8 jupyter numpy==${NUMPY_VERSION} scipy==${SCIPY_VERSION} && source activate gensim-test
pip install . && pip install .[test]
11 changes: 11 additions & 0 deletions continuous_integration/travis/run.sh
@@ -0,0 +1,11 @@
#!/bin/bash

set -e

pip freeze

if [[ "$ONLY_CODESTYLE" == "yes" ]]; then
continuous_integration/travis/flake8_diff.sh
else
python setup.py test
fi
2 changes: 1 addition & 1 deletion docker/start_jupyter_notebook.sh
Expand Up @@ -4,4 +4,4 @@ PORT=$1
NOTEBOOK_DIR=/gensim/docs/notebooks
DEFAULT_URL=/notebooks/gensim%20Quick%20Start.ipynb

jupyter notebook --no-browser --ip=* --port=$PORT --allow-root --notebook-dir=$NOTEBOOK_DIR --NotebookApp.token=\"\" --NotebookApp.default_url=$DEFAULT_URL
jupyter notebook --no-browser --ip=* --port=${PORT} --allow-root --notebook-dir=${NOTEBOOK_DIR} --NotebookApp.token=\"\" --NotebookApp.default_url=${DEFAULT_URL}
Binary file added docs/notebooks/Coherence.gif
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/notebooks/Convergence.gif
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 1 addition & 4 deletions docs/notebooks/Corpora_and_Vector_Spaces.ipynb
Expand Up @@ -354,7 +354,7 @@
"source": [
"Although the output is the same as for the plain Python list, the corpus is now much more memory friendly, because at most one vector resides in RAM at a time. Your corpus can now be as large as you want.\n",
"\n",
"We are going to create the dictionary from the mycorpus.txt file without loading the entire file into memory. Then, we will generate the list of token ids to remove from this dictionary by querying the dictionary for the token ids of the stop words, and by querying the document frequencies dictionary (dictionary.dfs) for token ids that only appear once. Finally, we will filter these token ids out of our dictionary and call dictionary.compactify() to remove the gaps in the token id series."
"We are going to create the dictionary from the mycorpus.txt file without loading the entire file into memory. Then, we will generate the list of token ids to remove from this dictionary by querying the dictionary for the token ids of the stop words, and by querying the document frequencies dictionary (`dictionary.dfs`) for token ids that only appear once. Finally, we will filter these token ids out of our dictionary. Keep in mind that `dictionary.filter_tokens` (and some other functions such as `dictionary.add_document`) will call `dictionary.compactify()` to remove the gaps in the token id series thus enumeration of remaining tokens can be changed."
]
},
{
Expand Down Expand Up @@ -385,9 +385,6 @@
"\n",
"# remove stop words and words that appear only once\n",
"dictionary.filter_tokens(stop_ids + once_ids)\n",
"\n",
"# remove gaps in id sequence after words that were removed\n",
"dictionary.compactify()\n",
"print(dictionary)"
]
},
Expand Down
Binary file added docs/notebooks/Diff.gif
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 351bdef

Please sign in to comment.