New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flexible API for nearest neighbour search #548

Closed
wants to merge 7 commits into
from

Conversation

Projects
None yet
3 participants
@anhldbk

anhldbk commented Nov 25, 2015

This is an implementation for issue #527

I've added neighbor_sim.py in directory gensim/models in which I defined NeighborIndexer classes.

class NeighborIndexer(object):
    """
    Base class for k-NN libraries
    """

    def __init__(self, model=None, **kwargs):
        """
        Create a new indexer
        :param model:   Instance of Doc2Vec or Word2Vec
        :param kwargs:  Additional named parameters
        """
        if model is None:
            raise Exception("Invalid model parameter. Please provide model with an instance of Doc2Vec or Word2Vec")
        if type(model) is Doc2Vec:
            self._init_doc2vec_(model)
        elif type(model) is Word2Vec:
            self._init_word2vec_(model)
        else:
            raise Exception("Invalid model parameter. Please provide model with an instance of Doc2Vec or Word2Vec")
        self.start_indexing()

    def _init_doc2vec_(self, model):
        docvecs = model.docvecs
        docvecs.init_sims()
        size = len(docvecs.doctag_syn0norm)

        for i in range(size):
            self.add_item(docvecs.offset2doctag[i], docvecs.doctag_syn0norm[i])

    def _init_word2vec_(self, model):
        # raise Exception("Not supported at the moment")
        model.init_sims()
        size = len(model.syn0norm)

        for i in range(size):
            self.add_item(model.index2word[i], model.syn0norm[i])

    def get_item(self, label):
        """
        Get an item's vector by its label
        :param label: The label
        :return: The item's vector if have. Otherwise, returns None
        """
        pass

    def add_item(self, label, vector):
        """
        Add an item to index
        :param label: the label of the item (must be unique).
        :param vector: the item's vector.
        """
        pass

    def start_indexing(self):
        """
        Start the indexing operation. This method is supposed to run after all items are added
        It may take time to finish.
        """
        pass

    def get_nearest_items(self, vector, top_n=10):
        """
        Get nearest items for an item described by its vector
        :param  vector: The vector
        :param  top_n: Number of nearest items to get
        :return A list of tuples (label, similarity) for the nearest ones
        """
        pass

    def save(self, file_name):
        """
        Dump internal data into disk
        :param file_name:   Output file path
        :return: Returns true on success. Otherwise, returns false
        """
        pass

    def load(self, file_name):
        """
        Load previously dumped info
        :param file_name:  Input file paht
        :return: Returns true on success. Otherwise, returns false
        """
        pass

To make the integration process easier, I've modified most_similar() functions in doc2vec.py and word2vec.py by adding optional parameter indexer

For example:

# at line #409, doc2vec.py, the new function looks like this
def most_similar(self, positive=[], negative=[], topn=10, clip_start=0, clip_end=None, indexer=None):

To use the indexers, you may use the snippet below:

from gensim.models.neighbor_sim import AnnoyIndexer
model = Doc2Vec.load(fname)

# create an indexer 
# features_num  = size of each document's vector
# You may need to optimize your *tree_size* parameter
indexer = AnnoyIndexer(model=model, features_num=100, tree_size=300)

# it may take time to finish the indexing operation

document = "Here comes the sun"
vector = model.features_num(document.lower().split())

# get most similar documents
similar_documents = model.most_similar([vector], indexer=indexer)

Currently, only Annoy-based NeighborIndexer is supported.

- utils.HAS_PATTERN, has also changed to utils.has_pattern()
* Word2vec allows non-strict unicode error handling (ignore or replace). (Gordon Mohr, #466)
0.12.3, 05/11/2015

This comment has been minimized.

@piskvorky

piskvorky Nov 25, 2015

Member

Why this change?

@piskvorky

piskvorky Nov 25, 2015

Member

Why this change?

This comment has been minimized.

@anhldbk

anhldbk Nov 25, 2015

@piskvorky Sorry, just an error during merging process. Pls discard it.

@anhldbk

anhldbk Nov 25, 2015

@piskvorky Sorry, just an error during merging process. Pls discard it.

@piskvorky piskvorky changed the title from Anh le to Flexible API for nearest neighbour search Nov 25, 2015

@piskvorky

This comment has been minimized.

Show comment
Hide comment
@piskvorky

piskvorky Nov 25, 2015

Member

Thanks @anhldbk ! Will review asap :)

Member

piskvorky commented Nov 25, 2015

Thanks @anhldbk ! Will review asap :)

@piskvorky

This comment has been minimized.

Show comment
Hide comment
@piskvorky

piskvorky Nov 25, 2015

Member

What's up with the commits by @tmylk in this PR?

Also, the commits (including yours) seem to be doubled. Let me know if you need assistance there -- what happened?

Member

piskvorky commented Nov 25, 2015

What's up with the commits by @tmylk in this PR?

Also, the commits (including yours) seem to be doubled. Let me know if you need assistance there -- what happened?

@anhldbk

This comment has been minimized.

Show comment
Hide comment
@anhldbk

anhldbk Nov 27, 2015

@piskvorky Me? Actually I've just made my first pull request on github. lol It's ok now.

anhldbk commented Nov 27, 2015

@piskvorky Me? Actually I've just made my first pull request on github. lol It's ok now.

@tmylk

This comment has been minimized.

Show comment
Hide comment
@tmylk

tmylk Jan 9, 2016

Contributor

@anhldbk Would be good to include this in the next release. For that purpose could you please:

  • update CHANGELOG.txt with description of your changes
  • add tests for regression of most_similar to ensure it didn't change. Existing tests either don't test the outcome 1, it is commented out 2, or not robust 3
  • resolve the merge conflicts in this pull request (follow github instructions)
Contributor

tmylk commented Jan 9, 2016

@anhldbk Would be good to include this in the next release. For that purpose could you please:

  • update CHANGELOG.txt with description of your changes
  • add tests for regression of most_similar to ensure it didn't change. Existing tests either don't test the outcome 1, it is commented out 2, or not robust 3
  • resolve the merge conflicts in this pull request (follow github instructions)
@anhldbk

This comment has been minimized.

Show comment
Hide comment
@anhldbk

anhldbk Jan 11, 2016

@tmylk I'm glad to hear that. Actually it's the first time ever I've worked on a Github project. Would you pls tell me how to fulfill your request?

I'm thinking of these steps:

  • Pull the latest source from the main branch
  • Modify the source to include my logic.
  • Modify CHANGLOG.txt
  • Run the tests --> How to run the tests? What will I do with the results?
  • Resolve any conflicts
  • Create a new pull request and close this one.

Is that ok ?

anhldbk commented Jan 11, 2016

@tmylk I'm glad to hear that. Actually it's the first time ever I've worked on a Github project. Would you pls tell me how to fulfill your request?

I'm thinking of these steps:

  • Pull the latest source from the main branch
  • Modify the source to include my logic.
  • Modify CHANGLOG.txt
  • Run the tests --> How to run the tests? What will I do with the results?
  • Resolve any conflicts
  • Create a new pull request and close this one.

Is that ok ?

@tmylk

This comment has been minimized.

Show comment
Hide comment
@tmylk

tmylk Jan 24, 2016

Contributor

@piskvorky ping

Contributor

tmylk commented Jan 24, 2016

@piskvorky ping

@piskvorky piskvorky self-assigned this Jan 24, 2016

@anhldbk

This comment has been minimized.

Show comment
Hide comment
@anhldbk

anhldbk Feb 4, 2016

@tmylk So what have I got to do?

anhldbk commented Feb 4, 2016

@tmylk So what have I got to do?

@piskvorky

This comment has been minimized.

Show comment
Hide comment
@piskvorky

piskvorky Feb 4, 2016

Member

@anhldbk yes, your plan sounds good. I don't know why the commits are doubled in this PR -- creating a new pull request, branching off latest develop, should solve this.

Member

piskvorky commented Feb 4, 2016

@anhldbk yes, your plan sounds good. I don't know why the commits are doubled in this PR -- creating a new pull request, branching off latest develop, should solve this.

More information can be found at: https://github.com/piskvorky/gensim/wiki/Ideas-&-Feature-proposals#integrate-a-fast-k-nn-library
Implemented by: Anh Le (anhldbk@gmail.com)

This comment has been minimized.

@piskvorky

piskvorky Feb 4, 2016

Member

Belongs into the file header (see other source files in gensim).

@piskvorky

piskvorky Feb 4, 2016

Member

Belongs into the file header (see other source files in gensim).

Implemented by: Anh Le (anhldbk@gmail.com)
Website: http://bigsonata.com

This comment has been minimized.

@piskvorky

piskvorky Feb 4, 2016

Member

No such personal plugs in gensim.

@piskvorky

piskvorky Feb 4, 2016

Member

No such personal plugs in gensim.

Website: http://bigsonata.com
Release 1: November 2015

This comment has been minimized.

@piskvorky

piskvorky Feb 4, 2016

Member

Again, belongs either to the file header, or -- even better -- delete completely (history changes are handled by git).

@piskvorky

piskvorky Feb 4, 2016

Member

Again, belongs either to the file header, or -- even better -- delete completely (history changes are handled by git).

def add_item(self, label, vector):
"""
Add an item to index
:param label: the label of the item (must be unique).

This comment has been minimized.

@piskvorky

piskvorky Feb 4, 2016

Member

We don't use Java-style docstrings in gensim. Either describe the params in free text, or structure them using the Google docstring style.

@piskvorky

piskvorky Feb 4, 2016

Member

We don't use Java-style docstrings in gensim. Either describe the params in free text, or structure them using the Google docstring style.

@anhldbk

This comment has been minimized.

Show comment
Hide comment
@anhldbk

anhldbk Feb 25, 2016

@piskvorky I'm gonna close this PR to make a new one.

anhldbk commented Feb 25, 2016

@piskvorky I'm gonna close this PR to make a new one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment