Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flexible API for nearest neighbour search #548

Closed
wants to merge 7 commits into from

Conversation

@anhldbk
Copy link

anhldbk commented Nov 25, 2015

This is an implementation for issue #527

I've added neighbor_sim.py in directory gensim/models in which I defined NeighborIndexer classes.

class NeighborIndexer(object):
    """
    Base class for k-NN libraries
    """

    def __init__(self, model=None, **kwargs):
        """
        Create a new indexer
        :param model:   Instance of Doc2Vec or Word2Vec
        :param kwargs:  Additional named parameters
        """
        if model is None:
            raise Exception("Invalid model parameter. Please provide model with an instance of Doc2Vec or Word2Vec")
        if type(model) is Doc2Vec:
            self._init_doc2vec_(model)
        elif type(model) is Word2Vec:
            self._init_word2vec_(model)
        else:
            raise Exception("Invalid model parameter. Please provide model with an instance of Doc2Vec or Word2Vec")
        self.start_indexing()

    def _init_doc2vec_(self, model):
        docvecs = model.docvecs
        docvecs.init_sims()
        size = len(docvecs.doctag_syn0norm)

        for i in range(size):
            self.add_item(docvecs.offset2doctag[i], docvecs.doctag_syn0norm[i])

    def _init_word2vec_(self, model):
        # raise Exception("Not supported at the moment")
        model.init_sims()
        size = len(model.syn0norm)

        for i in range(size):
            self.add_item(model.index2word[i], model.syn0norm[i])

    def get_item(self, label):
        """
        Get an item's vector by its label
        :param label: The label
        :return: The item's vector if have. Otherwise, returns None
        """
        pass

    def add_item(self, label, vector):
        """
        Add an item to index
        :param label: the label of the item (must be unique).
        :param vector: the item's vector.
        """
        pass

    def start_indexing(self):
        """
        Start the indexing operation. This method is supposed to run after all items are added
        It may take time to finish.
        """
        pass

    def get_nearest_items(self, vector, top_n=10):
        """
        Get nearest items for an item described by its vector
        :param  vector: The vector
        :param  top_n: Number of nearest items to get
        :return A list of tuples (label, similarity) for the nearest ones
        """
        pass

    def save(self, file_name):
        """
        Dump internal data into disk
        :param file_name:   Output file path
        :return: Returns true on success. Otherwise, returns false
        """
        pass

    def load(self, file_name):
        """
        Load previously dumped info
        :param file_name:  Input file paht
        :return: Returns true on success. Otherwise, returns false
        """
        pass

To make the integration process easier, I've modified most_similar() functions in doc2vec.py and word2vec.py by adding optional parameter indexer

For example:

# at line #409, doc2vec.py, the new function looks like this
def most_similar(self, positive=[], negative=[], topn=10, clip_start=0, clip_end=None, indexer=None):

To use the indexers, you may use the snippet below:

from gensim.models.neighbor_sim import AnnoyIndexer
model = Doc2Vec.load(fname)

# create an indexer 
# features_num  = size of each document's vector
# You may need to optimize your *tree_size* parameter
indexer = AnnoyIndexer(model=model, features_num=100, tree_size=300)

# it may take time to finish the indexing operation

document = "Here comes the sun"
vector = model.features_num(document.lower().split())

# get most similar documents
similar_documents = model.most_similar([vector], indexer=indexer)

Currently, only Annoy-based NeighborIndexer is supported.

- utils.HAS_PATTERN, has also changed to utils.has_pattern()
* Word2vec allows non-strict unicode error handling (ignore or replace). (Gordon Mohr, #466)

0.12.3, 05/11/2015

This comment has been minimized.

Copy link
@piskvorky

piskvorky Nov 25, 2015

Member

Why this change?

This comment has been minimized.

Copy link
@anhldbk

anhldbk Nov 25, 2015

Author

@piskvorky Sorry, just an error during merging process. Pls discard it.

@piskvorky piskvorky changed the title Anh le Flexible API for nearest neighbour search Nov 25, 2015
@piskvorky
Copy link
Member

piskvorky commented Nov 25, 2015

Thanks @anhldbk ! Will review asap :)

@piskvorky
Copy link
Member

piskvorky commented Nov 25, 2015

What's up with the commits by @tmylk in this PR?

Also, the commits (including yours) seem to be doubled. Let me know if you need assistance there -- what happened?

@anhldbk
Copy link
Author

anhldbk commented Nov 27, 2015

@piskvorky Me? Actually I've just made my first pull request on github. lol It's ok now.

@tmylk
Copy link
Contributor

tmylk commented Jan 9, 2016

@anhldbk Would be good to include this in the next release. For that purpose could you please:

  • update CHANGELOG.txt with description of your changes
  • add tests for regression of most_similar to ensure it didn't change. Existing tests either don't test the outcome 1, it is commented out 2, or not robust 3
  • resolve the merge conflicts in this pull request (follow github instructions)
@anhldbk
Copy link
Author

anhldbk commented Jan 11, 2016

@tmylk I'm glad to hear that. Actually it's the first time ever I've worked on a Github project. Would you pls tell me how to fulfill your request?

I'm thinking of these steps:

  • Pull the latest source from the main branch
  • Modify the source to include my logic.
  • Modify CHANGLOG.txt
  • Run the tests --> How to run the tests? What will I do with the results?
  • Resolve any conflicts
  • Create a new pull request and close this one.

Is that ok ?

@tmylk
Copy link
Contributor

tmylk commented Jan 24, 2016

@piskvorky ping

@piskvorky piskvorky self-assigned this Jan 24, 2016
@anhldbk
Copy link
Author

anhldbk commented Feb 4, 2016

@tmylk So what have I got to do?

@piskvorky
Copy link
Member

piskvorky commented Feb 4, 2016

@anhldbk yes, your plan sounds good. I don't know why the commits are doubled in this PR -- creating a new pull request, branching off latest develop, should solve this.

More information can be found at: https://github.com/piskvorky/gensim/wiki/Ideas-&-Feature-proposals#integrate-a-fast-k-nn-library
Implemented by: Anh Le (anhldbk@gmail.com)

This comment has been minimized.

Copy link
@piskvorky

piskvorky Feb 4, 2016

Member

Belongs into the file header (see other source files in gensim).

Implemented by: Anh Le (anhldbk@gmail.com)
Website: http://bigsonata.com

This comment has been minimized.

Copy link
@piskvorky

piskvorky Feb 4, 2016

Member

No such personal plugs in gensim.

Website: http://bigsonata.com
Release 1: November 2015

This comment has been minimized.

Copy link
@piskvorky

piskvorky Feb 4, 2016

Member

Again, belongs either to the file header, or -- even better -- delete completely (history changes are handled by git).

def add_item(self, label, vector):
"""
Add an item to index
:param label: the label of the item (must be unique).

This comment has been minimized.

Copy link
@piskvorky

piskvorky Feb 4, 2016

Member

We don't use Java-style docstrings in gensim. Either describe the params in free text, or structure them using the Google docstring style.

@anhldbk
Copy link
Author

anhldbk commented Feb 25, 2016

@piskvorky I'm gonna close this PR to make a new one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.