Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Flexible API for nearest neighbour search #548
This is an implementation for issue #527
I've added neighbor_sim.py in directory gensim/models in which I defined NeighborIndexer classes.
class NeighborIndexer(object): """ Base class for k-NN libraries """ def __init__(self, model=None, **kwargs): """ Create a new indexer :param model: Instance of Doc2Vec or Word2Vec :param kwargs: Additional named parameters """ if model is None: raise Exception("Invalid model parameter. Please provide model with an instance of Doc2Vec or Word2Vec") if type(model) is Doc2Vec: self._init_doc2vec_(model) elif type(model) is Word2Vec: self._init_word2vec_(model) else: raise Exception("Invalid model parameter. Please provide model with an instance of Doc2Vec or Word2Vec") self.start_indexing() def _init_doc2vec_(self, model): docvecs = model.docvecs docvecs.init_sims() size = len(docvecs.doctag_syn0norm) for i in range(size): self.add_item(docvecs.offset2doctag[i], docvecs.doctag_syn0norm[i]) def _init_word2vec_(self, model): # raise Exception("Not supported at the moment") model.init_sims() size = len(model.syn0norm) for i in range(size): self.add_item(model.index2word[i], model.syn0norm[i]) def get_item(self, label): """ Get an item's vector by its label :param label: The label :return: The item's vector if have. Otherwise, returns None """ pass def add_item(self, label, vector): """ Add an item to index :param label: the label of the item (must be unique). :param vector: the item's vector. """ pass def start_indexing(self): """ Start the indexing operation. This method is supposed to run after all items are added It may take time to finish. """ pass def get_nearest_items(self, vector, top_n=10): """ Get nearest items for an item described by its vector :param vector: The vector :param top_n: Number of nearest items to get :return A list of tuples (label, similarity) for the nearest ones """ pass def save(self, file_name): """ Dump internal data into disk :param file_name: Output file path :return: Returns true on success. Otherwise, returns false """ pass def load(self, file_name): """ Load previously dumped info :param file_name: Input file paht :return: Returns true on success. Otherwise, returns false """ pass
To make the integration process easier, I've modified most_similar() functions in doc2vec.py and word2vec.py by adding optional parameter indexer
# at line #409, doc2vec.py, the new function looks like this def most_similar(self, positive=, negative=, topn=10, clip_start=0, clip_end=None, indexer=None):
To use the indexers, you may use the snippet below:
from gensim.models.neighbor_sim import AnnoyIndexer model = Doc2Vec.load(fname) # create an indexer # features_num = size of each document's vector # You may need to optimize your *tree_size* parameter indexer = AnnoyIndexer(model=model, features_num=100, tree_size=300) # it may take time to finish the indexing operation document = "Here comes the sun" vector = model.features_num(document.lower().split()) # get most similar documents similar_documents = model.most_similar([vector], indexer=indexer)
Currently, only Annoy-based NeighborIndexer is supported.
@anhldbk Would be good to include this in the next release. For that purpose could you please:
@tmylk I'm glad to hear that. Actually it's the first time ever I've worked on a Github project. Would you pls tell me how to fulfill your request?
I'm thinking of these steps:
Is that ok ?