Error “too many values to unpack” when trying to get similiraties in Gensim using LDA model #2644

brandatastudio · 2019-10-23T12:22:55Z

Problem description

I'm trying to use a trained LDA model to compare similarity between the models documents, stored in corpus, and new documents unseen by the model.

Steps/code/corpus to reproduce

'm using anaconda enviroment python 3.7, gensim 3.8.0, basically. I have my data as a dataframe that I separated in a test and training set, they both have this structure:

X_test and Xtrain dataframe format :

 id                                            alltext  
1710  3264537  [exmodelo, karen, mcdougal, asegura, mantuvo, ...   
8211  3272079  [grupo, socialista, pionero, supone, apoyar, n...   
1885  3263933  [parte, entrenador, zaragoza, javier, aguirre,...   
2481  3263744  [fans, hielo, fuego, saga, literaria, dio, pie...   
2975  3265302  [actividad, busca, repetir, tres, ediciones, a...

already preprocessed.

This is the code I use for creating my model

id2word = corpora.Dictionary(X_train["alltext"])   
texts = X_train["alltext"]
corpus = [id2word.doc2bow(text) for text in texts]

lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
    id2word=id2word,
    num_topics=20,
    random_state=100, 
    update_every=1, 
    chunksize=400, 
    passes=10, 
    alpha='auto',
    per_word_topics=True)

Until here, everything works fine. I can effectively use

pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]](url)

to get my topics.

The problem comes, when I try to compare similarity between a new document and the corpus. Here is the code I'm using

newddoc = X_test["alltext"][2730] #I get a particular instance of the test_set
new_doc_freq_vector = id2word.doc2bow(newddoc)  #vectorize its list of words
model_vec= lda_model[new_doc_freq_vector] #run the trained model on it
index = similarities.MatrixSimilarity(lda_model[corpus]) # error
sims = index[model_vec] #error

In the last two lines, I get this error:

-------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-110-352248c464f8> in <module>
      4 
      5 #index = Similarity('model/indexes/similarity_index_01', lda_model[corpus], num_features=len(id2word)) #the first argument, the place where the
----> 6 index = similarities.MatrixSimilarity(lda_model[corpus]) # funciona si en vez de lda_model[corpus] usamos solo corpus
      7 index = similarities.MatrixSimilarity(model_vec)
      8 #sims = index[model_vec] #funciona si usamos index[new_doc_freq_vector] en vez de model_vec

~\AppData\Local\Continuum\anaconda3\envs\lda_henneo_01\lib\site-packages\gensim\similarities\docsim.py in __init__(self, corpus, num_best, dtype, num_features, chunksize, corpus_len)
    776                 "scanning corpus to determine the number of features (consider setting `num_features` explicitly)"
    777             )
--> 778             num_features = 1 + utils.get_max_id(corpus)
    779 
    780         self.num_features = num_features

~\AppData\Local\Continuum\anaconda3\envs\lda_henneo_01\lib\site-packages\gensim\utils.py in get_max_id(corpus)
    734     for document in corpus:
    735         if document:
--> 736             maxid = max(maxid, max(fieldid for fieldid, _ in document))
    737     return maxid
    738 

~\AppData\Local\Continuum\anaconda3\envs\lda_henneo_01\lib\site-packages\gensim\utils.py in <genexpr>(.0)
    734     for document in corpus:
    735         if document:
--> 736             maxid = max(maxid, max(fieldid for fieldid, _ in document))
    737     return maxid
    738 

ValueError: too many values to unpack (expected 2

Versions

anaconda enviroment python 3.7, gensim 3.8.0,
Please provide the output of:

import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import gensim; print("gensim", gensim.__version__)
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)

The text was updated successfully, but these errors were encountered:

brandatastudio · 2019-10-23T12:31:17Z

Things I have tried to solve this:

Using

Similarity('model/indexes/similarity_index_01', lda_model[corpus], num_features=len(id2word)).

But it did not work. Same error code was obtained.

If I replace lda_model[corpus] with corpus, and index[model_vec] with index[new_doc_freq_vector], similarities.MatrixSimilarity() works. But I believe it does not give the proper result because, it does not have the model information in there. The fact that it works it tells me it has something to do with data types (?), if I print lda_model[corpus] I get

<gensim.interfaces.TransformedCorpus object at 0x00000221ECA8E148>
no Idea what this means though.

piskvorky · 2019-11-04T12:14:07Z

Can you show the output of:

print(texts[:10])
print(corpus[:10])
print(newddoc)
print(model_vec)
print(index)

Pandas has a pretty bizarre indexing and iterating/slicing system, with many un-Pythonic gotchas, so I'm not sure what your code is actually doing.

In general it's always a good idea to:

post your logs at INFO level, and
eyeball samples of data going in / coming out of your pipeline at various points (the log will show some of that too).

CelesteM · 2019-11-06T23:51:50Z

Have you resolved the problem? I'm facing something similar here.

mpenkov · 2019-11-10T03:51:53Z

@brandatastudio Ping. Could you please provide the info requested in this comment?

brandatastudio · 2019-11-10T09:52:33Z

Helllo everyone sorry for the delay, so inmmersed in the project have not come up with time earlier. yes, I think I found the cause though I have not developed an algorithm to overcome it . Here I answered in my own stack post https://stackoverflow.com/questions/58522356/error-too-many-values-to-unpack-when-trying-to-get-similiraties-in-gensim-usin/58566190#58566190 this same question with the help of other stack users. Basically, You need to transform the output of the lda before applying similarity, because the output of the function is different from lsi's ( check the post for more details). That's my hypothesis of what I think is the origin of the problem ( if someone else has more insight, and more validated conclussions please be welcome to post or correct), about the solution, I don't have one implemented yet but I will start my search from here when I eventually tackle the problem https://www.kaggle.com/ktattan/lda-and-document-similarity. Right now I am focused on investigating other aspects of my recommender system using LSI but eventually will try to create one with LDA., when I do, I will be sure to update here . Hope this helps, I don't have much more to offer at the time.

piskvorky · 2019-11-10T10:18:31Z

You need to transform the output of the lda before applying similarity, because the output of the function is different from lsi's

The output of LDA is in the exact same format as LSI: Gensim's sparse vector format.

If you're seeing something different (and you're reasonably certain it's not some user error on your side), please open a bug report.

brandatastudio · 2019-11-10T10:55:24Z

You need to transform the output of the lda before applying similarity, because the output of the function is different from lsi's

The output of LDA is in the exact same format as LSI: Gensim's sparse vector format.

If you're seeing something different (and you're reasonably certain it's not some user error on your side), please open a bug report.

For a vectorized document as this one

when you apply a trained, lda model the output is this

and when you apply a trained lsi model, the output is this

both using the same number of topics for each of the trained models, the same corpus, and the same vectorized input document respectively. I thougth that this was justified as the math output of each, should be different as explained in this paper at page 4 , where it talks about the processing done to each of the models output to perform recommendation http://ceur-ws.org/Vol-1815/paper4.pdf .

If you are telling me that this lines of code should have the same output, to then calculate similarity as this

then, obviously there is a bug. Because the Lda output is not the same as the lsis in format.

Just for completeness, here is the code used for calculating both lsi and lda models respectively.

lsi_model = gensim.models.LsiModel(corpus=vectorized_corpus, id2word=id2word, num_topics=20, chunksize=400, power_iters = 10)

lda_model = gensim.models.ldamodel.LdaModel(corpus=vectorized_corpus,
                                           id2word=id2word, 
                                           num_topics=20,  
                                           random_state=100, 
                                           update_every=1, 
                                           chunksize=400,
                                           passes=10, 
                                           alpha='auto', 
                                           per_word_topics=True)``

Please confirm that you are sure that their output should be the same format, if this is the case I need to proceed and repurt a bug or share my code with someone who can for sure determine that there is no user error on my part. Although I think I already did share it in this questions post.

piskvorky · 2019-11-10T11:10:42Z

You're explicitly requesting per_word_topics=True, which changes the output format. By default, LSI and LDA have the same output format.

See also the documentation at https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_document_topics

Although that docstring shows two contradictory return value types for per_word_topics=True – which one is correct? And the listing by @brandatastudio above actually matches neither. @brandatastudio can you open a PR to fix that docstring? CC @mpenkov @menshikh-iv .

brandatastudio · 2019-11-10T11:16:13Z

You're explicitly requesting per_word_topics=True, which changes the output format. By default, LSI and LDA have the same output format.

See also the documentation at https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_document_topics

Although that docstring shows two contradictory return value types for per_word_topics=True – which one is correct? And the listing by @brandatastudio above actually matches neither. CC @mpenkov @menshikh-iv .

I don't understand what you refer with " Although that docstring shows two contradictory return value types for per_word_topics=True – which one is correct? And the listing by @brandatastudio above actually matches neither. "

I will try using per_word_topics = False and then trying to calculate similarity, and get to you, hopefully that is the solution of this post entirely.

piskvorky · 2019-11-10T11:17:59Z

Don't just try random parameters randomly. Choose the parameters that match your goal. Why did you set the (non-default) per_word_topics=True in the first place?

In any case, a PR fixing the docstring will be welcome.

brandatastudio · 2019-11-10T11:21:19Z

Sorry, closed it by accident. I need to revisit this problem, but not right now. Right now I'm busy, will get back to you asap , ps: what do you mean with the docstring?

piskvorky · 2019-11-10T11:54:35Z

I mean the docstring of https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_document_topics shows two contradictory return values for the case when per_word_topics=True, neither of which matches your output. So the docstring should be improved: made correct and clear.

brandatastudio · 2019-11-10T13:36:54Z

I mean the docstring of https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_document_topics shows two contradictory return values for the case when per_word_topics=True, neither of which matches your output. So the docstring should be improved: made correct and clear.

so if I understand correctly, you are not asking me to fix it. Not sure if you are talking to me there or not.

The reason I chose that value True was because it was my first experiment with the library, never used it before. Honestly, I don't think the documentation explains how that argument affects the output and the similarity calculation process ( in documentation, similarity is done with lsi as the example model , never seen an example of similarity calculation with LDA gensim ). I used true because I thought that that argument just added more information to the model but was not truly sure what it did ( as I don't see anywhere a clear explanation of that argument, the one here https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_document_topics is not dumb enough for me I guess )

I did not fully understand it and still don't, so I thougth I would just try it out. Seeing that the experiment failed, I proceeded posting this question here and on stack to get the feedback of people who know better than me because I thought they would know if that sort of thing was the problem. Sorry if it seems like random testing.

Seeing how you referenced that argument affecting the format, Could it be that, instead of giving me the topics probabilitites for a document as the LSI did, that argument causes the model to output the topic probability of each of the words in the document, and that is the reason for which the output is different? If this is the case, using False should solve the problem right? If not, an explanation of what that argument does would be indeed helpfull both for me, and the documentation in general.

Ps: will soon provide the output requested before, I thougth the problem was simpler so I did not give priority to it. sorry for the delay.

Can you show the output of:
print(texts[:10])
print(corpus[:10])
print(newddoc)
print(model_vec)
print(index)
Pandas has a pretty bizarre indexing and iterating/slicing system, with many un-Pythonic gotchas, so I'm not sure what your code is actually doing.

In general it's always a good idea to:

post your logs at INFO level, and

eyeball samples of data going in / coming out of your pipeline at various points (the log will show some of that too).

brandatastudio · 2019-11-10T14:18:06Z

Output requested

texts output

Vectorized corpus ( corpus in the original question, I have renamed my code since then ) output

here up to three objects

Newdoc output

new_doc_freq_vec

and new model vec ( model vec in the original question )

Index gives me no output, it's an error. The one mentioned on the question originally.

`

brandatastudio · 2019-11-10T14:39:57Z

Looks like that was the cause of the problem, similarity was effectively calculated using LDA after changing that argument to False.

Here is how the code looks like :

# Build LDA model lda_model = gensim.models.ldamodel.LdaModel(corpus=vectorized_corpus, id2word=id2word, num_topics=20, random_state=100, update_every=1, chunksize=400, passes=10, alpha='auto' )
new_doc_vector again

the output for the new_model_vec again

the index_output now

and the similarity calculation output

Thank you for your help.

piskvorky · 2019-11-10T17:49:49Z

No problem.

so if I understand correctly, you are not asking me to fix it. Not sure if you are talking to me there or not.

Yeah I meant for you to fix the docstring for LdaModel.get_document_topics, if you can. The current phrasing is too opaque, and fixing the docstring should be fairly trivial. It would help others looking at the documentation in the future.

mpenkov · 2019-12-21T03:04:16Z

@piskvorky Sounds like the most important part of this issue is:

I mean the docstring of https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_document_topics shows two contradictory return values for the case when per_word_topics=True, neither of which matches your output. So the docstring should be improved: made correct and clear.

Right? Can we gloss over everything else?

piskvorky · 2019-12-21T09:12:38Z

Yes, the documentation is weird and confusing.

mpenkov added the need info Not enough information for reproduce an issue, need more info from author label Nov 10, 2019

brandatastudio closed this as completed Nov 10, 2019

brandatastudio reopened this Nov 10, 2019

mpenkov added documentation Current issue related to documentation and removed need info Not enough information for reproduce an issue, need more info from author labels Dec 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error “too many values to unpack” when trying to get similiraties in Gensim using LDA model #2644

Error “too many values to unpack” when trying to get similiraties in Gensim using LDA model #2644

brandatastudio commented Oct 23, 2019 •

edited by mpenkov

Loading

brandatastudio commented Oct 23, 2019 •

edited

Loading

piskvorky commented Nov 4, 2019 •

edited

Loading

CelesteM commented Nov 6, 2019

mpenkov commented Nov 10, 2019 •

edited

Loading

brandatastudio commented Nov 10, 2019 •

edited

Loading

piskvorky commented Nov 10, 2019 •

edited

Loading

brandatastudio commented Nov 10, 2019 •

edited

Loading

piskvorky commented Nov 10, 2019 •

edited

Loading

brandatastudio commented Nov 10, 2019

piskvorky commented Nov 10, 2019 •

edited

Loading

brandatastudio commented Nov 10, 2019

piskvorky commented Nov 10, 2019 •

edited

Loading

brandatastudio commented Nov 10, 2019 •

edited

Loading

brandatastudio commented Nov 10, 2019 •

edited

Loading

brandatastudio commented Nov 10, 2019

piskvorky commented Nov 10, 2019 •

edited

Loading

mpenkov commented Dec 21, 2019

piskvorky commented Dec 21, 2019 •

edited

Loading

Error “too many values to unpack” when trying to get similiraties in Gensim using LDA model #2644

Error “too many values to unpack” when trying to get similiraties in Gensim using LDA model #2644

Comments

brandatastudio commented Oct 23, 2019 • edited by mpenkov Loading

Problem description

Steps/code/corpus to reproduce

Versions

brandatastudio commented Oct 23, 2019 • edited Loading

piskvorky commented Nov 4, 2019 • edited Loading

CelesteM commented Nov 6, 2019

mpenkov commented Nov 10, 2019 • edited Loading

brandatastudio commented Nov 10, 2019 • edited Loading

piskvorky commented Nov 10, 2019 • edited Loading

brandatastudio commented Nov 10, 2019 • edited Loading

piskvorky commented Nov 10, 2019 • edited Loading

brandatastudio commented Nov 10, 2019

piskvorky commented Nov 10, 2019 • edited Loading

brandatastudio commented Nov 10, 2019

piskvorky commented Nov 10, 2019 • edited Loading

brandatastudio commented Nov 10, 2019 • edited Loading

brandatastudio commented Nov 10, 2019 • edited Loading

brandatastudio commented Nov 10, 2019

piskvorky commented Nov 10, 2019 • edited Loading

mpenkov commented Dec 21, 2019

piskvorky commented Dec 21, 2019 • edited Loading

brandatastudio commented Oct 23, 2019 •

edited by mpenkov

Loading

brandatastudio commented Oct 23, 2019 •

edited

Loading

piskvorky commented Nov 4, 2019 •

edited

Loading

mpenkov commented Nov 10, 2019 •

edited

Loading

brandatastudio commented Nov 10, 2019 •

edited

Loading

piskvorky commented Nov 10, 2019 •

edited

Loading

brandatastudio commented Nov 10, 2019 •

edited

Loading

piskvorky commented Nov 10, 2019 •

edited

Loading

piskvorky commented Nov 10, 2019 •

edited

Loading

piskvorky commented Nov 10, 2019 •

edited

Loading

brandatastudio commented Nov 10, 2019 •

edited

Loading

brandatastudio commented Nov 10, 2019 •

edited

Loading

piskvorky commented Nov 10, 2019 •

edited

Loading

piskvorky commented Dec 21, 2019 •

edited

Loading