Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error “too many values to unpack” when trying to get similiraties in Gensim using LDA model #2644

Open
brandatastudio opened this issue Oct 23, 2019 · 18 comments
Labels
documentation Current issue related to documentation

Comments

@brandatastudio
Copy link

brandatastudio commented Oct 23, 2019

Problem description

I'm trying to use a trained LDA model to compare similarity between the models documents, stored in corpus, and new documents unseen by the model.

Steps/code/corpus to reproduce

'm using anaconda enviroment python 3.7, gensim 3.8.0, basically. I have my data as a dataframe that I separated in a test and training set, they both have this structure:

X_test and Xtrain dataframe format :

 id                                            alltext  
1710  3264537  [exmodelo, karen, mcdougal, asegura, mantuvo, ...   
8211  3272079  [grupo, socialista, pionero, supone, apoyar, n...   
1885  3263933  [parte, entrenador, zaragoza, javier, aguirre,...   
2481  3263744  [fans, hielo, fuego, saga, literaria, dio, pie...   
2975  3265302  [actividad, busca, repetir, tres, ediciones, a... 

already preprocessed.

This is the code I use for creating my model

id2word = corpora.Dictionary(X_train["alltext"])   
texts = X_train["alltext"]
corpus = [id2word.doc2bow(text) for text in texts]

lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
    id2word=id2word,
    num_topics=20,
    random_state=100, 
    update_every=1, 
    chunksize=400, 
    passes=10, 
    alpha='auto',
    per_word_topics=True)

Until here, everything works fine. I can effectively use

pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]](url)

to get my topics.

The problem comes, when I try to compare similarity between a new document and the corpus. Here is the code I'm using

newddoc = X_test["alltext"][2730] #I get a particular instance of the test_set
new_doc_freq_vector = id2word.doc2bow(newddoc)  #vectorize its list of words
model_vec= lda_model[new_doc_freq_vector] #run the trained model on it
index = similarities.MatrixSimilarity(lda_model[corpus]) # error
sims = index[model_vec] #error

In the last two lines, I get this error:

-------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-110-352248c464f8> in <module>
      4 
      5 #index = Similarity('model/indexes/similarity_index_01', lda_model[corpus], num_features=len(id2word)) #the first argument, the place where the
----> 6 index = similarities.MatrixSimilarity(lda_model[corpus]) # funciona si en vez de lda_model[corpus] usamos solo corpus
      7 index = similarities.MatrixSimilarity(model_vec)
      8 #sims = index[model_vec] #funciona si usamos index[new_doc_freq_vector] en vez de model_vec

~\AppData\Local\Continuum\anaconda3\envs\lda_henneo_01\lib\site-packages\gensim\similarities\docsim.py in __init__(self, corpus, num_best, dtype, num_features, chunksize, corpus_len)
    776                 "scanning corpus to determine the number of features (consider setting `num_features` explicitly)"
    777             )
--> 778             num_features = 1 + utils.get_max_id(corpus)
    779 
    780         self.num_features = num_features

~\AppData\Local\Continuum\anaconda3\envs\lda_henneo_01\lib\site-packages\gensim\utils.py in get_max_id(corpus)
    734     for document in corpus:
    735         if document:
--> 736             maxid = max(maxid, max(fieldid for fieldid, _ in document))
    737     return maxid
    738 

~\AppData\Local\Continuum\anaconda3\envs\lda_henneo_01\lib\site-packages\gensim\utils.py in <genexpr>(.0)
    734     for document in corpus:
    735         if document:
--> 736             maxid = max(maxid, max(fieldid for fieldid, _ in document))
    737     return maxid
    738 

ValueError: too many values to unpack (expected 2

Versions

anaconda enviroment python 3.7, gensim 3.8.0,
Please provide the output of:

import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import gensim; print("gensim", gensim.__version__)
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
@brandatastudio
Copy link
Author

brandatastudio commented Oct 23, 2019

Things I have tried to solve this:

  1. Using

Similarity('model/indexes/similarity_index_01', lda_model[corpus], num_features=len(id2word)).

But it did not work. Same error code was obtained.

  1. If I replace lda_model[corpus] with corpus, and index[model_vec] with index[new_doc_freq_vector], similarities.MatrixSimilarity() works. But I believe it does not give the proper result because, it does not have the model information in there. The fact that it works it tells me it has something to do with data types (?), if I print lda_model[corpus] I get

<gensim.interfaces.TransformedCorpus object at 0x00000221ECA8E148>
no Idea what this means though.

@piskvorky
Copy link
Owner

piskvorky commented Nov 4, 2019

Can you show the output of:

print(texts[:10])
print(corpus[:10])
print(newddoc)
print(model_vec)
print(index)

Pandas has a pretty bizarre indexing and iterating/slicing system, with many un-Pythonic gotchas, so I'm not sure what your code is actually doing.

In general it's always a good idea to:

  1. post your logs at INFO level, and
  2. eyeball samples of data going in / coming out of your pipeline at various points (the log will show some of that too).

@CelesteM
Copy link

CelesteM commented Nov 6, 2019

Have you resolved the problem? I'm facing something similar here.

@mpenkov mpenkov added the need info Not enough information for reproduce an issue, need more info from author label Nov 10, 2019
@mpenkov
Copy link
Collaborator

mpenkov commented Nov 10, 2019

@brandatastudio Ping. Could you please provide the info requested in this comment?

@brandatastudio
Copy link
Author

brandatastudio commented Nov 10, 2019

Helllo everyone sorry for the delay, so inmmersed in the project have not come up with time earlier. yes, I think I found the cause though I have not developed an algorithm to overcome it . Here I answered in my own stack post https://stackoverflow.com/questions/58522356/error-too-many-values-to-unpack-when-trying-to-get-similiraties-in-gensim-usin/58566190#58566190 this same question with the help of other stack users. Basically, You need to transform the output of the lda before applying similarity, because the output of the function is different from lsi's ( check the post for more details). That's my hypothesis of what I think is the origin of the problem ( if someone else has more insight, and more validated conclussions please be welcome to post or correct), about the solution, I don't have one implemented yet but I will start my search from here when I eventually tackle the problem https://www.kaggle.com/ktattan/lda-and-document-similarity. Right now I am focused on investigating other aspects of my recommender system using LSI but eventually will try to create one with LDA., when I do, I will be sure to update here . Hope this helps, I don't have much more to offer at the time.

@piskvorky
Copy link
Owner

piskvorky commented Nov 10, 2019

You need to transform the output of the lda before applying similarity, because the output of the function is different from lsi's

The output of LDA is in the exact same format as LSI: Gensim's sparse vector format.

If you're seeing something different (and you're reasonably certain it's not some user error on your side), please open a bug report.

@brandatastudio
Copy link
Author

brandatastudio commented Nov 10, 2019

You need to transform the output of the lda before applying similarity, because the output of the function is different from lsi's

The output of LDA is in the exact same format as LSI: Gensim's sparse vector format.

If you're seeing something different (and you're reasonably certain it's not some user error on your side), please open a bug report.

For a vectorized document as this one
vectorizedlsi

when you apply a trained, lda model the output is this

outputlda

and when you apply a trained lsi model, the output is this

outputlsi

both using the same number of topics for each of the trained models, the same corpus, and the same vectorized input document respectively. I thougth that this was justified as the math output of each, should be different as explained in this paper at page 4 , where it talks about the processing done to each of the models output to perform recommendation http://ceur-ws.org/Vol-1815/paper4.pdf .

If you are telling me that this lines of code should have the same output, to then calculate similarity as this

similarity calculus

then, obviously there is a bug. Because the Lda output is not the same as the lsis in format.

Just for completeness, here is the code used for calculating both lsi and lda models respectively.

lsi_model = gensim.models.LsiModel(corpus=vectorized_corpus, id2word=id2word, num_topics=20, chunksize=400, power_iters = 10)

lda_model = gensim.models.ldamodel.LdaModel(corpus=vectorized_corpus,
                                           id2word=id2word, 
                                           num_topics=20,  
                                           random_state=100, 
                                           update_every=1, 
                                           chunksize=400,
                                           passes=10, 
                                           alpha='auto', 
                                           per_word_topics=True)``

Please confirm that you are sure that their output should be the same format, if this is the case I need to proceed and repurt a bug or share my code with someone who can for sure determine that there is no user error on my part. Although I think I already did share it in this questions post.

@piskvorky
Copy link
Owner

piskvorky commented Nov 10, 2019

You're explicitly requesting per_word_topics=True, which changes the output format. By default, LSI and LDA have the same output format.

See also the documentation at https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_document_topics

Although that docstring shows two contradictory return value types for per_word_topics=True – which one is correct? And the listing by @brandatastudio above actually matches neither. @brandatastudio can you open a PR to fix that docstring? CC @mpenkov @menshikh-iv .

@brandatastudio
Copy link
Author

You're explicitly requesting per_word_topics=True, which changes the output format. By default, LSI and LDA have the same output format.

See also the documentation at https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_document_topics

Although that docstring shows two contradictory return value types for per_word_topics=True – which one is correct? And the listing by @brandatastudio above actually matches neither. CC @mpenkov @menshikh-iv .

I don't understand what you refer with " Although that docstring shows two contradictory return value types for per_word_topics=True – which one is correct? And the listing by @brandatastudio above actually matches neither. "

I will try using per_word_topics = False and then trying to calculate similarity, and get to you, hopefully that is the solution of this post entirely.

@piskvorky
Copy link
Owner

piskvorky commented Nov 10, 2019

Don't just try random parameters randomly. Choose the parameters that match your goal. Why did you set the (non-default) per_word_topics=True in the first place?

In any case, a PR fixing the docstring will be welcome.

@brandatastudio
Copy link
Author

Sorry, closed it by accident. I need to revisit this problem, but not right now. Right now I'm busy, will get back to you asap , ps: what do you mean with the docstring?

@piskvorky
Copy link
Owner

piskvorky commented Nov 10, 2019

I mean the docstring of https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_document_topics shows two contradictory return values for the case when per_word_topics=True, neither of which matches your output. So the docstring should be improved: made correct and clear.

@brandatastudio
Copy link
Author

brandatastudio commented Nov 10, 2019

I mean the docstring of https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_document_topics shows two contradictory return values for the case when per_word_topics=True, neither of which matches your output. So the docstring should be improved: made correct and clear.

so if I understand correctly, you are not asking me to fix it. Not sure if you are talking to me there or not.

The reason I chose that value True was because it was my first experiment with the library, never used it before. Honestly, I don't think the documentation explains how that argument affects the output and the similarity calculation process ( in documentation, similarity is done with lsi as the example model , never seen an example of similarity calculation with LDA gensim ). I used true because I thought that that argument just added more information to the model but was not truly sure what it did ( as I don't see anywhere a clear explanation of that argument, the one here https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_document_topics is not dumb enough for me I guess )

I did not fully understand it and still don't, so I thougth I would just try it out. Seeing that the experiment failed, I proceeded posting this question here and on stack to get the feedback of people who know better than me because I thought they would know if that sort of thing was the problem. Sorry if it seems like random testing.

Seeing how you referenced that argument affecting the format, Could it be that, instead of giving me the topics probabilitites for a document as the LSI did, that argument causes the model to output the topic probability of each of the words in the document, and that is the reason for which the output is different? If this is the case, using False should solve the problem right? If not, an explanation of what that argument does would be indeed helpfull both for me, and the documentation in general.

Ps: will soon provide the output requested before, I thougth the problem was simpler so I did not give priority to it. sorry for the delay.

Can you show the output of:

print(texts[:10])
print(corpus[:10])
print(newddoc)
print(model_vec)
print(index)

Pandas has a pretty bizarre indexing and iterating/slicing system, with many un-Pythonic gotchas, so I'm not sure what your code is actually doing.

In general it's always a good idea to:

  1. post your logs at INFO level, and
  2. eyeball samples of data going in / coming out of your pipeline at various points (the log will show some of that too).

@brandatastudio
Copy link
Author

brandatastudio commented Nov 10, 2019

Output requested

texts output
texts output

Vectorized corpus ( corpus in the original question, I have renamed my code since then ) output
vectorizedcorpusoutput

here up to three objects
vectorized corpus to three

Newdoc output
newdoc output

new_doc_freq_vec
new_doc_frq_vector_output

and new model vec ( model vec in the original question )
outputlda

Index gives me no output, it's an error. The one mentioned on the question originally.

`

@brandatastudio
Copy link
Author

Looks like that was the cause of the problem, similarity was effectively calculated using LDA after changing that argument to False.

Here is how the code looks like :

# Build LDA model lda_model = gensim.models.ldamodel.LdaModel(corpus=vectorized_corpus, id2word=id2word, num_topics=20, random_state=100, update_every=1, chunksize=400, passes=10, alpha='auto' )
new_doc_vector again
new_doc_freq_vector output_solution

the output for the new_model_vec again
new_model_vec output
the index_output now
index output

and the similarity calculation output
sim and sims sorted output

Thank you for your help.

@piskvorky
Copy link
Owner

piskvorky commented Nov 10, 2019

No problem.

so if I understand correctly, you are not asking me to fix it. Not sure if you are talking to me there or not.

Yeah I meant for you to fix the docstring for LdaModel.get_document_topics, if you can. The current phrasing is too opaque, and fixing the docstring should be fairly trivial. It would help others looking at the documentation in the future.

@mpenkov
Copy link
Collaborator

mpenkov commented Dec 21, 2019

@piskvorky Sounds like the most important part of this issue is:

I mean the docstring of https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_document_topics shows two contradictory return values for the case when per_word_topics=True, neither of which matches your output. So the docstring should be improved: made correct and clear.

Right? Can we gloss over everything else?

@piskvorky
Copy link
Owner

piskvorky commented Dec 21, 2019

Yes, the documentation is weird and confusing.

@mpenkov mpenkov added documentation Current issue related to documentation and removed need info Not enough information for reproduce an issue, need more info from author labels Dec 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Current issue related to documentation
Projects
None yet
Development

No branches or pull requests

4 participants