-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error “too many values to unpack” when trying to get similiraties in Gensim using LDA model #2644
Comments
Things I have tried to solve this:
Similarity('model/indexes/similarity_index_01', lda_model[corpus], num_features=len(id2word)). But it did not work. Same error code was obtained.
<gensim.interfaces.TransformedCorpus object at 0x00000221ECA8E148> |
Can you show the output of: print(texts[:10])
print(corpus[:10])
print(newddoc)
print(model_vec)
print(index) Pandas has a pretty bizarre indexing and iterating/slicing system, with many un-Pythonic gotchas, so I'm not sure what your code is actually doing. In general it's always a good idea to:
|
Have you resolved the problem? I'm facing something similar here. |
@brandatastudio Ping. Could you please provide the info requested in this comment? |
Helllo everyone sorry for the delay, so inmmersed in the project have not come up with time earlier. yes, I think I found the cause though I have not developed an algorithm to overcome it . Here I answered in my own stack post https://stackoverflow.com/questions/58522356/error-too-many-values-to-unpack-when-trying-to-get-similiraties-in-gensim-usin/58566190#58566190 this same question with the help of other stack users. Basically, You need to transform the output of the lda before applying similarity, because the output of the function is different from lsi's ( check the post for more details). That's my hypothesis of what I think is the origin of the problem ( if someone else has more insight, and more validated conclussions please be welcome to post or correct), about the solution, I don't have one implemented yet but I will start my search from here when I eventually tackle the problem https://www.kaggle.com/ktattan/lda-and-document-similarity. Right now I am focused on investigating other aspects of my recommender system using LSI but eventually will try to create one with LDA., when I do, I will be sure to update here . Hope this helps, I don't have much more to offer at the time. |
The output of LDA is in the exact same format as LSI: Gensim's sparse vector format. If you're seeing something different (and you're reasonably certain it's not some user error on your side), please open a bug report. |
For a vectorized document as this one when you apply a trained, lda model the output is this and when you apply a trained lsi model, the output is this both using the same number of topics for each of the trained models, the same corpus, and the same vectorized input document respectively. I thougth that this was justified as the math output of each, should be different as explained in this paper at page 4 , where it talks about the processing done to each of the models output to perform recommendation http://ceur-ws.org/Vol-1815/paper4.pdf . If you are telling me that this lines of code should have the same output, to then calculate similarity as this then, obviously there is a bug. Because the Lda output is not the same as the lsis in format. Just for completeness, here is the code used for calculating both lsi and lda models respectively.
Please confirm that you are sure that their output should be the same format, if this is the case I need to proceed and repurt a bug or share my code with someone who can for sure determine that there is no user error on my part. Although I think I already did share it in this questions post. |
You're explicitly requesting See also the documentation at https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_document_topics Although that docstring shows two contradictory return value types for |
I don't understand what you refer with " Although that docstring shows two contradictory return value types for per_word_topics=True – which one is correct? And the listing by @brandatastudio above actually matches neither. " I will try using per_word_topics = False and then trying to calculate similarity, and get to you, hopefully that is the solution of this post entirely. |
Don't just try random parameters randomly. Choose the parameters that match your goal. Why did you set the (non-default) In any case, a PR fixing the docstring will be welcome. |
Sorry, closed it by accident. I need to revisit this problem, but not right now. Right now I'm busy, will get back to you asap , ps: what do you mean with the docstring? |
I mean the docstring of https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_document_topics shows two contradictory return values for the case when |
so if I understand correctly, you are not asking me to fix it. Not sure if you are talking to me there or not. The reason I chose that value True was because it was my first experiment with the library, never used it before. Honestly, I don't think the documentation explains how that argument affects the output and the similarity calculation process ( in documentation, similarity is done with lsi as the example model , never seen an example of similarity calculation with LDA gensim ). I used true because I thought that that argument just added more information to the model but was not truly sure what it did ( as I don't see anywhere a clear explanation of that argument, the one here https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_document_topics is not dumb enough for me I guess ) I did not fully understand it and still don't, so I thougth I would just try it out. Seeing that the experiment failed, I proceeded posting this question here and on stack to get the feedback of people who know better than me because I thought they would know if that sort of thing was the problem. Sorry if it seems like random testing. Seeing how you referenced that argument affecting the format, Could it be that, instead of giving me the topics probabilitites for a document as the LSI did, that argument causes the model to output the topic probability of each of the words in the document, and that is the reason for which the output is different? If this is the case, using False should solve the problem right? If not, an explanation of what that argument does would be indeed helpfull both for me, and the documentation in general. Ps: will soon provide the output requested before, I thougth the problem was simpler so I did not give priority to it. sorry for the delay.
|
No problem.
Yeah I meant for you to fix the docstring for |
@piskvorky Sounds like the most important part of this issue is:
Right? Can we gloss over everything else? |
Yes, the documentation is weird and confusing. |
Problem description
I'm trying to use a trained LDA model to compare similarity between the models documents, stored in corpus, and new documents unseen by the model.
Steps/code/corpus to reproduce
'm using anaconda enviroment python 3.7, gensim 3.8.0, basically. I have my data as a dataframe that I separated in a test and training set, they both have this structure:
X_test and Xtrain dataframe format :
already preprocessed.
This is the code I use for creating my model
Until here, everything works fine. I can effectively use
to get my topics.
The problem comes, when I try to compare similarity between a new document and the corpus. Here is the code I'm using
In the last two lines, I get this error:
Versions
anaconda enviroment python 3.7, gensim 3.8.0,
Please provide the output of:
The text was updated successfully, but these errors were encountered: