Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All vectors in lsi_model[corpus_tfidf] are not of num_topics length passed to LsiModel #2501

Closed
dpalbrecht opened this issue May 24, 2019 · 2 comments

Comments

@dpalbrecht
Copy link

I am using a combination of TF-IDF and LSI as shown in this tutorial to arrive at corpus_lsi = lsi[corpus_tfidf]. I am then in need of unpacking this TransformedCorpus into a num_documents x num_topics matrix, but am unable to do so because not all of the individual vectors in corpus_lsi end up being the length of num_topics that I defined while using models.LsiModel.

I have opted to use 1000 topics but, 4 out of 5 times, each document vector does not have 1000 elements. Instead, 1 or 2 will be 999 or 998 elements in length. I attempted to resolve this by setting num_topics to 990 but the issue persists just the same at a lower value. I am unable to provide the dataset I am using, but below is the code I have written:

This works just fine:

id2word = corpora.Dictionary(text_dict)
texts = text_dict

# Term Document Frequency (bag of words)
corpus = [id2word.doc2bow(text) for text in texts]

# TF-IDF
tfidf = TfidfModel(corpus, smartirs='ntc')
corpus_tfidf = tfidf[corpus]

# LSI: 1000 Topics
lsi_model = LsiModel(corpus=corpus_tfidf, id2word=id2word,
                     num_topics=1000, decay=0.5)

corpus_lsi = lsi_model[corpus_tfidf]

Converting corpus_lsi to a matrix runs without error, but does not work as expected:

static_corpus_lsi = np.array([[j[1] for j in i] for i in corpus_lsi])

Most of the time, the result is that static_corpus_lsi is not of shape num_documents x 1000 as I expect them to be. Some document vectors are of length 1000 and 1 or 2 are of slightly less length.

Versions Info:

Linux-4.14.79+-x86_64-with-Ubuntu-18.04-bionic
('Python', '2.7.15rc1 (default, Nov 12 2018, 14:31:15) \n[GCC 7.3.0]')
('NumPy', '1.16.3')
('SciPy', '1.2.1')
('gensim', '3.6.0')
('FAST_VERSION', 1)
@piskvorky
Copy link
Owner

piskvorky commented May 25, 2019

The vectors returned by lsi[corpus] are sparse, meaning explicit zeros are omitted. That's probably the reason why some vectors appear "shorter" to you.

Check out gensim.matutils.corpus2dense and Q3: How do you calculate the matrix V in LSI space? in the Gensim FAQ, and more generally Gensim Core Concepts introduction.

@dpalbrecht
Copy link
Author

Great, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants