You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using a combination of TF-IDF and LSI as shown in this tutorial to arrive at corpus_lsi = lsi[corpus_tfidf]. I am then in need of unpacking this TransformedCorpus into a num_documents x num_topics matrix, but am unable to do so because not all of the individual vectors in corpus_lsi end up being the length of num_topics that I defined while using models.LsiModel.
I have opted to use 1000 topics but, 4 out of 5 times, each document vector does not have 1000 elements. Instead, 1 or 2 will be 999 or 998 elements in length. I attempted to resolve this by setting num_topics to 990 but the issue persists just the same at a lower value. I am unable to provide the dataset I am using, but below is the code I have written:
This works just fine:
id2word = corpora.Dictionary(text_dict)
texts = text_dict
# Term Document Frequency (bag of words)
corpus = [id2word.doc2bow(text) for text in texts]
# TF-IDF
tfidf = TfidfModel(corpus, smartirs='ntc')
corpus_tfidf = tfidf[corpus]
# LSI: 1000 Topics
lsi_model = LsiModel(corpus=corpus_tfidf, id2word=id2word,
num_topics=1000, decay=0.5)
corpus_lsi = lsi_model[corpus_tfidf]
Converting corpus_lsi to a matrix runs without error, but does not work as expected:
static_corpus_lsi = np.array([[j[1] for j in i] for i in corpus_lsi])
Most of the time, the result is that static_corpus_lsi is not of shape num_documents x 1000 as I expect them to be. Some document vectors are of length 1000 and 1 or 2 are of slightly less length.
The vectors returned by lsi[corpus] are sparse, meaning explicit zeros are omitted. That's probably the reason why some vectors appear "shorter" to you.
I am using a combination of TF-IDF and LSI as shown in this tutorial to arrive at
corpus_lsi = lsi[corpus_tfidf]
. I am then in need of unpacking this TransformedCorpus into a num_documents x num_topics matrix, but am unable to do so because not all of the individual vectors incorpus_lsi
end up being the length ofnum_topics
that I defined while usingmodels.LsiModel
.I have opted to use 1000 topics but, 4 out of 5 times, each document vector does not have 1000 elements. Instead, 1 or 2 will be 999 or 998 elements in length. I attempted to resolve this by setting
num_topics
to 990 but the issue persists just the same at a lower value. I am unable to provide the dataset I am using, but below is the code I have written:This works just fine:
Converting
corpus_lsi
to a matrix runs without error, but does not work as expected:Most of the time, the result is that
static_corpus_lsi
is not of shape num_documents x 1000 as I expect them to be. Some document vectors are of length 1000 and 1 or 2 are of slightly less length.Versions Info:
The text was updated successfully, but these errors were encountered: