id2word returns weird numbers instead of words with get_topic_terms #2845

arshad115 · 2020-05-21T19:56:38Z

Problem description

While getting the top topic terms for my lda model, I get numbers instead of words. The corpus doesn't contain any numbers. Other models return the top words correctly, but for my corpus the results are always numbers.

Code for training:

    dictionary = Dictionary(texts)  # texts list has documents split by  words
    dictionary.filter_extremes(no_above=0.8, no_below=3)
    dictionary.compactify()

    corpus = [dictionary.doc2bow(text) for text in texts]
    model = LdaMulticore(corpus=corpus, num_topics=num_topics, passes=5)

Versions

version 3.8.1

Please provide the output of:
I used this code to get the top words, which works for other models.

def get_topic_top_words(lda_model, topic_id, nr_top_words=5):
    """ Returns the top words for topic_id from lda_model.
    """
    id_tuples = lda_model.get_topic_terms(topic_id, topn=nr_top_words)
    word_ids = np.array(id_tuples)[:,0]
    words = map(lambda id_: lda_model.id2word[id_], word_ids)
    return words

Output:

For topic 0, the top words are: 9143.0, 3313.0, 14517.0, 17358.0, 306.0, 8569.0, 2560.0, 20018.0, 320.0, 2807.0, 2838.0, 14388.0.
For topic 1, the top words are: 14.0, 405.0, 496.0, 90.0, 440.0, 270.0, 417.0, 143.0, 274.0, 145.0, 193.0, 817.0.
For topic 2, the top words are: 496.0, 405.0, 14.0, 90.0, 1713.0, 40.0, 199.0, 71.0, 193.0, 931.0, 248.0, 145.0.
For topic 3, the top words are: 47065.0, 143.0, 1080.0, 14.0, 1598.0, 90.0, 147.0, 272.0, 900.0, 248.0, 595.0, 2580.0.
For topic 4, the top words are: 14.0, 1335.0, 1061.0, 339.0, 959.0, 2366.0, 809.0, 352.0, 147.0, 1312.0, 268.0, 989.0.
For topic 5, the top words are: 9143.0, 19553.0, 14.0, 759.0, 437.0, 22.0, 456.0, 405.0, 1971.0, 1704.0, 1821.0, 27175.0.
For topic 6, the top words are: 2560.0, 1229.0, 14.0, 93649.0, 817.0, 496.0, 71.0, 143.0, 270.0, 231.0, 914.0, 455.0.
For topic 7, the top words are: 417.0, 817.0, 150.0, 2644.0, 14.0, 147.0, 405.0, 622.0, 183.0, 2535.0, 517.0, 496.0.
For topic 8, the top words are: 183.0, 469.0, 14.0, 1331.0, 143.0, 389.0, 696.0, 332.0, 39.0, 104.0, 405.0, 886.0.
For topic 9, the top words are: 14.0, 921.0, 22.0, 405.0, 545.0, 417.0, 183.0, 24.0, 484.0, 71.0, 231.0, 143.0.

arshad115 · 2020-05-25T05:19:35Z

I forgot to pass the dictionary to the LDA model, so it prints out ids of the tokens instead of words. You can get the words from the dictionary by passing the id.

arshad115 changed the title ~~get_topic_terms returns weird numbers instead of words~~ id2word returns weird numbers instead of words with get_topic_terms May 21, 2020

arshad115 closed this as completed May 25, 2020

mpenkov mentioned this issue Oct 28, 2020

Update changelog for 4.0.0 release #2981

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

id2word returns weird numbers instead of words with get_topic_terms #2845

id2word returns weird numbers instead of words with get_topic_terms #2845

arshad115 commented May 21, 2020

arshad115 commented May 25, 2020

id2word returns weird numbers instead of words with get_topic_terms #2845

id2word returns weird numbers instead of words with get_topic_terms #2845

Comments

arshad115 commented May 21, 2020

Problem description

Versions

arshad115 commented May 25, 2020