Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

id2word returns weird numbers instead of words with get_topic_terms #2845

Closed
arshad115 opened this issue May 21, 2020 · 1 comment
Closed

Comments

@arshad115
Copy link

Problem description

While getting the top topic terms for my lda model, I get numbers instead of words. The corpus doesn't contain any numbers. Other models return the top words correctly, but for my corpus the results are always numbers.

Code for training:

    dictionary = Dictionary(texts)  # texts list has documents split by  words
    dictionary.filter_extremes(no_above=0.8, no_below=3)
    dictionary.compactify()

    corpus = [dictionary.doc2bow(text) for text in texts]
    model = LdaMulticore(corpus=corpus, num_topics=num_topics, passes=5)

Versions

version 3.8.1

Please provide the output of:
I used this code to get the top words, which works for other models.

def get_topic_top_words(lda_model, topic_id, nr_top_words=5):
    """ Returns the top words for topic_id from lda_model.
    """
    id_tuples = lda_model.get_topic_terms(topic_id, topn=nr_top_words)
    word_ids = np.array(id_tuples)[:,0]
    words = map(lambda id_: lda_model.id2word[id_], word_ids)
    return words

Output:

For topic 0, the top words are: 9143.0, 3313.0, 14517.0, 17358.0, 306.0, 8569.0, 2560.0, 20018.0, 320.0, 2807.0, 2838.0, 14388.0.
For topic 1, the top words are: 14.0, 405.0, 496.0, 90.0, 440.0, 270.0, 417.0, 143.0, 274.0, 145.0, 193.0, 817.0.
For topic 2, the top words are: 496.0, 405.0, 14.0, 90.0, 1713.0, 40.0, 199.0, 71.0, 193.0, 931.0, 248.0, 145.0.
For topic 3, the top words are: 47065.0, 143.0, 1080.0, 14.0, 1598.0, 90.0, 147.0, 272.0, 900.0, 248.0, 595.0, 2580.0.
For topic 4, the top words are: 14.0, 1335.0, 1061.0, 339.0, 959.0, 2366.0, 809.0, 352.0, 147.0, 1312.0, 268.0, 989.0.
For topic 5, the top words are: 9143.0, 19553.0, 14.0, 759.0, 437.0, 22.0, 456.0, 405.0, 1971.0, 1704.0, 1821.0, 27175.0.
For topic 6, the top words are: 2560.0, 1229.0, 14.0, 93649.0, 817.0, 496.0, 71.0, 143.0, 270.0, 231.0, 914.0, 455.0.
For topic 7, the top words are: 417.0, 817.0, 150.0, 2644.0, 14.0, 147.0, 405.0, 622.0, 183.0, 2535.0, 517.0, 496.0.
For topic 8, the top words are: 183.0, 469.0, 14.0, 1331.0, 143.0, 389.0, 696.0, 332.0, 39.0, 104.0, 405.0, 886.0.
For topic 9, the top words are: 14.0, 921.0, 22.0, 405.0, 545.0, 417.0, 183.0, 24.0, 484.0, 71.0, 231.0, 143.0.
@arshad115 arshad115 changed the title get_topic_terms returns weird numbers instead of words id2word returns weird numbers instead of words with get_topic_terms May 21, 2020
@arshad115
Copy link
Author

I forgot to pass the dictionary to the LDA model, so it prints out ids of the tokens instead of words. You can get the words from the dictionary by passing the id.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant