Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

topicexplorer.from_config #310

Open
JaimieMurdock opened this issue May 20, 2018 · 2 comments
Open

topicexplorer.from_config #310

JaimieMurdock opened this issue May 20, 2018 · 2 comments
Milestone

Comments

@JaimieMurdock
Copy link
Member

Originally raised in #150

Below is a mockup of the interface we're aiming for:

import topicexplorer
te = topicexplorer.from_config('ap.ini')

# access the corpus with .corpus
te.corpus

# access the individual models with dictionary attributes
assert isinstance(te[k], LdaCgsViewer)
te[k].theta
te[k].phi

# comparing two models using the interface
import topicexplorer.analysis
topicexplorer.analysis.model_dist(te[20], te[40])

# integrated past_to_text analysis
ordered_ids = ['some', 'labels', 'by', 'date']
p2t = topicexplorer.analysis.past_to_text(te[20], ordered_ids)
### returns raw numbers

# possible plot library?
import topicexplorer.analysis.plot
topicexplorer.analysis.plot.past_to_text(p2t)

Some other thoughts:

# accessing doc-topic distributions
te[20].doc_topics('some-document') == te[20]['some-document']
# getting specific topic proportion:
te[20]['some-document'][2]

# accessing word-topic distributions
te[20].topics(2) == te[20][2]
te[20].topics(2)[te[20].topics(2)[word=='something']] == te[20][2]['something']

This is too much for a single ticket, and definitely more of what I'm thinking for a 2.0, but I want to get at least to the point where the models are loaded with topicexplorer.from_config() in notebooks.

@JaimieMurdock
Copy link
Member Author

@colinallen Referring to your original comments on what happens when models are incommensurate, the methods I have reduce the vocabulary to the union of the two corpora and only compare topic distance on the remaining terms, but do not re-normalize the distributions. This at least maintains that we have a probabilistic source signal yielding tokens, and then non-assigned portions of the distribution (that is the parts of the vocabulary in the difference) do not contribute to the model distance.

@JaimieMurdock
Copy link
Member Author

JaimieMurdock commented May 23, 2018

Some desired improvements for the Corpus objects:

import topicexplorer
te = topicexplorer.from_config('sep.ini')

# use dictionary access to get the tokens
assert te.corpus['neo-kantianism'] == [25, 37, 141312, 12, ...] 

# assert a document label is in the Corpus object
assert 'neo-kantianism' in te.corpus

@JaimieMurdock JaimieMurdock added this to the 1.0 Release milestone Jul 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant