Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] LDA tutorial, tips and tricks #779

Merged
merged 9 commits into from
Nov 11, 2016

Conversation

olavurmortensen
Copy link
Contributor

@tmylk @piskvorky A tutorial on LDA sharing some of my experience, as requested.

@tmylk I'm sure you have some comments on it. Thought it would be easiest with a PR. It's still a work in progress, as reflected by the "TODO" list in the start of the tutorial.

Not exactly sure what you would like the tutorial to be, but I tried to explain what the goal of it was in the introduction.

@tmylk
Copy link
Contributor

tmylk commented Aug 9, 2016

Could you link to it in tutorials.md file as well?

@olavurmortensen
Copy link
Contributor Author

Will do @tmylk. Does that mean you think its ok as it is? In that case I'll just clean it up one of these days to prepare it for merging.

@tmylk
Copy link
Contributor

tmylk commented Aug 14, 2016

@olavurmortensen When do you think this would be finished?

@tmylk tmylk changed the title LDA tutorial, tips and tricks [WIP] LDA tutorial, tips and tricks Aug 14, 2016
@olavurmortensen
Copy link
Contributor Author

@tmylk Well, I thought you would have some comments. If you do not, then I think I can finish it tomorrow.

"\n",
"> **Note:**\n",
">\n",
"> This tutorial uses the scikit-learn and nltk libraries, although you can replace them with others if you want. Python 3 is used, although Python 2.7 can be used as well.\n",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is sklearn used? I cannot find it.

@cscorley
Copy link
Contributor

cscorley commented Aug 16, 2016

Briefly looked through it this morning, all I can add right now would be an explanation as to why 5 models trained with exactly the same input would have different output, e.g., perplexity, and perhaps an aside on how to achieve 1:1 training with the random state parameter. It seems like a common question.

@olavurmortensen
Copy link
Contributor Author

@cscorley The 5 models are different because of random initialization (specifically, random initialization of some hyperparameters, e.g. gamma). But you bring up an important point that this should be explained in the tutorial, and maybe even set the random state just make it more explicit.

@olavurmortensen
Copy link
Contributor Author

I have updated the tutorial according to the comments, thanks @piskvorky and @cscorley. Also added a link in tutorials.md.

I also changed the name because, bizarrely, someone posted a tutorial with the same name I was using just a week ago.

@tmylk and @piskvorky Will the tutorial appear on the RaRe blog?

@olavurmortensen
Copy link
Contributor Author

@tmylk @piskvorky Ready for merging. The conflict is because I changed the tutorials.md file, I think.

When it's merged I'll submit a blog post on WordPress for review as well.

Copy link
Contributor

@tmylk tmylk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The title should be changed to 'Pre-processing and training LDA' The value of this tutorial is in explaining pre-processing steps and the meaning of LDA parameters. The model selection is not really covered here as deeply as in Topic Coherence or 'America's Next Topic model' blog post.

"source": [
"# LDA: training tips\n",
"\n",
"LDA is a probabilistic hierarchical Bayesian model that is a mixture model as well as a mixed membership model... but we won't be getting into any of that.\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first sentence should say what this tutorial is. Here it is done in the Nth sentence - please move it to the very first line. It is ok to discuss what this tutorial is not, but it should be later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. The first sentence is also a tad snarky. I just removed the first sentence, does anything else need to be changed in that regard?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"In this tutorial I will show how to pre-process text and train LDA on it"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better now?

"cell_type": "markdown",
"metadata": {},
"source": [
"We select the model with the lowest perplexity."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you do that? you talk about topic coherence later so it is confusing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree this is not a good way of selecting a model at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the sections about model selection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

"cell_type": "markdown",
"metadata": {},
"source": [
"[pyLDAvis](https://pyldavis.readthedocs.io/en/latest/index.html) can be fun and useful. Include the code below in your notebook to visualize your topics with pyLDAvis.\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add actual pyLDAVis output to the notebook

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rendering pyLDAvis output in the notebook completely messes up the scale of the notebook, so I'd rather not include it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either include the actual picture or remove the code and link to a pyLDAvis tutorial. Only code serves no purpose

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the text about it. Come to think of it, since it's mentioned in other RaRe blogs, there isn't much need for it in this one.

@tmylk tmylk merged commit 951eebf into piskvorky:develop Nov 11, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants