Skip to content

zpahuja/EM

Repository files navigation

EM

We demonstrate the use of EM for topic modeling on NIPS dataset and image segmentation on select images.

The data can be found in /data folder and images including the segmented images generated can be found in /images folder.

Python 3.5 is required to run the code. We tested and recommend using Anaconda and Jupyter notebook. The code and output can be seen in the .ipynb files. The .py files were generated by Jupyter and may require minor modification to compile successfully. Additionally, popular python modules such as numpy, scikit learn and scipy are required to run the code.

The code was optimized for performance by replacing loops with linear algebra operations.

Topic Models

We used multivariate multinomial distribution model to cluster documents into 30 topics based on distribution of words.

We used scipy's CSR matrix to optimize performance since the dataset is sparse. In addition, we used starting points from results of scikit learn's k-means.

To avoid problems with underflow/overflow in computation of w, we used Log Sum Exp approximation trick.

Subsequently we plotted Q's as they vary over iterations of EM.

Upon convergence, we graphed probability with which each topic is selected and tabulated top 10 most probable words for each topic.

The relevant published notebook can be viewed here courtesy of Anaconda cloud.

Image Segmentation

We used multivariate normal distribution model to segment images into 10, 20 and 50 clusters. We used different convergence threshold for different images in proporation to the value of Q.

As in topic models, we initiated with starting points from k-means, but we also experimented with several different starting points for the sunset image. There was not much variance after all.

The relevant published notebook can be viewed here courtesy of Anaconda cloud.

References

For theory and model construction, please refer to chapter "Clustering using Probability Models" in David Forsyth's textbook on Applied Machine Learning.

About

EM Topic Models and Image Segmentation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published