%%HTML
<link rel="stylesheet" type="text/css" href="custom.css">
<link rel="stylesheet" type="text/css" href="pandas-table.css">

# Visualization topic models - Github Pull Request 

- __Serializable inputs in folder models__ 
    - several serialized topic models (rpmdelpr,ldamodelpr, lsimodelpr, hpdmodelpr) gets with pull request datasets and dictionary of pull request 
    - corpus pull request prcorpus
    - dictionary  
- __Output__ Graphical representation of topics in topic models

__Visualization only work with lda and hpd models__  

## References 
- [Topic modeling evaluation](https://datascience.blog.wzb.eu/2017/11/09/topic-modeling-evaluation-in-python-with-tmtoolkit/)
Likelihood and perplexity Evaluating the posterior distributions’ density or divergence
- [Perplexity To Evaluate Topic Models](http://qpleple.com/perplexity-to-evaluate-topic-models/)
- [Python tutorial topic model](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/)
Choose a better values of K (#number of topics)
We started with understanding what topic modeling can do. We built a basic topic model using Gensim’s LDA and visualize the topics using pyLDAvis. Then we built mallet’s LDA implementation. You saw how to find the optimal number of topics using coherence scores and how you can come to a logical understanding of how to choose the optimal model.
Finally we saw how to aggregate and present the results to generate insights that may be in a more actionable.
- [Paper Latent Dirichlet Allocation](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf)
- [wikipedia Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocatio)

## Index
1. [Pull request dataset](#dataset)
2. [Visualization LDAModel](#lda)
3. [Visualiazation HPDAModel](#hpd)
4. [Visualiazation LSIAModel](#lsi)

<a id='dataset'></a>
## Pull Request dataset
### Definitions
- [What is a Pull request in development software process?](https://github.com/features)
- [An example of pull requets commented](https://github.com/google/WebFundamentals/pull/4136)
### Corpus - Pullrequets in Github repositories
Our dataset from many github trending repositories, is updated to July 2017:
- ChartJS
- AngularJS
- CakePHP
- Play Framework
- WebFundamentals
- ElasticSearch
### Sentences - pull request 
We concat following data attributtes in a single attributte named `pull_request`.
- repository_owner
- repository_name
- repository_language
- pull_request_title
- pull_request_body 

## Topic Models
[Gensim tutorial](https://radimrehurek.com/gensim/tut2.html)

### Visualize models

### Random Projections
RP aim to reduce vector space dimensionality. This is a very efficient (both memory- and CPU-friendly) approach to approximating TfIdf distances between documents, by throwing in a little randomness. Recommended target dimensionality is again in the hundreds/thousands, depending on your dataset.

In [14]:
## No visualization avaible 

<a id='lda'></a>
### Latent Dirichlet Allocation
LDA is yet another transformation from bag-of-words counts into a topic space of lower dimensionality. LDA is a probabilistic extension of LSA (also called multinomial PCA), so LDA’s topics can be interpreted as probability distributions over words. These distributions are, just like with LSA, inferred automatically from a training corpus. Documents are in turn interpreted as a (soft) mixture of these topics (again, just like with LSA).


In [1]:
# Visualice LDA model
import gensim
import gensim.models.ldamodel as ldamodel
import gensim.corpora as corpora


# Load persistent LDA model
ldamodelpr =ldamodel.LdaModel.load("./models/ldamodelpr")
#Load dictionary
dictionary= gensim.corpora.Dictionary.load_from_text("./models/dict_pullrequets.txt")
#Load corpus
corpus = corpora.MmCorpus("./models/prcorpus.mm")


# Visualize the topics
import pyLDAvis
import pyLDAvis.gensim  # don't skip this

import matplotlib.pyplot as plt
%matplotlib inline
pyLDAvis.enable_notebook()
vislda = pyLDAvis.gensim.prepare(ldamodelpr, corpus, dictionary)
vislda



.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  topic_term_dists = topic_term_dists.ix[topic_order]


<a id='hpd'></a>
### Hierarchical Dirichlet Process
__HDP__ is a non-parametric bayesian method (note the missing number of requested topics):

In [2]:
import gensim
import gensim.models.ldamodel as ldamodel
import gensim.corpora as corpora
# Visualice HDA model
import gensim
import gensim.models.hdpmodel as hdpmodel
import gensim.corpora as corpora

# Load persistent hpd model
hpdmodelpr =hdpmodel.HdpModel.load("./models/hpdmodelpr")
#Load dictionary
dictionary= gensim.corpora.Dictionary.load_from_text("./models/dict_pullrequets.txt")
#Load corpus
corpus = corpora.MmCorpus("./models/prcorpus.mm")

pyLDAvis.enable_notebook()
vislda = pyLDAvis.gensim.prepare(hpdmodelpr, corpus, dictionary)
vislda


.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  topic_term_dists = topic_term_dists.ix[topic_order]


<a id='lsi'></a>
### Latent Semantic Indexing
LSI (or sometimes LSA) transforms documents from either bag-of-words or (preferrably) TfIdf-weighted space into a latent space of a lower dimensionality.
LSI training is unique in that we can continue “training” at any point, simply by providing more training documents. This is done by incremental updates to the underlying model, in a process called online training. Because of this feature, the input document stream may even be infinite – just keep feeding LSI new documents as they arrive, while using the computed transformation model as read-only in the meanwhile!



In [3]:
# Imposible visualice LSI model
import gensim
import gensim.models.lsimodel as lsimodel
import gensim.corpora as corpora

# Load persistent LSI model
lsimodelpr =lsimodel.LsiModel.load("./models/lsimodelpr")
#Load dictionary
dictionary= gensim.corpora.Dictionary.load_from_text("./models/dict_pullrequets.txt")
#Load corpus
corpus = corpora.MmCorpus("./models/prcorpus.mm")

pyLDAvis.enable_notebook()
vislda = pyLDAvis.gensim.prepare(lsimodelpr,corpus, dictionary)
vislda 


AttributeError: 'LsiModel' object has no attribute 'inference'

### LDA mallet version

So far you have seen Gensim’s inbuilt version of the LDA algorithm. Mallet’s version, however, often gives a better quality of topics.

Gensim provides a wrapper to implement Mallet’s LDA from within Gensim itself. You only need to download the zipfile, unzip it and provide the path to mallet in the unzipped directory to gensim.models.wrappers.LdaMallet. 
[download](https://www.machinelearningplus.com/wp-content/uploads/2018/03/mallet-2.0.8.zip)

In [4]:
# Imposible visualice LSI model
import gensim
import gensim.models.wrappers.ldamallet as ldamallet
# Plotting tools
import pyLDAvis
import pyLDAvis.gensim  # don't skip this

import matplotlib.pyplot as plt
%matplotlib inline

# Load persistent LSI model
ldamalletmodelpr =ldamallet.LdaMallet.load("./models/ldamalletmodelpr")
#Load dictionary
dictionary= gensim.corpora.Dictionary.load_from_text("./models/dict_pullrequets.txt")
#Load corpus
corpus = gensim.corpora.MmCorpus("./models/prcorpus.mm")

# Visualize the topics
pyLDAvis.enable_notebook()
vislda = pyLDAvis.gensim.prepare(ldamalletmodelpr, corpus, dictionary)
vislda

AttributeError: 'LdaMallet' object has no attribute 'inference'