# Latent Dirichlet Allocation (LDA)

### Basic Concept behind LDA's:

<span style="font-family:Papyrus; font-size:1.25em;">

</p>Latent Dirichlet Allocation is a probabilistic method of text analysis for topic modeling.  This method identifies the topics that exists within a set of a documents and maps those documents to their associated topics.  The process typically uses a bag-of-words feature representation for the documents of interest.  In Lda's, each document is described by a distribution of topics and each topic is described by a distribution of words.  There are two primary components to LDA's.  The observed layer are the documents (also called composites) and the words that comprise those documents (the parts).  The hidden (or latent) layer consists of the topics (also called categories) as well as the various variables utilized by the algorithm.  The output of the algorithm is a list of the topics associated with the entire set of documents and the top words associated with each topic.  These topics are indexed values assigned integer values to which they are later assigned English descriptors to describe those topics.</p>

</span>

### Plate Notation for the LDA Algorithm (from "Intuitive Guide to Latent Dirichlet Allocation"):

<span style="font-family:Papyrus; font-size:1.25em;">

</p>The plate notation below represents the algorithm in a graphical format. α is the parameter for the Dirichlet distribution prior that influences the topic-document distribution described by θ. η is the parameter for the Dirichlet distribution prior that influences the word-topic distribution described by β.  While shown as a constant value below, alpha and eta are actually 1-d vectors with length determined by the set of topics (k) that we specify.</p><br><br>

</p>The largest plate surrounds all the variables related to a single document in the set of documents (M) that comprise the corpus of interest.  The plate indicates that the variables contained within are repeated M times, once for each document, which also represents a for loop in the pseudocode for the algorithm.</p><br><br>

</p>The smaller plate within the largest plate surround all the variables related to a single word within a single document.  The plate indicates that the variables contained within are repeated N times, once for each word for the N words that comprise each of the M documents.  This smaller plate represents a nested for loop within the outer for loop represented by the largest plate.</p><br><br>

</p>Within the smaller plate, the variable "z" represents a single topic chosen from the topic distribution which represents the distribution of words that belong to that topic.  The variable "w" represents the the actual word itself.</p><br><br>
    
</p>"w" is shaded because it is a observed variable belonging to the observed layer.  All other variables are unshaded as they belong to the latent (hidden) layer that cannot be directly observed.</p><br><br>

</p>The directed edges between each circle representing each variable indicates dependencies between the variables.  The variable at the head of the edges depend on the variable at the tail of the edges.</p><br><br>

</p>The topmost plate surrounds the β word-topic distribution and indicates a for loop where we determine the word-topic distribution for each topic in the set of topics (k).  This is similar to the largest plate surrounding the θ topic-document distribution where there is a for loop that determines the topic-document distribution for each document in the set of documents (M). </p><br><br>

</span>

![lda](lda-presentation-images/lda_model.jpeg)

#### List of Key Terminology and Notation (from "Intuitive Guide to Latent Dirichlet Allocation"):

<span style="font-family:Papyrus; font-size:1.25em;">

k — Number of topics a document belongs to (a fixed number).<br>

V — Size of the vocabulary.<br>

M — Number of documents.<br>

N — Number of words in each document.<br>

w — A word in a document. This is represented as a one hot encoded vector of size V (i.e. V — vocabulary size).<br>

z — A topic from a set of k topics. A topic is a distribution words. For example it might be, Animal = (0.3 Cats, 0.4 Dogs, 0 AI, 0.2 Loyal, 0.1 Evil).<br>

α — Distribution related parameter that governs what the distribution of topics is for all the documents in the corpus looks like.<br>

θ — Random matrix where θ(i,j) represents the probability of the i th document to containing the j th topic.<br>

η — Distribution related parameter that governs what the distribution of words in each topic looks like.<br>

β — A random matrix where β(i,j) represents the probability of i th topic containing the j th word.<br>

</span>

### Statistical formula for the LDA algorithm (from "Intuitive Guide to Latent Dirichlet Allocation"):

<span style="font-family:Papyrus; font-size:1.25em;">

![mathematical_model](lda-presentation-images/lda_equation.png)

##### English Layman's Translation:

Given a set of M documents each containing N words with each word generated by a topic from a set of k topics, find the joint posterior probability of:

θ — A distribution of topics, one for each document,<br>
z — A single topic from the N words for each document,<br>
β — A distribution of words, one for each topic,<br>

Given:

D — All the data we have (i.e. the corpus),<br>

Using the parameters:

α — A parameter vector for each document (document—topic distribution).<br>
η — A parameter vector for each topic (topic—word distribution).<br>


##### Joint posterior probability: 

The revised or updated probablity of an event occurring given new information.<br>
Calculated by updating the prior probability using Bayes' Theorem.<br>
In other words, conditional probability - probability of event A occurring given that event B has occurred.<br>

##### Prior probability:

The probablity of an event occurring before new information is given.<br>
Calculated using Bayes' Theorem.

</span>

<span style="font-family:Papyrus; font-size:1.25em;">

##### Dirichlet Distribution (example):

![dirichlet](lda-presentation-images/dirichlet_distribution.png)

1) Large values of α pushes the distribution to the center.<br>
2) Small values of α pushes the distribution to the edges.<br>

</span>

<span style="font-family:Papyrus; font-size:1.25em;">

</p>The graphs above visualize Dirichlet Distributions using 3 topics (k = 3).  The values for α (alpha) and η (eta) influence the shape of the graphs.  By shape, we mean the shape of the probability density function that determines the θ and β distributions .  In this example, the graph is 3-d because we have k = 3 topics.  As k increases, the graphs would become k-dimensional Dirichlet Distribution graphs.</p><br>

</span>

### Pseudocode for the LDA algorithm:


Assign topic (z) to each word (w) in each document (d) (randomly or based on some probabilistic distribution)

while(NOT exhausted time constraints)

    for each document (d)
        for each word (w)
            for each topic(z)

                Compute Probability(topic (z) | document (d))
                Compute Probability(word (w) | topic (z))

            Assign new topic (z') to word (W) in document (d) (based on selection using computed probabilities).


<span style="font-family:Papyrus; font-size:1.25em;">

</p>The algorithm for Latent Dirichlet Allocation iteratively assigns a topic to each word in each document based on the computed conditional probabilities of a topic belonging to a document and a word belonging to a topic.  This is repeated until the allocated compute time is exhausted.</p><br>

</span>

## A Simplified Latent Dirichlet Example:

### Topics for our example:

| Topics (k=2) |
|--------------|
| Topic 1      |
| Topic 2      |

<span style="font-family:Papyrus; font-size:1.25em;">

It should be noted that the topics in a LDA model are actually just indexed (integer) values from 0-Z and not actually described by any sort of noun, verb, etc.  We later assign "food" and "animals" as the descriptors for the two topics as we see that the top N words for each indexed topic are strongly associated with those descriptors.  The # of topics and # of top words for each topic are determined by hyper parameter settings set by the user.<br>

</span>

### Initial topic assignment for each word in each document:

|    Documents (M = 5, N = 3)    |   Word 1   |  Word 2  |  Word 3  |   |
|:------------------------------:|:----------:|:--------:|:--------:|:-:|
| Doc 1 Word Topic Assignment--> |      1     |     2    |     1    |   |
|           Document 1           |     eat    | broccoli |  banana  |   |
| Doc 2 Word Topic Assignment--> |      2     |     1    |     2    |   |
|           Document 2           |   banana   |  spinach |   lunch  |   |
| Doc 3 Word Topic Assignment--> |      1     |     2    |     1    |   |
|           Document 3           | chinchilla |  kitten  |   cute   |   |
| Doc 4 Word Topic Assignment--> |      2     |     1    |     2    |   |
|           Document 4           |   sister   |  kitten  |   today  |   |
| Doc 5 Word Topic Assignment--> |      1     |     2    |     1    |   |
|           Document 5           |   hamster  |    eat   | broccoli |   |

<span style="font-family:Papyrus; font-size:1.25em;">

The above is step 1 in the LDA algorithm pseudocode.  For the purposes of this example, we simply randomly assign a topic to each word for each document rather than use a probabilistic distribution.<br>

M = 5 indicates that we have five documents total.<br>
N =3 indicates that we have 3 word per document.<br>

</span>

### The list of unique words in our vocabulary (V):

| Words (V = 11) |
|----------------|
| eat            |
| broccoli       |
| banana         |
| spinach        |
| lunch          |
| chinchilla     |
| kitten         |
| cute           |
| sister         |
| today          |
| hamster        |

<span style="font-family:Papyrus; font-size:1.25em;">

The above is all the unique words in our vocabulary across all documents.  These are the words for which we will assign topics to based on our set of topics (k).<br>

</span>

### Computing the β (Beta) Distribution:
<br>
β — A distribution of words, one for each topic.<br>

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-kiyi{font-weight:bold;border-color:inherit;text-align:left}
.tg .tg-u0o7{font-weight:bold;text-decoration:underline;border-color:inherit;text-align:left;vertical-align:top}
.tg .tg-xldj{border-color:inherit;text-align:left}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
.tg .tg-fymr{font-weight:bold;border-color:inherit;text-align:left;vertical-align:top}
.tg .tg-xwhs{font-weight:bold;text-decoration:underline;border-color:inherit;text-align:left}
</style>
<table class="tg">
  <tr>
    <th class="tg-xldj"></th>
    <th class="tg-0pky"></th>
    <th class="tg-0pky"></th>
    <th class="tg-0pky"></th>
    <th class="tg-0pky"></th>
    <th class="tg-0pky"></th>
    <th class="tg-0pky"></th>
    <th class="tg-0pky"></th>
    <th class="tg-0pky"></th>
    <th class="tg-0pky"></th>
    <th class="tg-0pky"></th>
    <th class="tg-0pky"></th>
    <th class="tg-0pky"></th>
  </tr>
  <tr>
    <td class="tg-0pky"></td>
    <td class="tg-u0o7">Words</td>
    <td class="tg-fymr">eat</td>
    <td class="tg-fymr">broccoli</td>
    <td class="tg-fymr">banana</td>
    <td class="tg-fymr">spinach</td>
    <td class="tg-fymr">lunch</td>
    <td class="tg-fymr">chinchilla</td>
    <td class="tg-fymr">kitten</td>
    <td class="tg-fymr">cute</td>
    <td class="tg-fymr">sister</td>
    <td class="tg-fymr">today</td>
    <td class="tg-fymr">hamster</td>
  </tr>
  <tr>
    <td class="tg-xwhs">Topics</td>
    <td class="tg-0pky"></td>
    <td class="tg-0pky"></td>
    <td class="tg-0pky"></td>
    <td class="tg-0pky"></td>
    <td class="tg-0pky"></td>
    <td class="tg-0pky"></td>
    <td class="tg-0pky"></td>
    <td class="tg-0pky"></td>
    <td class="tg-0pky"></td>
    <td class="tg-0pky"></td>
    <td class="tg-0pky"></td>
    <td class="tg-0pky"></td>
  </tr>
  <tr>
    <td class="tg-kiyi">1</td>
    <td class="tg-0pky"></td>
    <td class="tg-0pky">1</td>
    <td class="tg-0pky">1</td>
    <td class="tg-0pky">placeholder</td>
    <td class="tg-0pky">placeholder</td>
    <td class="tg-0pky">placeholder</td>
    <td class="tg-0pky">placeholder</td>
    <td class="tg-0pky">placeholder</td>
    <td class="tg-0pky">placeholder</td>
    <td class="tg-0pky">placeholder</td>
    <td class="tg-0pky">placeholder</td>
    <td class="tg-0pky">placeholder</td>
  </tr>
  <tr>
    <td class="tg-fymr">2</td>
    <td class="tg-0pky"></td>
    <td class="tg-0pky">1</td>
    <td class="tg-0pky">1</td>
    <td class="tg-0pky">placeholder</td>
    <td class="tg-0pky">placeholder</td>
    <td class="tg-0pky">placeholder</td>
    <td class="tg-0pky">placeholder</td>
    <td class="tg-0pky">placeholder</td>
    <td class="tg-0pky">placeholder</td>
    <td class="tg-0pky">placeholder</td>
    <td class="tg-0pky">placeholder</td>
    <td class="tg-0pky">placeholder</td>
  </tr>
</table>

<span style="font-family:Papyrus; font-size:1.25em;">

To compute the Beta distribution, we look at our initial topic assignment for each word in each document.<br>

We count the # of times each word is associated with a particular topic across all documents.<br>

For example, we see here that the word "eat" appears two times in total.  The first time "eat" appears, it is associated with topic 1.  The second time "eat" appears, it is associated with topic 2.<br>

Therefore, we put a 1 in the cell corresponding to Topic 1 and the Word "eat" and we also put a 1 in the cell corresponding to Topic 2 and the Word "eat".<br>

We do this for each word (w) in our vocabulary (V) across all documents (d) based on our initial topic assignment for each word in each document.<br>

Note: "placeholder" simply means that we are not inputting an actual value for the sake of simplicity in this example.

</span>

### Computing the θ (Theta) Distribution:
<br>
θ — A distribution of topics, one for each document.<br>

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-kiyi{font-weight:bold;border-color:inherit;text-align:left}
.tg .tg-xldj{border-color:inherit;text-align:left}
.tg .tg-xwhs{font-weight:bold;text-decoration:underline;border-color:inherit;text-align:left}
.tg .tg-fymr{font-weight:bold;border-color:inherit;text-align:left;vertical-align:top}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
</style>
<table class="tg">
  <tr>
    <th class="tg-xldj"></th>
    <th class="tg-xwhs">Documents</th>
    <th class="tg-kiyi">1</th>
    <th class="tg-fymr">2</th>
    <th class="tg-fymr">3</th>
    <th class="tg-fymr">4</th>
    <th class="tg-fymr">5</th>
  </tr>
  <tr>
    <td class="tg-xwhs">Topics</td>
    <td class="tg-xldj"></td>
    <td class="tg-xldj"></td>
    <td class="tg-0pky"></td>
    <td class="tg-0pky"></td>
    <td class="tg-0pky"></td>
    <td class="tg-0pky"></td>
  </tr>
  <tr>
    <td class="tg-kiyi">1</td>
    <td class="tg-xldj"></td>
    <td class="tg-xldj">2</td>
    <td class="tg-0pky">1</td>
    <td class="tg-0pky">placeholder</td>
    <td class="tg-0pky">placeholder</td>
    <td class="tg-0pky">placeholder</td>
  </tr>
  <tr>
    <td class="tg-fymr">2</td>
    <td class="tg-0pky"></td>
    <td class="tg-0pky">1</td>
    <td class="tg-0pky">2</td>
    <td class="tg-0pky">placeholder</td>
    <td class="tg-0pky">placeholder</td>
    <td class="tg-0pky">placeholder</td>
  </tr>
</table>

<span style="font-family:Papyrus; font-size:1.25em;">

To compute the Theta distribution, we look at our initial topic assignment for each word in each document.<br>

We count the # of times each document is associated with each topic in our set of topics.<br>

Since we have three words per document, we see that Document 1 is associated with Topic 1 two times since two words are associated with Topic 1.  We also see that Document 1 is associated with Topic 2 one time since one word is associated with Topic 2.<br>

Therefore, we put a 2 in the cell corresponding to Topic 1 and Document 1 and we also put a 1 in the cell corresponding to Topic 2 and Document 1.<br>

We do this for each topic (z) for each document (d) based on our initial topic assignment for each word in each document.<br>

Note: "placeholder" simply means that we are not inputting an actual value for the sake of simplicity in this example.

</span>

### Updating the initial topic assignment for each word in each document:

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-s268{text-align:left}
.tg .tg-0lax{text-align:left;vertical-align:top}
</style>
<table class="tg">
  <tr>
    <th class="tg-s268"></th>
    <th class="tg-s268"></th>
    <th class="tg-s268"></th>
    <th class="tg-0lax"></th>
    <th class="tg-0lax"></th>
    <th class="tg-0lax"></th>
  </tr>
  <tr>
    <td class="tg-0lax"></td>
    <td class="tg-0lax">Document 1</td>
    <td class="tg-0lax">Document 2</td>
    <td class="tg-0lax">Document 3</td>
    <td class="tg-0lax">Document 4</td>
    <td class="tg-0lax">Document 5</td>
  </tr>
  <tr>
    <td class="tg-0lax">Broccoli-Topic 1</td>
    <td class="tg-0lax">1 X 2 = 2</td>
    <td class="tg-0lax">placeholder</td>
    <td class="tg-0lax">placeholder</td>
    <td class="tg-0lax">placeholder</td>
    <td class="tg-0lax">placeholder</td>
  </tr>
  <tr>
    <td class="tg-0lax">Broccoli-Topic 2</td>
    <td class="tg-0lax">1 X 1 = 1</td>
    <td class="tg-0lax">placeholder</td>
    <td class="tg-0lax">placeholder</td>
    <td class="tg-0lax">placeholder</td>
    <td class="tg-0lax">placeholder</td>
  </tr>
</table>

<span style="font-family:Papyrus; font-size:1.25em;">

In order to update our initial topic assignments for each word in each document, we look at the Beta and Theta distributions we calculated previously.<br>

Notice that "broccoli" is associated with Topic 1 one time and Topic 2 one time in the Beta distribution while Document 1 is associated with Topic 1 two times and Topic 2 one time in the Theta distribution.<br>

Now, to calculate the new topic (z) assignment for the word (w) "broccoli", we do some simple arithmetic operations.<br>

We multiply the value in the cell associated with Topic 1 and "broccoli" in the Beta distribution with the value in the cell associated with Topic 1 and Document 1 in the Theta distribution.  This gives us 1 X 2 = 2.<br>

We then multiple the value in the cell associated with Topic 2 and "broccoli" in the Beta distribution with the value in the cell associated with Topic 2 and Document 1 in the Theta distribution.  This gives us 1 X 1 = 1.<br>

##### Important Note:  This process is repeated for each word in each document BEFORE moving on to the next document.

</span>

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-88nc{font-weight:bold;border-color:inherit;text-align:center}
.tg .tg-c3ow{border-color:inherit;text-align:center;vertical-align:top}
.tg .tg-uys7{border-color:inherit;text-align:center}
.tg .tg-y2k2{font-weight:bold;text-decoration:underline;border-color:inherit;text-align:center}
.tg .tg-7btt{font-weight:bold;border-color:inherit;text-align:center;vertical-align:top}
.tg .tg-9353{font-weight:bold;text-decoration:underline;border-color:inherit;text-align:center;vertical-align:top}
</style>
<table class="tg">
  <tr>
    <th class="tg-uys7"></th>
    <th class="tg-y2k2">Documents (M = 5)</th>
    <th class="tg-88nc">Document 1</th>
    <th class="tg-7btt">Document 2</th>
    <th class="tg-7btt">Document 3</th>
    <th class="tg-7btt">Document 4</th>
    <th class="tg-7btt">Document 5</th>
  </tr>
  <tr>
    <td class="tg-9353">Words (in Vocabulary)</td>
    <td class="tg-c3ow"></td>
    <td class="tg-c3ow"></td>
    <td class="tg-c3ow"></td>
    <td class="tg-c3ow"></td>
    <td class="tg-c3ow"></td>
    <td class="tg-c3ow"></td>
  </tr>
  <tr>
    <td class="tg-7btt">Eat-Topic 1</td>
    <td class="tg-c3ow"></td>
    <td class="tg-c3ow">placeholder</td>
    <td class="tg-c3ow">placeholder</td>
    <td class="tg-c3ow">placeholder</td>
    <td class="tg-c3ow">placeholder</td>
    <td class="tg-c3ow">placeholder</td>
  </tr>
  <tr>
    <td class="tg-7btt">Eat-Topic 2</td>
    <td class="tg-c3ow"></td>
    <td class="tg-c3ow">placeholder</td>
    <td class="tg-c3ow">placeholder</td>
    <td class="tg-c3ow">placeholder</td>
    <td class="tg-c3ow">placeholder</td>
    <td class="tg-c3ow">placeholder</td>
  </tr>
  <tr>
    <td class="tg-7btt">Broccoli-Topic 1</td>
    <td class="tg-c3ow"></td>
    <td class="tg-c3ow">1 X 2 = 2 --&gt; 2 / (2 + 1) = 2/3</td>
    <td class="tg-c3ow">placeholder</td>
    <td class="tg-c3ow">placeholder</td>
    <td class="tg-c3ow">placeholder</td>
    <td class="tg-c3ow">placeholder</td>
  </tr>
  <tr>
    <td class="tg-7btt">Broccoli-Topic 2</td>
    <td class="tg-c3ow"></td>
    <td class="tg-c3ow">1 X 1 = 1 --&gt; 1 / (2 + 1) = 1/3</td>
    <td class="tg-c3ow">placeholder</td>
    <td class="tg-c3ow">placeholder</td>
    <td class="tg-c3ow">placeholder</td>
    <td class="tg-c3ow">placeholder</td>
  </tr>
  <tr>
    <td class="tg-7btt">...</td>
    <td class="tg-c3ow"></td>
    <td class="tg-c3ow"></td>
    <td class="tg-c3ow"></td>
    <td class="tg-c3ow"></td>
    <td class="tg-c3ow"></td>
    <td class="tg-c3ow"></td>
  </tr>
  <tr>
    <td class="tg-7btt">...</td>
    <td class="tg-c3ow"></td>
    <td class="tg-c3ow"></td>
    <td class="tg-c3ow"></td>
    <td class="tg-c3ow"></td>
    <td class="tg-c3ow"></td>
    <td class="tg-c3ow"></td>
  </tr>
  <tr>
    <td class="tg-7btt">Hamster-Topic 1</td>
    <td class="tg-c3ow"></td>
    <td class="tg-c3ow">placeholder</td>
    <td class="tg-c3ow">placeholder</td>
    <td class="tg-c3ow">placeholder</td>
    <td class="tg-c3ow">placeholder</td>
    <td class="tg-c3ow">placeholder</td>
  </tr>
  <tr>
    <td class="tg-7btt">Hamster-Topic 2</td>
    <td class="tg-c3ow"></td>
    <td class="tg-c3ow">placeholder</td>
    <td class="tg-c3ow">placeholder</td>
    <td class="tg-c3ow">placeholder</td>
    <td class="tg-c3ow">placeholder</td>
    <td class="tg-c3ow">placeholder</td>
  </tr>
</table>

<span style="font-family:Papyrus; font-size:1.25em;">

Next, we sum the resulting values for the above multiplications to obtain (1 X 2) + (1 X 1) = 3.<br>

Then, we divide the resulting values for each of the above multiplications by the value of the sum, 3.<br>

Therefore, we have obtained probability values by which we use to select a new topic to assign to the word "broccoli".<br>

In this case, they are a 2/3 = 0.6666667 chance that we assign "broccoli" to Topic 1 in Document 1 and a 1/3 = 0.33333333 chance that we assign "broccoli" to Topic 2 in Document 1.<br>

Notice that we are assigning a new topic to the word "broccoli" in Document 1 according to PROBABILITIES that are calculated using the arithmetic operations above.<br>

We are NOT simply arbitrarily assigning a new topic (z) to the word (w) "broccoli".  Everything is based on the Beta and Theta distributions and the conditional probabilities in the LDA pseudocode described above.<br>

We repeat this for each word (w) in our vocabulary (V) for each document (d) in our set of documents (M).<br>

</span>

### Updated topic assignment for "broccoli" in Document 1:

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-c3ow{border-color:inherit;text-align:center;vertical-align:top}
</style>
<table class="tg">
  <tr>
    <th class="tg-c3ow">Documents (M = 5, N = 3)</th>
    <th class="tg-c3ow">Word 1</th>
    <th class="tg-c3ow">Word 2</th>
    <th class="tg-c3ow">Word 3</th>
    <th class="tg-c3ow"></th>
  </tr>
  <tr>
    <td class="tg-c3ow">Doc 1 Word Topic Assignment--&gt;</td>
    <td class="tg-c3ow">1</td>
    <td class="tg-c3ow">2 --&gt; 1</td>
    <td class="tg-c3ow">1</td>
    <td class="tg-c3ow"></td>
  </tr>
  <tr>
    <td class="tg-c3ow">Document 1</td>
    <td class="tg-c3ow">eat</td>
    <td class="tg-c3ow">broccoli</td>
    <td class="tg-c3ow">banana</td>
    <td class="tg-c3ow"></td>
  </tr>
  <tr>
    <td class="tg-c3ow">Doc 2 Word Topic Assignment--&gt;</td>
    <td class="tg-c3ow">2</td>
    <td class="tg-c3ow">1</td>
    <td class="tg-c3ow">2</td>
    <td class="tg-c3ow"></td>
  </tr>
  <tr>
    <td class="tg-c3ow">Document 2</td>
    <td class="tg-c3ow">banana</td>
    <td class="tg-c3ow">spinach</td>
    <td class="tg-c3ow">lunch</td>
    <td class="tg-c3ow"></td>
  </tr>
  <tr>
    <td class="tg-c3ow">Doc 3 Word Topic Assignment--&gt;</td>
    <td class="tg-c3ow">1</td>
    <td class="tg-c3ow">2</td>
    <td class="tg-c3ow">1</td>
    <td class="tg-c3ow"></td>
  </tr>
  <tr>
    <td class="tg-c3ow">Document 3</td>
    <td class="tg-c3ow">chinchilla</td>
    <td class="tg-c3ow">kitten</td>
    <td class="tg-c3ow">cute</td>
    <td class="tg-c3ow"></td>
  </tr>
  <tr>
    <td class="tg-c3ow">Doc 4 Word Topic Assignment--&gt;</td>
    <td class="tg-c3ow">2</td>
    <td class="tg-c3ow">1</td>
    <td class="tg-c3ow">2</td>
    <td class="tg-c3ow"></td>
  </tr>
  <tr>
    <td class="tg-c3ow">Document 4</td>
    <td class="tg-c3ow">sister</td>
    <td class="tg-c3ow">kitten</td>
    <td class="tg-c3ow">today</td>
    <td class="tg-c3ow"></td>
  </tr>
  <tr>
    <td class="tg-c3ow">Doc 5 Word Topic Assignment--&gt;</td>
    <td class="tg-c3ow">1</td>
    <td class="tg-c3ow">2</td>
    <td class="tg-c3ow">1</td>
    <td class="tg-c3ow"></td>
  </tr>
  <tr>
    <td class="tg-c3ow">Document 5</td>
    <td class="tg-c3ow">hamster</td>
    <td class="tg-c3ow">eat</td>
    <td class="tg-c3ow">broccoli</td>
    <td class="tg-c3ow"></td>
  </tr>
</table>

<span style="font-family:Papyrus; font-size:1.25em;">
    
Look to the table above for the new topic (z) assigned to the word (w) "broccoli" ASSUMING that in using the probabilities we just calculated we decide on reassigning "broccoli" to Topic 1 in Document 1.<br>

It is important to know that we could also have assigned "broccoli" to Topic 2 instead.  However, based on the calculated probabilities for each topic (z) in our set of topics (k) it is far more likely that a randomized selection will select Topic 1 rather than Topic 2 (since Topic 1 = 2/3 chance and Topic 2 = 1/3 chance).<br>

In an actual implementation of the LDA model, we would do this reassignment for each word (w) in each document (d) based on the probabilities calculated for each word (w) using the Beta and Theta distributions.<br>

However, we are not done yet with just the first iteration of the LDA algorithm.  We still need to update the values for the Beta and Theta distributions for the next iteration of the LDA algorithm.<br>

</span>

### Computing the Updated β (Beta) Distribution:

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-xldj{border-color:inherit;text-align:left}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
</style>
<table class="tg">
  <tr>
    <th class="tg-xldj"></th>
    <th class="tg-0pky"></th>
    <th class="tg-0pky"></th>
    <th class="tg-0pky"></th>
    <th class="tg-0pky"></th>
    <th class="tg-0pky"></th>
    <th class="tg-0pky"></th>
    <th class="tg-0pky"></th>
    <th class="tg-0pky"></th>
    <th class="tg-0pky"></th>
    <th class="tg-0pky"></th>
    <th class="tg-0pky"></th>
    <th class="tg-0pky"></th>
  </tr>
  <tr>
    <td class="tg-0pky"></td>
    <td class="tg-0pky">Words</td>
    <td class="tg-0pky">eat</td>
    <td class="tg-0pky">broccoli</td>
    <td class="tg-0pky">banana</td>
    <td class="tg-0pky">spinach</td>
    <td class="tg-0pky">lunch</td>
    <td class="tg-0pky">chinchilla</td>
    <td class="tg-0pky">kitten</td>
    <td class="tg-0pky">cute</td>
    <td class="tg-0pky">sister</td>
    <td class="tg-0pky">today</td>
    <td class="tg-0pky">hamster</td>
  </tr>
  <tr>
    <td class="tg-xldj">Topics</td>
    <td class="tg-0pky"></td>
    <td class="tg-0pky"></td>
    <td class="tg-0pky"></td>
    <td class="tg-0pky"></td>
    <td class="tg-0pky"></td>
    <td class="tg-0pky"></td>
    <td class="tg-0pky"></td>
    <td class="tg-0pky"></td>
    <td class="tg-0pky"></td>
    <td class="tg-0pky"></td>
    <td class="tg-0pky"></td>
    <td class="tg-0pky"></td>
  </tr>
  <tr>
    <td class="tg-xldj">1</td>
    <td class="tg-0pky"></td>
    <td class="tg-0pky">1</td>
    <td class="tg-0pky">1 --&gt; 2</td>
    <td class="tg-0pky">placeholder</td>
    <td class="tg-0pky">placeholder</td>
    <td class="tg-0pky">placeholder</td>
    <td class="tg-0pky">placeholder</td>
    <td class="tg-0pky">placeholder</td>
    <td class="tg-0pky">placeholder</td>
    <td class="tg-0pky">placeholder</td>
    <td class="tg-0pky">placeholder</td>
    <td class="tg-0pky">placeholder</td>
  </tr>
  <tr>
    <td class="tg-0pky">2</td>
    <td class="tg-0pky"></td>
    <td class="tg-0pky">1</td>
    <td class="tg-0pky">1 --&gt; 0</td>
    <td class="tg-0pky">placeholder</td>
    <td class="tg-0pky">placeholder</td>
    <td class="tg-0pky">placeholder</td>
    <td class="tg-0pky">placeholder</td>
    <td class="tg-0pky">placeholder</td>
    <td class="tg-0pky">placeholder</td>
    <td class="tg-0pky">placeholder</td>
    <td class="tg-0pky">placeholder</td>
    <td class="tg-0pky">placeholder</td>
  </tr>
</table>

<span style="font-family:Papyrus; font-size:1.25em;">

Note that the cell associated with Topic 1 and "broccoli" has changed from 1 --> 2 and that the cell associated with Topic 2 and "broccoli" has change from 1 --> 0.<br>

Refer to the updated topic assignment for "broccoli" in Document 1 in the table in the previous section.<br>

In that table, notice that the word (w) "broccoli" is now only associated with Topic 1 across all documents (d) and that the word (w) "broccoli" occurs twice across all documents (d).<br>

Therefore, we update the cell associated with Topic 1 and "broccoli" in the Beta distribution to 2 and we also update the cell associated with Topic 2 and "broccoli" in the Beta distribution to 0.<br>

We would do this for all words (w) in our vocabulary (V) for all topics (z) in our set of topics (k).<br>

</span>

### Computing the Updated θ (Theta) Distribution:

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-s268{text-align:left}
.tg .tg-0lax{text-align:left;vertical-align:top}
</style>
<table class="tg">
  <tr>
    <th class="tg-s268"></th>
    <th class="tg-s268">Documents</th>
    <th class="tg-s268">1</th>
    <th class="tg-0lax">2</th>
    <th class="tg-0lax">3</th>
    <th class="tg-0lax">4</th>
    <th class="tg-0lax">5</th>
  </tr>
  <tr>
    <td class="tg-s268">Topics</td>
    <td class="tg-s268"></td>
    <td class="tg-s268"></td>
    <td class="tg-0lax"></td>
    <td class="tg-0lax"></td>
    <td class="tg-0lax"></td>
    <td class="tg-0lax"></td>
  </tr>
  <tr>
    <td class="tg-s268">1</td>
    <td class="tg-s268"></td>
    <td class="tg-s268">2 --&gt; 3</td>
    <td class="tg-0lax">1</td>
    <td class="tg-0lax">placeholder</td>
    <td class="tg-0lax">placeholder</td>
    <td class="tg-0lax">placeholder</td>
  </tr>
  <tr>
    <td class="tg-0lax">2</td>
    <td class="tg-0lax"></td>
    <td class="tg-0lax">1 --&gt; 0</td>
    <td class="tg-0lax">2</td>
    <td class="tg-0lax">placeholder</td>
    <td class="tg-0lax">placeholder</td>
    <td class="tg-0lax">placeholder</td>
  </tr>
</table>

<span style="font-family:Papyrus; font-size:1.25em;">

Note that the cell associated with Topic 1 and Document 1 has changed from 2 --> 3 and that the cell associated with Topic 2 and Document 1 has changed from 1 --> 0.<br>

Refer to the updated topic assignment for "broccoli" in Document 1 in the table in the previous section.<br>

In that table, notice that Document 1 contains 3 words (N) that are now all associated with Topic 1.  So, there are no words in Document 1 that are associated with Topic 2.<br>

Therefore, we update the cell associated with Topic 1 and Document 1 in the Theta distribution to 3 and we also update the cell associated with Topic 2 and Document 1 in the Theta distribution to 0.<br>

We would do this for all documents (d) for all topics (z) in our set of topics (k).<br>

</span>

### We are finally finished with the FIRST iteration of the LDA algorithm:

<span style="font-family:Papyrus; font-size:1.25em;">

We would now start at Step 2 and rinse + repeat until we have exhausted our allocated compute time.<br>

</span>

# Scikit-Learn Latent Dirichlet Allocation on SLO Twitter Dataset:

### Implementation of LDA's:

<span style="font-family:Papyrus; font-size:1.25em;">

</p>Our implementation of LDA's utilizes the Scikit-Learn LatentDirichletAllocation class and the Python "lda" library.  We utilize Scikit-Learn's GridSearchCV class to perform an exhaustive grid search for the optimal hyper parameters to fit our Twitter dataset.  We preprocess our raw Twitter dataset before running multiple iterations of the LDA algorithm with the following specified number of topics: 3, 6, 12, and 20.  We limit each topic to the top 10 words that describe that topic.</p>
</span>

<span style="font-family:Papyrus; font-size:1.25em;">
    
Tweet preprocessing is done via a custom library imported as "lda_util" using "slo_lda_topic_extraction_utility_functions.py"<br>

</span>

### Import libraries and set parameters:

<span style="font-family:Papyrus; font-size:1.25em;">

Adjust log verbosity levels as necessary.<br>

Set to "DEBUG" to view all debug output.<br>
Set to "INFO" to view useful information on dataframe shape, etc.<br>

</span>

In [8]:
"""
Resources Used:

https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html#sklearn.decomposition.LatentDirichletAllocation
https://medium.com/mlreview/topic-modeling-with-scikit-learn-e80d33668730

"""

# Import libraries.
import logging as log
import warnings
import tensorflow as tf
import time
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

#############################################################

# Miscellaneous parameter adjustments for pandas and python.
pd.options.display.max_rows = 10
# pd.options.display.float_format = '{:.1f}'.format
pd.set_option('precision', 7)
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)

"""
Turn log statements for various sections of code on/off.
(adjust log level as necessary)
"""
log.basicConfig(level=log.INFO)
# tf.logging.set_verbosity(tf.logging.INFO)


### Import preprocessing functions and preprocess Tweets:

<span style="font-family:Papyrus; font-size:1.25em;">
    
We preprocess our Twitter dataset as follows:<br>

1) Downcase all text.<br>
2) Remove "RT" tags.
3) Remove URL's and replace with slo_url.<br>
4) Remove Tweet mentions and replace with slo_mention.<br>
5) Remove Tweet hashtags and replace with slo_hashtag.<br>
6) Remove all punctuation in the Tweet.<br>
7) Remove all words we find to be irrelevant for topic extraction from the Tweet.<br>
8) Save the preprocessed Tweets to a external CSV file for use in LDA topic extraction.<br>

</span>

In [12]:
# Import custom utility functions.
import slo_lda_topic_extraction_utility_functions as lda_util

# lda_util.tweet_dataset_preprocessor("D:/Dropbox/summer-research-2019/datasets/dataset_20100101-20180510_tok_PROCESSED.csv",
#                                     "D:/Dropbox/summer-research-2019/datasets/dataset_20100101-20180510_tok_LDA_PROCESSED2.csv", "tweet_t")

<span style="font-family:Papyrus; font-size:1.25em;">

The first parameter in our function call specifies the file path to the datset to be preprocessed.  The second parameter specifies the location to save the CSV file to.  The 3rd parameter specifies the name of the column in the dataset that contains the original Tweet text.<br>

Refer to URL link for the codebase to the utility functions used above for data preprocessing and LDA topic extraction:

https://github.com/J-Jinn/Summer-Research-2019/blob/master/slo_lda_topic_extraction_utility_functions.py

</span>

### Import and prepare the preprocessed dataset for use in LDA topic extraction:

<span style="font-family:Papyrus; font-size:1.25em;">
    
Refer to the code comments for the specific steps performed.<br>
Note that we have to use absolute file paths in Jupyter notebook as opposed to relative file paths in PyCharm.<br>

</span>

In [13]:
# Import the dataset.
# tweet_dataset_processed = \
#     pd.read_csv("datasets/dataset_20100101-20180510_tok_LDA_PROCESSED.csv", sep=",")

# Import the dataset.
tweet_dataset_processed = \
    pd.read_csv("D:/Dropbox/summer-research-2019/datasets/dataset_20100101-20180510_tok_LDA_PROCESSED.csv", sep=",")

# Reindex and shuffle the data randomly.
tweet_dataset_processed = tweet_dataset_processed.reindex(
    pd.np.random.permutation(tweet_dataset_processed.index))

# Generate a Pandas dataframe.
tweet_dataframe_processed = pd.DataFrame(tweet_dataset_processed)

# Drop any NaN or empty Tweet rows in dataframe (or else CountVectorizer will blow up).
tweet_dataframe_processed = tweet_dataframe_processed.dropna()

# Print shape and column names.
log.info("\n")
log.info("The shape of our preprocessed SLO dataframe with NaN (empty) rows dropped:")
log.info(tweet_dataframe_processed.shape)
log.info("\n")
log.info("The columns of our preprocessed SLO dataframe with NaN (empty) rows dropped:")
log.info(tweet_dataframe_processed.head)
log.info("\n")

# Reindex everything.
tweet_dataframe_processed.index = pd.RangeIndex(len(tweet_dataframe_processed.index))

# Assign column names.
tweet_dataframe_processed_column_names = ['Tweet']

# Rename column in dataframe.
tweet_dataframe_processed.columns = tweet_dataframe_processed_column_names

# Create input feature.
selected_features = tweet_dataframe_processed[tweet_dataframe_processed_column_names]
processed_features = selected_features.copy()

# Check what we are using as inputs.
log.debug("\n")
log.debug("The Tweets in our input feature:")
log.debug(processed_features['Tweet'])
log.debug("\n")

# Create feature set.
slo_feature_set = processed_features['Tweet']

INFO:root:

INFO:root:The shape of our preprocessed SLO dataframe with NaN (empty) rows dropped:
INFO:root:(653094, 1)
INFO:root:

INFO:root:The columns of our preprocessed SLO dataframe with NaN (empty) rows dropped:
INFO:root:<bound method NDFrame.head of                                                   tweet_t
653632  breaking deputy pm barnaby joyce warns that la...
362021  yesterdays biggest risers were ltd up 1456 sla...
480819  last year three companies paid zero tax on 514...
46623   bulga residents take fight against tintos ridi...
439289  annastacia palaszczuk confronted by antiadani ...
...                                                   ...
4456    aicle galillee coal project a can of financial...
85795   those indian influencer accounts are still swa...
263707  oh girl you have no idea ahah and its cause sh...
265613  infographic heres exactly what adanis mine mea...
285820  cuts 700 jobs in queensland coal about 700 job...

[653094 rows x 1 columns]>
INFO:root:



<span style="font-family:Papyrus; font-size:1.25em;">

The above log.INFO messages depict the shape and contents of the preprocessed dataframe after dropping any rows that are just "NaN", indicating the Tweet was full of irrelevant words and is now empty due to removal of those irrelevant words.<br>

</span>

### Perform the topic extraction:

<span style="font-family:Papyrus; font-size:1.25em;">

We use the Scikit-Learn CountVectorizer class to vectorize our categorical Tweet data.  We set the max_features parameter to 1000 to indicate a maximum vocabulary of 1k words based on the 1000 words with the highest term frequencies.  We set the stop_words parameter to "English" to indicate we would like to remove English stop words based on a built-in library of stop words.  We set the min_df and max_df parameters to indicate the words with the threshold term frequencies at which we ignore those words and do not include them in our vocabulary.<br>

We use the Scikit-Learn LatentDirichletAllocation class with the below hyper parameters to train on and fit to our Tweet data.  The parameter n_topics controls the # of topics we would like to extract for topic modeling.  The parameter max_iter controls the # of iterations to perform LDA before we cease.  The parameter learning_method controls the method by which we update the words in our topics.  <br>

We use a utility function to display Topics 1-20 and the top 10 words associated with each Topic.<br>

</span>

In [4]:
from sklearn.decomposition import LatentDirichletAllocation

# LDA can only use raw term counts for LDA because it is a probabilistic graphical model.
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=1000, stop_words='english')
tf = tf_vectorizer.fit_transform(slo_feature_set)
tf_feature_names = tf_vectorizer.get_feature_names()

# Run LDA.
lda = LatentDirichletAllocation(n_topics=20, max_iter=5, learning_method='online', learning_offset=50.,
                                random_state=0).fit(tf)

# Display the top words for each topic.
lda_util.display_topics(lda, tf_feature_names, 10)
    

Topic 0:
tax pay energy thanks ceo latest high corporate story office
Topic 1:
labor australian foescue going federal election said giant political hey
Topic 2:
stop coal rail fund line clean seam gas protest global
Topic 3:
news make right lnp wants did thats local shares watch
Topic 4:
climate public money change fight business world national policy doing
Topic 5:
jobs new coal iron ore plans beach create really end
Topic 6:
time year mines paid industry price look australias companies profit
Topic 7:
turnbull funding oil years banks biggest planet cuts despite dead
Topic 8:
queensland india coal power cou minister alp jobs approval environment
Topic 9:
ahead adani council come carbon politicians fossil week production mega
Topic 10:
project gas slocashn narrabri vote getting gov noh pipeline sign
Topic 11:
water say basin galilee way barnaby joyce repo canavan let
Topic 12:
want know coal state group campaign oppose join forest tell
Topic 13:
government environmental labor cut loan 

<span style="font-family:Papyrus; font-size:1.25em;">

We cannot seem to find any strong correlation between the 10 words in each Topic such that we could assign an English descriptor to each topic, such as "economic", "environmental", "social", etc.

</span>

### Results from a different execution of LDA topic extraction on our dataset (using PyCharm):

<span style="font-family:Papyrus; font-size:1.25em;">

These results were obtained using the exact same code-base and hyper parameters, only it was done within PyCharm rather than the Jupyter Notebook.<br>

</span>

<span style="font-family:Papyrus; font-size:1.25em;">

Though the results are different, the top 10 words for each of the 20 different Topics still lack any strong association to each other.  We still cannot easily assign any English descriptors to each topic.<br>

We decided to time the LDA model.  It takes around 450 seconds or so to finish LDA topic extraction per execution, so it is not a particularly fast process.<br>

</span>

### Results from LDA topic extraction for 3 topics (using PyCharm):

<span style="font-family:Papyrus; font-size:1.25em;">
  
Again, we can't really discern any noticeable patterns among the top 10 words for each topic.<br>

</span>

### Results from LDA topic extraction for 6 topics (using PyCharm):

<span style="font-family:Papyrus; font-size:1.25em;">

As stated above.<br>

</span>

### Results from LDA topic extraction for 12 topics (using PyCharm):

<span style="font-family:Papyrus; font-size:1.25em;">

Ditto.<br>

Of interesting note is that it appears to take longer to perform LDA topic extraction specifying fewer topics over more topics.  We surmise this is because we have a large dataset of 650k+ Tweets which translates to 650k+ different documents in our corpus.  Therefore, it would take the algorithm less time if it could simply assign 650k+ documents to 650k+ different topics rather than having to assign 650k+ documents to a mere 3 topics or in general a much smaller number of topics in comparison to the number of documents.<br>

</span>

## Exhaustive grid search for Scikit-Learn LDA:

<span style="font-family:Papyrus; font-size:1.25em;">

We use Scikit-Learn's Pipeline Class to construct a pipeline consisting of the CounteVectorizer and LatentDirichletAllocation classes.<br>

The "parameters" dictionary determine all the possible combinations of hyper parameters we will test in order to find the optimal hyper parameters for the Scikit-Learn LDA model.<br>

The grid search is performed by fitting on the Twitter data we wish to use for topic extraction.<br>

The optimal hyper parameters are displayed via "log.info" messages so the log verbosity level must be set appropriately to view them.<br>

We recommend executing this only on a supercomputer as otherwise it will take a extremely long time to finish, depending on the number of possible combinations of hyper parameters as defined below.<br>

</span>

In [17]:
# What parameters do we search for?
lda_search_parameters = {
    # 'vect__ngram_range': [(1, 1), (1, 2), (1, 3), (1, 4)],
    'clf__n_components': [1, 5, 10, 15],
    'clf__doc_topic_prior': [None],
    'clf__topic_word_prior': [None],
    'clf__learning_method': ['batch', 'online'],
    'clf__learning_decay': [0.5, 0.7, 0.9],
    'clf__learning_offset': [5, 10, 15],
    'clf__max_iter': [1],
    'clf__batch_size': [64, 128, 256],
    'clf__evaluate_every': [0],
    'clf__total_samples': [1e4, 1e6, 1e8],
    'clf__perp_tol': [1e-1, 1e-2, 1e-3],
    'clf__mean_change_tol': [1e-1, 1e-3, 1e-5],
    'clf__max_doc_update_iter': [50, 100, 150],
    'clf__n_jobs': [-1],
    'clf__verbose': [0],
    'clf__random_state': [None],
}
# lda_util.latent_dirichlet_allocation_grid_search(slo_feature_set, lda_search_parameters)

<span style="font-family:Papyrus; font-size:1.25em;">

If running this code snippet on a non workstation PC, you may want to change "n_jobs=-1" to "n_jobs=0" to prevent Python from utilizing all CPU cores and bogging down your system to unusability for the duration of the search.<br>

Refer to URL link for the codebase to the utility functions used above for data preprocessing and LDA topic extraction:

https://github.com/J-Jinn/Summer-Research-2019/blob/master/slo_lda_topic_extraction_utility_functions.py

</span>

### Exhaustive grid search for Scikit-Learn LDA using subset of Twitter dataset:

<span style="font-family:Papyrus; font-size:1.25em;">

Here, we implement a exhaustive grid search using a smaller subset of the entire Twitter dataset.  This is done as to cut down on the computational time required to finish the search.  We have a large dataset of over 650k+ Tweets so utilizing the full dataset drastically increases the search time.<br>

The first parameter for the "dataframe_subset" function dictates the full dataset you wish to subset while the second parameter defines the number of rows (examples) desired for the subset of the full dataset.<br>

</span>

In [None]:
data_subset = lda_util.dataframe_subset(tweet_dataset_processed, 10000)
lda_util.latent_dirichlet_allocation_grid_search(data_subset, lda_search_parameters)

<span style="font-family:Papyrus; font-size:1.25em;">

Placeholder.

</span>

# LDA Topic Extraction using the "lda" library and collapsed Gibbs Sampling:

<span style="font-family:Papyrus; font-size:1.25em;">

The code below uses the "lda" Python library package that performs LDA topic extraction using collapsed Gibbs Sampling.<br>
This is different from the Scikit-Learn implementation that uses online variational inference.<br>
Otherwise, the dataset is the same and we are still using Scikit-Learn's CountVectorizer class to vectorize our data.<br>

</span>

In [7]:
import lda

# LDA can only use raw term counts for LDA because it is a probabilistic graphical model.
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=1000, stop_words='english')
tf = tf_vectorizer.fit_transform(slo_feature_set)
tf_feature_names = tf_vectorizer.get_feature_names()

# Train and fit the LDA model.
model = lda.LDA(n_topics=20, n_iter=100, random_state=1)
model.fit(tf)  # model.fit_transform(X) is also available
topic_word = model.topic_word_  # model.components_ also works
n_top_words = 10

# Display the topics and the top words associated with.
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(tf_feature_names)[np.argsort(topic_dist)][:-(n_top_words + 1):-1]
    print('Topic {}: {}'.format(i, ' '.join(topic_words)))

INFO:lda:n_documents: 653094
INFO:lda:vocab_size: 1000
INFO:lda:n_words: 3267212
INFO:lda:n_topics: 20
INFO:lda:n_iter: 100
INFO:lda:<0> log likelihood: -33566609
INFO:lda:<10> log likelihood: -27759459
INFO:lda:<20> log likelihood: -24357244
INFO:lda:<30> log likelihood: -23217301
INFO:lda:<40> log likelihood: -22887465
INFO:lda:<50> log likelihood: -22760021
INFO:lda:<60> log likelihood: -22691980
INFO:lda:<70> log likelihood: -22648547
INFO:lda:<80> log likelihood: -22621636
INFO:lda:<90> log likelihood: -22601056
INFO:lda:<99> log likelihood: -22580145


Topic 0: labor stop greens vote alp lnp election shoen suppo win
Topic 1: time good going know im really think did dont right
Topic 2: beach day people join watch action tour stop bob sydney
Topic 3: coal adanis cou point approval federal new green giant light
Topic 4: jobs create thousands tourism coal 10000 pm adanis job cou
Topic 5: reef coal barrier stop save turnbull coral indian canavan minister
Topic 6: tax paid energy australian pay ceo companies donations origin chevron
Topic 7: gas project narrabri seam coal forest dam barnaby water joyce
Topic 8: climate change future coal energy clean fossil time planet global
Topic 9: iron ore foescue oil shares production price prices profit year
Topic 10: coal money fund banks funding billion adanis taxpayers project govt
Topic 11: tax pay company corporate workers cut use profits debt cuts
Topic 12: coal power india new solar mines environmental record company renewables
Topic 13: coal australian fuher foescue creek assets stranded asse

<span style="font-family:Papyrus; font-size:1.25em;">

The results seem to be as incoherent as the Scikit-Learn implementation of LDA topic extraction using online variational inference.<br>

It's difficult to see any correlation between the 10 top words for each topic.<br>

Here, we are using n_iter=100 (iterations) as the fitting to our Twitter data is a lot faster than the Scikit-Learn implementation where max_iter=5 already takes 450 seconds.<br>

</span>

### A second set of LDA topic extraction results using the "lda" library (in Pycharm) with 1000 iterations:

<span style="font-family:Papyrus; font-size:1.25em;">

These results are from another execution using the same library as the previous results.<br>

Again, there's no discernible patterns in the choice of top words across all Topics.<br>

</span>

### A third set of LDA topic extraction results using the "lda" library (in Pycharm) with 1000 iterations and 3 topics:

<span style="font-family:Papyrus; font-size:1.25em;">

Same situation.  Difficult to discern any patterns among the top words chosen for each topic.<br>

</span>

### A fourth set of LDA topic extraction results using the "lda" library (in Pycharm) with 1000 iterations and 6 topics:

<span style="font-family:Papyrus; font-size:1.25em;">

Ditto.<br>

</span>

### A fifth set of LDA topic extraction results using the "lda" library (in Pycharm) with 1000 iterations and 12 topics:

<span style="font-family:Papyrus; font-size:1.25em;">

And Ditto.<br>

</span>

### Why does it work poorly on Tweets?

<span style="font-family:Papyrus; font-size:1.25em;">
    
##### Based on Derek Fisher's senior project presentation:

1) LDA typically works best when the documents are lengthy (large word count) and written in a formal proper style.

2) Tweet text is generally very short in length with a max of around 280 characters.

3) Tweet text is generally written very informally style-wise.

    i) emojis.
    ii) spelling errors.
    iii) other grammatical errors.
    iv) etc.

4) The above makes it difficult for the LDA algorithm to discover any prominent underlying hidden structures.

</span>

## Some current-ish research on Twitter topic modeling and topic modeling in general:

<span style="font-family:Papyrus; font-size:1.25em;">
    
https://www.aclweb.org/anthology/W17-0210

"Twitter Topic Modeling by Tweet Aggregation"

https://www.media.mit.edu/publications/topic-modeling-in-twitter-aggregating-tweets-by-conversations/

"Topic Modeling in Twitter: Aggregating Tweets by Conversations"

https://arxiv.org/ftp/arxiv/papers/1206/1206.3297.pdf

"Hybrid Variational/Gibbs Collapsed Inference in Topic Models"

https://www.researchgate.net/publication/318726050_A_Hierarchical_Topic_Modelling_Approach_for_Tweet_Clustering

"A Hierarchical Topic Modelling Approach forTweet Clustering"

https://www.researchgate.net/publication/262244963_A_biterm_topic_model_for_short_texts

"A Biterm Topic Model for Short Texts"

</span>

## Resources Used:

<span style="font-family:Papyrus; font-size:1.25em;">

https://en.wikipedia.org/wiki/Dirichlet_distribution

Wikipedia page on Dirichlet distributions.<br>

https://en.wikipedia.org/wiki/Plate_notation

Wikipedia page on Plate notation.<br>

https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

Wikipedia page on LDA's.<br>

https://www.tablesgenerator.com/markdown_tables

Easy-to-use table generator for markdown, html, latex, etc.<br>

https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-latent-dirichlet-allocation-437c81220158

Utilized two diagrams, formula, and explanation of associated notation on LDA's.<br>

https://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/

Utilized blog's example as the basis for the explanation of the LDA algorithm pseudocode.<br>

https://www.coursera.org/learn/ml-clustering-and-retrieval

Information on collapsed Gibbs sampling and variational inference in relation to LDA's.<br>

https://www.investopedia.com/terms/p/posterior-probability.asp

Explanation of statistical terminology including posterior and prior probability.<br>

https://cs.calvin.edu/courses/cs/x95/videos/2018-2019/

Used Derek Fisher's explanation of why LDA does not work well on Tweets (with Scikit-Learn standard implementation).<br>

</span>

## Notes:

TODO - consider implementing NMF - Non-Negative Matrix Factorization for Topic Modeling.

https://medium.com/mlreview/topic-modeling-with-scikit-learn-e80d33668730

https://stackabuse.com/python-for-nlp-topic-modeling/

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html

https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html

</span>