# Topic Modeling

[1. Bag-of-Words in More Detail](#1)<br>
[2. Latent Variables](#2)<br>
[3. Matrix Representation of Latent Dirichlet Allocation](#3)<br>

> [3.1: Picking Topics](#3.1)<br>

[4. Beta Distributions](#4)<br>
[5. Dirichlet Distribution](#5)<br>
[6. More on Latent Dirichlet Allocation](#6)<br>
[7. Sample a Topic](#7)<br>
[8. Sample a Word](#8)<br>
[9. Combining the Models](#9)<br>

> [9.1 Summary](#9.1)<br>

[10. Topic Modeling Lab](#10)<br>

## References
In this section, we'll be following this article by David Blei, Andrew Ng, and Michael Jordan.
* [Latent Dirichlet Allocation](https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf)

# <a id='1'>1: Bag-of-Words (BoW) in More Detail</a>

If you think about the BoW model graphically, it represents the relationship between a set of document objects and a set of word objects.

Assume we have the article, *Space exploration. A vote to explore space has been explored*, and that we have done a good job processing the text (case, stemming, lemmatization, etc.)
* There are three main terms: **space**, **vote**, and **explore**
* To find the probability of each term appearing in the article, we divide the count of each term by the total number of terms
* We have three parameters - probabilities for each term ( $p(\text{space|article})$, $p(\text{vote|article})$, $p(\text{explore|article})$ )

To add some notation:
* d: documents (units of groups of terms to be analyzed)
* t: terms (elements that compose documents)
* P(t|d): probability of a term appearing in the document ("For any given document, $d$, and observed term, $t$, how likely is it that the document $d$ generated the term $t$")

<img src="assets/images/03/img_01.png" width=700 align='center'>

Now, if we do this for many documents, say 500, and many terms, say 1,000, we can get something of the sort:

<img src="assets/images/03/img_02.png" width=700 align='center'>

If we have 500,000 parameters, that is a lot of parameters to figure out. We can reduce the number of parameters and still keep most of the information by representing the terms in a latent space. This is commonly known as **topic modeling**.

# <a id='2'>2: Latent Variables</a>

Consider adding to the model the notion of a small set of topics or latent variables (or themes) that actually drive the generation of words in each document. So in this model, any document is considered to have an underlying mixture of topics associated with it. Similarly, a topic is considered to be a mixture of terms that it is likely to generate.

If we take our **documents**, our **terms**, and assert there are a number of **topics**, say 3, then we have 2-sets of probability distributions:
1. $p(\text{z|d})$: topic-document probability (probability of topic $z$ given a document $d$)
2. $p(\text{t|z})$: term-topic probability (probability of a term $t$ given a topic $z$)

Our new probability of a document given a term, $p(\text{t|d})$, can be expressed as a sum over the two previous probabilities:
$$P\left( t \mid d \right) = \Sigma_{z} P\left( t \mid z \right) \cdot P\left( z \mid d \right)$$

<img src="assets/images/03/img_03.png" width=700 align='center'>

Now, the number of parameters is: (number of documents * number of topics) + (number of topics * number of terms)
* 500 documents, 10 topics, 1,000 terms: (500 * 10) + (10 * 1,000) = 15,000
> * Note: same number of documents and terms as before, but much less parameters than 500,000!

This is called **Latent Dirichlet Allocation** or LDiA for short.

# <a id='3'>3: Matrix Representation of Latent Dirichlet Allocation</a>

An LDiA is an example of matrix factorization

The idea is as follows:<br>
<img src="assets/images/03/img_04.png" width=700 align='center'>

* We go from a BoW model to an LDiA model
> * The BoW on the left basically says "our probability of, say, the word 'tax' being generated by the second document is the label of the white arrow"
> * The LDiA on the right, that probability is calculated by the white arrows multiplying the probability of a term $t$, say 'tax' in a topic $z$ say 'politics', by the corresponding probability of a topic $z$ given a document $d$ and adding them

Then, you can have a BoW matrix, composed of terms as columns and documents as rows, like on the bottom left, equal to, or represented by, the product of two matrices:
1. tall skinny matrix of documents as rows and topics as columns
2. wide flat matrix of topics as rows and terms as columns

In this case, the entry of the second document for the term tax, will be equal to the inner product of the corresponding row and column in the matrices on the right
> * If the matrices are big, say 500 documents and 1,000 terms, such that the BoW matrix is 500,000 elements large (500 by 1,000 = $\text{m} x \text{n}$)
> * The two matrices in the topic model combined have only 15,000 elements (mxn * nxm = (500x10) * (10x1,000) = matrix of size 500 by 1,000 for the original matrix)

Aside from being much simpler, the LDiA model has a huge advantage that it gives us a bunch of topics that we can divide documents upon. In this example, we are asserting they are *science, politics, and sports*, but in reality the algorithm will just throw some topics and it'll be up to us to look at the associated words and decide what is the common topic of all these words.

For these examples, we'll keep asserting these 3 topics, but think of them instead as *topic 1, topic 2, and topic 3*

The LDiA model is represented, as before, as:
$$P\left( t \mid d \right) = \Sigma_{z} P\left( t \mid z \right) \cdot P\left( z \mid d \right)$$

## Matrix Multiplication

The idea for building our LDA model will be to factor our BoW matrix into two matrices, one with documents by topic and the other as topic by terms

<img src="assets/images/03/img_05.png" width=700 align='center'>

Recall how we built our BoW matrix: identify the terms and the number of times they appear in a specific document and divide by the sum of terms in that document to get the probabilities/frequencies:

<img src="assets/images/03/img_06.png" width=700 align='center'>

For our **document topic matrix**, we have as follows.

If we have a document, say `document 3` (or doc 3), and doc 3 is mostly about science and a bit about sports and politics. Maybe it's 70% about science, 10% about politics, and 20% about sports. We record these values in the **document-topic matrix**:

<img src="assets/images/03/img_07.png" width=700 align='center'>

For the **topic-term matrix**, we have a similar approach. Start with a topic, say politics, and let's say we can figure out the probabilities that words are generated by this topic. We take all these probabilities to sum to one. We take these probabilities and place them into the **topic-term matrix** as such:

<img src="assets/images/03/img_08.png" width=700 align='center'>

From these two matrices, the product of them together will approximate the BoW matrix!

<img src="assets/images/03/img_09.png" width=700 align='center'>

But we haven't gone into depth about HOW to calculate the entries in these matrices. One way is to use the traditional [*matrix factorization* algorithm](https://developers.google.com/machine-learning/recommendation/collaborative/matrix). However, these matrices are unique in that each of the rows sum to one and there is a significantly meaningful amount of structure coming from a set of documents, topics and words.

What we'll do is something more elaborate than matrix multiplication. The basic idea is that the entries in the two topic modeling matrices come from special distributions. So, we'll embrace this fact and work with these distributions to find these two matrices!

## <a id='3.1'>3.1: Picking topics</a>

Pretend you are at a party in a triangular room. There are people roaming around the room. In each of the corners, there are different things happening. In one corner, there is food, in another corner there is desert, and in the last there is music.

<img src="assets/images/03/img_10.png" width=700 align='center'>

People naturally get drawn to these corners based on their preferences if they like food, desert, or music. Or perhaps they are undecided and equally space themselves from say food and desert. However, they mostly walk away from the blue areas and toward the red areas.

<img src="assets/images/03/img_11.png" width=700 align='center'>

Now, imagine the alternative. We are still at a party, but now in the corners, there is a lion, fire, and radioactive material. 

<img src="assets/images/03/img_12.png" width=700 align='center'>

Now, people will do the opposite of what they did when there were desirable things in the corners; they will move away from the corners. They will gravitate toward the center.

<img src="assets/images/03/img_13.png" width=700 align='center'>

So now, we have three scenarios:
1. We place nice things in the corners
2. We put nothing in the corners
3. We place bad things in the corners

<img src="assets/images/03/img_14.png" width=700 align='center'>

In the above three scenarios, we can think of the parameters at the corners as *repelling factors*: if they are large, then the points are pushed away, small they draw the points to them, and $1$ the points are static

As an example, if we have the following three Dirichlet Distributions, which of these three wis more likely to generate the topics in our model?

<img src="assets/images/03/img_15.png" width=700 align='center'>

Answer: **Left**
> * If we randomly select any point in the distribution, it is most likely, of the three distributions, to be associated strongly with one of the three topics<br>
> * Most articles will be represented by one topic strongly, but maybe others weakly, this means then that the distribution that helps us do that, distinguish well the primary topic of an article, is one that is most useful!

<img src="assets/images/03/img_16.png" width=700 align='center'>

So, for our LDiA model, we will pick a Dirichlet Distribution with small parameters $\alpha$, such as $\overrightarrow{\alpha}=\{0.7,0.7,0.7\}$, and from here we'll sample a few points to be our documents. Each point gives us a mixture of probabilities $\overrightarrow{\theta}$ that will characterize the distribution of topics for that particular document.

<img src="assets/images/03/img_17.png" width=700 align='center'>

In 3D, the dirichlet distributions look as such:

<img src="assets/images/03/img_18.png" width=700 align='center'>

This shows that the probability of picking a point on the triangle depends on the height of the probability distribution at that point. So, as we can see on the left that the edges where the topics are most strong are the highest point on the distribution, we would prefer the one on the left!

# <a id='4'>4: $\beta$ Distributions</a>

Let's think about probability distributions.

Assume we have a coin and we toss it twice. The outcomes are 1 heads and 1 tails. What do we think about this coin? It could be a fair coin, it could be biased toward heads or tails, but we don't have enough data to be sure. To conintue the thought experiment, let's say that we think it's fair, but not with much confidence.

So, the probability distribution could look something like this - higher at $\frac{1}{2}$ but a bit *even* over the entire interval:

<img src="assets/images/03/img_19.png" width=700 align='center'>

Now, let's say we toss the coint 20 times and we get 10 heads and 10 tails. We feel more confident that the coin is fair. The probability distribution may look something more like this:

<img src="assets/images/03/img_20.png" width=700 align='center'>

But what if we toss the coin 4 times and get heads 3 times and tails once? We get an average of $\frac{3}{4}$ on probability of getting heads, but we don't have much confidence. We might have a distribution like such:

<img src="assets/images/03/img_21.png" width=700 align='center'>

But if we toss it 400 times and get 300 heads and 100 tails, we become more confident in the coin being biased toward heads and may get a probability distribution like this:

<img src="assets/images/03/img_22.png" width=700 align='center'>

This is called the **$\beta$ - Distribution** and it works for any values $a$ and $b$:

<img src="assets/images/03/img_23.png" width=700 align='center'>

The **gamma function** can be thought of as a continuous version of the factorial function.

$$\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}x^{a-1}y^{b-1}$$
$$\text{s.t.}$$
$$\Gamma(a)=(a-1)!$$

Where $a$ is an integer, we get the general factorial form. But, if $a$ is not an integer, but instead some form of flow, we can get something of the sort:

<img src="assets/images/03/img_24.png" width=700 align='center'>

So, if we have something like $0.1$ for heads and $0.4$ for tails, aside from it making no sense, we can still use this in the $\beta$-distribution. We just need to use the right funtion for the probability distribution. The probability distribution, or $\beta$-distribution looks as such and means that $p$ is much more likely to be close to zero or close to one than to be somewhere in the middle.

<img src="assets/images/03/img_25.png" width=700 align='center'>

This makes a bit of sense: if $p$ is close to zero or 1, then we are likely to have zero heads or zero tails, which at least gets us close to one of the values we mentioned of $0.1$ or $0.4$

# <a id='5'>5: Dirichlet Distribution</a>

A multinomial distribution is simply a generalization of the binomial distribution to more than two values. The Dirichlet Distribution is an example of a multinomial distribution.

As an example, let's say we have newspaper articles and three topics (just as our earlier examples): science, politics, and sports. Now, let's say that a topic is assigned randomly to the articles and when we look, we find we have: 3 science articles, 6 politics articles, and 2 sports articles. So, for a new article, what is the probability that the article is science, sports, or politics? $\frac{3}{11}$ probability that it is science, $\frac{6}{11}$ probability that it is politics, and $\frac{2}{11}$ probability that it is sports.

If we think of these articles as being represented over a probability distribution, we can think of a triangle, as we did previously. Where if the probability of a topic is 1, then it is at the corner, and something less than one, then the point can be on the plane in a few spaces. The first would be along an edge, indicating some probability between two of the topics, and zero at the third. And if the point is somewhere away from the boundary, it would indicate some positive probability among all of the above topics.

So, if we represented like the triangle discussed, then the we expect a density at the politics points, then maybe a little less in the science point, and finally the least in the sports point. This distribution is calculated with a generalization of the formula for the $\beta$ distribution: $\frac{\Gamma(a+b+c)}{\Gamma(a)\Gamma(b)\Gamma(c)}x^{a-1}y^{b-1}z^{c-1}$. This generalization is the **Dirichlet Distribution**.

<img src="assets/images/03/img_26.png" width=700 align='center'>

Now, there is no need for these values to be integers (just as we saw before). For example, if we have 0.7 for each science, politics, and sports, here is an example of the Dirichlet Distribution. It may be difficult to see, but the density function is very high when we get close to the corners of the triangle. Meaning that any point picked randmly from this distribution is very likely to fall either at the science, politics, or sports corners, or at least close to any of the edges. Also, it's very unlikely to be somewhere in the middle.

<img src="assets/images/03/img_27.png" width=700 align='center'>

Here are some samples of Dirichlet Distributions with different values. Notice that when the values are large, the density function is higher in the middle and if they're small, it's higher in the corners.

<img src="assets/images/03/img_28.png" width=700 align='center'>

If the values are different than each other, then the high part moves towards the smaller values and away from the larger values.

<img src="assets/images/03/img_29.png" width=700 align='center'>

Again, this is how they would look in 3D. So notice that if we want a good topic model, we need to pick small parameters like the one on the left:

<img src="assets/images/03/img_18.png" width=700 align='center'>

# <a id='6'>6: More on Latent Dirichlet Allocation</a>

So now, let's build our LDiA model.

Let's say we have 3 real documents and then we'll generate 3 fake documents. The way we create the fake documents is with a **topic model**. Then, after we generated them, we compare the generated documents with the real ones.

By comparing the generated documents with the real ones, this will tell us how far we are from creating the real documents with our model. As with most machine learning algorithms, we learned from these errors and we'll be able to improve the topic model.

<img src="assets/images/03/img_30.png" width=700 align='center'>

In the paper referenced earlier, [Latent Dirichlet Allocation](https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf), the topic model is drawn as such and appears very complicated. But we will continue to analyze this in our following sections:

<img src="assets/images/03/img_31.png" width=700 align='center'>

# <a id='7'>7: Sample a Topic</a> 

Let's start by picking some topics for our documents. We start with some Dirichlet Distribution with paramters $\alpha$. The parameters should be small so that the distribution is *spiky* toward the points. So that if we pick a point somewhere in the distribution, it will most likely be close to a corner or at least an edge. 

Let's pick a point close to the politics corner which generates the following values: 0.1 science, 0.1 sports, 0.8 for politics. These values represent a mixture of topics for this particular document. They also give us a multinomial distribution $\theta$. Now, from this distribution, we'll start picking topics. That means the topics we'll pick are Science with 10% probability, Sports with 10% probability, and Politics with 80% probability.

<img src="assets/images/03/img_32.png" width=700 align='center'>

Now, we do this with several documents. So, each document is a point in this declared distribution. We can see that each document can be represented along the distribution with a value for each point indicating how close one is from any one topic.

<img src="assets/images/03/img_33.png" width=700 align='center'>

Now, we can merge all of these vectors to get the first matrix, the matrix that indexes documents with the their corresponding topics, the **document-topic matrix**.

<img src="assets/images/03/img_34.png" width=700 align='center'>

# <a id='8'>8: Sample a Word</a>

Now, we'll do the same thing for topics and words. Let's say, for the sake of visualization, we only have four words: space, climate, vote, and rule. Now, these 4 words give us a different Dirichlet Distribution, $\beta$. This one is similar, but it is 3D. It is not around a triangle, but it is around a simplex. The red and blue parts are high and low probability areas respectively. If we had more words, we would still have a Dirichlet Distribution, except it would be in a much higher dimensional simplex.

So, in this distribution $\beta$ we pick a random point and it will very likely be close to a corner or an edge. Each point chose in the distribution will have a probability associated with each word. For our example, we can assume we get: 0.4 space, 0.4 climate, 0.1 vote, and 0.1 rule. This multinomial distribution is phi" $\phi$ and it represents the *connection between the words and the topic*. Now, from this distribution, we'll sample random words (which are 40% likely to be space, 40% likely to be climate, 10% likely to be vote, and 10% likely to be rule).

<img src="assets/images/03/img_35.png" width=700 align='center'>

So, now we do this for **every topic**. Notice that with this, we have topics by number. We do not know them by any name, we just know them by topic 1, 2, and 3. After some inspection, we can infer that topic 1, being close to space and climate must be science. Similarly, topic 2 being close to vote could be politics, and topic 3 close to rule could be sports. This inference of topic meaning from inspection is something you do at the end of the model.

<img src="assets/images/03/img_36.png" width=700 align='center'>

Now, as we join these together, we get our second matrix of the LDiA model, the **topic-term matrix**

<img src="assets/images/03/img_38.png" width=700 align='center'>

# <a id='9'>9: Combining the Models</a>

Now, if we put it all together and see how to get these two matrices from the LDiA model based on their respective Dirichlet distributions.

The rough idea we just saw was the entries fro mthe first matrix matrix come from picking points in the distribution $\alpha$, the entries in the second matrix come from picking points in the distribution $\beta$, and the idea is to find the best locations of these points to get the best factorization of the matrix. The best locations of these points will give us *precisely* the topics we want!

<img src="assets/images/03/img_39.png" width=700 align='center'>

Let's begin by generating some documents to compare them with the originals. We begin with the Dirichlet distribution for topics, $\alpha$. From here, we draw some points corresponding to all the documents. Each point will give some values for each of the topics which will generate a multivariate distribution $\theta$ (the mixture of topics corresponding to a document). Now, let's generate some words from document one as follows.

From $\theta$, we draw some topics. How many topics? We will have a Poisson variable, another parameter in the model, to tell us how many! So, we draw some topics based on the probability given by the $\theta$ distribution. Which, for this example, we'll draw science with a 0.7 probability, politics with 0.2 probability, and sports with 0.1 probability. Now, we'll associate words to these topics using the words Dirichlet distribution $\beta$.

In the dsitribution $\beta$, we locate the topic, and from each of the dots, we obtain a distribution of the words generated by each of the topics with some probability of each word. These distributions are called $\phi$ (phi).

For each of the topics we've chosen, we'll pick a word associated to it using the multivariate distribution $\phi$. For eacmple, for the first topic we have science. We look at the science row in the $\phi$ distribution and pick a word from there. For example, space. So, space is the first word in document one. We now do this for every one of our topics and then generate words from our first generated document, let's call it "fake document one". We then do this again, draw another point from $\alpha$, get another multivariable distribution $\theta$, which generates new topics from $\beta$ (let's call that "fake document two"). We do this many times, generating many documents. Finally, we compare them with the original documents!

To compare the generated documents with the originals, we can use maximum likelihood to figure out the arrangements of points which will give us the real articles with the highest probability.

<img src="assets/images/03/img_40.png" width=700 align='center'>

In summary, here's what we're doing. We have the two dirichlet distributions, $\alpha$ and $\beta$. From $\alpha$, we pick some documents, and from $\beta$, we pick some topics. We use these two combined to create some fake articles. Then, we compare the fake articles to the real articles. The probability of obtaining the real articles is, of course, really small, but there must be some arrangement of points in the above distribution that maximizes this probability. Our goal is to find the arrangement of points and that will give us the topics. In the same way that we train many algorithms in machine learning, there will be an error that will tell us how far we are from generating the real articles. The error will back-propogate all the way to the distributions, giving us a gradient that will tell us where to move the points in order to reduce this error.

<img src="assets/images/03/img_41.png" width=700 align='center'>

So, we move the points as indicated and now we have obtatined a slightly better model. Doing this repeatedly will give us a good enough arrangement of the points. Naturally, a good arrangement of the points will give us some topics.

<img src="assets/images/03/img_42.png" width=700 align='center'>

The dirichlet distribution $\alpha$ will tell us what articles are associated to these topics and the dirichlet distribution $\beta$ will tell us what words are associated to those topics. We can go a bit further and actually back-propogate the error all the way to $\alpha$ and $\beta$ obtaining not only better point arrangements but actually better distributions, $\alpha^{'}$ and $\beta^{'}$.

<img src="assets/images/03/img_43.png" width=700 align='center'>

And that's it! That's how latent dirichlate allocation works!


If you want to understand the diagram in the paper, here it is:
* $\alpha$ is the topic distribution
* $\beta$ is the word distribution
* $\theta$ is the multivariate distribtion drawn from the topics
* $\phi$ is the multivariate distribution drawn from the words
* $z$ is the topics
* $w$ is the document obtained by combining the two matrices

<img src="assets/images/03/img_44.png" width=700 align='center'>

## <a id='9.1'>9.1: Summary</a>

In this lesson, we covered:
1. Bag of words in more detail
2. Latent variables
3. Matrix representation of Latent Dirichlet Allocation
4. Beta distribution
5. Dirichlet distribution
6. More on Latent Dirichlet Allocation
7. Sample a topic
8. Sample a word
9. Combing the models
10. Topic modeling lab

**Use Cases:**
* Topic modeling, document categorization
* Mixture of topics in a new document
* Generate collections of words with desired mixture

# <a id='10'>10: Topic Modeling Lab</a>

## <a id='10.0'>10.0: Step 0 - Intro - Latent Dirichlet Allocation</a>

LDiA is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions. 

* Each document is modeled as a multinomial distribution of topics and each topic is modeled as a multinomial distribution of words.
* LDA assumes that the every chunk of text we feed into it will contain words that are somehow related. Therefore choosing the right corpus of data is crucial. 
* It also assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution.

## <a id='10.1'>10.1: Step 1 - Load the dataset</a>

The dataset we'll use is a list of over one million news headlines published over a period of 15 years. We'll start by loading it from the `02_computing-with-NLP/assets/ldia/abcnews-date-text.csv` file.

In [29]:
from pathlib import Path
import pandas as pd
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(400)

In [30]:
dir_items = ['assets','ldia','abcnews-date-text.csv']
file_path = Path(*dir_items)

In [31]:
data = pd.read_csv(file_path, on_bad_lines='skip')
display(data.head())
print(f"{data.shape[0]:,} rows by {data.shape[1]:,} columns")
print()
data.info()

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers


1,103,665 rows by 2 columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1103665 entries, 0 to 1103664
Data columns (total 2 columns):
 #   Column         Non-Null Count    Dtype 
---  ------         --------------    ----- 
 0   publish_date   1103665 non-null  int64 
 1   headline_text  1103665 non-null  object
dtypes: int64(1), object(1)
memory usage: 16.8+ MB


In [32]:
# Using only the first 300k documents; the tutorial includes the column for the index (no need to do this seems relevant just yet)
data_text = data[:300000][['headline_text']]
data_text['index'] = data_text.index
data_text.head()

Unnamed: 0,headline_text,index
0,aba decides against community broadcasting lic...,0
1,act fire witnesses must be aware of defamation,1
2,a g calls for infrastructure protection summit,2
3,air nz staff in aust strike for pay rise,3
4,air nz strike to affect australian travellers,4


## <a id='10.2'>10.2: Step 2 - Data Preprocessing</a>

We will perform the following steps:

* **Tokenization**: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
* Words that have fewer than 3 characters are removed.
* All **stopwords** are removed.
* Words are **lemmatized** - words in third person are changed to first person and verbs in past and future tenses are changed into present.
* Words are **stemmed** - words are reduced to their root form.

In [33]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/christopherdaigle/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Lemmatizer Example
Before preprocessing our dataset, let's first look at an lemmatizing example. What would be the output if we lemmatized the word 'went':

In [35]:
original_words = ['caresses', 'flies', 'dies', 'mules', 'denied','died', 'agreed', 'owned', 
           'humbled', 'sized','meeting', 'stating', 'siezing', 'itemization','sensational', 
           'traditional', 'reference', 'colonizer','plotted']

In [34]:
print(WordNetLemmatizer().lemmatize('went', pos ='v')) # past tense to present tense

go


In [38]:
for word in original_words + ['went']:
    print(f"{word}: {WordNetLemmatizer().lemmatize(word)}")

caresses: caress
flies: fly
dies: dy
mules: mule
denied: denied
died: died
agreed: agreed
owned: owned
humbled: humbled
sized: sized
meeting: meeting
stating: stating
siezing: siezing
itemization: itemization
sensational: sensational
traditional: traditional
reference: reference
colonizer: colonizer
plotted: plotted
went: went


### Stemmer Example
Let's also look at a stemming example. Let's throw a number of words at the stemmer and see how it deals with each one:

In [39]:
stemmer = SnowballStemmer("english")
singles = [stemmer.stem(plural) for plural in original_words]

pd.DataFrame(data={'original word':original_words, 'stemmed':singles })

Unnamed: 0,original word,stemmed
0,caresses,caress
1,flies,fli
2,dies,die
3,mules,mule
4,denied,deni
5,died,die
6,agreed,agre
7,owned,own
8,humbled,humbl
9,sized,size


In [45]:
'''
Write a function to perform the pre processing steps on the entire dataset
'''
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Tokenize and lemmatize
def preprocess(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
            
    return result

In [47]:
'''
Preview a document after preprocessing
'''
document_num = 4310
doc_sample = documents[documents['index'] == document_num].values[0][0]

print("Original document: ")
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print("\n\nTokenized and lemmatized document: ")
print(preprocess(doc_sample))

Original document: 
['rain', 'helps', 'dampen', 'bushfires']


Tokenized and lemmatized document: 
['rain', 'help', 'dampen', 'bushfir']


Let's now preprocess all the news headlines we have. To do that, let's use the [map](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html) function from pandas to apply `preprocess()` to the `headline_text` column

**Note**: This may take a few minutes (this took the instructor 6 minutes on his laptop)

In [48]:
documents.head()

Unnamed: 0,headline_text,index
0,aba decides against community broadcasting lic...,0
1,act fire witnesses must be aware of defamation,1
2,a g calls for infrastructure protection summit,2
3,air nz staff in aust strike for pay rise,3
4,air nz strike to affect australian travellers,4


In [52]:
%%time
# TODO: preprocess all the headlines, saving the list of results as 'processed_docs'
documents['proc_doc'] = documents['headline_text'].map(preprocess)
documents.head()

CPU times: user 20.7 s, sys: 82.9 ms, total: 20.8 s
Wall time: 20.8 s


Unnamed: 0,headline_text,index,proc_doc
0,aba decides against community broadcasting lic...,0,"[decid, communiti, broadcast, licenc]"
1,act fire witnesses must be aware of defamation,1,"[wit, awar, defam]"
2,a g calls for infrastructure protection summit,2,"[call, infrastructur, protect, summit]"
3,air nz staff in aust strike for pay rise,3,"[staff, aust, strike, rise]"
4,air nz strike to affect australian travellers,4,"[strike, affect, australian, travel]"


In [53]:
processed_docs = documents['proc_doc'].tolist()

In [54]:
'''
Preview 'processed_docs'
'''
processed_docs[:10]

[['decid', 'communiti', 'broadcast', 'licenc'],
 ['wit', 'awar', 'defam'],
 ['call', 'infrastructur', 'protect', 'summit'],
 ['staff', 'aust', 'strike', 'rise'],
 ['strike', 'affect', 'australian', 'travel'],
 ['ambiti', 'olsson', 'win', 'tripl', 'jump'],
 ['antic', 'delight', 'record', 'break', 'barca'],
 ['aussi', 'qualifi', 'stosur', 'wast', 'memphi', 'match'],
 ['aust', 'address', 'secur', 'council', 'iraq'],
 ['australia', 'lock', 'timet']]

## <a id='10.3'>10.3: Step 3.1 - Bag of words on the dataset</a>

Now let's create a dictionary from 'processed_docs' containing the number of times a word appears in the training set. To do that, let's pass `processed_docs` to [`gensim.corpora.Dictionary()`](https://radimrehurek.com/gensim/corpora/dictionary.html) and call it '`dictionary`'.

In [59]:
'''
Create a dictionary from 'processed_docs' containing the number of times a word appears 
in the training set using gensim.corpora.Dictionary and call it 'dictionary'
'''
dictionary = gensim.corpora.Dictionary(processed_docs)

In [60]:
'''
Checking dictionary created
'''
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 broadcast
1 communiti
2 decid
3 licenc
4 awar
5 defam
6 wit
7 call
8 infrastructur
9 protect
10 summit


** Gensim filter_extremes **

[`filter_extremes(no_below=5, no_above=0.5, keep_n=100000)`](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.filter_extremes)

Filter out tokens that appear in

* less than no_below documents (absolute number) or
* more than no_above documents (fraction of total corpus size, not absolute number).
* after (1) and (2), keep only the first keep_n most frequent tokens (or keep all if None).

In [61]:
'''
OPTIONAL STEP
Remove very rare and very common words:

- words appearing less than 15 times
- words appearing in more than 10% of all documents
'''
# TODO: apply dictionary.filter_extremes() with the parameters mentioned above
dictionary.filter_extremes(no_below=15, no_above=0.1)

** Gensim doc2bow **

[`doc2bow(document)`](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.doc2bow)

* Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a tokenized and normalized string (either unicode or utf8-encoded). No further preprocessing is done on the words in document; apply tokenization, stemming etc. before calling this method.

In [62]:
'''
Create the Bag-of-words model for each document i.e for each document we create a dictionary reporting how many
words and how many times those words appear. Save this to 'bow_corpus'
'''
# TODO
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

In [63]:
'''
Checking Bag of Words corpus for our sample document --> (token_id, token_count)
'''
bow_corpus[document_num]

[(71, 1), (107, 1), (462, 1), (3530, 1)]

In [64]:
'''
Preview BOW for our sample preprocessed document
'''
# Here document_num is document number 4310 which we have checked in Step 2
bow_doc_4310 = bow_corpus[document_num]

for i in range(len(bow_doc_4310)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0], 
                                                     dictionary[bow_doc_4310[i][0]], 
                                                     bow_doc_4310[i][1]))

Word 71 ("bushfir") appears 1 time.
Word 107 ("help") appears 1 time.
Word 462 ("rain") appears 1 time.
Word 3530 ("dampen") appears 1 time.


## <a id='10.4'>10.4: Step 3.2 - TF-IDF on our document set</a>

While performing TF-IDF on the corpus is not necessary for LDA implemention using the gensim model, it is recemmended. TF-IDF expects a bag-of-words (integer values) training corpus during initialization. During transformation, it will take a vector and return another vector of the same dimensionality.

*Please note: The author of Gensim dictates the standard procedure for LDA to be using the Bag of Words model.*

** TF-IDF stands for "Term Frequency, Inverse Document Frequency".**

* It is a way to score the importance of words (or "terms") in a document based on how frequently they appear across multiple documents.
* If a word appears frequently in a document, it's important. Give the word a high score. But if a word appears in many documents, it's not a unique identifier. Give the word a low score.
* Therefore, common words like "the" and "for", which appear in many documents, will be scaled down. Words that appear frequently in a single document will be scaled up.

In other words:

* TF(w) = `(Number of times term w appears in a document) / (Total number of terms in the document)`.
* IDF(w) = `log_e(Total number of documents / Number of documents with term w in it)`.

** For example **

* Consider a document containing `100` words wherein the word 'tiger' appears 3 times. 
* The term frequency (i.e., tf) for 'tiger' is then: 
    - `TF = (3 / 100) = 0.03`. 

* Now, assume we have `10 million` documents and the word 'tiger' appears in `1000` of these. Then, the inverse document frequency (i.e., idf) is calculated as:
    - `IDF = log(10,000,000 / 1,000) = 4`. 

* Thus, the Tf-idf weight is the product of these quantities: 
    - `TF-IDF = 0.03 * 4 = 0.12`.

In [65]:
'''
Create tf-idf model object using models.TfidfModel on 'bow_corpus' and save it to 'tfidf'
'''
from gensim import corpora, models

# TODO
tfidf = models.TfidfModel(bow_corpus)

# >>> corpus = [dct.doc2bow(line) for line in dataset]  # convert corpus to BoW format
# >>>
# >>> model = TfidfModel(corpus)  # fit model
# >>> vector = model[corpus[0]]  # apply model to the first corpus document

In [67]:
'''
Apply transformation to the entire corpus and call it 'corpus_tfidf'
'''
# TODO
corpus_tfidf = [tfidf[doc] for doc in bow_corpus]

In [68]:
'''
Preview TF-IDF scores for our first document --> --> (token_id, tfidf score)
'''
from pprint import pprint
for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.5959813347777092),
 (1, 0.39204529549491984),
 (2, 0.48531419274988147),
 (3, 0.5055461098578569)]


## <a id='10.5'>Step 4.1: Running LDA using Bag of Words</a>

We are going for 10 topics in the document corpus.

** We will be running LDiA using all CPU cores to parallelize and speed up model training.**

Some of the parameters we will be tweaking are:

* **num_topics** is the number of requested latent topics to be extracted from the training corpus.
* **id2word** is a mapping from word ids (integers) to words (strings). It is used to determine the vocabulary size, as well as for debugging and topic printing.
* **workers** is the number of extra processes to use for parallelization. Uses all available cores by default.
* **alpha** and **eta** are hyperparameters that affect sparsity of the document-topic (theta) and topic-word (lambda) distributions. We will let these be the default values for now(default value is `1/num_topics`)
    - Alpha is the per document topic distribution.
        * High alpha: Every document has a mixture of all topics(documents appear similar to each other).
        * Low alpha: Every document has a mixture of very few topics

    - Eta is the per topic word distribution.
        * High eta: Each topic has a mixture of most words(topics appear similar to each other).
        * Low eta: Each topic has a mixture of few words.

* ** passes ** is the number of training passes through the corpus. For  example, if the training corpus has 50,000 documents, chunksize is  10,000, passes is 2, then online training is done in 10 updates: 
    * `#1 documents 0-9,999 `
    * `#2 documents 10,000-19,999 `
    * `#3 documents 20,000-29,999 `
    * `#4 documents 30,000-39,999 `
    * `#5 documents 40,000-49,999 `
    * `#6 documents 0-9,999 `
    * `#7 documents 10,000-19,999 `
    * `#8 documents 20,000-29,999 `
    * `#9 documents 30,000-39,999 `
    * `#10 documents 40,000-49,999` 

In [74]:
%%time
# LDA mono-core -- fallback code in case LdaMulticore throws an error on your machine
# lda_model = gensim.models.LdaModel(bow_corpus, 
#                                    num_topics = 10, 
#                                    id2word = dictionary,                                    
#                                    passes = 50)

# LDA multicore 
'''
Train your lda model using gensim.models.LdaMulticore and save it to 'lda_model'
'''
# TODO
lda_model = gensim.models.LdaMulticore(
    corpus=bow_corpus, # input data for model
    num_topics=10, # number of topics for the model to identify
    id2word=dictionary, # Mapping from word IDs to words - from the filtered dictionary
    passes=3, # number of times for the model to go over the data
    workers=31 # number of actual cores of a CPU to train over (I have 32, leaving one open)
)

CPU times: user 6.35 s, sys: 1.02 s, total: 7.37 s
Wall time: 33.6 s


In [75]:
'''
For each topic, we will explore the words occuring in that topic and its relative weight
'''
for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic))
    print("\n")

Topic: 0 
Words: 0.019*"polic" + 0.014*"crash" + 0.009*"murder" + 0.008*"coast" + 0.008*"die" + 0.007*"iraq" + 0.007*"woman" + 0.007*"claim" + 0.006*"investig" + 0.006*"boost"


Topic: 1 
Words: 0.010*"lead" + 0.009*"govt" + 0.007*"death" + 0.007*"polic" + 0.007*"jail" + 0.006*"chang" + 0.006*"charg" + 0.005*"claim" + 0.005*"court" + 0.005*"doubt"


Topic: 2 
Words: 0.021*"plan" + 0.010*"govt" + 0.009*"polic" + 0.008*"back" + 0.008*"council" + 0.007*"protest" + 0.007*"group" + 0.007*"closer" + 0.007*"fund" + 0.006*"support"


Topic: 3 
Words: 0.015*"polic" + 0.009*"mayor" + 0.009*"farmer" + 0.008*"urg" + 0.008*"drought" + 0.008*"test" + 0.008*"group" + 0.007*"govt" + 0.007*"target" + 0.007*"consid"


Topic: 4 
Words: 0.015*"kill" + 0.010*"attack" + 0.010*"plan" + 0.008*"council" + 0.007*"hold" + 0.007*"report" + 0.007*"say" + 0.007*"offer" + 0.006*"iraq" + 0.006*"develop"


Topic: 5 
Words: 0.014*"council" + 0.011*"water" + 0.010*"hospit" + 0.008*"plan" + 0.008*"claim" + 0.008*"govt" +

### Classification of the topics ###

Using the words in each topic and their corresponding weights, what categories were you able to infer?

* 0: 'political labor'
* 1: 'catastrophe'
* 2: 'more catastrophe'
* 3: 
* 4: 
* 5: 
* 6: 
* 7:  
* 8: 
* 9:

**These are hard to distinguish**

## <a id='10.6'>10.6: Step 4.2 - Running LDA using TF-IDF</a>

In [76]:
%%time
'''
Define lda model using corpus_tfidf
'''
# TODO
lda_model_tfidf = gensim.models.LdaMulticore(
    corpus=corpus_tfidf, # input data for model
    num_topics=10, # number of topics for the model to identify
    id2word=dictionary, # Mapping from word IDs to words - from the filtered dictionary
    passes=3, # number of times for the model to go over the data
    workers=31 # number of actual cores of a CPU to train over (I have 32, leaving one open)
)

CPU times: user 12 s, sys: 1.32 s, total: 13.3 s
Wall time: 28.4 s


In [77]:
'''
For each topic, we will explore the words occuring in that topic and its relative weight
'''
for idx, topic in lda_model_tfidf.print_topics(-1):
    print("Topic: {} Word: {}".format(idx, topic))
    print("\n")

Topic: 0 Word: 0.006*"polic" + 0.004*"stab" + 0.004*"boost" + 0.004*"injur" + 0.003*"govt" + 0.003*"damag" + 0.003*"fund" + 0.003*"plan" + 0.003*"murder" + 0.003*"sydney"


Topic: 1 Word: 0.018*"polic" + 0.008*"crash" + 0.008*"iraq" + 0.007*"investig" + 0.006*"miss" + 0.006*"die" + 0.006*"council" + 0.006*"driver" + 0.005*"search" + 0.005*"fatal"


Topic: 2 Word: 0.006*"guilti" + 0.005*"charg" + 0.004*"plead" + 0.004*"court" + 0.004*"say" + 0.004*"titl" + 0.004*"world" + 0.004*"telstra" + 0.003*"murder" + 0.003*"face"


Topic: 3 Word: 0.008*"charg" + 0.006*"govt" + 0.006*"face" + 0.005*"plan" + 0.005*"assault" + 0.005*"court" + 0.004*"seek" + 0.004*"polic" + 0.004*"victim" + 0.004*"child"


Topic: 4 Word: 0.008*"water" + 0.005*"plan" + 0.004*"council" + 0.004*"polic" + 0.004*"govt" + 0.003*"terror" + 0.003*"kill" + 0.003*"continu" + 0.003*"reject" + 0.003*"futur"


Topic: 5 Word: 0.005*"govt" + 0.005*"plan" + 0.005*"blaze" + 0.004*"rescu" + 0.004*"polic" + 0.004*"rail" + 0.003*"kill" +

### Classification of the topics ###

As we can see, when using tf-idf, heavier weights are given to words that are not as frequent which results in nouns being factored in. That makes it harder to figure out the categories as nouns can be hard to categorize. This goes to show that the models we apply depend on the type of corpus of text we are dealing with. 

Using the words in each topic and their corresponding weights, what categories could you find? **skipping this step**

* 0: 
* 1:  
* 2: 
* 3: 
* 4:  
* 5: 
* 6: 
* 7: 
* 8: 
* 9: 

## <a id='10.7'>10.7: Step 5.1 - Performance evaluation by classifying sample document using LDA Bag of Words model</a>

We will check to see where our test document would be classified. 

In [78]:
'''
Text of sample document 4310
'''
processed_docs[4310]

['rain', 'help', 'dampen', 'bushfir']

In [79]:
'''
Check which topic our test document belongs to using the LDA Bag of Words model.
'''

# Our test document is document number 4310
for index, score in sorted(lda_model[bow_corpus[document_num]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.8199149966239929	 
Topic: 0.015*"polic" + 0.009*"mayor" + 0.009*"farmer" + 0.008*"urg" + 0.008*"drought" + 0.008*"test" + 0.008*"group" + 0.007*"govt" + 0.007*"target" + 0.007*"consid"

Score: 0.02001369558274746	 
Topic: 0.011*"hous" + 0.010*"warn" + 0.010*"govt" + 0.009*"power" + 0.008*"servic" + 0.007*"public" + 0.006*"record" + 0.006*"return" + 0.005*"polic" + 0.005*"emerg"

Score: 0.020012564957141876	 
Topic: 0.015*"kill" + 0.010*"attack" + 0.010*"plan" + 0.008*"council" + 0.007*"hold" + 0.007*"report" + 0.007*"say" + 0.007*"offer" + 0.006*"iraq" + 0.006*"develop"

Score: 0.020010806620121002	 
Topic: 0.019*"polic" + 0.014*"crash" + 0.009*"murder" + 0.008*"coast" + 0.008*"die" + 0.007*"iraq" + 0.007*"woman" + 0.007*"claim" + 0.006*"investig" + 0.006*"boost"

Score: 0.02000867947936058	 
Topic: 0.022*"govt" + 0.015*"court" + 0.015*"charg" + 0.015*"urg" + 0.013*"face" + 0.008*"opposit" + 0.008*"dead" + 0.008*"boost" + 0.007*"water" + 0.007*"council"

Score: 0.020008308812

### It has the highest probability (`82%`) to be  part of the topic that we assigned as X, which is the accurate classification. ###

## <a id='10.8'>10.8: Step 5.2 - Performance evaluation by classifying sample document using LDA TF-IDF model</a>

In [80]:
'''
Check which topic our test document belongs to using the LDA TF-IDF model.
'''
for index, score in sorted(lda_model_tfidf[bow_corpus[document_num]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))


Score: 0.44177496433258057	 
Topic: 0.005*"govt" + 0.005*"say" + 0.005*"elect" + 0.004*"cost" + 0.004*"plan" + 0.004*"council" + 0.004*"urg" + 0.004*"kill" + 0.004*"face" + 0.004*"claim"

Score: 0.39813169836997986	 
Topic: 0.005*"govt" + 0.005*"market" + 0.005*"industri" + 0.005*"council" + 0.005*"boost" + 0.004*"plan" + 0.004*"record" + 0.003*"fund" + 0.003*"offer" + 0.003*"staff"

Score: 0.020013555884361267	 
Topic: 0.008*"charg" + 0.006*"govt" + 0.006*"face" + 0.005*"plan" + 0.005*"assault" + 0.005*"court" + 0.004*"seek" + 0.004*"polic" + 0.004*"victim" + 0.004*"child"

Score: 0.020012762397527695	 
Topic: 0.008*"water" + 0.005*"plan" + 0.004*"council" + 0.004*"polic" + 0.004*"govt" + 0.003*"terror" + 0.003*"kill" + 0.003*"continu" + 0.003*"reject" + 0.003*"futur"

Score: 0.02001274563372135	 
Topic: 0.005*"govt" + 0.005*"plan" + 0.005*"blaze" + 0.004*"rescu" + 0.004*"polic" + 0.004*"rail" + 0.003*"kill" + 0.003*"health" + 0.003*"blast" + 0.003*"highway"

Score: 0.020012095570564

### It has the highest probability (`44%`) to be  part of the topic that we assigned as X. ###

## <a id='10.9'>10.9: Step 6 - Testing model on unseen document</a>

In [81]:
unseen_document = "My favorite sports activities are running and swimming."

# Data preprocessing step for the unseen document
bow_vector = dictionary.doc2bow(preprocess(unseen_document))

for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

Score: 0.8198835849761963	 Topic: 0.010*"lead" + 0.009*"govt" + 0.007*"death" + 0.007*"polic" + 0.007*"jail"
Score: 0.0200209803879261	 Topic: 0.011*"hous" + 0.010*"warn" + 0.010*"govt" + 0.009*"power" + 0.008*"servic"
Score: 0.02001331001520157	 Topic: 0.013*"polic" + 0.010*"bomb" + 0.008*"say" + 0.008*"kill" + 0.007*"labor"
Score: 0.02001316472887993	 Topic: 0.015*"kill" + 0.010*"attack" + 0.010*"plan" + 0.008*"council" + 0.007*"hold"
Score: 0.02001257613301277	 Topic: 0.015*"polic" + 0.009*"mayor" + 0.009*"farmer" + 0.008*"urg" + 0.008*"drought"
Score: 0.020012354478240013	 Topic: 0.022*"govt" + 0.015*"court" + 0.015*"charg" + 0.015*"urg" + 0.013*"face"
Score: 0.02001224085688591	 Topic: 0.019*"polic" + 0.014*"crash" + 0.009*"murder" + 0.008*"coast" + 0.008*"die"
Score: 0.02001161314547062	 Topic: 0.014*"council" + 0.011*"water" + 0.010*"hospit" + 0.008*"plan" + 0.008*"claim"
Score: 0.020010674372315407	 Topic: 0.010*"polic" + 0.010*"say" + 0.010*"union" + 0.008*"worker" + 0.007*"de

The model correctly classifies the unseen document with '82'% probability to the X category.

This tutorial is [also available on the Udacity GitHub](https://github.com/udacity/cd0379-computing-with-natural-language/tree/main/2.2-topic-modeling)