# Topic Modeling

In this lecture, we'll work through an example of *topic modeling*. The idea of topic modeling is to find "topics" in documents that tie together many words. Here are some examples of hypothetical topics that you might find in a newspaper: 

1. **Finance**: "dollar", "stock", "banks"
2. **Politics**: "party", "vote", "election"
3. **Sports**: "team", "win", "game"

In this lecture, we'll see how to use the term-document matrix from last time, in combination with some nice algorithms from `scikit-learn`, to perform topic modeling. Our overall aim is to get a coarse, topic-level summary of the plot of the short book *Alice’s Adventures in Wonderland* by Lewis Carroll. 

In [4]:
#standard imports

Then, we create a nice tidy data frame

Then, and this is the complex part, we used the `CountVectorizer` from `sklearn` to construct the term-document matrix. In this example, I've used a few more of the arguments for `CountVectorizer`. In particular, because I'd like to eventually be able to see how topics evolve between chapters, I use the `max_df` argument to specify that I'd like like to include words that appear in at most 50% of the chapters. 

Next, we can use this `CountVectorizer` to create the term-document matrix and collect it all as a nice, tidy data frame. 

## On To Topic Modeling

Now we are ready to run our model! Topic modeling is an *unsupervised* machine learning framework, which means that there's no set of true labels `y`. So, we just need to create the variables `X`. To do this, we can ignore the `text` and `chapter` columns. 

In [11]:
X=df.drop(['text','chapter'],axis=1)

There are many algorithms for topic modeling. We will use *nonnegative matrix factorization* or NMF for now. 

NMF decomposes the term-document matrix into topics. We start with a matrix $X\sim\text{documents}\times \text{words}$. Then we factor $X=WH$ where $W=\text{documents}\times \text{topics}$ and $H=\text{topics}\times\text{words}$. 

In other words, before the factorization we look at how often each word appears in each document. After the factorization, we look at how strongly a word is associated with a given topic and how strongly associated a topic is with a given document.

This is a bit abstract/mathematical, but putting it into action requires three easy steps: 

1. Import the model we want. 
2. Initialize an instance of the model. 
3. Fit the model on data. 

If you don't get it yet, don't worry. It might feel more concrete after we explore via example.

NMF requires us to specify `n_components`, which is the number of topics to find. Choosing the right number of topics is a bit of an art, but there are also quantitative approaches based on Bayesian statistics that we won't go into here. 

There are two important parts of NMF. First, we have the topics themselves, which are stored in the `components_` attribute of the model. 

### What does this mean?
Each row is a different component, 0,1,2,3,. We can think of each component as a collection of **weights** for each word.

We can find the most important words in each component by finding the words where the weights are highest within that component. We can do this with a handy function called `np.argsort()`, which tells you which entries of an array are the largest, second largest, etc.

We can then use `numpy` "fancy" indexing to arrange the words in the needed orders. 

The next important aspect of topic modeling is the assignment of topics per document. This is done via weights. We can access this by using the `transform()` method of the model. 

The weights indicate the relative presence of each topic in each chapter. For example, Topic 2 is highly present in the first six chapters, but then mostly absent for the rest of the book. Topic 3 appears in Chapters 7 and 11, and so on. 

We can also visualize the same information as a line chart. Let's add as labels some of the top words for each topic. 

This plot allows us to easily see several major features of the plot of the novel, including the tea party with the March Hare, the Mad Hatter, and the Dormouse (Chapter 7), the crocquet game in the court of the Queen of Hearts (Chapter 8), the appearance of the  Mock Turtle and the Lobster in (Chapters 9 and 10), and the reappearance of many characters in Chapter 11. 

Lastly, I should add that the list of applications of Nonnegative Matrix Factorization is endless. Here are a couple of cool examples from researches right here at UCLA. This paper uses NMF to understand the discussion of COVID-19 on Twitter https://arxiv.org/abs/2010.01600, and this paper discussing using NMF to make better topic aware chatbots https://arxiv.org/abs/1912.00315. 