# <font face="times"><font size="6pt"><p style = 'text-align: center;'> The City University of New York, Queens College

<font face="times"><font size="6pt"><p style = 'text-align: center;'><b>Introduction to Computational Social Science</b><br/><br/>

<p style = 'text-align: center;'><font face="times"><b>Lesson 11 | Natural Language Processing III: Topic Modeling with the LDA</b><br/><br/>

<p style = 'text-align: center;'><font face="times"><b>7 Checkpoints</b><br/><br/>



***
***
# Begin Lesson 11
## Topic Model - The LDA Model

Topic modeling is one of the important parts of natural language processing. 

The goal of topic modeling is to capture the topics from a set of documents and cluster the documents by their topics. Here, a topic is a cluster of terms that happen to co-occur together frequently across documents in a corpus. With many topic models, documents are seen not in terms of their syntax but instead as a mere "bag of words." In other words, documents are just containers for words, where the order they occur doesn't matter. What does matter is that these words co-occured together in the same document. 

Here we're going to use of the most popular topic models called a LDA (Latent Dirichlet allocation). Since the LDA model doesn’t have to deal with large matrix processing, it has the fastest process speed.

!["LDA"](Images/12_LDA.png)

LDA (short for Latent Dirichlet Allocation) is an unsupervised machine-learning model that takes documents as input and finds topics as output. The model also says in what percentage each document talks about each topic.

The mechanics of the LDA model are complex, so let's look at it from a high-level. Assumes you have a corpus made up of documents, and each of these documents are made up of terms. (And like the good computational social scientist, you already stemmed and cleaned-up these terms in each document across the corpus.) The LDA model assumes that the terms of every document is actually a particular mixture of some number of topics, where a topic is characterized by a distribution of terms. 

Both the topics in a document and words in a topics follow something called a Dirichlet distribution. This distribution is somewhat related to a coin-flip. When you flip a coin, you have one of two options: heads or tails. Now, replace "heads" and "tails" with the number of topics that exist in your corpus. It's highly unlikely that you have only two topics, so we use the Dirichlet distribution to handle a coin with multiple sides. (Obviously, a coin only has two sides, but I want to make the point that essentially the Dirichelt distribution is a coin-toss with multiple sides.) 

Thus, a topic is represented as a weighted list of words. An example of a topic is shown below:

!["LDA"](Images/12_LDA_2.png)

The LDA topic model will "train" or "learn" these data (think back to machine learning!), and determine the probability distribution of every term across all of the topics. (We can also determine from this the probability distribution of documents across these topics.)


So, how many topics are there? There is no correct answer of how many topics there are, because you decide how many exist **before** you run the LDA. This is where both the art and science of LDA comes head-to-head. How we choose the number of objects totally dependents on the purpose of the project. However, there are ways to measuer if we picked the number of topics that are a good "fit" for our specific data. 

Thus, there are 3 main parameters of the model:
- the number of topics, given by the parameter `K`
- `Alpha`, which represents document-topic density. With a higher alpha, documents are made up of more topics, and with lower alpha, documents contain fewer topics. Higher alpha results in a more specific topic distribution per document. 
- `Beta`, which represents topic-word density. With a higher beta, topics are made up of most of the words in the corpus, and with a low beta they consist of few words. `Beta` results in a more specific word distribution per topic.

In reality, the last two parameters are not exactly designed like this in the algorithm, but I prefer to stick to these simplified versions which are easier to understand.

We're going to be using `gensim` to run our Topic Models, so let's import them here. 

First, let's import the modules we'll need. 

In [None]:
import gensim
from gensim.matutils import Sparse2Corpus
from gensim import corpora, models, similarities
from gensim.corpora import Dictionary, MmCorpus
from nltk.corpus import stopwords
import nltk 
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
import re

import pandas as pd
import numpy as np
import time
from dateutil.parser import parse
import requests
import string

import matplotlib.pyplot as plt
%matplotlib inline

In this example, let's use a dataset of articles taken from BBC’s website. To implement the LDA in Python, let's use the package `gensim`.

In [None]:
data = pd.read_csv('Data/articles_bbc_2018_01_30.csv')

In [None]:
data.head()

In [None]:
data.shape

Let's remove any NAs from our dataset, as many articles many not have any text associated to it. (E.g., the web scraping process is far from perfect, and sometimes we can't extract data.) 

In [None]:
data = data.dropna().reset_index(drop=True)

Let's see how many rows we dropped. 

In [None]:
data.shape

***
***

# Checkpoint 1 of 7
## Now you try!

### For this checkpoint, just read in similar data to the BBC news articles. These data will come from the New York Times. Label the `DataFrame` as `NYTimes_df`. These New York Times article deal with issues related to Silicon Valley and the tech industry in San Francisco. Since these data are really big, we're just going to subset these data and take the first 1000 rows, hence be sure to use `.head(1000)` when reading in these data. 

    NYTimes_df = pd.read_csv('Data/New_York_Times_SF_All.csv').head(1000)
    
### Explore the data just as you did for the BBC data.  Here, you'll focus on the `article` column. 


### Don't forget to remove NAs! 

### One small caveat (and this may be useful to you for future work you do): Here, we'll remove any document that contains less than 50 characters. Documents with just short sentences, for instance, are not all that informative. So also run the following code below.  

    NYTimes_df = NYTimes_df[NYTimes_df['article'].map(len) >= 50]

### This code keeps rows (articles) that have a length of greater than 50 characters. 

***
***

***
***

## Cleaning the data

First, we'll need to clean the data. Note here that many of the languages of these articles are NOT in English. For our example, we're only going to focus on English langauge articles, but you can just as easily run LDAs on different languages.

Since we don't have data on whether the article was written in English or not, we need to install a pacakge called `langdetect` that will provide a fairly reasonably guess as to what language each article is written in. 

Use `pip` to install `langdetect` and import that module in. Specifically, we'll use a function called `detect` from this module. 

In [None]:
!pip3.6 install --user langdetect

In [None]:
from langdetect import detect

Now, use the `apply` method to the `articles` column in our `DataFrame` and apply a function called `detect` from `langdetect`. It will return a two-letter code for the langauge it detects in every article. 

In [None]:
data['lang'] = data.articles.apply(detect)

Let's take a peek at our new column. 

In [None]:
data.head()

Great! It seems to mostly work! 

Now let's use the `pandas` method `.value_counts()` to see how many articles are written in each language. As you can see, `en` (English) is far in the majority. (Given that this is the BBC, this shouldn't come as a surprise.) Since English data are plentiful, we'll uuse them. 

In [None]:
data.lang.value_counts()

Since we only want English article, let's subset on the new column we just made that contains the language for each article. Let's subset just on English (or `en`) articles, denoted by `en` in the `lang` column. 

In [None]:
data = data.loc[data.lang=='en']

***
***

# Checkpoint 2 of 7
## Now you try!

### Using the New York Times `DataFrame`, count the numbe of articles that are in English, and keep only the articles that are in English. 

### Note, that you'll focus only on the column called `article`. This may take some time! 

### Also recall how we removed any article with less than 50 characters. This is because `detect` will only work if there's enough text to use to determine the language. 

***
***

Now, we're going to have to do the dirty work of cleaning each article and preparing it for our LDA. Thankfully, we've actually done this process before, so it should be rather easy. 

If you are a bit shaky on this process, review the previous Lecture Notebook on cleaning data for a good refresher. 

We're going to remove punctuation and lemmatize these articles, as well as remove any English stop words from it. In this process, we'll convert each article into a list of tokens to be processed by the LDA. 

Below, let's define the punctuation, lemmatizer, and stop words, as well as the function that will do the heavy lifting that we explored last time. 

In [None]:
def token_process(doc):
    
    ## stop words and updates
    ## Note, you should add more terms to this list to see what may or may not be useful.
    ## Also note, that I also remove punctuation here by adding the string module
    stop_en = stopwords.words('english') + list(string.punctuation) + [u',',u'.',u'?',u'!',u':',u';', u')', u'(',u'[',u']',u'{',u'}',u'%',u'@',u'-',u'`',
                                           u'san',u'francisco',u'san francisco',u'new',u'tr',u'th',u'to',u'on',u'of',u'mr',
                                           u'monday','tuesday',u'wednesday',u'thursday',u'friday',u'saturday',u'sunday','want','befor','becaus'
                                           u'said',u'ms',u'york',u'say',u'could',u'q',u'got',u'found',u'began','|',"''","'s","``","--",
                                           'mr','year','would','one','way','l','ms.','$','mr.','dr.','get','before','like','know','day','because',
                                           '"','see','look','dont','im','&','b','also','de','la','el','en','un','two','al','su','es','lo','se']
    
        
    #stemming
    stemmer = SnowballStemmer("english")
    
    #tokenize
    tokens = [w.strip() for sent in sent_tokenize(doc) for w in word_tokenize(sent)] if doc else None
    
    #remove numbers
    num_pat = re.compile(r'^(-|\+)?(\d*).?(\d+)')
    tokens = filter(lambda x: not num_pat.match(x), tokens)
    
    #remove dates
    date_pat =  re.compile(r'^(\d{1,2})(/|-)(\d{1,2})(/|-)(\d{2,4})$')
    tokens = filter(lambda x: not date_pat.match(x), tokens)
    
    #use stemmer
    lemmatized_tokens = map(lambda x: stemmer.stem(x), tokens)
    
    #filter out empty tokens and stopwords
    lemmatized_tokens = filter(lambda x: x and x.strip() not in stop_en, lemmatized_tokens)

    x = ' '.join(lemmatized_tokens)
    return x.split(' ')

Now, let's apply `token_process`, a modified (and combined) function from last time that we apply to the articles column. 

This function will produce tokens that have been put in lower case, lemmatized, and where stop words and punctuations have been removed. Here, I add more stop words and punctuations that may be unique to this corpus. When you actually do this with your own data, you will need to go back and update this list accordingly.  

We'll apply `token_process()` to each row in `articles` by using the `.apply(lambda x:)` `pandas` method. The end result is an article that was once one long string now turned into a list of tokens that have been cleaned up and ready for processing. 

In [None]:
data["articles_filtered"] = data["articles"].apply(lambda doc: token_process(doc))

Now, let's take a look at this new column. 

In [None]:
data.articles_filtered.head()

***
***

# Checkpoint 3 of 7
## Now you try!

### Filter the articles in just the same way as we did here for the BBC articles. 

***
***

***
***

### Let's run the LDA!

We'll need to turn this column into a format. 

First, convert this column to a `list` using the `.tolist()` method. 

In [None]:
initial_corpus = data["articles_filtered"].tolist()

From the module `corpora`, we'll use the function `Dictionary` to convert our list of tokens into a `corpora-specific Dictionary` object. We do this because we want to do some "trimming" of the corpus that is more easily done (e.g., doesn't take up too much memory) in this format. 

(Indeed, the biggest challenge with text analysis is memory constraints, so converting objects into other form may be annoying, but is a necessity to avoid a memory leakage issue, halting your code.)

In [None]:
dictionary_LDA = corpora.Dictionary(initial_corpus)

Many of the words are going to be superfluous: one off words that don't have any meaning that we didn't catch with the stop words. This is rather common, as capturing every token that doesn't have any meaning is near impossible. 

Instead, we can create a cutoff value that removes terms if they appear less than `n` number of times in the corpus. Here, we can usee the `.filter_extremes()` method applied to our `corpora Dictionary`. 

I set it to 3, so any term that occurs in less than three documents will be removed. 

**Note:** This is a bit confusing, but we don't need to save this filtering onto itself. It automatically does it in place. 

In [None]:
dictionary_LDA.filter_extremes(no_below=3)

Now, let's prepare the corpus before we submit it to the LDA model. 

Recall that the LDA model sees documents as a "bag of words." In other words, documents are a big burlap sack of words, and the fact that these words were used in this document means something. Namely, that documents are themselves just collections of topics. We just don't know how many topics and which terms are in these topics. This is the job of the LDA to uncover!

For the package `gensim`, we can't merely pass in the list of tokens in its current form. Instead we need to pass it in a very particular format, namely as a list of tuples. In other words, each element in the tuple is a combination of the token: as a unique identifier given as some number, followed by the number of times it appears in this specific document. 

Let's actually see what this means. First, let's convert our list of documents currently stored as a list of tokens (again, this is found in `initial_corpus`) into a list of tuples of token IDs and the number of times the token appears in the document. 

We'll use the method `.doc2bow()`---in other words, document to "bag of words"---to convert each "document" in the list `initial_corpus` to convert the string tokens into these tuples. 

So, we use list comprehension to loop through each article in `initial_corpus` and convert every list to a list of tuples.

In [None]:
corpus = [dictionary_LDA.doc2bow(doc_) for doc_ in initial_corpus]

So, let's see the first document in the corpus, for example. 

In [None]:
corpus[0]

Here, we see a list of tuples, where the first entry are the unique IDs assigned to each token, followed by the number of times the token appears in the first document. 

Let's see how this corresponds to what we have in our `initial_corpus`. 

First, let's look at the first entry.

We can use `dictionary_LDA` to convert from a token ID to the actual token. So let's try that out. 

In [None]:
dictionary_LDA[PUT_THE_FIRST_ENTRY_HERE]

Look at first ID entry and its actual value (i.e., not the ID).  

If we look at the raw data from `initial_corpus`, looking at the same document, the number of times this token occurs is the second value in the tuple. 

(Check for yourself! To do this, use `Control-F` or `Command-F` [depending on your OS] in your browser and search for the term `listed`. It should appear twice in the long list below.)

In [None]:
initial_corpus[0]

We can even do this with the original article. Print out the original raw article and find the number of times `listed` appears. It should be twice. 

In [None]:
print(data.articles.loc[0])

***
***

# Checkpoint 4 of 7
## Now you try!

### Prepare the `Dictionary`, `corpus`, and the `initial_corpus` for the New York Times data. 

### Label the `Dictionary` as `NYTimes_dictionary_LDA`, the `corpus` as `NYTimes_corpus`, and the `initial_corpus` as `NYTimes_initial_corpus`. 

***
***

***
***

### Running the LDA

Now that we've done heavy lifting (i.e., formating our data), we can now run the LDA. This is the easy part! 

Recall that we need to pick out the proper hyperparameters. For our purposes, we'll focus on two here: `K` topics and `alpha`. `K` is more of a tunning parameter, in that it's a bit of a guess as to what the right value of `K` is. 

For the sake of argument, let's just use a `K` of 20, or that we expect there to be 20 distinct topics in our BBC article corpus. For `alpha`, let's set it as 0.01, a standard value often used in NLP. For `alpha`, we need to pass it in as a list, whose length is the same as the number of topics. This is achieved by `[0.01]*number_of_topics`. 

We also pass in our "dictionary" that we created earlier: `dictionary_LDA` as our token ID to terms dictionary. 

In [None]:
num_topics = 20 # Number of Topics, we set initially as K. 

In [None]:
np.random.seed(123456)

Okay, let's FINALLY run it! 

Pass in `corpus`, the number of topics, the `dictionary_LDA` and `alpha`. LDAs can take some time. To speed this up, we're going to use a special version of the LDA model called a `LdaMulticore` model. It's an LDA model that uses multicore processors (if available) to speed up the model! (You can use the regular `models.LDA()` as well to similar effect.)

**NOTE:** This may throw a lot of errors at you and may take a few minutes. Don't freak out at first!

In [None]:
lda_model = models.LdaMulticore(corpus,\
                            num_topics=num_topics,\
                                  id2word=dictionary_LDA,\
                                  alpha=[0.01]*num_topics)

Hopefully if your computer didn't blow up, the model should have completed successfully! Let's take a look at the top  20 words most associated to each of the 20 topics. 

We'll run a `for` loop and use the method `.show_topics()` to print out the top 20 terms for each of the 20 topics. 

In [None]:
for i,topic in lda_model.show_topics(formatted=True, num_topics=num_topics, num_words=20):
    print(str(i)+": "+ topic)
    print()

We might be able to see a pattern with some of these topics. We'd likely need to go through the corpus one more time and more move punctuation and stop words. Indeed, the first time you run an LDA, you'll need to likely re-run the LDA and move terms that happened to slip through that you need to remove. 

Recall our "bag of words" assumption. Let's again look at the first article in the corpus and see what topics make up this article. 

In [None]:
print(data.articles.loc[0])

Clearly, this article has to do with Russia and U.S. sanctions. We can use the LDA model we just ran and see what topics make up this article. 

Let's use `lda_model` and pass in the text from `corpus`, namely the first document, to see. 

In [None]:
lda_model[corpus[0]]

This produces a list of tuples. The first value is the ID related to the topic and the second value is how much this topic dominates the document.

Look at what the top terms for each topic is (as produced by the `for loop`). Does this make sense to you?

***
***

# Checkpoint 5 of 7
## Now you try!

### Run the LDA model for your NY Times data. Choose some `K` number of topics. I would suggest pick a value of `K` that is less than 20 but greater than 5. (Let's keep `alpha` as 0.01.) Save your LDA model as `NYTimes_lda_model`. 

### Print out the top twenty words for each topic. 

### Which topics make up first article in your corpus?

***
***

***
***

### Predicting topics on unseen documents

Let's try and predict the topics on a document unseen by the LDA (i.e., not trained by the LDA). Here is a sample document about Twitter. Let's see what topics are found in this unseen article by the LDA. 

In [None]:
document = '''Eric Tucker, a 35-year-old co-founder of a marketing company in Austin, Tex., had just about 40 Twitter followers. But his recent tweet about paid protesters being bused to demonstrations against President-elect Donald J. Trump fueled a nationwide conspiracy theory — one that Mr. Trump joined in promoting. 

Mr. Tucker's post was shared at least 16,000 times on Twitter and more than 350,000 times on Facebook. The problem is that Mr. Tucker got it wrong. There were no such buses packed with paid protesters.

But that didn't matter.

While some fake news is produced purposefully by teenagers in the Balkans or entrepreneurs in the United States seeking to make money from advertising, false information can also arise from misinformed social media posts by regular people that are seized on and spread through a hyperpartisan blogosphere.

Here, The New York Times deconstructs how Mr. Tucker’s now-deleted declaration on Twitter the night after the election turned into a fake-news phenomenon. It is an example of how, in an ever-connected world where speed often takes precedence over truth, an observation by a private citizen can quickly become a talking point, even as it is being proved false.'''


In [None]:
tokens = word_tokenize(document) #Turn the document into a list of token

In [None]:
tokens

Let's pass in this new article and see what topics comprise it according to the LDA. 

We can use the `lda_model` and `dictionary_LDA` object and its `.doc2bow()` method, to take each token from `document` to calculate the topic distribution. 

Here, I've saved the output as a `DataFrame` to make it easier to see. 

In [None]:
pd.DataFrame([(el[0], round(el[1],2)) for el in lda_model[dictionary_LDA.doc2bow(tokens)]], columns=['topic #', 'weight'])

We see here the distribution of topics across the documents. Note, that the weight sums to 100%

Given the article, look back at the top terms for each of these topics and see if you agree with this distribution. 

***
***

## Visualizing Topics

`pyLDAvis` is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.

The visualization is intended to be used within a Jupyter notebook but can also be saved to a stand-alone HTML file for easy sharing.

First, we need to install `pyLDAvis`. We'll also need to install `joblib` to help us run it. 

In [None]:
!pip3.6 install --user pyLDAvis
!pip3.6 install --user joblib

Now, let's import these modules, and set `%matplotlib in line` to visualize the LDA results.

In [None]:
%matplotlib inline
import pyLDAvis
import pyLDAvis.gensim
from joblib import parallel_backend

Now, let's run the pyLDAvis! Thankfully, we already have all the inputs that we need from our previous work. 

One somewhat confusing thing is that we need add in an extra line of code `with parallel_backend('threading'):` in order for it to work. This is because `pyLDAvis` is very CPU and memory intensive, and parallelizing it in the backend will ensure the visualization works and doesn't crash. 

**NOTE:** This may take a few minutes to run! 

In [None]:
with parallel_backend('threading'):
    vis = pyLDAvis.gensim.prepare(topic_model=lda_model, corpus=corpus, dictionary=dictionary_LDA)

Here's a short vignette that explains more specifically what's happening: 

https://cran.r-project.org/web/packages/LDAvis/vignettes/details.pdf

In broad strokes, it visualizes the topics that were found in the LDA model. 

First in the box next to "Selectec Topic" put in a topic number (anywhere from 0 to 19, inclusive) to visualize the topic. 

For each topic, the visualization produces two plots. The plot to the left is the "intertopic distance" map, or how dissimilar or similar topics are to one another. (This procedure actually uses a PCA!) The size of each topic's "bubble" is proportional to the proportions of the topics across all of the tokens in the corpus. 

The bar chart on th right is the most relevant topics to each topic. The red bars estimate number of times a given term was generated by a given topic. The blue bars capture the overall frequency of each term in the corpus. Finally, the relevance of words is computed with a parameter lambda, where the optimal Lambda value is taken at around ~0.6 (https://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf). 

In [None]:
pyLDAvis.enable_notebook()
pyLDAvis.display(vis)

***
***

# Checkpoint 6 of 7
## Now you try!

### Use `pyLDAvis` for the LDA model you ran on the New York Times data. Explore a few topics. Remember, everyone will pick a different value of `K`, so the results will vary. 

***
***

***

## Validating our LDA Model | What Value of `K` is Best?

There are many other approaches to evaluate topci models, but most approaches are rather poor. One way is just human inspection, to see if these topics make sense. Hence, topic visualization is a good way to assess topic models. 

However, can we find some sort of quantifiable metric? One measure that is commonly used is perplexity, a measure of entropy.

Perplexity as well is one of the intrinsic evaluation metric, and is widely used for language model evaluation. It captures how surprised a model is of new data it has not seen before, and is measured as the normalized log-likelihood of a held-out test set. 

Focussing on the log-likelihood part, you can think of the perplexity metric as measuring how probable some new unseen data is given the model that was learned earlier. That is to say, how well does the model represent or reproduce the statistics of the held-out data? However, recent studies have shown that predictive likelihood (or equivalently, perplexity) and human judgment are often not correlated, and even sometimes slightly anti-correlated. In other words, optimizing for perplexity may not yield human interpretable topics. 

#### Topic Coherence
The concept of topic coherence combines a number of measures into a framework to evaluate the coherence between topics inferred by a model. 

Topic Coherence measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic. These measurements help distinguish between topics that are semantically interpretable topics and topics that are artifacts of statistical inference.

Coherence is a set of statements or facts is said to be coherent, if they support each other. Thus, a coherent fact set can be interpreted in a context that covers all or most of the facts. An example of a coherent fact set is “the game is a team sport”, “the game is played with a ball”, “the game demands great physical efforts”

There are several ways to measure coherence. We'll use a method called `c_v`. It's a measure based on a sliding window, one-set segmentation of the top words and an indirect confirmation measure that uses normalized pointwise mutual information (NPMI) and the cosine similarity. 

Two other coherence measures are:
        1. `C_uci` measure is based on a sliding window and the pointwise mutual information (PMI) of all word pairs of the given top words
        2. `C_umass` is based on document cooccurrence counts, a one-preceding segmentation and a logarithmic conditional probability as confirmation measure

First, let's import `CoherenceModel`. 

In [None]:
from gensim.models import CoherenceModel

Now, let's compute the coherence score. We've run everything that we need already, so it's a matter of putting it into our function. 

In [None]:
# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model,\
                                     texts=initial_corpus,\
                                     dictionary=dictionary_LDA,\
                                     coherence='c_v')

Now, let's run the `.get_coherence()` method from our coherence object. 

In [None]:
coherence_lda = coherence_model_lda.get_coherence()

In [None]:
coherence_lda

We get a rather high coherence value! That's great. However, is it optimal? We want the lowest number of `K` topics while also yielding the highest coherence values. 

If you recall our `GridSearch` approach, we can perform a similar task here. We can actually test this out here. ** However,** it'll require a lot of processing power! 

So, for our purposes here, we won't try it. That said, I provide code below where you can pass in the various parameters of the LDA model and it will calculate coherence scores. 

As with most machine learning models, you want to minimize `K` while yielding the highest value for coherence. 

In [None]:
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=5):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    for num_topics_K in range(start, limit, step):
        lda_model_K = models.LdaMulticore(corpus,\
                            num_topics=num_topics_K,\
                                  id2word=dictionary,\
                                  alpha=[0.01]*num_topics_K)
        
        model_list.append(lda_model_K)
        coherencemodel = CoherenceModel(model=lda_model_K, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values

In [None]:
#model_list, coherence_values = compute_coherence_values(dictionary=dictionary_LDA,\
#                                                        corpus=corpus,\
#                                                        texts=initial_corpus,\
#                                                        start=5,\
#                                                        limit=30,\
#                                                        step=5)

You can then map out these values and plot them to find what the best coherence score might be. 

In [None]:
#import matplotlib.pyplot as plt
#limit=30; start=5; step=5;
#x = range(start, limit, step)
#plt.plot(x, coherence_values)
#plt.xlabel("Num Topics")
#plt.ylabel("Coherence score")
#plt.legend(("coherence_values"), loc='best')
#plt.show()

***
***

# Checkpoint 7 of 7
## Now you try!

### Re-run the LDA model with the New York Times data with two new different values of `K` (one larger than the decision you picked before and one smaller) and calculate the coherence scores for the two new values of `K`. (DO NOT use the function above, unless you have lots of time!) 

### Based on the coherence scores, which of the three values of `K` is optimal?   