# Discovering Thematic Content with Latent Dirichlet Allocation - Working Notebook

## 1. Preparing a Dataset for Analysis

In this notebook, we're going to model the topics in a number of different sets of data by means of latent Dirichlet allocation (LDA). In effect, we'll be extracting the thematic DNA of whatever document or corpus we analyse.

Like tagging, LDA is an automated method of text analysis, but you'll find that it differs from tagging in that we can intervene in how the process of topic modelling works by modifying the parameters with each run. This means, in short, that there is a great deal more to think about. It also means that as you change the parameters, your results will change, as well. What this means is that, whereas tagging strives for an objective, definitive description of the parts of speech or named entities in a text, the results we derive from topic modelling are much more provisional. This is compounded by the fact that topics are derived from a series of statistical inferences made by your computer, so that the results you get might vary, even as the parameters for a specific run remain unchanged.

Let's again use our metadata to create a subcorpus. In this case, let's use the results from our search for `principle` tokens, which we produced in the simple counting lesson. To keep this relatively small we'll use only those texts that contain more than 500 occurences of the word. If you happen to find that this set is still too large for your computer to handle, continue to increase the principle count until you're working with only one or two volumes.

In [None]:
# In this cell, change your working directory to `/dh2/corpora_and_metadata/`

In [None]:
# Use `glob` to get a list of `.csv` files in the directory

In [None]:
# Read in the big metadata set that contains the results for our `principles` search.
ecco_metadata_w_principles = pd.read_csv("???")

Let's now take a look at the first ten rows.

In [None]:
ecco_metadata_w_principles[0:10]

In [None]:
ecco_metadata_w_principles = pd.read_csv("ecco_data_w_principles.csv")
filenames = ecco_metadata_w_principles.loc[(ecco_metadata_w_principles["principles"] >= 100)]["TCP"].tolist()

In [None]:
len(filenames)

In [None]:
title_names = ecco_metadata_w_principles.loc[(ecco_metadata_w_principles["principles"] >= 100)]["Title"].tolist()

In [None]:
len(title_names)

Check to see whether your `filenames` include the `.txt` extension. If they don't, run the following cells. Otherwise, you can skip it.

In [None]:
lda_filenames_new = []
for file in lda_filenames:
    newstr = file + ".txt"
    lda_filenames_new.append(newstr)
print(lda_filenames_new)

Let's now import everything we'll need to perform our anaylsis.

In [None]:
from string import punctuation
punctuation += "“”‘’↩§†"
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords
more_stopwords = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '100', 'able', 'also', 'although', 'among', 'another', 'away', 'began', 'came', 'could', 'done', 'eight', 'even', 'ever', 'every', 'first', 'five', 'found', 'four', 'gave', 'give', 'go', 'however', 'indeed', 'left', 'like', 'made', 'make', 'many', 'may', 'might', 'much', 'must', 'near', 'never', 'nine', 'nothing', 'often', 'one', 'part', 'put', 'said', 'saw', 'see', 'seven', 'several', 'shall', 'six', 'soon', 'take', 'ten', 'thee', 'therefore', 'thing', 'things', 'thou', 'though', 'three', 'thy', 'till', 'time', 'told', 'took', 'two', 'upon', 'us', 'way', 'well', 'went', 'whether', 'without', 'would', 'yet', '’', '“', '”', ',', 'u']
full_stopwords = (stopwords.words('english')) + more_stopwords
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
from gensim.corpora.dictionary import Dictionary

As we discussed earlier in this lesson, LDA works a lot better when it has many small texts to work with, rather than a few big ones. Let's divide our texts into smaller chunks of 1000 words each.

First, we need to define a function to perform the text. Let's call it `text_splitter`.

In [None]:
def text_splitter(filename, n_words):
    with open(str(i), 'r') as file:
        readFile = file.read()
        tokenized_file = nltk.tokenize.word_tokenize(readFile)
        file.close()
        chunks = []
        current_chunk_words = []
        current_chunk_word_count = 0
        for word in tokenized_file:
            current_chunk_words.append(word)
            current_chunk_word_count += 1
            if current_chunk_word_count == n_words:
                chunks.append(' '.join(current_chunk_words))
                current_chunk_words = []
                current_chunk_word_count = 0
        chunks.append(' '.join(current_chunk_words))
        return chunks

Create a directory

In [None]:
textdirectory = home + '/dh2/corpora_and_metadata/sec5/'

os.chdir(textdirectory)
print(os.getcwd())

In [None]:
new_directory = os.path.join(textdirectory, r'chunked_files_principles')
if not os.path.exists(new_directory):
   os.makedirs(new_directory)

Now we run `text_splitter` and write each chunk to a directory.

In [None]:
textdirectory = home + '/dh2/corpora_and_metadata/working_set_cleaned/'

os.chdir(textdirectory)
print(os.getcwd())

output_dir = home + '/dh2/corpora_and_metadata/unit5_files/chunked_files_principles/'

chunk_length = 1000
chunks = []
for i in filenames_new:
    chunk_counter = 0
    texts = text_splitter(i, chunk_length)
    for text in texts:
        chunk = {'text': text, 'number': chunk_counter, 'filename': i}
        chunks.append(chunk)
        chunk_counter += 1
    for chunk in chunks:
        basename = os.path.basename(chunk['filename'])
        basename = os.path.splitext(basename)[0]
        fn = os.path.join(output_dir, "{}.{:04d}.txt".format(basename, chunk['number']))
        with open(fn, 'w') as f:
            f.write(chunk['text'])

## 2. Starting LDA in Earnest: Creating a Dictionary and Corpus

Now that we've created a directory of chunked files, we're ready to start our analysis. The first order of business is to read and process each of the chunks we just created, and then to process the files by tokenizing, removing stopword, and lemmatizing. Once that's done, we can create a common dictionary and a corpus from the entire set.

In [None]:
textdirectory = home + '/dh2/corpora_and_metadata/unit5_files/chunked_files_principles/'
os.chdir(textdirectory)
filenames = glob.glob("*.txt")

list_files = []
for i in filenames:
    with open (str(i),'r') as file:
        readFile = file.read()
        file.close()
        chunks = []
        current_chunk_words = []
        current_chunk_word_count = 0
        tokenized_file = nltk.tokenize.word_tokenize(readFile)
        usefulTxt = [word for word in tokenized_file if word not in (full_stopwords)]
        lemmas = [wordnet_lemmatizer.lemmatize(word) for word in usefulTxt]
        list_files.append(lemmas)
common_dictionary = Dictionary(list_files)
common_corpus = [common_dictionary.doc2bow(file) for file in list_files]

In [None]:
len(common_dictionary)

## 3. Training the LDA Model

At this point, we have two documents: `common_dictionary` and `common_corpus`. This is where things start to get interesting, as we set the training parameters that determine exactly how LDA will run.


    <b>num_topics</b>: The number of topics to extract;
    <b>chunksize</b>: The number of documents that will be used in training each chunk;
    <b>passes</b>: The number of passes that the script makes through the process during training;
    <b>iterations</b>: The maximum number of iterations through a corpus when inferring the topic distribution of that corpus;
    <b>eval_every</b>: The number of passes after which the model's perplexity is evaluated. This can really slow down your script, so we'll generally leave it as 'None';
    <b>alpha</b>: The higher your alpha, the more similar your topic contents;
    <b>eta</b>: The same is true for beta. A higher value here results in more similar word contents;



In [None]:
# Train LDA model.
from gensim.models import LdaModel

# Set training parameters.
num_topics = 10 # standard is 10
chunksize = 200 # standard is 2000
passes = 20 # standard is 20
iterations = 400 # standard is 400
eval_every = None  # Don't evaluate model perplexity, takes too much time.
alpha = 0.1 #Originally set to 0.1
eta = 0.01 #Originally set to 0.1

# Make a index to word dictionary.
temp = common_dictionary[0]  # This is only to "load" the dictionary.
id2word = common_dictionary.id2token
model = LdaModel(
    corpus=common_corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha=alpha, # 'auto' - Generally 50/T, where T is the number of Topics anticipated - 0.1 is standard
    eta=eta, # 'auto' - Generally 200/W, where W is the number of words in the vocabulary - 0.1 is standard
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every
)

## 4. Finally, run your LDA Script!

In [None]:
top_topics = model.top_topics(common_corpus, topn = 20) # num_words=20)

from pprint import pprint
pprint(top_topics)

## 5. Evaluating Your Initial Results

Let's take a close look at these results. Remember, as you try to grapple with the results, that this is a very small run, and the quality of your results will generally improve as you increase the size of your dataset.

Allowing that these are provisional results, you may be able to see some obvious opportunities to improve our results, on a second pass. When you examine your results, ask yourself the following questions:

    - Are the topics sufficiently coherent? That is, can I imagine how the words in this topic would tend to cluster around a specific subject or theme?;
    - Are the topics sufficiently distinct? That is, do I see the same words coming up again and again, or am I getting clear differentiation from one group of words to the next?;
    - Are there any obvious opportunities to improve the quality of my run by removing additional stopwords? This is an important question to ask, because often the most prevalent topics will be composed, in part, of words that don't have a great deal of semantic specificity. Might some of these be candidates to include in a subsequent list of stop words?
    
If you have concerns about these first two general problem areas, you might consider adjusting your `alpha` and `eta` parameters. If you notice new candidates for stop words, it's a simple matter to add these to your stop-word list and to rerun the analysis. Feel free to modify the other parameters, but let's focus here on stop words that it might be good to remove. Keep in mind that if you see a particular problem, like a stray letter or a numeral, you may as well try to remove a good number of that particular <i>type</i> of token, so that no tokens of the same sort will pollute your subsequent analyses.

In [None]:
more_stopwords = ["0", "1", "10", "100", "11", "12", "13", "14", "15", "16", "17", "18", "19", "2", "3", "4", "5", "6", "7", "8", "9", "a", "able", "adam", "also", "also", "although", "among", "another", "away", "b", "began", "c", "c", "came", "could", "d", "de", "done", "e", "eight", "et", "even", "even", "ever", "every", "every", "f", "first", "five", "found", "four", "g", "gave", "give", "go", "good", "great", "h", "high", "however", "i", "ii", "iii", "indeed", "j", "john", "k", "know", "l", "la", "le", "left", "let", "life", "like", "little", "long", "m", "made", "made", "make", "make", "man", "many", "may", "may", "men", "might", "mr", "much", "much", "must", "must", "n", "near", "never", "nine", "nothing", "o", "often", "one", "one", "p", "p", "part", "per", "place", "put", "q", "r", "s", "said", "said", "saw", "sect", "see", "self", "seven", "several", "shall", "shall", "sir", "six", "soon", "t", "take", "ten", "th", "thee", "therefore", "thing", "things", "thou", "though", "though", "three", "thus", "thy", "till", "time", "told", "took", "two", "two", "u", "u", "upon", "upon", "us", "v", "v", "vol", "w", "way", "well", "went", "whether", "without", "without", "would", "would", "x", "y", "yet", "z"]

In [None]:
reduced_files = []
for file in list_files:
    file = [word for word in file if word not in (more_stopwords)]
    reduced_files.append(file)

In [None]:
common_dictionary = Dictionary(reduced_files)
common_corpus = [common_dictionary.doc2bow(file) for file in reduced_files]

In [None]:
# Set training parameters.
num_topics = 10 # standard is 10
chunksize = 200 # standard is 2000 - depends very much on the number of documents you have
passes = 20 # standard is 20
iterations = 400 # standard is 400
eval_every = None  # Don't evaluate model perplexity, takes too much time.
alpha = 0.1
eta = 0.1

temp = common_dictionary[0]  # This is only to "load" the dictionary.
id2word = common_dictionary.id2token
model = LdaModel(
    corpus=common_corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha=alpha, # 'auto' - Generally 50/T, where T is the number of Topics anticipated - 0.1 is good
    eta=eta, # 'auto' - Generally 200/W, where W is the number of words in the vocabulary - 0.1 is good
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every
)

In [None]:
top_topics = model.top_topics(common_corpus, topn = 20) # top_n=20, normally)

from pprint import pprint
pprint(top_topics)