# Topic Modeling the Dispatch


(**NB:** this is now the only part!!!)

In [None]:
import re
import pandas

# Gensim
import gensim
import gensim.corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

import matplotlib.pyplot as plt
from pprint import pprint

import warnings
warnings.filterwarnings("ignore")

from ast import literal_eval # for loading columns with lists

We need to ensure that data in the columns is in the formats that it must be for further processing: namely, we need to ensure that dates are in the date format (we did that before), but most importantly that our generated columns with *lists* of words are lists and not strings (that is how they will be loaded from a CSV file by default). The latter can be done during loading (`literal_eval`).

**Note:** we can actually reduce the size of the `Dispatch_Light_Preprocessed.tsv` by removing columns that we do not really need anymore: `textData` and `textDataLists` (we only need `textDataListsFiltered`).

The following should load preprocessed Dispatch data for 1860-1864. 

In [None]:
# WE CAN EDIT THIS LIST IN ORDER TO REDUCE THE AMOUNT OF DATA THAT WE ARE LOADING
dispatchSubfolder = "./Dispatch_Processed_TSV/"
dispatchFiles = ["Dispatch_1860_tmReady.tsv",  # incomplete
                 "Dispatch_1861_tmReady.tsv",  # The War starts of April 12, 1861
                 "Dispatch_1862_tmReady.tsv",
                 "Dispatch_1863_tmReady.tsv",
                 "Dispatch_1864_tmReady.tsv",
                 #"Dispatch_1865_tmReady.tsv",  # incomplete - The War ends on May 9, 1865
                 ]

df = pandas.DataFrame()

for f in dispatchFiles:
    dfTemp = pandas.read_csv(dispatchSubfolder + f, sep="\t", header=0, converters={'textDataLists': literal_eval})
    df = df.append(dfTemp)

dispatch_light = df
# drop=True -- use it to avoid creating a new column with the old index values
dispatch_light = dispatch_light.reset_index(drop=True) 

dispatch_light["month"] = [re.sub("-\d\d$", "", str(i)) for i in dispatch_light["date"]]
dispatch_light["month"] = pandas.to_datetime(dispatch_light["month"], format="%Y-%m")
dispatch_light["date"] = pandas.to_datetime(dispatch_light["date"], format="%Y-%m-%d")

In [None]:
#dispatch_light = dispatch_light.dropna() # you may want to use this line in the future in case you want to exclude rows of data that have NaN values (not analyzable)
dispatch_light

# Create the Dictionary and Corpus Objects needed for Topic Modeling

In [None]:
%%time
# Create Dictionary
dictionary = gensim.corpora.Dictionary(dispatch_light["textDataLists"])

# Create Corpus
texts = dispatch_light["textDataLists"]

# Term Document Frequency
corpus = [dictionary.doc2bow(text) for text in texts] # bow == bag of words

# View
print(corpus[:1])
print("-"*50)

This list above is a text: every tuple (like `(0, 1)`) represents a word (first number) and its frequency (second number). We can check which word is hiding behind a number by:

In [None]:
dictionary[45]

So, the word `president` occurs three times in the article. And this is the complete text hiding behinf this numeric abstraction:

In [None]:
dispatch_light["text"][0]

# Optimal Number of Topics

The number of topics is a tricky issue, since we pre-determine this number and the algorithm then splits all data into that number of topics. In other words, if you tell the machine to find 10 topics, it will use 10 buckets to sort all the data into; if you give it 20, it will do that for 20. For this reason topic modeling often takes multiple attempts. Additionally, it is not uncommon to then pull out a specific topic from your data and re-run the algorithm on that subset of texts. 

Nonetheless, there is a mathematical way to identify the optimal number of topics. The common practice is to generate several models and calculate `coherence score` (*k*) for them all—the number with the highest `coherence score` is considered optimal (ideally, above 0.55). 

**NB**: Do not run the code below during the class, as it took about 67 minutes to complete (this time may differ significantly depending on the configuration of your machine).

**NB:** You can comment out the entire chunk by selecting it and pressing `Ctrl+/` on Windows or `Cmd+/` on Mac. Pressing these combinations again will make code runnable again.

In the code below I have added one more parameter: `random_state`. This parameter is crucial for reproducability. The issue with generating statistical models is that they always get generated slightly differently. In order to avoid this, we can use `random_state` parameter with the same number (`random_state = 2023`); the number itself is not important; what is important is that you reuse the **same** number, which will guarantee the same results.

In [None]:
# %%time

# optimalTopicsNumber = ["topics\tscore"]

# for num in range(4, 51, 1):
#     lda_model_temp = gensim.models.LdaModel(corpus=corpus,id2word=dictionary, num_topics=num, random_state=2023)
#     coherence_model_lda = CoherenceModel(model=lda_model_temp,texts=dispatch_light["textDataLists"],
#                                      dictionary=dictionary, coherence='c_v')
#     coherence_lda = coherence_model_lda.get_coherence()
#     optimalTopicsNumber.append("%d\t%f" % (num, coherence_lda))

#     print('Coherence Score for %02d topics: ' % num, coherence_lda)
    
# print("-"*50)

# optimalTopicsNumber = "\n".join(optimalTopicsNumber)
# print(optimalTopicsNumber)

Not really necessary, but we can plot these results in the following manner (score above 0.55 should be ok):

In [None]:
import io

scores = """topics	score
4	0.492556
5	0.520064
6	0.531907
7	0.528860
8	0.539734
9	0.517753
10	0.554835
11	0.537774
12	0.554257
13	0.561778
14	0.553429
15	0.534285
16	0.516632
17	0.528319
18	0.542015
19	0.544126
20	0.510367
21	0.519284
22	0.508858
23	0.522271
24	0.515742
25	0.514178
26	0.501335
27	0.487533
28	0.509954
29	0.509522
30	0.522582
31	0.500205
32	0.501318
33	0.493806
34	0.490917
35	0.493692
36	0.464409
37	0.487567
38	0.489053
39	0.476119
40	0.475229
41	0.462814
42	0.473426
43	0.482906
44	0.475647
45	0.477711
46	0.462434
47	0.468734
48	0.456667
49	0.463903
50	0.454378
"""

scoresData = io.StringIO(scores)
scoresDF = pandas.read_csv(scoresData, sep="\t", header=0)

plt.rcParams["figure.figsize"] = (20, 9)
plt.stem(scoresDF['topics'], scoresDF['score'])
plt.plot([4, 50], [0.55, 0.55], color="red", linestyle="--")

plt.ylabel("coherence score")
plt.xlabel("number of topics")
plt.title("Coherence Score Test for TM of the Dispatch")
plt.gca().yaxis.grid(linestyle=':')

plt.show()

It looks like we have several "sweet spots" where the corpus breaks down into a coherent number of topics—essentially, all above  or very close to 0.55. If you are doing an in-depth study of some corpus, it makes sense to generate models for all meaningful matches and explore the results in the manner that is described below. You are likely to observe that certain meaningful topics will appear at some point and will disappear as you change the number of topics.

However, I suggest we go with the number of topics that was used in Robert K. Nelson's “Mining the Dispatch” (<https://dsl.richmond.edu/dispatch/>), which was 40. This way we should get results vlose to his and we can then use his work as a reference point for our results.

# Building the Topic Model

The following step usually takes quite a lot of time, so do not run it in class. You may want to leave it running overnight. Ideally, the number of passes should be at least a hundred—this increases the stability of topics, but also increases the amount of time required for generating the model. Good news is that you do not have to train the model every time: you can save it and load everytime you want to use it.

On all the data (and we have a very large corpus), with 100 passes and 20 updates it is likely to take several hours to train a single model (see below).

```
CPU times: user 3h 30min 39s, sys: 1h 32min 40s, total: 5h 3min 20s
Wall time: 41min 20s
```

**NB:** You can comment out the entire chunk by selecting it and pressing `Ctrl+/` on Windows or `Cmd+/` on Mac. Pressing these combinations again will make code runnable again.

We can also train a quick and simple model with the default parameters of the training function (commented out: `#update_every=20, passes=100, alpha='auto',`). On my computer it took only about 2 mins, but it may differ significantly depending on the configuration of your machine. Overall, be prepared to take a long coffee break (just make sure your computer does not go to sleep in the meantime!)

In [None]:
number_of_topics = 40
number_of_topics

In [None]:
%%time
# Build LDA model
lda_model = gensim.models.LdaModel(corpus=corpus, id2word=dictionary,
                                   random_state=2023,
                                   #update_every=20, passes=100, alpha='auto',
                                   num_topics=number_of_topics)

print("-"*50)

These are just a couple of lines of code for saving your model; since we are not generating one, they are commented out.

In [None]:
lda_model.save(dispatchSubfolder + "model_dispatch_1860_1864_%d.lda" % number_of_topics)

These lines will load the pre-generated model:

In [None]:
#lda_model = gensim.models.LdaModel.load(dispatchSubfolder + "model_dispatch_1860_1864_%d.lda" % number_of_topics)

# View generated topics

In [None]:
# Print the Keyword in topics
topicsData = lda_model.print_topics(num_topics = number_of_topics, num_words=10)
pprint(topicsData)
doc_lda = lda_model[corpus]

You can ignore the following code for now. It simply converts the topic modeling data into a network format (one of possibilities) for an alternative exploration of topics.

In [None]:
len(topicsData)

In [None]:
topicsDataNW = lda_model.print_topics(num_topics = number_of_topics, num_words=20)

topicsTidy = []
topicsDicQuick = {}

for t in topicsDataNW:
    topicsDicQuick[t[0]] = t[1]
    words = t[1].split(" + ")
    for w in words:
        w = w.replace('"', "").replace("*", "\t")
        topicsTidy.append("%s\tT%02d\t%s" % (t[0], int(t[0])+1, w))

topicsTidy = "\n".join(topicsTidy)

with open(dispatchSubfolder + "tmTidy.tsv", "w", encoding="utf8") as f9:
    f9.write("topic\ttopicName\tscore\tterm\n" + topicsTidy)

How can we infer topics?

# Visualize topic-keywords

`pyLDAvis` library offers a visual tool for exploring topics. **λ-parameter** is designed to "slice" topics words in such a way that topics would be easier to interpret. If you slide this parameter below 1, you will see that the selection of words changes and if you reach 0, only words unique to this topic will be shown.

**NB:** Keep in mind that on the visualization topics are numbered from 1, not from 0 as in the data!

In [None]:
# Visualize the topics

import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

pyLDAvis.enable_notebook()
modelVis = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)

# you are likely to see lots of red-ish text --- just ignore it... 

In [None]:
pyLDAvis.save_html(modelVis, dispatchSubfolder + 'tmVis_%02d_Topics_in_Dispatch_1860_1864.html' % number_of_topics)
modelVis

# Let's explore our topics more

In [None]:
dispatch_light.head()

## `get_document_topics` for the entire corpus

You can get `doc_topics`, `word_topics` and `phi_values` (probabilities of words for topics) for all the documents in the corpus in the following manner:

In [None]:
%%time
all_topics = lda_model.get_document_topics(corpus, per_word_topics=True)
print("-"*50)

Let's print out data for the first document:

In [None]:
%%time

for doc_topics, word_topics, phi_values in all_topics[:1]:
    print('New Document')
    print('\nDocument topics:', doc_topics)
    print('\nWord topics:', word_topics)
    print('\nPhi values:', phi_values)
    #print(" ")
    print('-------------- \n')

print("-"*50)

Let's build a dataframe with topic values for every item (article) from the Dispatch. This will allow us to do more interesting things.

In [None]:
%%time

topicTableCols = [] # empty table for topic values (technically, a list still)
topicDic = {} # a dictionary with top words per topic

for i in range(0, number_of_topics, 1):
    tVal = "T%02d" % (i + 1)
    topicTableCols.append(tVal)
    
    topicVals  = lda_model.show_topic(i)
    topicWords = ", ".join([word for word, prob in topicVals])
    topicDic[tVal] = topicWords


In [None]:
pprint(topicDic)

The following step may take a couple minutes.

- Here were creating an empty row (filled with zeros) for each document in our corpus;
- then we loop through all topic values for every given document;
- and feed existing values into our row with zeros (if we have a value, it will replace a zero);
- you can insert `print`/`input` statements in the code below to see the steps of the process.

The following piece of code takes quite a while, so do not run it in class:

```
--------------------------------------------------
CPU times: user 18min 41s, sys: 25min 56s, total: 44min 38s
Wall time: 6min 58s
```

In [None]:
%%time

topicTableRows = [] # now we are feeding topic values into our empty table

for doc_topics, word_topics, phi_values in all_topics:
    rawRow = [0] * number_of_topics
    for t in doc_topics:
        rawRow[t[0]] = t[1]
    topicTableRows.append(rawRow)

print("-"*50)

In [None]:
# different topics in the 0th document/article
topicTableRows[0]

So, our corpus table length:

In [None]:
len(dispatch_light)

And our topic table length:

In [None]:
len(topicTableRows)

We just need to convert it into a proper dataframe format:

In [None]:
topicTable = pandas.DataFrame(topicTableRows, columns=topicTableCols)
topicTable

While our topic table has the same dimensions (number of rows) as our Dispatch data, the rows in the topic table are indexed differently (starting from 0, while the Dispatch data starts from 1). We need to fix that before we can join topics table to the Dispatch data. Since default count in Python is from 0, it is easier to reset the Dispatch data like this:

In [None]:
dispatch_light = dispatch_light.reset_index(drop=True)
dispatch_light

Now, we can merge them without any issues:

In [None]:
mergedTable = pandas.concat([dispatch_light, topicTable], axis=1, sort=False)
len(mergedTable)

In [None]:
mergedTable

In [None]:
mergedTable.loc[[3947]]

In [None]:
pprint(mergedTable["text"][3947])

# Graphing topics over time

One of the most interesting things that we can do with data like ours is to check how topics are distributed over time. Our data is already prepared for this, we only need to summarize topics values by days or months before we can graph them (for more [details](https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/)). `pandas` library has all we need, but there many specialized graphing libraries that may produce better visual results. (For more details, check this [overview](https://www.shanelynn.ie/plotting-with-python-and-pandas-libraries-for-data-visualisation/))

In [None]:
topicSum = mergedTable.groupby("month").mean().copy() # This will find the average mean for each topic for each month
topicSum["month"] = topicSum.index
topicSum.head()

Now we can plot all our topics and see their progression over time:

In [None]:
topic = "T32"

plt.rcParams["figure.figsize"] = (20, 9)
plt.plot(topicSum['month'], topicSum[topic])

plt.ylabel("topic salience (cumulative frequencies)")
plt.xlabel("dates of issues of the Dispatch")
plt.title(topic + ": " + topicDic[topic])
plt.gca().yaxis.grid(linestyle=':')

plt.savefig(dispatchSubfolder + "graph_dispatch_1860-1864_%s.pdf" % topic, dpi=150)

You can run the cell below to print out items most representative of this topic.

In [None]:
temp = mergedTable.sort_values(by=topic, ascending=False)
temp[["id", "date", "text", topic]].head(15)

And you can check individual items like this (where the number is the **index** of the row):

In [None]:
pprint(mergedTable["text"][91118])

We can also use `plotly` library to generat dynamic graphs:

In [None]:
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

In [None]:
# Create the data for the bar chart
bar_data = go.Bar(
    x=topicSum["month"], 
    y=topicSum[topic]
)
# Create the layout information for display
layout = go.Layout(
    title=topic+": "+topicDic[topic],    
    xaxis=dict(title='Dates'),
    yaxis=dict(title='Topic Salience')
)
# These two together create the "figure"
figure = go.Figure(data=[bar_data], layout=layout)
# Use "iplot" to create the figure in the Jupyter notebook
iplot(figure)

`Plotly` also allows one to save a graph into a separate file for sharing and embedding:

In [None]:
plot(figure, filename=dispatchSubfolder+"dispatch_1860_1864_%s.html" % topic)

# Plots of all topics

Below is a simple code for generating graphs for each topic. As you must realize by now, it will be easier to generate these graphs with a loop (something you would definitely do in a regular script). To reduce the amount of code, we can also create a function that will be easier to reuse as a shortcut.

In [None]:
def graphTopic(dataTable, topicDic, topicID, saveOption=True):
    plt.plot(dataTable['month'], dataTable[topicID])
    plt.ylabel("topic salience (cumulative frequencies)")
    plt.xlabel("dates of issues of the Dispatch")
    plt.title(topicID + ": " + topicDic[topicID])
    plt.gca().yaxis.grid(linestyle=':')
    if not saveOption:
        # the following line saves the figure
        plt.savefig(dispatchSubfolder + "graph_dispatch_1860-1864_%s.pdf" % topicID, dpi=150) 
        # this one tells matplotlib that you are done with the plot and closes it
        plt.clf()

We can still quickly generate all the graphs and save them on our computer:

In [None]:
for i in range(1, 41, 1):
    topic = "T%02d" % i
    graphTopic(topicSum, topicDic, topic, False)

In [None]:
graphTopic(topicSum, topicDic, "T01")

In [None]:
graphTopic(topicSum, topicDic, "T02")

In [None]:
graphTopic(topicSum, topicDic, "T03")

In [None]:
graphTopic(topicSum, topicDic, "T04")

In [None]:
graphTopic(topicSum, topicDic, "T05")

In [None]:
graphTopic(topicSum, topicDic, "T06")

In [None]:
graphTopic(topicSum, topicDic, "T07")

In [None]:
graphTopic(topicSum, topicDic, "T08")

In [None]:
graphTopic(topicSum, topicDic, "T09")

In [None]:
graphTopic(topicSum, topicDic, "T10")

In [None]:
graphTopic(topicSum, topicDic, "T11")

In [None]:
graphTopic(topicSum, topicDic, "T12")

In [None]:
graphTopic(topicSum, topicDic, "T13")

In [None]:
graphTopic(topicSum, topicDic, "T14")

In [None]:
graphTopic(topicSum, topicDic, "T15")

In [None]:
graphTopic(topicSum, topicDic, "T16")

In [None]:
graphTopic(topicSum, topicDic, "T17")

In [None]:
graphTopic(topicSum, topicDic, "T18")

In [None]:
graphTopic(topicSum, topicDic, "T19")

In [None]:
graphTopic(topicSum, topicDic, "T20")

In [None]:
graphTopic(topicSum, topicDic, "T21")

In [None]:
graphTopic(topicSum, topicDic, "T22")

In [None]:
graphTopic(topicSum, topicDic, "T23")

In [None]:
graphTopic(topicSum, topicDic, "T24")

In [None]:
graphTopic(topicSum, topicDic, "T25")

In [None]:
graphTopic(topicSum, topicDic, "T26")

In [None]:
graphTopic(topicSum, topicDic, "T27")

In [None]:
graphTopic(topicSum, topicDic, "T28")

In [None]:
graphTopic(topicSum, topicDic, "T29")

In [None]:
graphTopic(topicSum, topicDic, "T30")

In [None]:
graphTopic(topicSum, topicDic, "T31")

In [None]:
graphTopic(topicSum, topicDic, "T32")

In [None]:
graphTopic(topicSum, topicDic, "T33")

In [None]:
graphTopic(topicSum, topicDic, "T34")

In [None]:
graphTopic(topicSum, topicDic, "T35")

In [None]:
graphTopic(topicSum, topicDic, "T36")

In [None]:
graphTopic(topicSum, topicDic, "T37")

In [None]:
graphTopic(topicSum, topicDic, "T38")

In [None]:
graphTopic(topicSum, topicDic, "T39")

In [None]:
graphTopic(topicSum, topicDic, "T40")

# Re-using our LDA model

The trained LDA model can be applied to new texts, although this, of course should be done with utmost care. In general, this works as shown below. We'll start with text already split into a list of words, converted into lower case (*original text*: "Caliph Umar led the conquest of several Iranian provinces."):

In [None]:
new_doc = ['caliph', 'umar', 'led', 'conquest', 'several', 'iranian', 'provinces']
new_doc_bow = dictionary.doc2bow(new_doc)
print(new_doc_bow)
print(lda_model.get_document_topics(new_doc_bow))

The first list is the words that match words in our model. The second list is the topics and their proportions. You can check the matching words by typing `dictionary[number]` where `number` is the first number in the tuple (`(4429, 1)`). You can also check the ropics by typing `lda_model.print_topic(number)`, where the number is the number of the topic in the tuples of the second liist. Try them below:

In [None]:
print(dictionary[4428], dictionary[13911], dictionary[22023])

In [None]:
topicsDicQuick[9]

Results for texts from the same period and region should, however, be much better. Let's take another example (just a few lines from [here](https://www.thoughtco.com/the-battle-of-antietam-1773739)):

> After a summer of defeats in Virginia in the summer of 1862, the Union Army was demoralized in its camps near Washington, D.C. at the beginning of September. On the Confederate side, General Robert E. Lee was hoping to strike a decisive blow by invading the North. Lee's plan was to strike into Pennsylvania, imperiling the city of Washington and forcing an end to the war. The Confederate Army began crossing the Potomac on September 4, and within a few days had entered Frederick, a town in western Maryland. The citizens of the town stared at the Confederates as they passed through, hardly extending the warm welcome Lee had hoped to receive in Maryland. Lee split up his forces, sending part of the Army of Northern Virginia to capture the town of Harpers Ferry and its federal arsenal (which had been the site of John Brown's raid three years earlier).

In [None]:
testDoc = ["after", "a", "summer", "of", "defeats", "in", "virginia", "in", "the", "summer",
           "of", "1862", "the", "union", "army", "was", "demoralized", "in", "its", "camps",
           "near", "washington", "d", "c", "at", "the", "beginning", "of", "september", "on",
           "the", "confederate", "side", "general", "robert", "e", "lee", "was", "hoping",
           "to", "strike", "a", "decisive", "blow", "by", "invading", "the", "north", "lee",
           "s", "plan", "was", "to", "strike", "into", "pennsylvania", "imperiling", "the",
           "city", "of", "washington", "and", "forcing", "an", "end", "to", "the", "war",
           "the", "confederate", "army", "began", "crossing", "the", "potomac", "on", "september",
           "4", "and", "within", "a", "few", "days", "had", "entered", "frederick", "a", "town",
           "in", "western", "maryland", "the", "citizens", "of", "the", "town", "stared", "at",
           "the", "confederates", "as", "they", "passed", "through", "hardly", "extending", "the",
           "warm", "welcome", "lee", "had", "hoped", "to", "receive", "in", "maryland", "lee",
           "split", "up", "his", "forces", "sending", "part", "of", "the", "army", "of",
           "northern", "virginia", "to", "capture", "the", "town", "of", "harpers", "ferry",
           "and", "its", "federal", "arsenal", "which", "had", "been", "the", "site", "of",
           "john", "brown", "s", "raid", "three", "years", "earlier"]

testDoc = dictionary.doc2bow(testDoc)
print(testDoc)
print("=====")
print(lda_model.get_document_topics(testDoc))

What is the dominant topics? Try your code below:

In [None]:
topicsDicQuick[32]