**Simple Notebook Using ODW Newspaper data**

With kudos to [jcoliver's](https://github.com/jcoliver) [Collections as Data](repository). This is a slight variation on the [Introduction to text mining notebook](https://github.com/jcoliver/dig-coll-borderlands/blob/main/Text-Mining-Short.ipynb).

In [None]:
# First step is to get all of the necessary python building blocks
import pandas

# for file navigation
import os

# for text data mining
import nltk

# for stopword corpora for a variety of languages
from nltk.corpus import stopwords

# for splitting data into individual words
from nltk.tokenize import RegexpTokenizer

# download the stopwords for several languages
nltk.download('stopwords')

# for drawing the plot
import plotly.express as px

# custom module from U of Arizona
import digcol

You need to run the block above for the one below to work. The notebook will need all of the libraries it imports. The next block is where the newspaper is specfied.

In [None]:
newspaper = 'echo' #code for the Amherstburg Echo
language = 'english' #needs to match stopwords from jovyan
topic_words = ['influenza'] #simple catch-all for demo

The data can now be accessed in the notebook.

In [None]:
# only one data path at this point
volume_path = 'data_samples'

sample_issues = os.listdir(newspaper)
sample_issues.sort()

dates = []
for issue in sample_issues:
    dates.append(issue[0:10])

# Add those dates to a data frame
results_table = pandas.DataFrame(dates, columns = ['Date'])

# Set all frequencies to zero
results_table['Frequency'] = 0.0

# Cycle over all issues and do relative frequency calculations
for issue in sample_issues:
    issue_text = digcol.CleanText(filename = newspaper + '/' + issue, language = language)
    issue_text = issue_text.clean_list
    
    # Create a table with words
    word_table = pandas.Series(issue_text)

    # Calculate relative frequencies of all words in the issue
    word_freqs = word_table.value_counts(normalize = True)
    
    # Pull out only values that match words of interest
    my_freqs = word_freqs.filter(topic_words)
    
    # Get the total frequency for words of interest
    total_my_freq = my_freqs.sum()
    
    # Format the date from the name of the file so we know where to put
    # the data in our table
    issue_date = str(issue[0:10])
    
    # Add the date & relative frequency to our data table
    results_table.loc[results_table['Date'] == issue_date, 'Frequency'] = total_my_freq
    
# Analyses are all done, plot the figure
my_figure = px.line(results_table, x = 'Date', y = 'Frequency')
my_figure.show()