Plots various graphs for a series of plaintext files in a directory
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.

PlotSummary is a python program that will scan a specified directory for plaintext files and produce a range of graphs based on the frequency and distribution of words within those files.

A great deal of the groundwork for PlotSummary was achieved by David McClure, who wrote the excellent TextPlot. PlotSummary is derived from this work, although I have substantially modified parts of TextPlot and have focused only on its intermediate generation functions, which are not the ultimate goal of TextPlot's network generation functions.

Basic Usage

"""plotsummary: Plots various graphs for a series of plaintext files in a directory

Usage: single <directory> <term_file> [options] hist <directory> <term_file> [options] group <directory> <term_file> <term_name> <second_term_file> <second_term_name> [options] overlap <directory> <first_term> <second_term> [options] rawcount <directory> <term_file> [options] search <directory> <term> <count> [options] (-h | --help) --version

    -c, --caption <caption>                         Specify the output caption
    -d, --debug                                     Enable debug output
    -h --help                                       Show this screen.
    -n, --nostem <nostem>                           Specify a path containing words that should not be stemmed
    --version                                       Show version.
    -w, --words <words>                             Specify the word frequency to sample (default: 5000)

There are six different modes in which PlotSummary can be run, which should be passed as the first argument to the script: single, hist, group, overlap,rawcount and search.

Single mode will produce a kernel density estimate graph for the provided terms.

Group mode will produce a kernel density estimates for two groups of provided terms.

Hist mode will produce a histogram of term frequencies spread across 5,000 word intervals.

Overlap mode will produce a graph showing the degree to which two terms overlap in a kernel density estimation (using Bray-Curtis dissimilarity).

Rawcount mode will produce a line graph of term frequencies across 5,000 word intervals.

Search mode will take a single term and tell you the top X other terms that occur in the same areas of the text.

The "term_file" (and "second_term_file") argument(s) should be an absolute path to a file that contains a list of terms to plot; one term per line.

The "term_name" (and "second_term_name") argument(s) should be strings given on the command line. These will be used as labels for each set of terms.

"first_term" and second_term" arguments for options that compare two terms should just be the raw terms.

The "count" argument (used with search) will let you limit the number of results.

The --caption option allows you to title the resulting graph.

The --debug option will let you see what's going on. I recommend enabling it.

The --nostem option allows you to specify a file containing a list of words that should be exempt from stemming. PlotSummary uses the Porter2 algorithm for stemming, which has some known false positives. For instance, "university" becomes "univers". The debug option (as above) will show how your terms are being stemmed. You can, therefore, use the nostem list to specify that such terms should be exempted.

The --words option allows you to set the number of words sampled in hist and rawcount modes.

#Example usage: rawcount ./ rawcount ~/Barth/ ~/term_file.txt -d -n ~/data/no_stem.txt -c 'University Terms' > ~/Averages.out

Input files

Inside ~/term_file.txt:


Inside ~/data/no_stem.txt:


In the directory ~/Barth is a plaintext version of John Barth's novel, Giles Goat Boy.

Output files

In ~/Averages.out:

[plotsummary] university will not be stemmed
[plotsummary] professor will be stemmed to professor
[plotsummary] student will be stemmed to student
[plotsummary] lecturer will be stemmed to lectur
[plotsummary] college will be stemmed to colleg
[plotsummary] Loading Barth.txt
[plotsummary] Plotting Barth.txt
[plotsummary] The term university appears on average 2 times every 5000 words
[plotsummary] The term professor appears on average 1 times every 5000 words
[plotsummary] The term student appears on average 5 times every 5000 words
[plotsummary] The term lecturer appears on average 0 times every 5000 words
[plotsummary] The term college appears on average 4 times every 5000 words
[plotsummary] Saving Barth.png

In ~/Barth/Barth.png:


#Example usage: search and overlap

./ search ~/GR/ blicero 20 --debug > out.txt

Input Files In the directory ~/GR is a single plaintext version of Thomas Pynchon's Gravity's Rainbow (Pynchon.txt).

Output Files

In out.txt:

[plotsummary] blicero will be stemmed to blicero
[plotsummary] Loading Pynchon.txt
[plotsummary] Plotting Pynchon.txt
[plotsummary] Top twenty correlated terms (with more than one occurrence) for blicero: 
[plotsummary] gottfri
[plotsummary] lager
[plotsummary] wandervogel
[plotsummary] thanatz
[plotsummary] kalahari
[plotsummary] stadt
[plotsummary] flotsam
[plotsummary] recogniz
[plotsummary] execution
[plotsummary] asham
[plotsummary] tray
[plotsummary] silhouett
[plotsummary] heath
[plotsummary] crotch
[plotsummary] fungu
[plotsummary] crippl
[plotsummary] erd
[plotsummary] exempt
[plotsummary] butt
[plotsummary] wand
[plotsummary] infinit

These are the terms, in Gravity's Rainbow that most closely correlate to Blicero.

We can verify this through two commands and the resulting output graphs:

./ overlap ~/GR/ blicero gottfried --debug

Blicero and Gottfried in Gravity's Rainbow

./ overlap ~/GR/ blicero thanatz --debug

Blicero and Thanatz in Gravity's Rainbow

#Components and Licensing PlotSummary is copyright Martin Paul Eve 2015. It is released under the terms specified in LICENSE.

PlotSummary makes use of several other open-source/free-software projects, including: