Textplot

Textplot is a little program that converts a document into a network of terms, with the goal of teasing out information about the high-level topic structure of the text. For each term:

Get the set of offsets in the document where the term appears.
Using kernel density estimation, compute a probability density function (PDF) that represents the word's distribution across the document. Eg, from War and Peace:

Compute a Bray-Curtis dissimilarity between the term's PDF and the PDFs of all other terms in the document. This measures the extent to which two words appear in the same locations.
Sort this list in descending order to get a custom "topic" for the term. Skim off the top N words (usually 10-20) to get the strongest links. Here's "napoleon":

[('napoleon', 1.0),
('war', 0.65319871313854128),
('military', 0.64782349297012154),
('men', 0.63958189887106576),
('order', 0.63636730075877446),
('general', 0.62621616907584432),
('russia', 0.62233286026418089),
('king', 0.61854160459241103),
('single', 0.61630514751638699),
('killed', 0.61262010905310182),
('peace', 0.60775702746632576),
('contrary', 0.60750138486684579),
('number', 0.59936009740377516),
('accompanied', 0.59748552019874168),
('clear', 0.59661288775164523),
('force', 0.59657370362505935),
('army', 0.59584331507492383),
('authority', 0.59523854206807647),
('troops', 0.59293965397478188),
('russian', 0.59077308177196441)]

Shovel all of these links into a network and export a GML file.

Generating graphs

There are two ways to create graphs - you can use the textplot executable from the command line, or, if you want to tinker around with the underlying NetworkX graph instance, you can fire up a Python shell and use the build_graph() helper directly.

Either way, first install Textplot. With PyPI:

pip install textplot

Or, clone the repo and install the package manually:

pyvenv env
. env/bin/activate
pip install -r requirements.txt
python setup.py install

From the command line

Then, from the command line, generate graphs with:

texplot generate [IN_PATH] [OUT_PATH] [OPTIONS]

Where the input is a regular .txt file, and the output is a .gml file. So, if you're working with War and Peace:

texplot generate war-and-peace.txt war-and-peace.gml

The generate command takes these options:

--term_depth=1000 (int) - The number of terms to include in the network. For now, Textplot takes the top N most frequent terms, after stopwords are removed.
--skim_depth=10 (int) - The number of connections (edges) to skim off the top of the "topics" computed for each word.
--d_weights (flag) - By default, terms that appear in similar locations in the document will be connected by edges with "heavy" weights, the semantic expected by force-directed layout algorithms like Force Atlas 2 in Gephi. If this flag is passed, the weights will be inverted - use this if you want to do any kind of pathfinding analysis on the graph, where it's generally assumed that edge weights represent distance or cost.
--bandwidth=2000 (int) - The bandwidth for the kernel density estimation. This controls how "smoothness" of the curve. 2000 is a sensible default for long novels, but bump it down if you're working with shorter texts.
--samples=1000 (int) - The number of equally-spaced points on the X-axis where the kernel density is sampled. 1000 is almost always enough, unless you're working with a huge document.
--kernel=gaussian (str) - The kernel function. The scikit-learn implementation also supports tophat, epanechnikov, exponential, linear, and cosine.

From a Python shell

Or, fire up a Python shell and import build_graph() directly:

In [1]: from textplot.helpers import build_graph

In [2]: g = build_graph('war-and-peace.txt')

Tokenizing text...
Extracted 573064 tokens

Indexing terms:
[################################] 124750/124750 - 00:00:06

Generating graph:
[################################] 500/500 - 00:00:03

build_graph() returns an instance of textplot.graphs.Skimmer, which gives access to an instance of networkx.Graph. Eg, to get degree centralities:

In [3]: import networkx as nx
In [4]: nx.degree_centrality(g.graph)

Texplot uses numpy, scipy, scikit-learn, matplotlib, networkx, and clint.

Name		Name	Last commit message	Last commit date
Latest commit History 330 Commits
bin		bin
notes/mental-maps		notes/mental-maps
test		test
textplot		textplot
.editorconfig		.editorconfig
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bin

bin

notes/mental-maps

notes/mental-maps

test

test

textplot

textplot

.editorconfig

.editorconfig

.gitignore

.gitignore

LICENSE.txt

LICENSE.txt

README.md

README.md

requirements.txt

requirements.txt

setup.py

setup.py

Repository files navigation

Textplot

Generating graphs

From the command line

From a Python shell

About

Releases

Packages

Languages

License

davidmcclure/textplot

Folders and files

Latest commit

History

Repository files navigation

Textplot

Generating graphs

From the command line

From a Python shell

About

Resources

License

Stars

Watchers

Forks

Languages