Dissociated Parse #62
It's well known that Markov chains don't understand grammar. Any sequences in the output that might look grammatical are only there because grammatical-looking sequences are statistically likely.
Off and on I've been interested in seeing if one can make variations of Markov generators that do retain some of the syntactic structure of the original text.
I had some success with it in 2019 with Anne of Green Garbles, which shows that (roughly speaking) one can combine two Markov models to obtain a third model where the generation can switch between states (like "in narration" and "in dialogue").
This is a second experiment with this goal.
tl;dr -- we can run the Dissociated Press algorithm in a context where we retain the syntactic structure associated with the text, that is, not just on a list of words like usual, but on a forest of parse trees. I call this variation Dissociated Parse.
Obtaining some parse trees from The Wonderful Wizard of Oz using the link-grammar parser, and running Dissociated Parse on them, produces text like:
So, this is clearly generating nonsense, like a Markov chain would, but it is also retaining some of the source's syntactic structure, like a Markov chain wouldn't. (It still doesn't understand anything about tense and aspect, though; and the link-grammar parser doesn't always parse exactly as you'd expect; so for these reasons it still frequently falls short of "syntactically well-formed".)
For background, a description of the Dissociated Press algorithm.
Just as there is more than one algorithm for sorting, there is more than one algorithm for generating a Markov chain.
The usual algorithm involves analyzing the source text and building a weighted transition table, then finding a random (but likely) path
The probably less well-known algorithm called Dissociated Press goes like this:
Even though this works rather differently from the transition table algorithm, it produces the same result. (Exercise for the reader: convince yourself that it does in fact produce the same result.)
One downside of this algorithm is that it requires the entire text be kept in memory, rather than just a transition table. But, this is also an upside, in the sense that variations on the algorithm can exploit structure in the text which would not be retained in the transition table.
Dissociated Parse adapts this to work recursively on parse trees. Consider a parse tree to consist of a word, a part-of-speech tag, and zero or more child trees. Here is a sketch of the algorithm:
That's all it is. One obvious improvement that I haven't tried yet is to have it find all trees that have the same part of speech and word as the current tree. I hope to try that. And to submit a 50,000-word long run too. But in case I do not get around to those, I wanted to at least get this writeup out.
The text was updated successfully, but these errors were encountered:
There is a repository now, here: https://github.com/catseye/Dissociated-Parse
I have also adapted the algorithm as I mentioned, to pick each new tree based on the part of speech plus the first word of current tree. (The link-grammar parser sometimes bundles a number of words into each tree.) The output is similar -- perhaps a little more structured, or perhaps that is just my imagination:
A postscript of sorts, for anyone who might still be reading this issue.
When I was in the middle of working on this entry, the output was not bad, lexically speaking, but it was really lacking in terms of correct (or at least plausible) punctuation, spacing, and capitalization.
This is a situation I've encountered before, in a previous NaNoGenMo, and to deal with it back then I wrote a tool called T-Rext that tries to clean up those things in a mechanical way.
So I wanted to use T-Rext in this project too. But when I tried it again, I was disappointed on two fronts:
I ended up using it regardless -- I ran it from a Python 2.7-based Docker image and wrote an ad-hoc cleanup script to run after it, to compensate for its shortcomings.
But after NaNoGenMo ended I had some time to go back and fix it properly. So, there is now a version 0.2 of T-Rext which runs under Python 3 and does more thorough cleanup, including capitalization.
If I had made these improvement before or during NaNoGenMo I wouldn't have had to write that ad-hoc cleanup script.
T-Rext is in the public domain so please feel free to use it in your own efforts if you feel it would be useful to you.
That's all I wanted to say. Have a nice day, Happy Holidays, and see you again next year.