-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The Average Novel #22
Comments
Some progress! I present: The Average Novel. I'm still working with Project Gutenberg files from the April 2010 DVD ISO (downloadable here) (1) Fetch every text in PG that was labelled as fiction and then parsed them into sentences and used gensim's Word2Vec module to calculate 100-dimensional word embeddings from the resulting sentences. I guess I secretly hoped that this technique would reveal, average face-like, the Narrative Ur-text underlying all storytelling. But the result is pretty much what I actually expected: all of the structural variation gets lost in the wash. (The I'm planning to continue experimenting with this technique, but wanted to share this progress in case further experiments extend past the deadline. |
If nothing else, pachyderms.camp would make for a great mastodon instance
domain
…On Tue, Nov 28, 2017 at 3:23 PM, Allison Parrish ***@***.***> wrote:
Some progress! I present: *The Average Novel*.
I'm still working with Project Gutenberg files from the April 2010 DVD ISO
(downloadable here
<http://www.gutenberg.org/wiki/Gutenberg:The_CD_and_DVD_Project>
and Leonard Richardson's 47000_metadata.json
<https://twitter.com/leonardr/status/667049187918356480>). Steps:
(1) Fetch every text in PG that was labelled as fiction and then parsed
them into sentences and used gensim's Word2Vec module to calculate word
embeddings from the resulting sentences.
(2) Create an array of word embeddings for every text (by looking up each
word in the embedding) and normalized the length of every novel to length
50,000 exactly.
(3) Sum the arrays for every length-normalized text and divide by the
number of texts (~11k).
(4) For each vector in the array, find the word with the closest embedding.
You can see the results here
<https://gist.github.com/aparrish/86daccdfa4f338b1d33e98d1624029d7>.
I guess I secretly hoped that this technique would reveal, average
face-like <https://pmsol3.wordpress.com/>, the Narrative Ur-text
underlying all storytelling. But the result is pretty much what I actually
expected: pretty much all of the structural variation gets lost in the
wash. (The Produced and Proofreaders tokens at the top are obviously
remnants of PG credits and boilerplate that weren't caught by the filtering
tools I'm using; the , token just happens to have been the vector most
central to the average, which I guess kinda makes sense given how Word2Vec
works. Not sure what all those pachyderms are doing in there though.)
I'm planning to continue experimenting with this technique, but wanted to
share this progress in case further experiments extend past the deadline.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#22 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AB0knsj4pnqo-UuZytJmUMXXmYMlcTGrks5s7HnvgaJpZM4QDH_x>
.
--
.
|
going to post the source code for this soon, stay tuned! |
I'm a day late but I posted the source code and a new version of the output. For the new version, I decided to ignore punctuation tokens when calculating the vectors for each novel. The result has fewer commas and variation that is a bit more interesting IMO! |
Hi everyone, I'm going to try to make something this year! I haven't planned out anything yet but it will probably have to do with word embeddings somehow. Also the WikiPlots corpus.
The text was updated successfully, but these errors were encountered: