The Average Novel #22

aparrish · 2017-10-23T16:26:20Z

Hi everyone, I'm going to try to make something this year! I haven't planned out anything yet but it will probably have to do with word embeddings somehow. Also the WikiPlots corpus.

aparrish · 2017-11-28T21:23:58Z

Some progress! I present: The Average Novel.

I'm still working with Project Gutenberg files from the April 2010 DVD ISO (downloadable here)
and Leonard Richardson's 47000_metadata.json. Steps:

(1) Fetch every text in PG that was labelled as fiction and then parsed them into sentences and used gensim's Word2Vec module to calculate 100-dimensional word embeddings from the resulting sentences.
(2) Create an array of word embeddings for every text (by looking up each word in the embedding) and normalize the length of these arrays to 50,000 (leaving ~11k arrays of dimensionality (50000,100)).
(3) Sum the arrays for every length-normalized text and divide by the number of texts.
(4) For each vector in the resulting array, find the word with the closest embedding.

You can see the results here.

I guess I secretly hoped that this technique would reveal, average face-like, the Narrative Ur-text underlying all storytelling. But the result is pretty much what I actually expected: all of the structural variation gets lost in the wash. (The Produced and Proofreaders tokens at the top are obviously remnants of PG credits and boilerplate that weren't caught by the filtering tools I'm using; the , token just happens to have been the vector most central to the average, which I guess kinda makes sense given how Word2Vec works. Not sure what all those pachyderms are doing in there though.)

I'm planning to continue experimenting with this technique, but wanted to share this progress in case further experiments extend past the deadline.

swizzard · 2017-11-28T21:41:33Z

If nothing else, pachyderms.camp would make for a great mastodon instance domain

…

On Tue, Nov 28, 2017 at 3:23 PM, Allison Parrish ***@***.***> wrote: Some progress! I present: *The Average Novel*. I'm still working with Project Gutenberg files from the April 2010 DVD ISO (downloadable here <http://www.gutenberg.org/wiki/Gutenberg:The_CD_and_DVD_Project> and Leonard Richardson's 47000_metadata.json <https://twitter.com/leonardr/status/667049187918356480>). Steps: (1) Fetch every text in PG that was labelled as fiction and then parsed them into sentences and used gensim's Word2Vec module to calculate word embeddings from the resulting sentences. (2) Create an array of word embeddings for every text (by looking up each word in the embedding) and normalized the length of every novel to length 50,000 exactly. (3) Sum the arrays for every length-normalized text and divide by the number of texts (~11k). (4) For each vector in the array, find the word with the closest embedding. You can see the results here <https://gist.github.com/aparrish/86daccdfa4f338b1d33e98d1624029d7>. I guess I secretly hoped that this technique would reveal, average face-like <https://pmsol3.wordpress.com/>, the Narrative Ur-text underlying all storytelling. But the result is pretty much what I actually expected: pretty much all of the structural variation gets lost in the wash. (The Produced and Proofreaders tokens at the top are obviously remnants of PG credits and boilerplate that weren't caught by the filtering tools I'm using; the , token just happens to have been the vector most central to the average, which I guess kinda makes sense given how Word2Vec works. Not sure what all those pachyderms are doing in there though.) I'm planning to continue experimenting with this technique, but wanted to share this progress in case further experiments extend past the deadline. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#22 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AB0knsj4pnqo-UuZytJmUMXXmYMlcTGrks5s7HnvgaJpZM4QDH_x> .

-- .

aparrish · 2017-12-01T17:40:40Z

going to post the source code for this soon, stay tuned!

moonmilk · 2017-12-01T17:45:09Z

This part is so beautiful.

aparrish · 2017-12-02T03:15:39Z

I'm a day late but I posted the source code and a new version of the output. For the new version, I decided to ignore punctuation tokens when calculating the vectors for each novel. The result has fewer commas and variation that is a bit more interesting IMO!

aparrish changed the title ~~something with word embeddings probably~~ The Average Novel Dec 2, 2017

hugovk added completed For completed novels! preview There is an excerpt somewhere in the thread! labels Dec 2, 2017

cpressey mentioned this issue Oct 18, 2018

Language survey 2017 #135

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Average Novel #22

The Average Novel #22

aparrish commented Oct 23, 2017

aparrish commented Nov 28, 2017 •

edited

Loading

swizzard commented Nov 28, 2017 via email

aparrish commented Dec 1, 2017

moonmilk commented Dec 1, 2017

aparrish commented Dec 2, 2017

The Average Novel #22

The Average Novel #22

Comments

aparrish commented Oct 23, 2017

aparrish commented Nov 28, 2017 • edited Loading

swizzard commented Nov 28, 2017 via email

aparrish commented Dec 1, 2017

moonmilk commented Dec 1, 2017

aparrish commented Dec 2, 2017

aparrish commented Nov 28, 2017 •

edited

Loading