Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Average Novel #22

Open
aparrish opened this issue Oct 23, 2017 · 5 comments
Open

The Average Novel #22

aparrish opened this issue Oct 23, 2017 · 5 comments
Labels
completed For completed novels! preview There is an excerpt somewhere in the thread!

Comments

@aparrish
Copy link

Hi everyone, I'm going to try to make something this year! I haven't planned out anything yet but it will probably have to do with word embeddings somehow. Also the WikiPlots corpus.

@aparrish
Copy link
Author

aparrish commented Nov 28, 2017

Some progress! I present: The Average Novel.

I'm still working with Project Gutenberg files from the April 2010 DVD ISO (downloadable here)
and Leonard Richardson's 47000_metadata.json. Steps:

(1) Fetch every text in PG that was labelled as fiction and then parsed them into sentences and used gensim's Word2Vec module to calculate 100-dimensional word embeddings from the resulting sentences.
(2) Create an array of word embeddings for every text (by looking up each word in the embedding) and normalize the length of these arrays to 50,000 (leaving ~11k arrays of dimensionality (50000,100)).
(3) Sum the arrays for every length-normalized text and divide by the number of texts.
(4) For each vector in the resulting array, find the word with the closest embedding.

You can see the results here.

I guess I secretly hoped that this technique would reveal, average face-like, the Narrative Ur-text underlying all storytelling. But the result is pretty much what I actually expected: all of the structural variation gets lost in the wash. (The Produced and Proofreaders tokens at the top are obviously remnants of PG credits and boilerplate that weren't caught by the filtering tools I'm using; the , token just happens to have been the vector most central to the average, which I guess kinda makes sense given how Word2Vec works. Not sure what all those pachyderms are doing in there though.)

I'm planning to continue experimenting with this technique, but wanted to share this progress in case further experiments extend past the deadline.

@swizzard
Copy link

swizzard commented Nov 28, 2017 via email

@aparrish
Copy link
Author

aparrish commented Dec 1, 2017

going to post the source code for this soon, stay tuned!

@moonmilk
Copy link

moonmilk commented Dec 1, 2017

This part is so beautiful.
image

@aparrish aparrish changed the title something with word embeddings probably The Average Novel Dec 2, 2017
@aparrish
Copy link
Author

aparrish commented Dec 2, 2017

I'm a day late but I posted the source code and a new version of the output. For the new version, I decided to ignore punctuation tokens when calculating the vectors for each novel. The result has fewer commas and variation that is a bit more interesting IMO!

@hugovk hugovk added completed For completed novels! preview There is an excerpt somewhere in the thread! labels Dec 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
completed For completed novels! preview There is an excerpt somewhere in the thread!
Projects
None yet
Development

No branches or pull requests

4 participants