## Getting the book list
My first goal was to obtain a list of books from prosecraft.io. To see what I had to work with, I chose the 50th percentile and checked what the HTML looked like.

In [1]:
import requests
import json
from bs4 import BeautifulSoup

In [2]:
URL = "http://prosecraft.io/analysis/word-count/percentile/"
page = requests.get(URL)

In [3]:
page.text[:1000]

'<html>\n<head>\n  <meta charset="UTF-8">\n  <title>Prosecraft: Percentiles of Word Count</title>\n  <script>\n    var directory = [{"a":"Simon Jimenez","t":"The Vanished Birds","w":197,"e":"jpg","v":124205.0},{"a":"Jonathan P. Brazee","t":"The Price of Honor","w":200,"e":"jpg","v":77253.0},{"a":"Michael Carter","t":"The Mathematical Murder of Innocence","w":190,"e":"jpg","v":37688.0},{"a":"Anthony Boucher","t":"The Case of the Baker Street Irregulars","w":197,"e":"jpg","v":80557.0},{"a":"Charlie Dalton","t":"Zombie Nation","w":188,"e":"jpg","v":64396.0},{"a":"C. C. Harrington","t":"Wildoak","w":197,"e":"jpg","v":55602.0},{"a":"T. M. Logan","t":"The Holiday","w":195,"e":"jpg","v":101767.0},{"a":"Howard Sounes","t":"Fab: An Intimate Life of Paul McCartney","w":225,"e":"jpg","v":225785.0},{"a":"Nora Roberts","t":"Honest illusions","w":197,"e":"jpg","v":163279.0},{"a":"P. D. Cacek","t":"Second Chances","w":194,"e":"jpg","v":91571.0},{"a":"Robertson Davies","t":"Fifth Business","w":196,"e"

In [4]:
soup = BeautifulSoup(page.content, "html.parser")

I noticed that the list of books was all contained in one variable called directory, found under the first script tag. Next, I needed to access it.

Examination showed that the starting and ending points of the list of dictionaries, so I extracted the relevant string and turned it into a JSON. 

In [5]:
script = soup.find_all("script")[0]
script.text[:1000]

'\n    var directory = [{"a":"Simon Jimenez","t":"The Vanished Birds","w":197,"e":"jpg","v":124205.0},{"a":"Jonathan P. Brazee","t":"The Price of Honor","w":200,"e":"jpg","v":77253.0},{"a":"Michael Carter","t":"The Mathematical Murder of Innocence","w":190,"e":"jpg","v":37688.0},{"a":"Anthony Boucher","t":"The Case of the Baker Street Irregulars","w":197,"e":"jpg","v":80557.0},{"a":"Charlie Dalton","t":"Zombie Nation","w":188,"e":"jpg","v":64396.0},{"a":"C. C. Harrington","t":"Wildoak","w":197,"e":"jpg","v":55602.0},{"a":"T. M. Logan","t":"The Holiday","w":195,"e":"jpg","v":101767.0},{"a":"Howard Sounes","t":"Fab: An Intimate Life of Paul McCartney","w":225,"e":"jpg","v":225785.0},{"a":"Nora Roberts","t":"Honest illusions","w":197,"e":"jpg","v":163279.0},{"a":"P. D. Cacek","t":"Second Chances","w":194,"e":"jpg","v":91571.0},{"a":"Robertson Davies","t":"Fifth Business","w":196,"e":"jpg","v":103691.0},{"a":"Ciji Ware","t":"Spy Above the Clouds","w":196,"e":"jpg","v":161061.0},{"a":"Jill 

In [6]:
books = script.text.split('=')[1].split('];')[0] + ']'
books[:1000]

' [{"a":"Simon Jimenez","t":"The Vanished Birds","w":197,"e":"jpg","v":124205.0},{"a":"Jonathan P. Brazee","t":"The Price of Honor","w":200,"e":"jpg","v":77253.0},{"a":"Michael Carter","t":"The Mathematical Murder of Innocence","w":190,"e":"jpg","v":37688.0},{"a":"Anthony Boucher","t":"The Case of the Baker Street Irregulars","w":197,"e":"jpg","v":80557.0},{"a":"Charlie Dalton","t":"Zombie Nation","w":188,"e":"jpg","v":64396.0},{"a":"C. C. Harrington","t":"Wildoak","w":197,"e":"jpg","v":55602.0},{"a":"T. M. Logan","t":"The Holiday","w":195,"e":"jpg","v":101767.0},{"a":"Howard Sounes","t":"Fab: An Intimate Life of Paul McCartney","w":225,"e":"jpg","v":225785.0},{"a":"Nora Roberts","t":"Honest illusions","w":197,"e":"jpg","v":163279.0},{"a":"P. D. Cacek","t":"Second Chances","w":194,"e":"jpg","v":91571.0},{"a":"Robertson Davies","t":"Fifth Business","w":196,"e":"jpg","v":103691.0},{"a":"Ciji Ware","t":"Spy Above the Clouds","w":196,"e":"jpg","v":161061.0},{"a":"Jill Lepore","t":"If Then"

In [7]:
book_data = json.loads(books)

In [8]:
book_data[:5]

[{'a': 'Simon Jimenez',
  't': 'The Vanished Birds',
  'w': 197,
  'e': 'jpg',
  'v': 124205.0},
 {'a': 'Jonathan P. Brazee',
  't': 'The Price of Honor',
  'w': 200,
  'e': 'jpg',
  'v': 77253.0},
 {'a': 'Michael Carter',
  't': 'The Mathematical Murder of Innocence',
  'w': 190,
  'e': 'jpg',
  'v': 37688.0},
 {'a': 'Anthony Boucher',
  't': 'The Case of the Baker Street Irregulars',
  'w': 197,
  'e': 'jpg',
  'v': 80557.0},
 {'a': 'Charlie Dalton',
  't': 'Zombie Nation',
  'w': 188,
  'e': 'jpg',
  'v': 64396.0}]

To test whether had worked, I selected an arbitary book. Success! 

In [9]:
book_data[50]

{'a': 'Stephen King',
 't': 'Just After Sunset',
 'w': 199,
 'e': 'jpg',
 'v': 132036.0}

I also confirmed that all 24,997 books contained in Prosecraft's database were represented.

Update 3/25: More books have since been added! 

In [10]:
len(book_data)

25090

Finally, I save the list for later use.

In [11]:
with open("book_list.json", "w") as outfile:
    json.dump(book_data, outfile)