## Cleaning the Prosecraft data
Next, it was time to get the Prosecraft books from a JSON containing every book to a DataFrame containing a sample. 

In [1]:
import requests
import json
from bs4 import BeautifulSoup
import pandas as pd
import random

I created a function to take in the title and author of a book and output a dictionary containing each piece of information given by Prosecraft. 

In [2]:
def get_info(title, author):
    '''Given a title and author of a book in the list,
    returns a dictionary of prosecraft's analysis about the book.'''
    
    #Get rid of special characters in URL
    chars_to_remove = [':', '’', '.', '“', '”']
    info = {'title': title, 'author': author}
    URL = f"{author}/{title}/"
    for char in chars_to_remove: 
        URL = URL.replace(char, '')
    URL = URL.replace('&', 'and').replace(' ','-').lower()
    URL = "http://prosecraft.io/library/" + URL
    
    #Get data from Prosecraft and turn it into a dict
    page = requests.get(URL)
    soup = BeautifulSoup(page.content, "html.parser")
    headings = soup.find_all("div", {"class": "book-info-metric-heading"})
    values = soup.find_all("div", {"class": "book-info-metric-value"})
    for heading, value in zip(headings, values):
        info[heading.text] = float(value.text.strip('%').replace(',',''))
    return info

I tested the function with a book from my favorite series. 

In [3]:
get_info("Thick as Thieves", "Megan Whalen Turner")

{'title': 'Thick as Thieves',
 'author': 'Megan Whalen Turner',
 'total words': 81228.0,
 'vividness': 46.07,
 'passive voice': 8.24,
 'all adverbs': 2.9,
 'ly-adverbs': 1.0,
 'non-ly-adverbs': 1.91}

Next, I tried creating a random sample of ten books and turning that sample into a pandas DataFrame. 

In [4]:
with open('book_list.json', 'r') as lst:
    book_list = json.load(lst)


In [5]:
indices = random.sample(range(0, 24997), 10)
books = []
for index in indices:
    books.append(book_list[index])

In [6]:
books

[{'a': 'Howard Lauther',
  't': 'Creating Characters',
  'w': 200,
  'e': 'jpg',
  'v': 73732.0},
 {'a': 'Robert Coover',
  't': 'Huck Out West',
  'w': 197,
  'e': 'jpg',
  'v': 95282.0},
 {'a': 'Graham Masterton',
  't': 'Ghost Virus',
  'w': 192,
  'e': 'jpg',
  'v': 114497.0},
 {'a': 'Conner Habib',
  't': 'Hawk Mountain',
  'w': 197,
  'e': 'jpg',
  'v': 77111.0},
 {'a': 'Jon Cohen', 't': 'Harry’s Trees', 'w': 199, 'e': 'jpg', 'v': 123948.0},
 {'a': 'Andrew Maclure',
  't': 'First Of The First',
  'w': 187,
  'e': 'jpg',
  'v': 168709.0},
 {'a': 'Gail Bowen',
  't': 'The Unlocking Season',
  'w': 206,
  'e': 'jpg',
  'v': 86649.0},
 {'a': 'Joyce Carol Oates',
  't': 'The Corn Maiden & Other Nightmares',
  'w': 201,
  'e': 'jpg',
  'v': 89549.0},
 {'a': 'James D. Mortain',
  't': 'Dead Ringer',
  'w': 187,
  'e': 'jpg',
  'v': 58803.0},
 {'a': 'Jesse Watters',
  't': 'How I Saved the World',
  'w': 198,
  'e': 'jpg',
  'v': 85594.0}]

In [7]:
sample = []

for book in books:
    sample.append(get_info(book['t'], book['a']))
    

In [8]:
df = pd.DataFrame(sample)

In [9]:
df

Unnamed: 0,title,author,total words,vividness,passive voice,all adverbs,ly-adverbs,non-ly-adverbs
0,Creating Characters,Howard Lauther,73732.0,32.56,8.5,2.87,1.24,1.63
1,Huck Out West,Robert Coover,95282.0,53.7,7.44,2.65,0.51,2.14
2,Ghost Virus,Graham Masterton,114497.0,55.44,9.49,3.16,0.87,2.29
3,Hawk Mountain,Conner Habib,77111.0,46.65,8.72,3.23,0.75,2.48
4,Harry’s Trees,Jon Cohen,123948.0,70.12,6.92,2.55,0.9,1.65
5,First Of The First,Andrew Maclure,168709.0,29.64,10.11,2.73,0.82,1.92
6,The Unlocking Season,Gail Bowen,86649.0,40.38,9.32,2.47,0.76,1.71
7,The Corn Maiden & Other Nightmares,Joyce Carol Oates,89549.0,55.88,9.08,3.58,1.42,2.16
8,Dead Ringer,James D. Mortain,58803.0,38.09,7.71,2.77,1.28,1.48
9,How I Saved the World,Jesse Watters,85594.0,38.27,7.89,2.86,1.24,1.62


On my first try, one title produced a row of null values. It turned out that a ':' in the title did not appear in the URL, so I accounted for that in the get_info function. I ran it a few more times to see if a similar event occurred--it did, with several different special characters. 

At this point I reworked the get_info function to loop through the characters I wanted to remove rather than chaining .replace() on author and title with each character individually. 