## Cleaning the Prosecraft data
Next, it was time to get the Prosecraft books from a JSON containing every book to a DataFrame containing a sample. 

In [1]:
import requests
import json
from bs4 import BeautifulSoup
import pandas as pd
import random

I created a function to take in the title and author of a book and output a dictionary containing each piece of information given by Prosecraft. 

In [10]:
def get_info(title, author):
    '''Given a title and author of a book in the list,
    returns a dictionary of prosecraft's analysis about the book.'''
    
    #Get rid of special characters in URL
    chars_to_remove = [':', '’', '.', ",", '“', '”']
    info = {'title': title, 'author': author}
    URL = f"{author}/{title}/"
    for char in chars_to_remove: 
        URL = URL.replace(char, '')
    URL = URL.replace('&', 'and').replace(' ','-').lower()
    URL = "http://prosecraft.io/library/" + URL
    
    #Get data from Prosecraft and turn it into a dict
    page = requests.get(URL)
    soup = BeautifulSoup(page.content, "html.parser")
    headings = soup.find_all("div", {"class": "book-info-metric-heading"})
    values = soup.find_all("div", {"class": "book-info-metric-value"})
    for heading, value in zip(headings, values):
        info[heading.text] = float(value.text.strip('%').replace(',',''))
    return info

I tested the function with a book from my favorite series. 

In [3]:
get_info("Thick as Thieves", "Megan Whalen Turner")

{'title': 'Thick as Thieves',
 'author': 'Megan Whalen Turner',
 'total words': 81228.0,
 'vividness': 46.07,
 'passive voice': 8.24,
 'all adverbs': 2.9,
 'ly-adverbs': 1.0,
 'non-ly-adverbs': 1.91}

Next, I tried creating a random sample of ten books and turning that sample into a pandas DataFrame. 

In [4]:
with open('book_list.json', 'r') as lst:
    book_list = json.load(lst)


In [5]:
indices = random.sample(range(0, len(book_list)), 10)
books = []
for index in indices:
    books.append(book_list[index])

In [6]:
books

[{'a': 'Katy Milkman',
  't': 'How to Change',
  'w': 199,
  'e': 'jpg',
  'v': 53409.0},
 {'a': 'Zadie Smith',
  't': 'Changing My Mind',
  'w': 199,
  'e': 'jpg',
  'v': 102881.0},
 {'a': 'Lauren Belfer',
  't': 'And After the Fire',
  'w': 198,
  'e': 'jpg',
  'v': 121275.0},
 {'a': 'Kelly Powell',
  't': 'Magic Dark and Strange',
  'w': 197,
  'e': 'jpg',
  'v': 51333.0},
 {'a': 'James P. Hogan',
  't': 'Endgame Enigma',
  'w': 186,
  'e': 'jpg',
  'v': 152071.0},
 {'a': 'Toni Jordan', 't': 'Nine Days', 'w': 194, 'e': 'jpg', 'v': 67063.0},
 {'a': 'Kerrelyn Sparks',
  't': 'Vamps and the City',
  'w': 185,
  'e': 'jpg',
  'v': 95146.0},
 {'a': 'Jenny McCarthy',
  't': 'Love, Lust & Faking It',
  'w': 203,
  'e': 'jpg',
  'v': 51744.0},
 {'a': 'Ilona Andrews',
  't': 'Sweep of the Blade',
  'w': 187,
  'e': 'jpg',
  'v': 84146.0},
 {'a': 'Eric T. Knight',
  't': 'Power Forged',
  'w': 200,
  'e': 'jpg',
  'v': 129144.0}]

In [11]:
sample = []

for book in books:
    sample.append(get_info(book['t'], book['a']))
    

In [12]:
df = pd.DataFrame(sample)

In [13]:
df

Unnamed: 0,title,author,total words,vividness,passive voice,all adverbs,ly-adverbs,non-ly-adverbs
0,How to Change,Katy Milkman,53409.0,26.54,6.85,3.27,1.48,1.8
1,Changing My Mind,Zadie Smith,102881.0,34.56,7.25,3.23,1.37,1.85
2,And After the Fire,Lauren Belfer,121275.0,44.64,8.07,2.66,0.82,1.84
3,Magic Dark and Strange,Kelly Powell,51333.0,60.47,6.86,2.28,0.65,1.63
4,Endgame Enigma,James P. Hogan,152071.0,37.22,8.26,3.22,1.22,2.0
5,Nine Days,Toni Jordan,67063.0,59.03,9.16,2.44,0.42,2.02
6,Vamps and the City,Kerrelyn Sparks,95146.0,51.39,8.58,2.87,0.89,1.98
7,"Love, Lust & Faking It",Jenny McCarthy,51744.0,36.98,9.15,3.49,1.13,2.36
8,Sweep of the Blade,Ilona Andrews,84146.0,56.42,7.28,2.3,0.75,1.55
9,Power Forged,Eric T. Knight,129144.0,46.85,9.51,3.16,1.04,2.12


On my first try, one title produced a row of null values. It turned out that a ':' in the title did not appear in the URL, so I accounted for that in the get_info function. I ran it a few more times to see if a similar event occurred--it did, with several different special characters. 

At this point I reworked the get_info function to loop through the characters I wanted to remove rather than chaining .replace() on author and title with each character individually. 