## Data Set Shares: Reviews of Popular Books

This notebook includes the code that scrapes two data sets related to the most popular books, as deteremined by the Goodreads community. 

The first dataset includes the top 100 best books of all time, as reviewed by the Goodreads community. Community reviews for each of the top 100 books are included as text data. That text data also includes a synopsis and extraneous information from the Goodreads website.  

The second dataset includes the NYT's reviews for any of those 100 books that have been reviewed by the NYT.


In [1]:
# import libraries
from collections import Counter
from collections import defaultdict
import requests
from bs4 import BeautifulSoup 
from bs4.element import Comment
import re
from nyt_api import api_key
import pandas as pd
from time import sleep

In [2]:
# function to extract visible text (will need this later)

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


In [3]:
# function to turn a URL into a nice file name (will need this later)

def generate_filename_from_url(url) :
    
    if not url :
        return None
    
    # drop the http or https
    name = url.replace("https","").replace("http","")

    # Replace useless chareacters with UNDERSCORE
    name = name.replace("://","").replace(".","_").replace("/","_")
    
    # remove last underscore
    last_underscore_spot = name.rfind("_")
    
    name = name[:last_underscore_spot] + name[(last_underscore_spot+1):]

    # tack on .txt
    name = name + ".txt"
    
    return(name)

### Part 1: Goodreads Best Books List - Review Scrape

In [4]:
# collect information on the first 100 books, which are stored on the first page of the best books list
site = ["https://www.goodreads.com/list/show/1.Best_Books_Ever"]

In [5]:
# check the link
print(site)
r = requests.get(site[0])
r.status_code

['https://www.goodreads.com/list/show/1.Best_Books_Ever']


200

In [6]:
soup = BeautifulSoup(r.text, 'html.parser')

In [7]:
# grabs all of the book titles

book_titles = []

for title in soup.find_all('a', class_ = 'bookTitle'):
    book_titles.append(title.get('href'))

In [8]:
# note that each of these urls starts with the prefix: https://www.goodreads.com/
# so I need to add that to them so the function below works
book_titles[:10]

['/book/show/2767052-the-hunger-games',
 '/book/show/2.Harry_Potter_and_the_Order_of_the_Phoenix',
 '/book/show/2657.To_Kill_a_Mockingbird',
 '/book/show/1885.Pride_and_Prejudice',
 '/book/show/41865.Twilight',
 '/book/show/19063.The_Book_Thief',
 '/book/show/170448.Animal_Farm',
 '/book/show/11127.The_Chronicles_of_Narnia',
 '/book/show/30.J_R_R_Tolkien_4_Book_Boxed_Set',
 '/book/show/11870085-the-fault-in-our-stars']

In [9]:
full_book_titles = []

for item in book_titles :
    full_book_titles.append(''.join('https://www.goodreads.com/'+item))

In [10]:
full_book_titles[:10]

['https://www.goodreads.com//book/show/2767052-the-hunger-games',
 'https://www.goodreads.com//book/show/2.Harry_Potter_and_the_Order_of_the_Phoenix',
 'https://www.goodreads.com//book/show/2657.To_Kill_a_Mockingbird',
 'https://www.goodreads.com//book/show/1885.Pride_and_Prejudice',
 'https://www.goodreads.com//book/show/41865.Twilight',
 'https://www.goodreads.com//book/show/19063.The_Book_Thief',
 'https://www.goodreads.com//book/show/170448.Animal_Farm',
 'https://www.goodreads.com//book/show/11127.The_Chronicles_of_Narnia',
 'https://www.goodreads.com//book/show/30.J_R_R_Tolkien_4_Book_Boxed_Set',
 'https://www.goodreads.com//book/show/11870085-the-fault-in-our-stars']

Now we can crawl through each of the book title pages to extract the text and store it in a dictionary.

URL will be the key and text will be the value.

In [None]:
# dictionary to hold results
titles_text = dict()

for link in full_book_titles :
    try :
        r = requests.get(link)
    except :
        pass 
    
    if r.status_code == 200 :
        soup = BeautifulSoup(r.text, 'html.parser')
        texts = soup.findAll(text=True)
        visible_texts = filter(tag_visible, texts)
        titles_text[link] = " ".join(t.strip() for t in visible_texts)
    else :
        print(f"We got code {r.status_code} for this link: {link}")
        

In [None]:
# the text values for each book key is massive
# check a portion of results for one book to make sure it worked

titles_text['https://www.goodreads.com//book/show/2767052-the-hunger-games'][50000:51000]

We now have a dictionary item where the keys are the URLs for each of the first 100 books and the values are all of the text on the synopsis/review pages of those books

In [13]:
# write all text to files 

for item in titles_text :
    output_file = generate_filename_from_url(item)

    with open(output_file,'w',encoding = "UTF-8") as outfile :
        outfile.write("\t".join(["link","text"]) + "\n")
        
        the_text = titles_text[item]

        # get rid of some of our more annoying output chars
        the_text = the_text.replace("\t"," ").replace("\n"," ").replace("\r"," ") 

        if the_text : # test to see if it is non-empty
            outfile.write("\t".join([item,the_text]) + "\n")


### Part 2: NYT's Book Reviews 

This scraping exercise will pull the NYT's book review for any book on Goodread's top 100 that has been reviewed by the NYT. 

In [9]:
# we have the list of top 100 titles from above
book_titles[:10]

['/book/show/2767052-the-hunger-games',
 '/book/show/2.Harry_Potter_and_the_Order_of_the_Phoenix',
 '/book/show/2657.To_Kill_a_Mockingbird',
 '/book/show/1885.Pride_and_Prejudice',
 '/book/show/41865.Twilight',
 '/book/show/19063.The_Book_Thief',
 '/book/show/170448.Animal_Farm',
 '/book/show/11127.The_Chronicles_of_Narnia',
 '/book/show/30.J_R_R_Tolkien_4_Book_Boxed_Set',
 '/book/show/11870085-the-fault-in-our-stars']

In [10]:
# to pull from the NYT's API, I need the titles in a specific format -- 
# spaces: %20
# apostrophes: %27
# it looks like the titles start after either a "." or a "-". 
# I can take off 'book/show' for all the titles as a start

titles = []

for item in book_titles :
    if "/" in item :
        titles.append(item.split("/"))
        

In [11]:
# pull off all but the title names (and some junk that I'll remove later)

clean_titles = []

for item in titles:
    clean_titles.append(item[-1]) 

In [12]:
# then remove all the numbers, except for 1984 because that's a book title

clean_titles_2 = []

for item in clean_titles :
    if '1984' not in item:
        clean_titles_2.append(''.join([i for i in item if not i.isdigit()]))
    else :
        clean_titles_2.append(item)

In [13]:
# if an s is preceded by an _, I need to replace it with "%27". 

titles_apostrophes = []

for item in clean_titles_2 :
    if '_s_' in item :
        titles_apostrophes.append(item.replace('_s','%27s'))
    else :
        titles_apostrophes.append(item)
    


In [14]:
# all of the spaces between words need to be coded as "%20"
# some spaces are currently underscores

titles_spaces = []

for item in titles_apostrophes :
    if '_' in item:
        titles_spaces.append(item.replace('_','%20'))
    else :
        titles_spaces.append(item)

In [15]:
# other spaces are curently dashes

titles_spaces2 = []

for item in titles_spaces :
    if '-' in item:
        titles_spaces2.append(item.replace('-','%20'))
    else:
        titles_spaces2.append(item)

In [16]:
# I fix 1984 manually 

odd_title_clean = []

for item in titles_spaces2 :
    if item == '40961427%201984' :
        odd_title_clean.append(item.replace("40961427%201984",".1984"))
    else:
        odd_title_clean.append(item)

In [17]:
# remove the first character

remove_first = []

for item in odd_title_clean:
    remove_first.append(item[1:])

In [18]:
# finally, if the title now starts with '2', delete those two characters

nyt_titles = []

for item in remove_first :
    if item[0] == '2' :
        nyt_titles.append(item[2:])
    else :
        nyt_titles.append(item)

In [19]:
nyt_titles[:10]

['the%20hunger%20games',
 'Harry%20Potter%20and%20the%20Order%20of%20the%20Phoenix',
 'To%20Kill%20a%20Mockingbird',
 'Pride%20and%20Prejudice',
 'Twilight',
 'The%20Book%20Thief',
 'Animal%20Farm',
 'The%20Chronicles%20of%20Narnia',
 'J%20R%20R%20Tolkien%20%20Book%20Boxed%20Set',
 'the%20fault%20in%20our%20stars']

I now have 100 book titles in the form that the NYT's book review API can understand. I can iterate through every one of those titles and pull the review. If one exists, the results will include a URL that links to the review. If no NYT review exists for the book, it will return an empty space in the URL results. 

In [20]:
# pull results from NYT's reviews, wait 6 seconds between requests to avoid quote limit

review_urls = defaultdict(list) 

for idx, title in enumerate(nyt_titles) :
    
    requestURL = ''.join('https://api.nytimes.com/svc/books/v3/reviews.json?title='+
                       title + '&api-key='+ api_key)

    requestHeaders = {
        "Accept": "applications/json"
    }

    request = requests.get(requestURL, headers=requestHeaders)
    
    for review_obj in request.json()['results'] :
        this_url = review_obj['url']
        
        if r'/movies/' in this_url : # make sure movie reviews aren't included
            pass
        else :
            review_urls[title].append(this_url)  
    
    sleep(6)
    


In [31]:
# crawl each page and extract the text into a dictionary
# it put the text into a dictionary with the key as the book title and the value
# as the text of the review

site_text = dict()

for key, values in review_urls.items() :
    for link in values :
    
        try :
            r = requests.get(link)
        except :
            pass 
    
        if r.status_code == 200 :
            soup = BeautifulSoup(r.text, 'html.parser')
            texts = soup.findAll(text=True)
            visible_texts = filter(tag_visible, texts) 
            site_text[key] = []
            site_text[key].append(" ".join(t.strip() for t in visible_texts))
        else :
            print(f"We got code {r.status_code} for this link: {link}")


In [34]:
# write text to a file 

for key, values in site_text.items() :
    for value in values :
        output_file = key
        with open(output_file,'w',encoding = "UTF-8") as outfile :

            the_text = value

            # get rid of some of our more annoying output chars
            the_text = the_text.replace("\t"," ").replace("\n"," ").replace("\r"," ") 

            if the_text : # test to see if it is non-empty
                outfile.write("\t".join([the_text]) + "\n")

## Appendix

In [35]:
# FOR USE IN ACQUIRE AND ANALYZE NOTEBOOK
# Grabbing the titles of the books from Goodreads that have been reviewed by the NYT

nyt_goodreads_books = []

for key, value in review_urls.items() :
    nyt_goodreads_books.append(key)

In [42]:
nyt_goodreads_books

['To%20Kill%20a%20Mockingbird',
 'Twilight',
 'The%20Book%20Thief',
 'the%20fault%20in%20our%20stars',
 'The%20Da%20Vinci%20Code',
 'Memoirs%20of%20a%20Geisha',
 'divergent',
 'Crime%20and%20Punishment',
 'The%20Little%20Prince',
 'City%20of%20Bones',
 'the%20help',
 'Brave%20New%20World',
 'A%20Thousand%20Splendid%20Suns',
 'the%20lovely%20bones',
 'The%20Odyssey',
 'Life%20of%20Pi',
 'Water%20for%20Elephants',
 'The%20Handmaid%27s%20Tale',
 'dune',
 'Little%20Women',
 'Harry%20Potter%20and%20the%20Deathly%20Hallows',
 'The%20Stand',
 'anna%20karenina',
 'The%20Girl%20with%20the%20Dragon%20Tattoo',
 'My%20Sister%27s%20Keeper',
 'the%20color%20purple',
 'The%20Road',
 'Angela%27s%20Ashes',
 'Don%20Quixote',
 'the%20notebook',
 'A%20Prayer%20for%20Owen%20Meany',
 31,
 30]