## Acquire and Analyze: Goodreads vs. NYT's Book Reviews

In this notebook, I use three datasets - Wikipedia's list of books that have sold at least 10M copies, Goodreads community's top 100 rated books, and New York Times books review - to develop a list of books that have the most popular appeal while also being "worthy" of critical review.

Then I pull some basic statistics on the text from the New York Times reviews of the books that were published after 2000 and meet all three criteria above. 

### Part 1: Analysis of Books on Top Sellers List, Goodreads, and NYTs Reviews

In [1]:
# import libraries
from collections import Counter
from collections import defaultdict
import requests
from bs4 import BeautifulSoup 
from bs4.element import Comment
import re
from time import sleep
import nltk
nltk.data.path.append("/Users/austinsmith/documents/Fall21/TextMining/NLTK")
from nltk.corpus import stopwords
sw = stopwords.words('english')
import numpy as np

In [2]:
# read in Wikipedia list of top selling books of all time 

top_sellers = open('Wikipedia_Best_Sellers','r').read()

In [3]:
# garner a list of the top 100 books from the Goodreads Best Books list

site = ["https://www.goodreads.com/list/show/1.Best_Books_Ever"]
r = requests.get(site[0])
r.status_code

soup = BeautifulSoup(r.text, 'html.parser')

book_titles = []

for title in soup.find_all('a', class_ = 'bookTitle'):
    book_titles.append(title.get('href'))

In [4]:
book_titles[:5]

['/book/show/2767052-the-hunger-games',
 '/book/show/2.Harry_Potter_and_the_Order_of_the_Phoenix',
 '/book/show/2657.To_Kill_a_Mockingbird',
 '/book/show/1885.Pride_and_Prejudice',
 '/book/show/41865.Twilight']

In [5]:
# clean up the book titles (using same method as enumerated in the data set shares notebook)
titles = []

for item in book_titles :
    if "/" in item :
        titles.append(item.split("/"))


clean_titles = []

for item in titles:
    clean_titles.append(item[-1]) 
    

clean_titles_2 = []

for item in clean_titles :
    if '1984' not in item:
        clean_titles_2.append(''.join([i for i in item if not i.isdigit()]))
    else :
        clean_titles_2.append(item)
        

titles_spaces = []

for item in clean_titles_2 :
    if '_' in item:
        titles_spaces.append(item.replace('_',' '))
    else :
        titles_spaces.append(item)
        

titles_hyphens = []

for item in titles_spaces :
    if '-' in item:
        titles_hyphens.append(item.replace('-',' '))
    else :
        titles_hyphens.append(item)
        
        

fix_1984 = []

for item in titles_hyphens :
    if item == '40961427 1984' :
        fix_1984.append(item.replace('40961427 1984',' 1984'))
    else :
        fix_1984.append(item)
        
        
        
goodreads_titles = []

for item in fix_1984:
    goodreads_titles.append(item[1:])

In [6]:
goodreads_titles[:10]

['the hunger games',
 'Harry Potter and the Order of the Phoenix',
 'To Kill a Mockingbird',
 'Pride and Prejudice',
 'Twilight',
 'The Book Thief',
 'Animal Farm',
 'The Chronicles of Narnia',
 'J R R Tolkien  Book Boxed Set',
 'the fault in our stars']

In [7]:
# figure out how many titles in the Goodreads list are also
# on the list of top-selling books from Wikipedia

best_sellers_goodreads = []

for item in goodreads_titles :
    if item in top_sellers :
        best_sellers_goodreads.append(item)
    else :
        pass

In [8]:
len(best_sellers_goodreads)

26

In total, 26 of the 100 books on Goodread's Best Books list sold at least 10 million copies worldwide. 

Now I will see how many of those books were also reviewed by the New York Times. I know (from my data set shares notebook), that there are 31 books on Goodreads 100 Best Books list that were also reviewed by the NYT. 

In [9]:
# insert list of NYT/Goodreads overlap from the data shares jupyter notebook
# (note: I just copied and pasted here)

nyt_goodreads_overlap = (
'To%20Kill%20a%20Mockingbird',
 'Twilight',
 'The%20Book%20Thief',
 'the%20fault%20in%20our%20stars',
 'The%20Da%20Vinci%20Code',
 'Memoirs%20of%20a%20Geisha',
 'divergent',
 'Crime%20and%20Punishment',
 'The%20Little%20Prince',
 'City%20of%20Bones',
 'the%20help',
 'Brave%20New%20World',
 'A%20Thousand%20Splendid%20Suns',
 'the%20lovely%20bones',
 'The%20Odyssey',
 'Life%20of%20Pi',
 'Water%20for%20Elephants',
 'The%20Handmaid%27s%20Tale',
 'dune',
 'Little%20Women',
 'Harry%20Potter%20and%20the%20Deathly%20Hallows',
 'The%20Stand',
 'anna%20karenina',
 'The%20Girl%20with%20the%20Dragon%20Tattoo',
 'My%20Sister%27s%20Keeper',
 'the%20color%20purple',
 'The%20Road',
 'Angela%27s%20Ashes',
 'Don%20Quixote',
 'the%20notebook',
 'A%20Prayer%20for%20Owen%20Meany')

In [10]:
# clean up the overlap list 

clean_overlap = []

for item in nyt_goodreads_overlap :
    if '%20' in item :
        clean_overlap.append(item.replace('%20',' '))
    else :
        clean_overlap.append(item)

In [11]:
# compare the Goodreads books that were reviewed by the NYTs to the Goodreads books 
# that sold at least 10M copies

nyt_reviewed_bs = []

for item in clean_overlap : 
    if item in best_sellers_goodreads :
        nyt_reviewed_bs.append(item)
    else :
        pass

In [12]:
nyt_reviewed_bs

['To Kill a Mockingbird',
 'The Book Thief',
 'The Da Vinci Code',
 'The Little Prince',
 'Life of Pi',
 'Harry Potter and the Deathly Hallows',
 'The Girl with the Dragon Tattoo']

In total, only 7 books were 1) voted one of the top 100 best books by the Goodreads community, 2) reviewed by the NYT, and 3) sold at least 10M copies. Those books - which both appeal to the masses and are worthy of critical review - include the following: 
- To Kill a Mockingbird
- The Book Thief
- The Da Vinci Code
- The Little Prince
- Life of Pi
- Harry Potter and the Deathly Hallows
- The Girl with the Dragon Tattoo

### Part 2: Text Analysis of NYT Review of Most Popular Books

I use the information from the Wikipedia dataset to garner a bit more information on each of the top books identified in part 1 (please see the Wikipedia notebook in the Data Set Shares repo for this data source). 

The Little Prince sold the most copies (100M copies), followed by the Da Vinci Code, Harry Potter and the Deathly Hallows, To Kill a Mockingbird, The Girl with the Dragon Tattoo, The Book Thief, and Life of Pi (10M copies).

All books but two were originally printed in English; The Little Prince was printed in French and The Girl with the Dragon Tattoo was published in Swedish. 

Five books were published between 2001 and 2007. The Little Prince was published in 1943 and To Kill a Mockingbird was published in 1960. 

For my text analysis, I will analyze the four books that were published after 2000, as that will give us more of a "modern-day" interpretation of the literature by the critics. Those books are (in order of most recently published):
- Harry Potter and the Deathly Hallows
- The Girl with the Dragon Tattoo
- The Book Thief
- The Da Vince Code
- Life of Pi

I will clean each text and report the following:
- total number of tokens 
- unique tokens
- average token length
- lexical diversity
- top ten words used in each review

In [13]:
# first, read in all five bodies of text 
# The Harry Potter book has three reviews to read in; the Book Thief has two
# The other books only have one review each

# read in the easy files first 

life_of_pi = open('Life of Pi.txt','r').read()
da_vinci_code = open('The Da Vinci Code.txt','r').read()
dragon_tattoo = open('The Girl with the Dragon Tattoo.txt','r').read()

In [14]:
# read in the Harry Potter Reviews

filenames = ['Harry Potter 1.txt', 'Harry Potter 2.txt', 'Harry Potter 3.txt']

with open('Harry Potter.txt', 'w') as outfile:
    for fname in filenames:
        with open(fname) as infile:
            for line in infile:
                outfile.write(line)

harry_potter = open('Harry Potter.txt','r').read()
                

In [15]:
# read in The Book Thief Reviews

filenames = ['The Book Thief 1.txt', 'The Book Thief 2.txt']

with open('The Book Thief.txt', 'w') as outfile:
    for fname in filenames:
        with open(fname) as infile:
            for line in infile:
                outfile.write(line)

book_thief = open('The Book Thief.txt','r').read()

In [16]:
# use the patterns function to pull each of these statistics

def get_patterns(text,num_words):
       
    # clean the text (split on whitespace, change to lowercase, and drop non-alphanumerics and stopwords)
    clean_text_split = [w.lower() for w in text.split()]
    clean_text = [w for w in clean_text_split if w.isalpha and w not in sw]
    
    # report the total number of tokens
    total_tokens = len(clean_text)
    
    # report the total number of unique tokens
    unique_tokens = len(set(clean_text))
    
    # report the average token length
    clean_text_token_len = [len(w) for w in clean_text]
    avg_token_len = np.mean(clean_text_token_len)
    
    # report the lexical diversity
    lex_diversity = len(set(clean_text))/len(clean_text)
    
    # report the top 10 most used tokens
    clean_text_counter = Counter(clean_text)
    top_10 = clean_text_counter.most_common(num_words)
    
    results = {'tokens':total_tokens,
               'unique_tokens':unique_tokens,
               'avg_token_length':avg_token_len,
               'lexical_diversity':lex_diversity,
               'top_10':top_10}
    
    return(results)  

In [17]:
get_patterns(life_of_pi, 10)

{'tokens': 858,
 'unique_tokens': 684,
 'avg_token_length': 6.462703962703963,
 'lexical_diversity': 0.7972027972027972,
 'top_10': [('story', 9),
  ('pi', 9),
  ('new', 9),
  ('--', 9),
  ('tiger', 6),
  ('one', 6),
  ('book', 6),
  ('reading', 5),
  ('books', 4),
  ('animal', 4)]}

In [18]:
get_patterns(da_vinci_code,10)

{'tokens': 874,
 'unique_tokens': 690,
 'avg_token_length': 6.821510297482837,
 'lexical_diversity': 0.7894736842105263,
 'top_10': [('da', 11),
  ('vinci', 11),
  ('holy', 9),
  ('new', 9),
  ('books', 7),
  ("''holy", 7),
  ('blood,', 7),
  ("''the", 6),
  ('one', 6),
  ('book', 6)]}

In [19]:
get_patterns(dragon_tattoo,10)

{'tokens': 529,
 'unique_tokens': 431,
 'avg_token_length': 6.546313799621928,
 'lexical_diversity': 0.8147448015122873,
 'top_10': [('book', 5),
  ('swedish', 5),
  ('blomkvist', 5),
  ('“girl”', 5),
  ('reading', 4),
  ('story', 4),
  ('henrik', 4),
  ('.', 4),
  ('site', 3),
  ('vanished', 3)]}

In [20]:
get_patterns(harry_potter,10)

{'tokens': 2349,
 'unique_tokens': 1525,
 'avg_token_length': 6.558535547041294,
 'lexical_diversity': 0.6492124308216263,
 'top_10': [('—', 30),
  ('story', 17),
  ('reading', 16),
  ('harry', 16),
  ('book', 13),
  ('.', 13),
  ('new', 12),
  ('main', 11),
  ('us', 11),
  ('continue', 10)]}

In [21]:
get_patterns(book_thief,10)

{'tokens': 1573,
 'unique_tokens': 1017,
 'avg_token_length': 6.341385886840432,
 'lexical_diversity': 0.6465352828989193,
 'top_10': [('book', 36),
  ('liesel', 20),
  ('"the', 15),
  ('new', 13),
  ('books', 13),
  ('death', 12),
  ('reading', 10),
  ('thief"', 10),
  ('story', 9),
  ('max', 9)]}

Most of the book reviews have roughly 500-800 words, which makes sense seeing as they're all written for the same magazine (keep in mind that the Book Thief contains a total of two reviews and the Harry Potter books has three; on average, their reviews are about 800 tokens each). 

The average token length is about 6.5 for all books and the lexical diversity ranges from about 0.65 to 0.81, which indicates that all of the reviews are written with roughly the same level of complexity, regardless of what audience the book is intended to entertain (e.g., teens vs. adults). 

The most used words are predictable for several of the books. For example, it makes sense that "pi" and "tiger" would be among the top 10 words for Life of Pi. It also make sense that "holy" and "blood" would show up in The Da Vinci Code and that Harry Potter reviews would mention "Harry" frequently. 

There are some unexpected words, though. I am surprised to see that the review of The Girl with the Dragon Tattoo mentions Henrik and Blomkvist, male characters in the story, multiple times, but Lisbeth, the name of the female protagonist, does not appear in the top 10 words. Since "Swedish" is also in the top 10 words, the review may have been focused more on the role of men in Swedish society, or how they are portrayed in the book, rather than on the development of the main character. 

This was not the case for The Book Thief, where the main protagonist - Liesel - is mentioned often. Her mentions far outweigh the second main character, Max, which indicates that review may have focused on Liesel as its primary subject. 