# <font color=blue> SCRAPE GOODREADS BOOKS AND REVIEWS </font>

### 1.0 INTRODUCTION

##### This notebook contains code which serves as ```Part I``` of an E-2-E project on the sentiments analysis of Data Science books reviews.

##### The motivation for this project came from my need to answer the question of: "Which Data Science Book(s) to read?" . While on that thought I came across similar projects which also inspired me to boraden the scope of the project. On this project I will br doing the following:

* Scrape basic data on 'Data Science' books from Goodreads.
* Do Some EDA around cost per book, and comparison of other book features with cost as a ground property.
* Explore the variety topics available on the subject, and its related subjects.
* Scrape reviews and stars ✨ , as the second scraping run
* Attempt to compare the result of sentiments analaysis from different models on the true ratings, per review comment and the overall book rating.

#### 1.1 IMPORT PACKAGES & DEPENDENCIES

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from time import sleep

import datetime
from datetime import datetime
import time


from tqdm import tqdm

# import image module
from IPython.display import Image

In [2]:
#Define headers property needed by the requests library for scraping
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0',
    'Accept-Language': 'en-US, en;q=0.5'
}

### 2.0 STEP-WISE DATA COLLECTION

#### 2.1 Collect Basic Information

In [3]:
#Decalre variables describing the URL/page to be scraped

search_query = 'data science'.replace(' ', '+')
page_num=1
base_url = f'https://www.goodreads.com/search?page={page_num}&qid=lolPVmy4Sz&query={search_query}&tab=books&utf8=%E2%9C%93'#.format(search_query)

base_url

'https://www.goodreads.com/search?page=1&qid=lolPVmy4Sz&query=data+science&tab=books&utf8=%E2%9C%93'

***
THE BOOK SHELF
***

* All books are housed within the table object, as seen in the image below.
* Each book object (with their properties) are stored in ```tr``` tags.

<img src="images\books_shelve1.png" width="900" height="500">

In [4]:
#collect content of desired page
firstpage_content = requests.get(base_url, headers=headers)

#prettify the html content
soup = BeautifulSoup(firstpage_content.text, 'lxml')

#book shelve
books_shelve = soup.find('table').find_all('tr')

#print properties of the first book on the shelve
book = books_shelve[0]

book.text

'\n\n\n\n\n \n\n\nData Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking\n \nby\n\n\nFoster Provost, \n\n\nTom Fawcett\n\n\n\n\n\n 4.13 avg rating — 2,337 ratings\n              —\n                published\n               2013\n              —\n              23 editions\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nWant to Read\nsaving…\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nWant to Read\n\n\n\n\nCurrently Reading\n\n\n\n\nRead\n\n\n\n\n\n\n\n\nError rating book. Refresh and try again.\n\nRate this book\nClear rating\n1 of 5 stars2 of 5 stars3 of 5 stars4 of 5 stars5 of 5 stars\n\n\n\n\nGet a copy\n\n\n\n\n'

***
From the spilled content above we see that the each book object on the book shelve provides the following basic info about the book:
1. Book title
2. Book Author & Co-author
3. Publication information
4. Overall book rating

The next line we will show how to neatly collect these information from the book objects.
***

<img src="images\basic_book_details.png" width="900" height="900">

#### 2.2 Extract the Basci Info from ```Soup``` of Book Overview

In [5]:
#book title and author
book_title = book.a['title']

author = book.find_all('span')[2].text

#collects all authors and inserts in a list
authors = book.find_all('div', {'class': 'authorName__container'})#.text
author_names = [a.text.strip(', \n') for a in authors]

#rating info
rating = book.find('span', {'class': 'minirating'}).text[0:5]
rating_count = book.find('span', {'class': 'minirating'}).text.split(' — ')[1]

#publication info
editions = book.find('a', {'class': 'greyText'}).text

#view some of the extracted details
print(book_title, '\n', author_names, '\n', rating, '\n', editions)

Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking 
 ['Foster Provost', 'Tom Fawcett'] 
  4.13 
 23 editions


#### 2.3 Proceed to Getting In-Depth Book Details

##### 2.3 Extract ```href``` Property and Get the Details Page URL per Book

***
An important property that accompanies the basic information but not shown as part of the content viewed above is the ```href``` property.

The ```href``` property is a link extension that allows us to view in-depth details for each book, namely;

1. Book Price (not all prices are available; some refer to Amazon)
2. Foreward (if available)
3. Book genre
4. Reviews & stars for each review et.c

In the next lines the ```href``` property in combination with the ```home-page url``` will be used to extracted deeper book details as outlined above.
***

In [6]:
#get href property
href = book.a['href']
print(href)

/book/show/17912916-data-science-for-business?from_search=true&from_srp=true&qid=lolPVmy4Sz&rank=1


In [7]:
#combine with the home-page url to get the complete book address
book_url='https://www.goodreads.com' + href
print(book_url)

https://www.goodreads.com/book/show/17912916-data-science-for-business?from_search=true&from_srp=true&qid=lolPVmy4Sz&rank=1


##### 2.4 Preview the ```html Soup``` for the Main Book Page

In [8]:
#fetch the html content for the book's extra details
bookdetails = requests.get(book_url, headers=headers)

#prettify the html content
soup1 = BeautifulSoup(bookdetails.text, 'lxml')

#soup1.text

###### The cookdetails soup above contains all the 8 features outlined above, including the first 30 reviews. Being a large soup, the  reult will be sliced just so we have a brief view

***
As will be seen in a slice of the text/string output, we have more data (price, foreward, reviews, ratings distribution etc.)
***

In [9]:
soup1.text[:1200]

'Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking by Foster Provost | GoodreadsHomeMy BooksBrowse ▾RecommendationsChoice AwardsGiveawaysNew ReleasesListsExploreNews & InterviewsLoading...Community ▾GroupsQuotesAsk the AuthorPeopleSign inJoinJump to ratings and reviewsWant to readKindle $25.49Rate this bookData Science for Business: What You Need to Know about Data Mining and Data-Analytic ThinkingFoster Provost, Tom Fawcett4.132,337\xa0ratings152\xa0reviewsWant to readKindle $25.49Rate this bookWritten by renowned data science experts Foster Provost and Tom Fawcett, Data Science for Business introduces the fundamental principles of data science, and walks you through the "data-analytic thinking" necessary for extracting useful knowledge and business value from the data you collect. This guide also helps you understand the many data-mining techniques in use today. Based on an MBA course Provost has taught at New York University over the past 

#### Important Notice⚠️
***
This page does not always make data avaialble at each call/request.

For this reason book properties that requirie indexing, slicing operations would throw an error when the page fails to respond with data.

To deal with this error however, the codes will be wrapped in _```try-except```_ block, especially during iterations to avoid code breakage.
***

#### 2.4 Collect All Basic Book Data

In [10]:
#separate try-except blocks will be used for different properties as data response from each call is random for different properties
#Foreward
try:
    price = soup1.find_all('div' , {'class':'BookActions__button'})[-1].text.split(' ')[1]
except IndexError:
    price = 'Buys on Amazon'

try:
    foreward = soup1.find('span' , {'class': 'Formatted', }).text
except AttributeError:
    foreward = 'Not Found'

#Book genres
try:
    genres = soup1.find('ul' , {'class': 'CollapsableList'}).text
except AttributeError:
    genres = 'Not Found'

try:
    num_pages=soup1.find('p' , {'data-testid':'pagesFormat'}).text.split(', ')[0]
except AttributeError:
    num_pages = 'Not Available'

try:
    publication_info_firstedition = soup1.find('p' , {'data-testid':'publicationInfo'}).text #may not always be available. use tr,except block
except AttributeError:
    publication_info_firstedition = 'Not Available'

try:
    publication_info_firstedition1 = [i.text for i in soup1.find('div' , {'class':'FeaturedDetails'})]
except (AttributeError, TypeError):
    publication_info_firstedition1 = 'Not Available'

first30_reviews=soup1.find_all('div' , {'class':'TruncatedContent__text'})
reviews = [(rev.text, "\n") for rev in first30_reviews]



In [11]:
#view some of the extracted details
print(book_title)
print('\n')
print(f'Published {publication_info_firstedition1}')
print('\n')
print(f'Written by {author_names} with a rating of {rating} from {rating_count}')
print('\n')
print(f'Costs {price}')

Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking


Published ['413 pages, Paperback', 'First published January 1, 2013']


Written by ['Foster Provost', 'Tom Fawcett'] with a rating of  4.13 from 2,337 ratings


Costs $25.49


### 3.0 PUTTING IT ALL TOGETHER AND SCRAPE FOR ALL BOOKS ON ALL SHELVES (PAGE)

***
The next code line shows how to extract overview data and main book data for each book on all existing shelve/web-page as at the time of this scrape. 

As at the time of this scrape there exists 101 pages of 'data science' related books on Goodreads website.
***

In [12]:
#Iterate throught the 101 webpages of Goodreads 'data science' books
items = []
search_query = 'data science'.replace(' ', '+')


for i in tqdm(range(1, 101)):
    start = time.time()
    
    #create a URL for each page with the search query
    page_url = f'https://www.goodreads.com/search?page={i}&qid=lolPVmy4Sz&query={search_query}&tab=books&utf8=%E2%9C%93'

    response = requests.get(page_url, headers=headers)
    sleep(5)
    soup = BeautifulSoup(response.text, 'lxml')
    
    #each page renders all books in a 'table' containier where all details are captured in the 'tr' tag
    #wrap the search through the soup in a try-except block to capture exceptions and keep loop running
    try:
        results = soup.find('table').find_all('tr')
    except AttributeError:
        continue
    
    #iterate through each book in the 'tr' tag containing all books to fetch overview information
    for result in results:
            
        href = result.a['href']
        title = result.a['title']
        title1 = result.find_all('span')[0].text
        author = result.find_all('span')[2].text
        authors = result.find_all('div', {'class': 'authorName__container'})#.text
        author_names = [a.text.strip(', \n') for a in authors]
        rating = result.find('span', {'class': 'minirating'}).text[0:5]
        rating_count = result.find('span', {'class': 'minirating'}).text.split(' — ')[1]
        editions = result.find('a', {'class': 'greyText'}).text
        
        #load the html for each book to fetch drill down info, that is not available on the overview
        product_url='https://www.goodreads.com' + href
        sleep(2)
        response1 = requests.get(product_url, headers=headers)
        sleep(5)
        soup1 = BeautifulSoup(response1.text, 'lxml')
        
        #Foreward
        try:
            foreward = soup1.find('span' , {'class': 'Formatted', }).text
        except AttributeError:
            foreward = 'Not Found'
            
        try:
            foreward1 = soup1.find_all('div' , {'class':'TruncatedContent__text'})[0].text
        except IndexError:
            foreward1 = 'Not Found'
        
        #Book genres
        try:
            genres = soup1.find('ul' , {'class': 'CollapsableList'}).text
        except AttributeError:
            genres = 'Not Found'
        
        try:
            num_pages=soup1.find('p' , {'data-testid':'pagesFormat'}).text.split(', ')[0]
        except AttributeError:
            num_pages = 'Not Available'
            
        try:
            paperback = soup1.find('p' , {'data-testid':'pagesFormat'}).text
        except (AttributeError,IndexError):
            paperback = 'Not Available'
            
        #paperback=soup1.find('p' , {'data-testid':'pagesFormat'}).text.split(', ')[1]
        
        try:
            publication_info_firstedition = soup1.find('p' , {'data-testid':'publicationInfo'}).text #may not always be available. use tr,except block
        except AttributeError:
            publication_info_firstedition = 'Not Available'
            
        try:
            publication_info_firstedition1 = [i.text for i in soup1.find('div' , {'class':'FeaturedDetails'})]
        except (AttributeError, TypeError):
            publication_info_firstedition1 = 'Not Available'
            
        first30_reviews=soup1.find_all('div' , {'class':'TruncatedContent__text'})
        reviews = [(rev.text, "\n") for rev in first30_reviews]
        #soup1.find('div' , {'class':'TruncatedContent__text', 'class':'TruncatedContent__text--small'})
        
        try:
            price = soup1.find_all('div' , {'class':'BookActions__button'})[-1].text.split(' ')[1]
        except IndexError:
            price = 'Buys on Amazon'
        
        try:
            price1 = [i.text for i in soup1.find_all('span' , {'class':'Button__labelItem'}) if '$' in i.text]
        except AttributeError:
            price1 = 'Not Found'
            
        try:
            rating1 = soup1.find('div' , {'class':'RatingStatistics__rating'}).text
        except AttributeError:
            rating1 = 'Not Found'
            
        try:
            rating_count1 = soup1.find('span' , {'data-testid':'ratingsCount'}).text
        except AttributeError:
            rating_count1 = 'Not Found'
        
        items.append([title, title1, author, author_names, genres, num_pages, paperback, rating, rating1, rating_count, rating_count1, 
                       editions, publication_info_firstedition, publication_info_firstedition1, foreward, foreward1, price, price1,
                       reviews, product_url])
    #duration = time.time() - start
    #print(f'Processed Page {i}... in {duration}')
            
    sleep(5)


100%|███████████████████████████████████████████████████████████████████████████████████| 2/2 [07:51<00:00, 235.96s/it]


In [13]:
#define column titles
cols = ['title', 'title1', 'author', 'author_names', 'genres', 'num_pages', 'paperback', 'rating', 'rating1', 'rating_count', 'rating_count1', 
                       'editions', 'publication_info_firstedition', 'publication_info_firstedition1', 'foreward', 'foreward1', 'price', 'price1',
                       'reviews', 'product_url']


#preview the scraped data as a dataframe
df = pd.DataFrame(items, columns=cols)
df.head()

Unnamed: 0,title,title1,author,author_names,genres,num_pages,paperback,rating,rating1,rating_count,rating_count1,editions,publication_info_firstedition,publication_info_firstedition1,foreward,foreward1,price,price1,reviews,product_url
0,Data Science for Business: What You Need to Kn...,Data Science for Business: What You Need to Kn...,"\n\nFoster Provost, \n\n\nTom Fawcett\n\n","[Foster Provost, Tom Fawcett]",Not Found,Not Available,Not Available,4.13,Not Found,"2,337 ratings",Not Found,23 editions,Not Available,Not Available,Not Found,Not Found,Buys on Amazon,[],[],https://www.goodreads.com/book/show/17912916-d...
1,Data Smart: Using Data Science to Transform In...,Data Smart: Using Data Science to Transform In...,\n\nJohn W. Foreman\n\n,[John W. Foreman],GenresBusinessNonfictionTechnologyProgrammingS...,409 pages,"409 pages, Paperback",4.11,4.11,988 ratings,988 ratings,5 editions,"First published October 31, 2013","[409 pages, Paperback, First published October...",Data Science gets thrown around in the press l...,Data Science gets thrown around in the press l...,$27.00,"[Kindle $27.00, Kindle $27.00]",[(Data Science gets thrown around in the press...,https://www.goodreads.com/book/show/17682206-d...
2,Data Science from Scratch: First Principles wi...,Data Science from Scratch: First Principles wi...,\n\nJoel Grus\n\n,[Joel Grus],GenresProgrammingComputer ScienceTechnologyNon...,330 pages,"330 pages, Kindle Edition",3.91,3.91,"1,018 ratings","1,018 ratings",25 editions,"First published April 14, 2015","[330 pages, Kindle Edition, First published Ap...","\nData science libraries, frameworks, modules,...","\nData science libraries, frameworks, modules,...",on,[],"[(\nData science libraries, frameworks, module...",https://www.goodreads.com/book/show/25407018-d...
3,"R for Data Science: Import, Tidy, Transform, V...","R for Data Science: Import, Tidy, Transform, V...","\n\nHadley Wickham (Goodreads Author), \n\n\nG...","[Hadley Wickham (Goodreads Author), Garrett Gr...",Not Found,Not Available,Not Available,4.56,Not Found,"1,062 ratings",Not Found,17 editions,Not Available,Not Available,Not Found,Not Found,Buys on Amazon,[],[],https://www.goodreads.com/book/show/33399049-r...
4,Machine Learning For Absolute Beginners: A Pla...,Machine Learning For Absolute Beginners: A Pla...,\n\nOliver Theobald\n\n,[Oliver Theobald],GenresTechnologyNonfictionArtificial Intellige...,168 pages,"168 pages, Kindle Edition",4.13,4.13,361 ratings,361 ratings,1 edition,"Published June 21, 2017","[168 pages, Kindle Edition, Published June 21,...",To buy the newest edition of this book (2021)...,To buy the newest edition of this book (2021)...,Unlimited,"[Kindle Unlimited $0.00, Kindle Unlimited $0.00]",[( To buy the newest edition of this book (202...,https://www.goodreads.com/book/show/35518108-m...


In [14]:
#create a date_time signature to save each scraped data with
current_dateTime = datetime.now()
month = current_dateTime.month
day = current_dateTime.day
hour = current_dateTime.hour
minute = current_dateTime.minute
seconds = current_dateTime.second
file_name = f'{search_query}_{month}_{day}_{hour}_{minute}'

#save dataframe to csv
df.to_csv('{0}.csv'.format(file_name), index=False)

In [15]:
#Have a view of the initial scrape summary info.

first_batch = pd.read_excel('data_files\data+science_9_18_13_11_page70.xlsx') #the iteration was interrupted by network response delay
second_batch = pd.read_excel('data_files\data+science_9_18_15_9_page71+.xlsx') #this is the second batch

initial_scrape = pd.concat([first_batch, second_batch], ignore_index=True, axis=0)

initial_scrape.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1886 entries, 0 to 1885
Data columns (total 20 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   title                           1886 non-null   object
 1   title1                          1886 non-null   object
 2   author                          1886 non-null   object
 3   author_names                    1886 non-null   object
 4   genres                          1860 non-null   object
 5   num_pages                       1886 non-null   object
 6   paperback                       1886 non-null   object
 7   rating                          1886 non-null   object
 8   rating1                         1886 non-null   object
 9   rating_count                    1886 non-null   object
 10  rating_count1                   1886 non-null   object
 11  editions                        1886 non-null   object
 12  publication_info_firstedition   1886 non-null   

#### 3.3 OBSERVATIONS FROM SCRAPING ALL PAGES 📝

1. Some books appear multiple times. May be due to multiple editions that are not propoerly differentiated.
2. The random nature of the reposnse from each call on the shelves/pages or book url makes it necessary to individually scrape some properties with huge missingness. Only _Book Title_, _Author Name(s)_ & _product_URL_ columns had zero missing/unreported data, while other attributes had partial responses for expected data.
3. Some books with unknown authors and random data.
4. Adding few sleep calls within the lines improved the response from each page call.
5. It takes an average of __225 seconds__ (excluding sleep duraton) to scrape each page for all the data collected as in the previous code line.
6. There is the need to run another re-iteration for these columns, to fill the initial scrape.

### 4.0 IMPROVING THE INITIAL SCRAPE 🛠️

<a id="4.1"></a>
#### 4.1 RERUN REQUESTS TO RESCRAPE INCOMPLETE COLUMNS

To improve the initial scrape the dataframe will be filtered, choosing one column at a time (preferrably), for rows with missing data. The rows will be updated by making another request using the unique _product_URL_. With the new row appended to the dataframe, the old row with missing feature will be dropped.

The function in the next lines accomplishes this.

In [3]:
#Filter for rows with incomplete data in the chosen column
df_url_all = pd.read_excel('auto_updated.xlsx')
df_url_all = df_url_all[df_url_all['five_star']=='Not Found']# | (df_url['reviews']=='[]') | (df_url['foreward']=='Not Found') |(df_url['rating']=='')]
df_url_all = df_url_all['product_url']
len(df_url_all)#.head(3)

6

In [None]:
#Re-run for incompletely scraped columns
retry_result = []
for i in tqdm(df_url_all):
    response2 = requests.get(i, headers=headers)
    sleep(4)
    soup2 = BeautifulSoup(response2.text, 'lxml')
    
    # try:
    #     rating1 = soup2.find('div' , {'class':'RatingStatistics__rating'}).text
    # except AttributeError:
    #     rating1 = 'Not Found'
    
    # try:
    #     foreward = soup2.find('span' , {'class': 'Formatted', }).text
    # except AttributeError:
    #     foreward = 'Not Found'
    
#     first30_reviews=soup2.find_all('div' , {'class':'TruncatedContent__text'})
#     reviews = [(rev.text, "\n") for rev in first30_reviews]
    
    # try:
    #     publication_info_firstedition1 = [i.text for i in soup2.find('div' , {'class':'FeaturedDetails'})]
    # except (AttributeError, TypeError):
    #     publication_info_firstedition1 = 'Not Found'
    
    try:
        num_pages=soup2.find('p' , {'data-testid':'pagesFormat'}).text.split(', ')[0]
    except AttributeError:
        num_pages = 'Not Found'
    
    # try:
    #     price = soup2.find_all('div' , {'class':'BookActions__button'})[-1].text.split(' ')[1]
    # except IndexError:
    #     price = 'Amazon'

    # try:
    #     price1 = [i.text for i in soup2.find_all('span' , {'class':'Button__labelItem'}) if '$' in i.text]
    # except AttributeError:
    #     price1 = 'Amazon'
    
#     try:
#         rating1 = soup2.find('div' , {'class':'RatingStatistics__rating'}).text
#     except AttributeError:
#         rating1 = 'Not Found'
    
#     try:
#         genres = soup2.find('ul' , {'class': 'CollapsableList'}).text
#     except AttributeError:
#         genres = 'Not Found'


    #data = [price, price1, rating1, num_pages, genres, foreward, publication_info_firstedition1, reviews, i]
    #retry_result.append([five_stars, four_stars,three_stars,two_stars,one_stars,price,i])
    retry_result.append([num_pages,i])

In [None]:
#preview result
pd.DataFrame(retry_result)#.head()

In [None]:
def _make_changes(new_list, old_file_path, col_name, marker):
    """
    This function is used to update the original file with the new values
    from the re-scrapes process done for each column...
    
    Parameters
    ----------
    new_list : List object
        The new list to be used to update specified columns on desired file.
    old_file_path : str
        The file path to the file whose column/columns needs to be updated.
    col_name : str
        The specific column to be updated at the instance.
    marker : str
        The content in the column which indicates rows/cells to be updated.
        
    Returns
    -------
    dataframe
        The modified dataframe.
    """
    #make the new values into a dataframe and rename the columns appropriately
    new_col_names = [f'new_{col_name}', 'product_url']
    new_val_df = pd.DataFrame(new_list, columns=new_col_names)
    
    #read in the file to be updated/edited as 'tbe'
    try:
        df_tbe = pd.read_excel(old_file_path)
    except (ValueError,TypeError):
        df_tbe = pd.read_csv(old_file_path, encoding = 'ISO-8859-1')
    
    #slice out the rows that require update from the specified column.
    df_be = df_tbe[df_tbe[col_name]==marker]
    
    #convert the new values to be inserted into a list
    updated_column_vals = new_val_df.copy()[f'new_{col_name}'].to_list()
    #make the changes on the sliced out dataframe
    df_be[col_name] = updated_column_vals
    
    #concatenate the slice and the main dataframe which now contains duplicates of rows that have been updated
    both = pd.concat([df_tbe, df_be], axis=0)
    
    #drop the duplicates, keeping the newly appended row
    cleaned = both.drop_duplicates(subset='product_url',
                     keep='last')
    
    return cleaned

In [None]:
#declare variables for the function, or update per column being worked on
new_list = retry_result
old_file_path = 'auto_updated.xlsx'
col_name = 'num_pages'
marker = 'Not Found'

#execute update and check if some/all rows of the interested columns have been updated
df_updated = _make_changes(new_list=new_list, old_file_path=old_file_path, col_name=col_name, marker=marker)

diff = len(df_url_all) - len(df_updated[df_updated[col_name]==marker])
remnant = len(df_updated[df_updated[col_name]==marker])

print(f'{diff} rows have been updated, {remnant} more needs update')

In [None]:
#save the newly updated file to csv on specified directory
df_updated.to_excel('auto_updated.xlsx', index=False)

[Return to section 4.1 here and iterate until the field/column is completely filled](#4.1)

### 5.0 SCRAPING REVIEWS

It is necessary to scrape reviews separately from other data due to the structure required for the analysis to be carried out; each review is independent regardless of the book being reviewed. The interest here is the review text and the _star_ rating that comes with each.

The book URLs from the previous scraping step will be used here.

***
First the absolute book address is extracted from the full URL.

For instance, _URL_1_ below is the default address for the book called '100 Puzzles and Case Studies To Crack Data Science Interview' while _URL_2_ is the URL for the reviews page of the same book.

1. https://www.goodreads.com/book/show/35296800-100-puzzles-and-case-studies-to-crack-data-science-interview
2. https://www.goodreads.com/book/show/35296800/reviews
***

In [None]:
#Read the data from the previous scrape, split URL for each book
#and obain these as a series to be used for iteration
url_for_revs = pd.read_excel('Goodread_InitialFullScrape_deldup.xlsx')

#filter for books/rows with at least one rating
url_for_revs = url_for_revs[url_for_revs['rating_count']!='0 ratings']

#engineer the absolute reviews page URL for each book
url_for_revs['abs_url'] = url_for_revs['product_url'].str.split('-',expand=True)[0] + '/reviews'
url_for_revs = url_for_revs['abs_url']

url_for_revs.head()

In [None]:
reviews_ = []
for i in tqdm(url_for_revs[69:-1]):
    
    response3 = requests.get(i, headers=headers)
    sleep(2)
    soup3 = BeautifulSoup(response3.text, 'lxml')

    rev_sec = soup3.find_all('article', {'class':'ReviewCard'})
    revs = [d.text for d in soup3.find_all('span', {'class':'Formatted'})]
    
    #There are usually 30 reviews available at once on each book review page,
    #if total reveiws up to or more than 30
    
    
    for r in range(len(revs)):
        #book urls not separated by '-' separator will return an Index error
        try:
            user_review = revs[r]
            start = str(rev_sec[r]).find('aria-label="Rating ')
            user_rating = str(rev_sec[r])[start+18:start+29]
            reviews_.append([user_review, user_rating,i])
        except IndexError:
            continue
            

In [None]:
#print lenght of reviews
len(reviews_)

#save to csv
pd.DataFrame(reviews_).to_csv('reviews2.csv')

In [None]:
response3 = requests.get(url_for_revs.to_list()[12], headers=headers)
soup3 = BeautifulSoup(response3.text, 'lxml')

rev_sec = soup3.find_all('article', {'class':'ReviewCard'})#[1].get_text()#find_all('span', {'class':'Formatted'})
revs = [i.text for i in rev_sec]

In [None]:
dfb = pd.read_csv('Goodread_InitialFullScrape_180923.csv', encoding = 'ISO-8859-1')
dfb.query('rating'=='')

In [22]:
df_url_all = pd.read_excel('auto_updated.xlsx')
# #df_url = df_url[(df_url['price'].str.contains('Amazon')) | (df_url['reviews']=='[]') | (df_url['foreward']=='Not Found') |(df_url['rating']=='')]
df_url_all = df_url_all[df_url_all['five_star']=='Not Found']
df_url_all = df_url_all['product_url']
len(df_url_all)#.head(3)

1

In [27]:
stars_list = []
for i in tqdm(df_url_all):
    response2 = requests.get(i, headers=headers)
    sleep(4)
    soup2 = BeautifulSoup(response2.text, 'lxml')
    
    try:
        
        five_stars=soup2.find('div' , {'class':'RatingsHistogram__labelTotal', 'data-testid':'labelTotal-5'}).text
        four_stars=soup2.find('div' , {'class':'RatingsHistogram__labelTotal', 'data-testid':'labelTotal-4'}).text
        three_stars=soup2.find('div' , {'class':'RatingsHistogram__labelTotal', 'data-testid':'labelTotal-3'}).text
        two_stars=soup2.find('div' , {'class':'RatingsHistogram__labelTotal', 'data-testid':'labelTotal-2'}).text
        one_stars=soup2.find('div' , {'class':'RatingsHistogram__labelTotal', 'data-testid':'labelTotal-1'}).text
    except AttributeError:
        five_stars,four_stars,three_stars,two_stars,one_stars = 'Not Found','Not Found','Not Found','Not Found','Not Found'
    
    stars = [i,five_stars,four_stars,three_stars,two_stars,one_stars]
    stars_list.append(stars)

100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:06<00:00,  6.16s/it]


In [28]:
#res = pd.DataFrame({'price': prices_list})
res = pd.DataFrame(stars_list)
res.head(40)

Unnamed: 0,0,1,2,3,4,5
0,https://www.goodreads.com/book/show/58437650-t...,Not Found,Not Found,Not Found,Not Found,Not Found


In [None]:
file_name = 'retry_stars8'
res.to_csv('{0}.csv'.format(file_name), index=False)

### REFERENCES

* [Scraping Dynamic HTML content in a Flex Container](https://discourse.mcneel.com/t/extract-specific-html-with-a-flex-container-using-ironpython-in-gh/141082/5)