# Webscarping Goodreads

This file is used to webscrape the required book data from the goodread.com List of "Books that everyone should have read at least once", using beautiful soup.
The list contains 24,529 books.

## Import Libraries

In [37]:
from bs4 import BeautifulSoup
import pandas as pd
import requests
import math
import re
import time
import random
import glob

## Request URL

In [2]:
books = requests.get("https://www.goodreads.com/list/show/264.Books_That_Everyone_Should_Read_At_Least_Once?page={pagenr}")
print("books:", books.status_code)

books: 200


## Fetch book information

In [None]:
# book title - Done

#book series - Done 

#book genres - Done

#book author - Done

#book desription -Done

#book year published - Done

#number of pages - Done 

#language - Done 

#isbn10 - Done

#av rating - Done 

#1 star ratings, 2 star ratings .. - will not do 

#number of total ratings - Done

#book cover link - Done 

In [64]:
books = requests.get("https://www.goodreads.com/book/show/9648068-the-first-days")
print("books:", books.status_code)
soup = BeautifulSoup(books.content, 'html.parser')

books: 200


In [575]:
print(soup.prettify)

<bound method Tag.prettify of <!DOCTYPE html>

<html class="desktop">
<head prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb# good_reads: http://ogp.me/ns/fb/good_reads#">
<title>The First Days (As the World Dies, #1) by Rhiannon Frater</title>
<meta content="The First Days book. Read 1,141 reviews from the world's largest community for readers. Katie is driving to work one beautiful day when a dead man jumps ..." name="description"/>
<meta content="telephone=no" name="format-detection"/>
<link href="https://www.goodreads.com/book/show/9648068-the-first-days" rel="canonical"/>
<meta content="2415071772" property="fb:app_id"/>
<meta content="books.book" property="og:type"/>
<meta content="The First Days (As the World Dies, #1)" property="og:title"/>
<meta content="Katie is driving to work one beautiful day when a dead man jumps into her car and tries to eat her.  That same morning, Jenni opens a bed..." property="og:description"/>
<meta content="https://i.gr-assets.com/images/S/com

### Book Title

In [4]:
def get_book_title(soup):
    if soup.find(id='bookTitle'):
        btitle = soup.find(id="bookTitle").get_text(strip=True)
        return btitle
    return ''

In [5]:
get_book_title(soup)

'Nudge: Improving Decisions About Health, Wealth, and Happiness'

### Book isbn10 

In [6]:
# This function gives back the isbn10 number by browsing the soup as string for "nisbn". 
#The first "nisbn" in the html code is the ISBN number of the chosen book[0].
def get_book_isbn10(soup):
    try:
        isbn10 = re.findall(r'nisbn: \d{10}' , str(soup))[0] #get first nisbn number in str(soup)
        return isbn10.split()[1] # only show isbn number not "nisbn"
    except:
        return "isbn not found"


In [7]:
get_book_isbn10(soup)

'0300122233'

### Book Series

In [61]:
def get_book_series(soup):
    if soup.find(id="bookSeries").find('a'):
        book_series=soup.find(id="bookSeries").find('a').get_text(strip=True)
        book_series=book_series.replace('(',')')
        return (book_series.split(')')[1]) 
    return ''

#not a really generic solution, but works ;) Alternative would have been re.search (...)


In [62]:
get_book_series(soup)

''

### Book Description

In [10]:
def get_book_description(soup):
    bdescription= ''
    if soup.find(id="description").find(style="display:none"):
        bdescription +=soup.find(id="description").find(style="display:none").get_text(strip=True)
        return bdescription
    return ''

In [11]:
get_book_description(soup)

'From the winner of the 2017 Nobel Prize in Economics, Richard H. Thaler, and Cass R. Sunstein: a revelatory look at how we make decisionsNew York Times bestsellerNamed a Best Book of the Year byThe Economistand theFinancial TimesEvery day we make choices—about what to buy or eat, about financial investments or our children’s health and education, even about the causes we champion or the planet itself. Unfortunately, we often choose poorly. Nudge is about how we make these choices and how we can make better ones. Using dozens of eye-opening examples and drawing on decades of behavioral science research, Nobel Prize winner Richard H. Thaler and Harvard Law School professor Cass R. Sunstein show that no choice is ever presented to us in a neutral way, and that we are all susceptible to biases that can lead us to make bad decisions. But by knowing how people think, we can use sensible “choice architecture” to nudge people toward the best decisions for ourselves, our families, and our soci

### Book Author

In [12]:
#DONE - This function gives a list with all the authors
def get_book_authors(soup):
    if soup.find(id="bookAuthors").find_all(itemprop='name'):  #find('a', attrs={'class': 'authorName'}).find(itemprop='name'):
        authors=soup.find(id="bookAuthors").find_all(itemprop='name')#.find(itemprop='name')
        #print(authors)
        authors_all=[]
        for i in authors:
            authors_all.append(i.get_text(strip=True))
        return(authors_all)
    return ''

In [13]:
get_book_authors(soup)

['Richard H. Thaler', 'Cass R. Sunstein']

### Number of pages

In [14]:
def get_book_pages(soup):
    if soup.find(id="details").find(itemprop='numberOfPages'):  #find('a', attrs={'class': 'authorName'}).find(itemprop='name'):
        pages=soup.find(id="details").find(itemprop='numberOfPages').get_text(strip=True)#.find(itemprop='name')
        return int(pages.split()[0])
    return ''

In [15]:
get_book_pages(soup)

260

### Date first published

In [16]:
def get_book_year(soup):
    if soup.find('nobr', attrs={'class':'greyText'}):
        year_published=soup.find('nobr', attrs={'class':'greyText'}).get_text(strip=True)
        #print(year_published)
        return re.search('(\d{4})', year_published).group(1) #search for year: number with 4 digits
    return ''

In [17]:
get_book_year(soup)

'2008'

### Language

In [18]:
def get_book_language(soup):
    if soup.find(itemprop="inLanguage"):
        language=soup.find(itemprop="inLanguage").get_text(strip=True)
        return language
    return ''

In [19]:
get_book_language(soup)

'English'

### Book Cover Link

In [20]:
def get_book_cover(soup):
    if soup.find(id="coverImage"):
        cover=soup.find(id="coverImage")
       # print(cover)
        return cover.get('src') #img.get
    return''

In [21]:
get_book_cover(soup)

'https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1348322381l/3450744.jpg'

### Number of total ratings

In [22]:
def get_total_ratings(soup):
    if soup.find(itemprop='ratingCount'):  #find('a', attrs={'class': 'authorName'}).find(itemprop='name'):
        reviews=soup.find(itemprop='ratingCount').get_text(strip=True)#.find(itemprop='name')
        return reviews.split()[0]#float(pages.split()[0])

In [23]:
get_total_ratings(soup)

'74,284'

### Average rating of book

In [24]:
def get_bookavrating(soup):
    if soup.find("span", itemprop="ratingValue"):#soup.find("div", class_="RatingStatistics__rating"):
        av_rating = soup.find("span", itemprop="ratingValue").get_text(strip=True)#soup.find("div", class_="RatingStatistics__rating")
        return float(av_rating)
    return ''

In [25]:
get_bookavrating(soup)

3.82

### Number of different star ratings - skip due to time reasons

### Book Genres

In [26]:
#This function gets all genres listed under more, can be various amounts
def get_all_book_genres(soup):
    genres=[]
    for i in soup.find_all('div', class_="left"):
        genre=i.find_all('a', class_="actionLinkLite bookPageGenreLink")
        for j in genre:
            texts=j.get_text(strip=True)
            if texts not in genres:
                genres.append(texts)
    return genres
    

In [27]:
get_all_book_genres(soup)

['Nonfiction',
 'Psychology',
 'Economics',
 'Business',
 'Self Help',
 'Science',
 'Politics',
 'Personal Development',
 'Sociology',
 'Social Science']

In [28]:
#This function pulls all available genres. Not correct. 
def gs(soup):
    if soup.find('div', class_='siteHeader').find('div', attrs={'data-react-class':'ReactComponents.HeaderStoreConnector'}).find_all('li',class_='genreList__genre'):
        #div data-react-class="ReactComponents.HeaderStoreConnector"
        gs=soup.find('div', class_='siteHeader').find('div', attrs={'data-react-class':'ReactComponents.HeaderStoreConnector'}).find_all('li',class_='genreList__genre')
        #print(gs)
        gslist=[]
        for t in gs:
            gslist.append(t.get_text(strip=True))
        return gslist
    return ''

In [819]:
gs(soup)

[<li class="genreList__genre" data-reactid=".1bnbnfbifre.1.0.2.0.2.0.1.0.1.0.1:$genreList0.0:$Art" role="menuitem"><a class="genreList__genreLink gr-hyperlink gr-hyperlink--naked" data-reactid=".1bnbnfbifre.1.0.2.0.2.0.1.0.1.0.1:$genreList0.0:$Art.0" href="/genres/art">Art</a></li>, <li class="genreList__genre" data-reactid=".1bnbnfbifre.1.0.2.0.2.0.1.0.1.0.1:$genreList0.0:$Biography" role="menuitem"><a class="genreList__genreLink gr-hyperlink gr-hyperlink--naked" data-reactid=".1bnbnfbifre.1.0.2.0.2.0.1.0.1.0.1:$genreList0.0:$Biography.0" href="/genres/biography">Biography</a></li>, <li class="genreList__genre" data-reactid=".1bnbnfbifre.1.0.2.0.2.0.1.0.1.0.1:$genreList0.0:$Business" role="menuitem"><a class="genreList__genreLink gr-hyperlink gr-hyperlink--naked" data-reactid=".1bnbnfbifre.1.0.2.0.2.0.1.0.1.0.1:$genreList0.0:$Business.0" href="/genres/business">Business</a></li>, <li class="genreList__genre" data-reactid=".1bnbnfbifre.1.0.2.0.2.0.1.0.1.0.1:$genreList0.0:$Children's" r

['Art',
 'Biography',
 'Business',
 "Children's",
 'Christian',
 'Classics',
 'Comics',
 'Cookbooks',
 'Ebooks',
 'Fantasy',
 'Fiction',
 'Graphic Novels',
 'Historical Fiction',
 'History',
 'Horror',
 'Memoir',
 'Music',
 'Mystery',
 'Nonfiction',
 'Poetry',
 'Psychology',
 'Romance',
 'Science',
 'Science Fiction',
 'Self Help',
 'Sports',
 'Thriller',
 'Travel',
 'Young Adult',
 'More Genres',
 'Art',
 'Biography',
 'Business',
 "Children's",
 'Christian',
 'Classics',
 'Comics',
 'Cookbooks',
 'Ebooks',
 'Fantasy',
 'Fiction',
 'Graphic Novels',
 'Historical Fiction',
 'History',
 'Horror',
 'Memoir',
 'Music',
 'Mystery',
 'Nonfiction',
 'Poetry',
 'Psychology',
 'Romance',
 'Science',
 'Science Fiction',
 'Self Help',
 'Sports',
 'Thriller',
 'Travel',
 'Young Adult',
 'More Genres']

gs1(soup)

In [541]:
print(soup.prettify)

<bound method Tag.prettify of <!DOCTYPE html>

<html class="desktop">
<head prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb# good_reads: http://ogp.me/ns/fb/good_reads#">
<title>The First Days (As the World Dies, #1) by Rhiannon Frater</title>
<meta content="The First Days book. Read 1,141 reviews from the world's largest community for readers. Katie is driving to work one beautiful day when a dead man jumps ..." name="description"/>
<meta content="telephone=no" name="format-detection"/>
<link href="https://www.goodreads.com/book/show/9648068-the-first-days" rel="canonical"/>
<meta content="2415071772" property="fb:app_id"/>
<meta content="books.book" property="og:type"/>
<meta content="The First Days (As the World Dies, #1)" property="og:title"/>
<meta content="Katie is driving to work one beautiful day when a dead man jumps into her car and tries to eat her.  That same morning, Jenni opens a bed..." property="og:description"/>
<meta content="https://i.gr-assets.com/images/S/com

## Scrape Data

### From one book page

In [65]:
#Done - trying to scrape from one book
title=[]
series=[]
authors=[]
isbn=[]
genres=[]
descriptions=[]
years=[]
pages=[]
languages=[]
setting=[]
av_ratings=[]
total_ratings=[]
cover_image_link=[]



title.append(get_book_title(soup))
series.append(get_book_series(soup))
authors.append(get_book_authors(soup))
isbn.append(get_book_isbn10(soup))
genres.append(get_all_book_genres(soup))
descriptions.append(get_book_description(soup))
pages.append(get_book_isbn10(soup))
years.append(get_book_year(soup))
languages.append(get_book_language(soup))
cover_image_link.append(get_book_cover(soup))
total_ratings.append(get_total_ratings(soup))
av_ratings.append(get_bookavrating(soup))





print((title,series, authors,isbn,descriptions,pages,genres,years,languages,cover_image_link, total_ratings,av_ratings)) 


(['The First Days'], ['As the World Dies #1'], [['Rhiannon Frater']], ['1438250800'], ['Katie is driving to work one beautiful day when a dead man jumps into her car and tries to eat her.\xa0 That same morning, Jenni opens a bedroom door to find her husband devouring their toddler son.Fate puts Jenni and Katie—total strangers—together in a pickup, fleeing the suddenly zombie-filled streets of the Texas city in which they live. Before the sun has set, they have become more than just friends and allies—they are bonded as tightly as any two people who have been to war together.During their cross-Texas odyssey to find and rescue Jenni’s oldest son, Jenni discovers the joy of watching a zombie’s head explode when she shoots its brains out. Katie learns that she’s a terrific tactician—and a pretty good shot.A chance encounter puts them on the road to an isolated, fortified town, besieged by zombies, where fewer than one hundred people cling to the shreds of civilization.It looks like the end

### Extract all book URLs from List

In [33]:
#Extract all urls of books
def geturls(startpage,endpage):
    urls_all=[]
    for pagenr in range(startpage,endpage):    
        req = requests.get(f'https://www.goodreads.com/list/show/264.Books_That_Everyone_Should_Read_At_Least_Once?page={pagenr}')
        soup = BeautifulSoup(req.content, 'html.parser')
        #print(soup)
        urls=soup.find_all('a', class_='bookTitle')#.get('href')
        for i in urls:
            urls_all.append('https://www.goodreads.com'+i.get('href'))
        time.sleep(random.randint(1,4))
    #print(urls_all)
    df = pd.DataFrame({"url": urls_all})
    df.to_csv(r'c:\Users\anton\Desktop\df_urls{pagenr}.csv',index=None, header=True)
    
    

In [36]:
geturls(100,101)

In [41]:
#concat url lists
main_df= pd.DataFrame([])

for file_name in glob.glob(r'C:\Users\anton\Ironhack\Final Project\URLsFinal\*.csv'):
    df = pd.read_csv(file_name)
    main_df = pd.concat([main_df,df])
main_df

Unnamed: 0,url
0,https://www.goodreads.com/book/show/2657.To_Ki...
1,https://www.goodreads.com/book/show/3.Harry_Po...
2,https://www.goodreads.com/book/show/1885.Pride...
3,https://www.goodreads.com/book/show/48855.The_...
4,https://www.goodreads.com/book/show/170448.Ani...
...,...
4995,https://www.goodreads.com/book/show/1949296.Sa...
4996,https://www.goodreads.com/book/show/20735474-t...
4997,https://www.goodreads.com/book/show/29268319-t...
4998,https://www.goodreads.com/book/show/32620489-d...


## Scrape all book sites

In [None]:
#scrape all links on link list

In [52]:
def getfinaldata(url_list):
    title=[]
    series=[]
    authors=[]
    isbn=[]
    genres=[]
    descriptions=[]
    years=[]
    pages=[]
    languages=[]
    setting=[]
    av_ratings=[]
    total_ratings=[]
    cover_image_link=[]
    link_list=pd.read_csv(url_list)
    link_list=link_list['url'].tolist()
    
    for url in link_list:
        req = requests.get(url)
        soup = BeautifulSoup(req.content, 'html.parser')
        #FETCH DATA, append lists:
        title.append(get_book_title(soup))
        
        try:
            series.append(get_book_series(soup))
        except:
            series.append('None')
        try:  
            authors.append(get_book_authors(soup))
        except:
            authors.append('None')
            
        isbn.append(get_book_isbn10(soup))
        genres.append(get_all_book_genres(soup))
        try:
            descriptions.append(get_book_description(soup))
        except:
            descriptions.append('None')
        try:    
            pages.append(get_book_pages(soup))
        except:
            pages.append('None')
        try:
            years.append(get_book_year(soup))
        except:
            years.append('None')
            
        languages.append(get_book_language(soup))
        cover_image_link.append(get_book_cover(soup))
        total_ratings.append(get_total_ratings(soup))
        av_ratings.append(get_bookavrating(soup))
        print(url)
        time.sleep(random.randint(1,4))
    #CREATE DATAFRAME FROM LISTS
    df = pd.DataFrame(
    {
     "title": title, 
     "series": series,
     "authors": authors,
     "isbn": isbn,
     "genres": genres,
     "description": descriptions,
     "pages": pages,
     "year": years, 
     "language": languages, 
     "cover_image": cover_image_link,
     "total_number_ratings": total_ratings, 
     "average_rating": av_ratings,
     }
    )
    df
    df.to_csv(r'c:\Users\anton\Desktop\df_alldetails.csv',index=None, header=True)
    return df    

In [53]:
getfinaldata('URLsFinal\df_urls{100}.csv')

https://www.goodreads.com/book/show/31210107-dust
https://www.goodreads.com/book/show/32181246-recreant
https://www.goodreads.com/book/show/30944078-scarlet-crosses
https://www.goodreads.com/book/show/32470947-how-to-suck-cock-deep
https://www.goodreads.com/book/show/32000545-the-dragon-and-the-princess
https://www.goodreads.com/book/show/32314640-erica-s-house
https://www.goodreads.com/book/show/32493170-black-sunrise
https://www.goodreads.com/book/show/32798314-the-six-foot-bonsai
https://www.goodreads.com/book/show/6701087-barbelo-s-blood
https://www.goodreads.com/book/show/32921704-counting-blessings
https://www.goodreads.com/book/show/32765853-the-bad-canadian
https://www.goodreads.com/book/show/1520213.True_Stories
https://www.goodreads.com/book/show/30754700-the-game-changers
https://www.goodreads.com/book/show/32791571-endless-darkness-vol-2
https://www.goodreads.com/book/show/31922087-the-greenfather
https://www.goodreads.com/book/show/18897672-tales-from-shakespeare
https://w

Unnamed: 0,title,series,authors,isbn,genres,description,pages,year,language,cover_image,total_number_ratings,average_rating
0,Dust,,[Mark Thompson],isbn not found,"[Fiction, Historical, Historical Fiction]",,197,,English,https://i.gr-assets.com/images/S/compressed.ph...,124,4.01
1,Recreant,,[Geffrey Kane],1537033654,[],It has been three days since Father Jack McKen...,236,,,https://i.gr-assets.com/images/S/compressed.ph...,8,4.38
2,Scarlet Crosses: The Truth Lies Within,,[J. Beckham Steele],0997522003,[],…fears swarm inside Harris’ mind. He listens a...,,,English,https://i.gr-assets.com/images/S/compressed.ph...,5,4.6
3,How To Suck Cock Deep,,[Jock Camp],isbn not found,[],,30,,,https://i.gr-assets.com/images/S/compressed.ph...,4,4.5
4,,,,isbn not found,[],,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
95,,,,isbn not found,[],,,,,,,
96,Hobo Stew,,[Naomi Jackson],1630731978,[],,200,,English,https://i.gr-assets.com/images/S/compressed.ph...,7,4.29
97,hunger,,[Tanzeela K. Hassan],1544673191,[],"Adam a successful New Yorker, who knew nothing...",,,,https://i.gr-assets.com/images/S/compressed.ph...,49,4.35
98,Hatching Charlie: A Psychotherapist's Tale,,[Charles C. McCormack],1654589918,"[Autobiography, Memoir, Health, Mental Health]",McCormack was asked two questions by his child...,426,2016,English,https://i.gr-assets.com/images/S/compressed.ph...,47,4.49


In [66]:
dfgh =pd.read_csv(r'C:\Users\anton\Desktop\df_alldetails.csv')
dfgh#

Unnamed: 0,title,series,authors,isbn,genres,description,pages,year,language,cover_image,total_number_ratings,average_rating
0,Dust,,['Mark Thompson'],isbn not found,"['Fiction', 'Historical', 'Historical Fiction']",,197,,English,https://i.gr-assets.com/images/S/compressed.ph...,124,4.01
1,Recreant,,['Geffrey Kane'],1537033654,[],It has been three days since Father Jack McKen...,236,,,https://i.gr-assets.com/images/S/compressed.ph...,8,4.38
2,Scarlet Crosses: The Truth Lies Within,,['J. Beckham Steele'],0997522003,[],…fears swarm inside Harris’ mind. He listens a...,,,English,https://i.gr-assets.com/images/S/compressed.ph...,5,4.60
3,How To Suck Cock Deep,,['Jock Camp'],isbn not found,[],,30,,,https://i.gr-assets.com/images/S/compressed.ph...,4,4.50
4,,,,isbn not found,[],,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
95,,,,isbn not found,[],,,,,,,
96,Hobo Stew,,['Naomi Jackson'],1630731978,[],,200,,English,https://i.gr-assets.com/images/S/compressed.ph...,7,4.29
97,hunger,,['Tanzeela K. Hassan'],1544673191,[],"Adam a successful New Yorker, who knew nothing...",,,,https://i.gr-assets.com/images/S/compressed.ph...,49,4.35
98,Hatching Charlie: A Psychotherapist's Tale,,['Charles C. McCormack'],1654589918,"['Autobiography', 'Memoir', 'Health', 'Mental ...",McCormack was asked two questions by his child...,426,2016,English,https://i.gr-assets.com/images/S/compressed.ph...,47,4.49
