# Webscarping Goodreads

This file is used to webscrape the required book data from the goodread.com List of "Books that everyone should have read at least once", using beautiful soup.
The list contains 24,529 books.

## Import Libraries

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import requests
import math
import re
import time
import random
import glob

## Fetch book information

In [68]:
books = requests.get("https://www.goodreads.com/book/show/32000545-the-dragon-and-the-princess")
print("books:", books.status_code)
soup = BeautifulSoup(books.content, 'html.parser')

books: 200


In [None]:
print(soup.prettify)

### Book Title

In [53]:
def get_book_title(soup):
    if soup.find(id='bookTitle'):
        btitle = soup.find(id="bookTitle").get_text(strip=True)
        return btitle
    return ''

In [54]:
get_book_title(soup)

'Dust'

### Book isbn13 

In [71]:
# This function gives back the isbn13 number by browsing the soup as string for "nisbn". 
#The first "nisbn" in the html code is the ISBN number of the chosen book[0].
def get_book_isbn13(soup):
    if soup.find('span', attrs={'itemprop':"isbn"}):
        isbn1 = soup.find('span', attrs={'itemprop':"isbn"}).get_text(strip=True)
        #print(1)
        return isbn1
    if not soup.find('span', attrs={'itemprop':"isbn"}):
        try:
            try:
                isbn10 = re.findall(r'nisbn: \d{10}' , str(soup))[0] #get first nisbn number in str(soup)
                #print(2)
                return isbn10.split()[1] # only show isbn number not "nisbn
            except:
                try:
                    isbn2 =soup.find(id="bookDataBox").find('div', class_="infoBoxRowItem").get_text(strip=True)
                    #print(3)
                    return re.search(r'\d{10}',isbn2).group(0)
                except:
                    return ""
        except:
            return ""
                
        #except:
            #continue      
    #else:
       # try:
           # isbn10 = re.findall(r'nisbn: \d{10}' , str(soup))[0] #get first nisbn number in str(soup)
            #print(2)
           # return isbn10.split()[1] # only show isbn number not "nisbn
       # except:
         #   return ""


In [70]:
get_book_isbn13(soup)

3


''

In [38]:
print(soup.prettify)

<bound method Tag.prettify of <!DOCTYPE html>

<html class="desktop">
<head prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb# good_reads: http://ogp.me/ns/fb/good_reads#">
<title>Barbelo's Blood by Joseph Ferri</title>
<meta content="Barbelo's Blood book. Read 4 reviews from the world's largest community for readers. Forty years of war Twenty years of peace Five seconds of bloody murd..." name="description"/>
<meta content="telephone=no" name="format-detection"/>
<link href="https://www.goodreads.com/book/show/6701087-barbelo-s-blood" rel="canonical"/>
<meta content="2415071772" property="fb:app_id"/>
<meta content="books.book" property="og:type"/>
<meta content="Barbelo's Blood" property="og:title"/>
<meta content="Forty years of war Twenty years of peace Five seconds of bloody murder! Captain Barbelo hunts the rubbish-strewn streets of 1980's Brixto..." property="og:description"/>
<meta content="https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1327920570i/67

### Book Series

In [7]:
def get_book_series(soup):
    if soup.find(id="bookSeries").find('a'):
        book_series=soup.find(id="bookSeries").find('a').get_text(strip=True)
        book_series=book_series.replace('(',')')
        return (book_series.split(')')[1]) 
    return ''

#not a really generic solution, but works ;) Alternative would have been re.search (...)


In [8]:
get_book_series(soup)

''

### Book Description

In [9]:
def get_book_description(soup):
    bdescription= ''
    if soup.find(id="description").find(style="display:none"):
        bdescription +=soup.find(id="description").find(style="display:none").get_text(strip=True)
        return bdescription
    return ''

In [10]:
get_book_description(soup)

"Forty years of war Twenty years of peace Five seconds of bloody murder! Captain Barbelo hunts the rubbish-strewn streets of 1980's Brixton, exterminating the thugs that prey on the helpless. Encountering the nefarious Illuminati, Barbelo embarks on Lawful Rebellion, declaring war on the real criminals - the British Government, the World Banks and the Vatican. Joseph Ferri - a powerful new force in fiction has created an original anti-hero that you will never forget!"

### Book Author

In [11]:
#DONE - This function gives a list with all the authors
def get_book_authors(soup):
    if soup.find(id="bookAuthors").find_all(itemprop='name'):  #find('a', attrs={'class': 'authorName'}).find(itemprop='name'):
        authors=soup.find(id="bookAuthors").find_all(itemprop='name')#.find(itemprop='name')
        #print(authors)
        authors_all=[]
        for i in authors:
            authors_all.append(i.get_text(strip=True))
        return(authors_all)
    return ''

In [12]:
get_book_authors(soup)

['Joseph Ferri']

### Number of pages

In [13]:
def get_book_pages(soup):
    if soup.find(id="details").find(itemprop='numberOfPages'):  #find('a', attrs={'class': 'authorName'}).find(itemprop='name'):
        pages=soup.find(id="details").find(itemprop='numberOfPages').get_text(strip=True)#.find(itemprop='name')
        return int(pages.split()[0])
    return '0'

In [14]:
get_book_pages(soup)

446

### Date first published

In [15]:
def get_book_year(soup):
    if soup.find('nobr', attrs={'class':'greyText'}):
        year_published1=soup.find('nobr', attrs={'class':'greyText'}).get_text(strip=True)
        #print(1)
        return re.search('(\d{4})', year_published1).group(1) #search for year: number with 4 digits
    elif soup.find('p',attrs={'data-testid':"publicationInfo"}):
        year_published2=soup.find('p',attrs={'data-testid':'publicationInfo'}).get_text(strip=True)
        #print(2)
        return re.search('(\d{4})', year_published2).group(1)
    elif soup.find(id="details"):#.findall('div', attrs={'class':'row'}):
        year_published3=soup.find(id="details").get_text(strip=True)#.split()#findall('div', attrs={'class':'row'})
        #print(3)
        return re.search('\d{4}', year_published3).group(0)   ##.split()[3]       
    return ''      

In [16]:
get_book_year(soup)

'2009'

### Language

In [17]:
def get_book_language(soup):
    if soup.find(itemprop="inLanguage"):
        language=soup.find(itemprop="inLanguage").get_text(strip=True)
        return language
    return ''

In [18]:
get_book_language(soup)

'English'

### Book Cover Link

In [19]:
def get_book_cover(soup):
    if soup.find(id="coverImage"):
        cover=soup.find(id="coverImage")
       # print(cover)
        return cover.get('src') #img.get
    return''

In [20]:
get_book_cover(soup)

'https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1327920570l/6701087.jpg'

### Number of total ratings

In [21]:
def get_total_ratings(soup):
    if soup.find(itemprop='ratingCount'):  #find('a', attrs={'class': 'authorName'}).find(itemprop='name'):
        reviews=soup.find(itemprop='ratingCount').get_text(strip=True)#.find(itemprop='name')
        return reviews.split()[0]#float(pages.split()[0])

In [22]:
get_total_ratings(soup)

'6'

### Average rating of book

In [23]:
def get_bookavrating(soup):
    if soup.find("span", itemprop="ratingValue"):#soup.find("div", class_="RatingStatistics__rating"):
        av_rating = soup.find("span", itemprop="ratingValue").get_text(strip=True)#soup.find("div", class_="RatingStatistics__rating")
        return float(av_rating)
    return ''

In [24]:
get_bookavrating(soup)

4.0

### Book Genres

In [25]:
#This function gets all genres listed under more, can be various amounts
def get_all_book_genres(soup):
    genres=[]
    for i in soup.find_all('div', class_="left"):
        genre=i.find_all('a', class_="actionLinkLite bookPageGenreLink")
        for j in genre:
            texts=j.get_text(strip=True)
            if texts not in genres:
                genres.append(texts)
    return genres
    

In [26]:
get_all_book_genres(soup)

[]

gs1(soup)

## Scrape Data

### From one book page

In [29]:
#Checking functions for one book
title=[]
series=[]
authors=[]
isbn=[]
genres=[]
descriptions=[]
years=[]
pages=[]
languages=[]
setting=[]
av_ratings=[]
total_ratings=[]
cover_image_link=[]



title.append(get_book_title(soup))
series.append(get_book_series(soup))
authors.append(get_book_authors(soup))
isbn.append(get_book_isbn13(soup))
genres.append(get_all_book_genres(soup))
descriptions.append(get_book_description(soup))
pages.append(get_book_isbn13(soup))
years.append(get_book_year(soup))
languages.append(get_book_language(soup))
cover_image_link.append(get_book_cover(soup))
total_ratings.append(get_total_ratings(soup))
av_ratings.append(get_bookavrating(soup))





print((title,series, authors,isbn,descriptions,pages,genres,years,languages,cover_image_link, total_ratings,av_ratings)) 


(["Barbelo's Blood"], [''], [['Joseph Ferri']], ['0955912814'], ["Forty years of war Twenty years of peace Five seconds of bloody murder! Captain Barbelo hunts the rubbish-strewn streets of 1980's Brixton, exterminating the thugs that prey on the helpless. Encountering the nefarious Illuminati, Barbelo embarks on Lawful Rebellion, declaring war on the real criminals - the British Government, the World Banks and the Vatican. Joseph Ferri - a powerful new force in fiction has created an original anti-hero that you will never forget!"], ['0955912814'], [[]], ['2009'], ['English'], ['https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1327920570l/6701087.jpg'], ['6'], [4.0])


### Extract all book URLs from List

In [None]:
#Extract all urls of books
def geturls(startpage,endpage):
    urls_all=[]
    for pagenr in range(startpage,endpage):    
        req = requests.get(f'https://www.goodreads.com/list/show/264.Books_That_Everyone_Should_Read_At_Least_Once?page={pagenr}')
        soup = BeautifulSoup(req.content, 'html.parser')
        #print(soup)
        urls=soup.find_all('a', class_='bookTitle')#.get('href')
        for i in urls:
            urls_all.append('https://www.goodreads.com'+i.get('href'))
        time.sleep(random.randint(1,4))
    #print(urls_all)
    df = pd.DataFrame({"url": urls_all})
    df.to_csv(r'c:\Users\anton\Desktop\df_urls{pagenr}.csv',index=None, header=True)
    
    

In [None]:
geturls(100,101)

### Optional - Concat all URL dataframes

In [None]:
url_list= pd.DataFrame([])

for file_name in glob.glob(r'C:\Users\anton\Ironhack\Final Project\URLsFinal\*.csv'):
    df = pd.read_csv(file_name)
    url_list= pd.concat([url_list],df)
url_list

## Scrape all book sites

In [None]:
#scrape all links on link list

In [30]:
def getfinaldata(url_list):
    title=[]
    series=[]
    authors=[]
    isbn=[]
    genres=[]
    descriptions=[]
    years=[]
    pages=[]
    languages=[]
    setting=[]
    av_ratings=[]
    total_ratings=[]
    cover_image_link=[]
    link_list=pd.read_csv(url_list)
    link_list=link_list['url'].tolist()
        
    for url in link_list:
        req = requests.get(url)
        soup = BeautifulSoup(req.content, 'html.parser')
        #FETCH DATA, append lists:
        title.append(get_book_title(soup))
        
        try:
            series.append(get_book_series(soup))
        except:
            series.append('None')
        try:  
            authors.append(get_book_authors(soup))
        except:
            authors.append('None')
            
        isbn.append(get_book_isbn13(soup))
        genres.append(get_all_book_genres(soup))
        try:
            descriptions.append(get_book_description(soup))
        except:
            descriptions.append('None')
        try:    
            pages.append(get_book_pages(soup))
        except:
            pages.append('None')
        try:
            years.append(get_book_year(soup))
        except:
            years.append('None')
            
        languages.append(get_book_language(soup))
        cover_image_link.append(get_book_cover(soup))
        total_ratings.append(get_total_ratings(soup))
        av_ratings.append(get_bookavrating(soup))
        print(url)
        time.sleep(random.randint(1,4))
        
    #CREATE DATAFRAME FROM LISTS
    df = pd.DataFrame(
    {
     "title": title, 
     "series": series,
     "authors": authors,
     "isbn": isbn,
     "genres": genres,
     "description": descriptions,
     "pages": pages,
     "year": years, 
     "language": languages, 
     "cover_image": cover_image_link,
     "total_number_ratings": total_ratings, 
     "average_rating": av_ratings,
     "book_url": url
     }
    )
    df.head()
    df.to_csv(r'c:\Users\anton\Desktop\df_alldetails.csv',index=None, header=True)
    return df    

In [31]:
getfinaldata('URLsFinal\df_urls{100}.csv')

https://www.goodreads.com/book/show/31210107-dust
https://www.goodreads.com/book/show/32181246-recreant
https://www.goodreads.com/book/show/30944078-scarlet-crosses
https://www.goodreads.com/book/show/32470947-how-to-suck-cock-deep
https://www.goodreads.com/book/show/32000545-the-dragon-and-the-princess
https://www.goodreads.com/book/show/32314640-erica-s-house
https://www.goodreads.com/book/show/32493170-black-sunrise
https://www.goodreads.com/book/show/32798314-the-six-foot-bonsai
https://www.goodreads.com/book/show/6701087-barbelo-s-blood
https://www.goodreads.com/book/show/32921704-counting-blessings
https://www.goodreads.com/book/show/32765853-the-bad-canadian
https://www.goodreads.com/book/show/1520213.True_Stories
https://www.goodreads.com/book/show/30754700-the-game-changers
https://www.goodreads.com/book/show/32791571-endless-darkness-vol-2
https://www.goodreads.com/book/show/31922087-the-greenfather
https://www.goodreads.com/book/show/18897672-tales-from-shakespeare
https://w

Unnamed: 0,title,series,authors,isbn,genres,description,pages,year,language,cover_image,total_number_ratings,average_rating,book_url
0,Dust,,[Mark Thompson],9781910453223,"[Fiction, Historical, Historical Fiction]",,197,2016,English,https://i.gr-assets.com/images/S/compressed.ph...,124,4.01,https://www.goodreads.com/book/show/33785153-i...
1,Recreant,,[Geffrey Kane],1537033654,[],It has been three days since Father Jack McKen...,236,2016,,https://i.gr-assets.com/images/S/compressed.ph...,8,4.38,https://www.goodreads.com/book/show/33785153-i...
2,Scarlet Crosses: The Truth Lies Within,,[J. Beckham Steele],0997522003,[],…fears swarm inside Harris’ mind. He listens a...,0,2016,English,https://i.gr-assets.com/images/S/compressed.ph...,5,4.6,https://www.goodreads.com/book/show/33785153-i...
3,How To Suck Cock Deep,,[Jock Camp],How To Suck Cock Deep,[],,30,2017,,https://i.gr-assets.com/images/S/compressed.ph...,4,4.5,https://www.goodreads.com/book/show/33785153-i...
4,The Dragon and the Princess,,[Andrew P.M. Yiallouros],B01LLVARVI,[],"A tale of magic, love, spirituality and advent...",94,2016,English,https://i.gr-assets.com/images/S/compressed.ph...,4,5.0,https://www.goodreads.com/book/show/33785153-i...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,Love and Other Perishable Items,,[Laura Buzo],9780375970009,"[Young Adult, Romance, Contemporary, Realistic...","A wonderful, coming-of-age love story from a f...",243,2010,English,https://i.gr-assets.com/images/S/compressed.ph...,8165,3.51,https://www.goodreads.com/book/show/33785153-i...
96,Hobo Stew,,[Naomi Jackson],1630731978,[],,200,2017,English,https://i.gr-assets.com/images/S/compressed.ph...,7,4.29,https://www.goodreads.com/book/show/33785153-i...
97,hunger,,[Tanzeela K. Hassan],9781544673196,[],"Adam a successful New Yorker, who knew nothing...",0,2017,,https://i.gr-assets.com/images/S/compressed.ph...,49,4.35,https://www.goodreads.com/book/show/33785153-i...
98,Hatching Charlie: A Psychotherapist's Tale,,[Charles C. McCormack],9780692813430,"[Autobiography, Memoir, Health, Mental Health]",McCormack was asked two questions by his child...,426,2016,English,https://i.gr-assets.com/images/S/compressed.ph...,47,4.49,https://www.goodreads.com/book/show/33785153-i...


In [32]:
dfgh =pd.read_csv(r'C:\Users\anton\Desktop\df_alldetails.csv')
dfgh

Unnamed: 0,title,series,authors,isbn,genres,description,pages,year,language,cover_image,total_number_ratings,average_rating,book_url
0,Dust,,['Mark Thompson'],9781910453223,"['Fiction', 'Historical', 'Historical Fiction']",,197,2016,English,https://i.gr-assets.com/images/S/compressed.ph...,124,4.01,https://www.goodreads.com/book/show/33785153-i...
1,Recreant,,['Geffrey Kane'],1537033654,[],It has been three days since Father Jack McKen...,236,2016,,https://i.gr-assets.com/images/S/compressed.ph...,8,4.38,https://www.goodreads.com/book/show/33785153-i...
2,Scarlet Crosses: The Truth Lies Within,,['J. Beckham Steele'],0997522003,[],…fears swarm inside Harris’ mind. He listens a...,0,2016,English,https://i.gr-assets.com/images/S/compressed.ph...,5,4.60,https://www.goodreads.com/book/show/33785153-i...
3,How To Suck Cock Deep,,['Jock Camp'],How To Suck Cock Deep,[],,30,2017,,https://i.gr-assets.com/images/S/compressed.ph...,4,4.50,https://www.goodreads.com/book/show/33785153-i...
4,The Dragon and the Princess,,['Andrew P.M. Yiallouros'],B01LLVARVI,[],"A tale of magic, love, spirituality and advent...",94,2016,English,https://i.gr-assets.com/images/S/compressed.ph...,4,5.00,https://www.goodreads.com/book/show/33785153-i...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,Love and Other Perishable Items,,['Laura Buzo'],9780375970009,"['Young Adult', 'Romance', 'Contemporary', 'Re...","A wonderful, coming-of-age love story from a f...",243,2010,English,https://i.gr-assets.com/images/S/compressed.ph...,8165,3.51,https://www.goodreads.com/book/show/33785153-i...
96,Hobo Stew,,['Naomi Jackson'],1630731978,[],,200,2017,English,https://i.gr-assets.com/images/S/compressed.ph...,7,4.29,https://www.goodreads.com/book/show/33785153-i...
97,hunger,,['Tanzeela K. Hassan'],9781544673196,[],"Adam a successful New Yorker, who knew nothing...",0,2017,,https://i.gr-assets.com/images/S/compressed.ph...,49,4.35,https://www.goodreads.com/book/show/33785153-i...
98,Hatching Charlie: A Psychotherapist's Tale,,['Charles C. McCormack'],9780692813430,"['Autobiography', 'Memoir', 'Health', 'Mental ...",McCormack was asked two questions by his child...,426,2016,English,https://i.gr-assets.com/images/S/compressed.ph...,47,4.49,https://www.goodreads.com/book/show/33785153-i...
