<h3>Step 1: Install and load the packages.</h3>

In [24]:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd

<h3>Step 2: Get the URL of the website to extract information.</h3>

In [25]:
from requests import get
url = "https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=1"
response = get(url)
print(response.text[:500])

<!DOCTYPE html>
<html class="desktop withSiteHeaderTopFullImage
">
<head>
  <title>All Time Favorite Romance Novels (5133 books)</title>

<meta content='5,112 books based on 12318 votes: Pride and Prejudice by Jane Austen, Fifty Shades of Grey by E.L. James, Beautiful Disaster by Jamie McGuire, Twilight b...' name='description'>
<meta content='telephone=no' name='format-detection'>
<link href='https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels' rel='canonical'>



    <sc


<h3>Step 3: Data Extraction</h3>
<ul>
  <li>We are finding all the elements with '&lt;tr>' and item_type=http://schema.org/Book and storing the value in <code>book_container</code>. <code>book_container</code> type is <code>ResultSet</code> class which is a subclass of a list.
    <ul>
      <li><code>response.content</code> returns the content of the response, in bytes, refers to Binary Response content.</li>
      <li>The <code>find_all()</code> method looks through a tag’s descendants and retrieves all descendants that match your filters.</li>
      <li>In the html of the website, each of the book items are in a table row element(<code>'&lt;tr>'</code>) and have <code>itemtype="http://schema.org/Book"</code>.</li>
    </ul>
  </li>
</ul>

In [26]:
response = requests.get(url)
html = response.content
html_soup = bs(html, "html.parser")
book_containers = html_soup.find_all('tr', itemtype="http://schema.org/Book")
print(type(book_containers))
print(len(book_containers))

<class 'bs4.element.ResultSet'>
100


Display the container for the first book

In [27]:
first_book = book_containers[0]
first_book

<tr itemscope="" itemtype="http://schema.org/Book">
<td class="number" valign="top">1</td>
<td valign="top" width="5%">
<div class="u-anchorTarget" id="1885"></div>
<div class="js-tooltipTrigger tooltipTrigger" data-resource-id="1885" data-resource-type="Book">
<a href="/book/show/1885.Pride_and_Prejudice" title="Pride and Prejudice">
<img alt="Pride and Prejudice" class="bookCover" itemprop="image" src="https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1320399351i/1885._SY75_.jpg"/>
</a> </div>
</td>
<td valign="top" width="100%">
<a class="bookTitle" href="/book/show/1885.Pride_and_Prejudice" itemprop="url">
<span aria-level="4" itemprop="name" role="heading">Pride and Prejudice</span>
</a> <br/>
<span class="by">by</span>
<span itemprop="author" itemscope="" itemtype="http://schema.org/Person">
<div class="authorName__container">
<a class="authorName" href="https://www.goodreads.com/author/show/1265.Jane_Austen" itemprop="url"><span itemprop="name">Jane Austen</s

Extracting book title text

In [28]:
name = first_book.find('a', class_="bookTitle")
name

<a class="bookTitle" href="/book/show/1885.Pride_and_Prejudice" itemprop="url">
<span aria-level="4" itemprop="name" role="heading">Pride and Prejudice</span>
</a>

In [29]:
name = first_book.find('a', class_="bookTitle").text.strip()
name

'Pride and Prejudice'

Extracting author information.

In [30]:
authors = first_book.find('a', class_="authorName").text.strip()
authors

'Jane Austen'

Extracting average rating and total # of ratings.

In [31]:
scoring = first_book.find('span', class_="greyText smallText uitext").text.strip().split()
scoring

['4.28', 'avg', 'rating', '—', '3,904,923', 'ratings']

In [32]:
avg_scores=scoring[0]
rates = scoring[4]
print("average scores:", avg_scores)
print("ratings", rates)

average scores: 4.28
ratings 3,904,923


Extracting scores and votes.

In [33]:
voted= first_book.find('span', class_="smallText uitext").text.strip().split()
voted

['score:', '237,502,', 'and', '2,403', 'people', 'voted']

In [34]:
scores=voted[1]
print("scores:",scores)
vote=voted[3]
print("votes:",vote)

scores: 237,502,
votes: 2,403


Extracting book image.

In [35]:
# Big Image
imgParams = first_book.find('a', class_="bookTitle")['href']
imgLink = "https://www.goodreads.com/" + imgParams
# print(imgLink)
imgResponse = requests.get(imgLink)
img_html = imgResponse.content
img_html_soup = bs(img_html, "html.parser")
img = img_html_soup.find('img', class_="ResponsiveImage")['src']
img

'https://images-na.ssl-images-amazon.com/images/S/compressed.photo.goodreads.com/books/1320399351i/1885.jpg'

In [36]:
#Small Image
img = first_book.find('img', class_="bookCover")['src']
img

'https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1320399351i/1885._SY75_.jpg'

<h3>Step 4: Extract data from all the books across multiple web pages</h3>

Find total pages available to extract the URL simultaneously. I search for the anchor tag(<code>&lt;a></code>) just before the next page anchor tag to find the last # of pages/urls I will extract.

In [37]:
nextPageLink = html_soup.find('a', class_="next_page")
nextPageLink

<a class="next_page" href="/list/show/12362.All_Time_Favorite_Romance_Novels?page=2" rel="next">Next →</a>

In [38]:
nextPageLink.previous_sibling.previous_sibling

<a href="/list/show/12362.All_Time_Favorite_Romance_Novels?page=52">52</a>

In [39]:
numPages = int(nextPageLink.previous_sibling.previous_sibling.text)
numPages

52

In [40]:
page = 1
while page != numPages:
      url = f"https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page={page}"
      print(url)
      page = page + 1

https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=1
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=2
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=3
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=4
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=5
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=6
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=7
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=8
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=9
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=10
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=11
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=12
https://www.g

<h3>Step 5: Same steps for extracting <code>first_book</code>. Instead of directly printing, we store the values in an array.</h3>

In [41]:
page = 1
names = []
ratings = []
avgscores = []
author=[]
score=[]
votes=[]
imgs=[]
while page != 51:
    url = f"https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page={page}"
    response = requests.get(url)
    html = response.content
    soup = bs(html, "html.parser")
    book_containers = soup.find_all('tr', itemtype="http://schema.org/Book")
    for container in book_containers:
        if container.find('td', width= '100%') is not None:
            name = container.find('a',class_="bookTitle").text.strip()
            names.append(name)
            authors = container.find('a',class_="authorName").text.strip()
            author.append(authors)
            scoring = container.find('span',class_="greyText smallText uitext").text.strip().split()
            ascores=scoring[0]
            avgscores.append(ascores)
            rates = scoring[4]
            ratings.append(rates)
            voted= container.find('span',class_="smallText uitext").text.strip().split()        
            scores=voted[1]
            score.append(scores)
            vote=voted[3]
            votes.append(vote)
            img = container.find('img', class_="bookCover")['src']
            imgs.append(img)
    page = page + 1

In [42]:
names

['Pride and Prejudice',
 'Fifty Shades of Grey (Fifty Shades, #1)',
 'Beautiful Disaster (Beautiful, #1)',
 'Twilight (The Twilight Saga, #1)',
 'The Notebook (The Notebook, #1)',
 'Perfect Chemistry (Perfect Chemistry, #1)',
 'Outlander (Outlander, #1)',
 'Jane Eyre',
 'Thoughtless (Thoughtless, #1)',
 'Bared to You (Crossfire, #1)',
 'Easy (Contours of the Heart, #1)',
 'Gone with the Wind',
 "Gabriel's Inferno (Gabriel's Inferno, #1)",
 "The Time Traveler's Wife",
 'Slammed (Slammed, #1)',
 'Anna and the French Kiss (Anna and the French Kiss, #1)',
 'Vampire Academy (Vampire Academy, #1)',
 'A Walk to Remember',
 'Dark Lover (Black Dagger Brotherhood, #1)',
 'Wuthering Heights',
 'Hush, Hush (Hush, Hush, #1)',
 'The Fault in Our Stars',
 'Sense and Sensibility',
 'Persuasion',
 'The Host (The Host, #1)',
 'Divergent (Divergent, #1)',
 'City of Bones (The Mortal Instruments, #1)',
 'Obsidian (Lux, #1)',
 'Love Unscripted (Love, #1)',
 'On the Island (On the Island, #1)',
 'Hopeless (

<h3>Step 6: Convert above arrays into data frame.</h3>

In [43]:
df = pd.DataFrame({
  'book title': names,
  'ratings': ratings,
  'avg_score': avgscores,
  'author': author,
  'score': score,
  'votes': votes,
  'imgs': imgs                   
})
df

Unnamed: 0,book title,ratings,avg_score,author,score,votes,imgs
0,Pride and Prejudice,3904923,4.28,Jane Austen,237502,2403,https://i.gr-assets.com/images/S/compressed.ph...
1,"Fifty Shades of Grey (Fifty Shades, #1)",2415343,3.66,E.L. James,231036,2343,https://i.gr-assets.com/images/S/compressed.ph...
2,"Beautiful Disaster (Beautiful, #1)",644651,4.04,Jamie McGuire,216772,2199,https://i.gr-assets.com/images/S/compressed.ph...
3,"Twilight (The Twilight Saga, #1)",6114258,3.64,Stephenie Meyer,138962,1426,https://i.gr-assets.com/images/S/compressed.ph...
4,"The Notebook (The Notebook, #1)",1586104,4.14,Nicholas Sparks,98597,1013,https://i.gr-assets.com/images/S/compressed.ph...
...,...,...,...,...,...,...,...
4994,"Afflicted (Battlescars, #2)",4217,4.17,Sophie Monroe,20,1,https://i.gr-assets.com/images/S/compressed.ph...
4995,"Stoned (Wrecked, #1)",3651,4.07,Mandi Beck,20,1,https://i.gr-assets.com/images/S/compressed.ph...
4996,"Local Custom (Liaden Universe, #5)",2584,4.27,Sharon Lee,20,1,https://i.gr-assets.com/images/S/compressed.ph...
4997,The Endearment,4284,4.04,LaVyrle Spencer,20,1,https://i.gr-assets.com/images/S/compressed.ph...


<h3>Step 7: Convert the dataframe into csv files</h3>

In [44]:
import os
cwd = os.getcwd()
path = cwd + "/romance"
df.to_csv(path)