# Extracting the Top 100 e-books from Gutenberg

Import the necessary libraries, including regex and BeautifulSoup

In [1]:
import urllib.request, urllib.parse, urllib.error
import requests
from bs4 import BeautifulSoup
import ssl
import re

Read the HTML from the URL

In [2]:
top100url = 'https://www.gutenberg.org/browse/scores/top'
response = requests.get(top100url)

Write a small function to check the status of the web request

In [3]:
def status_check(r):
    if r.status_code == 200:
        print("Success!")
        return 1
    else:
        print("Failed!")
        return -1

In [4]:
status_check(response)

Success!


1

Decode the response and pass this on to BeautifulSoup for HTML parsing

In [5]:
contents = response.content.decode(response.encoding)
soup = BeautifulSoup(contents, 'html.parser')

Find all the href tags and store them in the list of links. Check what the list looks like – print the first 30 elements

In [6]:
# Empty list to hold all the http links in the HTML page
lst_links=[]
# Find all the href tags and store them in the list of links
for link in soup.find_all('a'):
    #print(link.get('href'))
    lst_links.append(link.get('href'))

In [7]:
lst_links[:30]

['/',
 '/about/',
 '/about/',
 '/policy/collection_development.html',
 '/about/contact_information.html',
 '/about/background/',
 '/policy/permission.html',
 '/policy/privacy_policy.html',
 '/policy/terms_of_use.html',
 '/ebooks/',
 '/ebooks/',
 '/ebooks/bookshelf/',
 '/browse/scores/top',
 '/ebooks/offline_catalogs.html',
 '/help/',
 '/help/',
 '/help/copyright.html',
 '/help/errata.html',
 '/help/file_formats.html',
 '/help/faq.html',
 '/policy/',
 '/help/public_domain_ebook_submission.html',
 '/help/submitting_your_own_work.html',
 '/help/mobile.html',
 '/attic/',
 '/donate/',
 '/donate/',
 '#books-last1',
 '#authors-last1',
 '#books-last7']

Use a regular expression to find the numeric digits in these links. These are the file numbers for the top 100 eBooks

In [8]:
booknum = []
for i in range(19,119):
    link=lst_links[i]
    link=link.strip()
    # Regular expression to find the numeric digits in the link (href) string
    n=re.findall('[0-9]+',link)
    if len(n)==1:
        # Append the filenumber casted as integer
        booknum.append(int(n[0]))

In [9]:
print ("\nThe file numbers for the top 100 ebooks on Gutenberg are shown below\n"+"-"*70)
print(booknum)


The file numbers for the top 100 ebooks on Gutenberg are shown below
----------------------------------------------------------------------
[1, 1, 7, 7, 30, 30, 1342, 11, 84, 1661, 2701, 98, 64317, 174, 65790, 2600, 1952, 996, 345, 1232, 5200, 2591, 4300, 1260, 43, 46, 63256, 6133, 45, 16, 2542, 2554, 1400, 65789, 205, 58585, 19924, 1080, 844, 65786, 120, 55, 5740, 74, 6130, 1184, 30254, 514, 219, 5739, 27827, 65791, 2852, 1497, 829, 26184, 135, 203, 244, 20228, 76, 160, 65787, 768, 158, 1998, 902, 833, 113, 16328, 863, 1727, 521, 1399, 28054, 236, 408, 2500, 35, 42108, 3600, 766, 132, 2680, 730, 36, 25344, 33283, 20203, 38658, 2814, 105]


What does the soup object's text look like? Use the .text method and print only the first 2,000 characters (do not print the whole thing, as it is too long)

In [10]:
print(soup.text[:2000])





Top 100 | Project Gutenberg



























Menu▾



About
          ▾

▾


About Project Gutenberg
Collection Development
Contact Us
History & Philosophy
Permissions & License
Privacy Policy
Terms of Use



Search and Browse
      	  ▾

▾


Book Search
Bookshelves
Frequently Downloaded
Offline Catalogs



Help
          ▾

▾


All help topics →
Copyright Procedures
Errata, Fixes and Bug Reports
File Formats
Frequently Asked Questions
Policies →
Public Domain eBook Submission
Submitting Your Own Work
Tablets, Phones and eReaders
The Attic →


Donate










Donation







Frequently Viewed or Downloaded
These listings are based on the number of times each eBook gets downloaded.
      Multiple downloads from the same Internet address on the same day count as one download, and addresses that download more than 100 eBooks in a day are considered robots and are not counted.

Downloaded Books
2021-07-08133427
last 7 days942771
last 30 days4078055



Top 100 EBooks yesterda

Search in the extracted text (using a regular expression) from the soup object to find the names of the top 100 eBooks (yesterday's ranking)

In [11]:
lst_titles_temp = []

Create a starting index. It should point at the text Top 100 Ebooks yesterday. Use the splitlines method of soup.text. It splits the lines of text of the soup object

In [14]:
start_idx = soup.text.splitlines().index('Top 100 | Project Gutenberg')

Run the for loop 1-100 to add the strings of the next 100 lines to this temporary list

In [15]:
for i in range(100):
    lst_titles_temp.append(soup.text.splitlines()[start_idx+2+i])

Use a regular expression to extract only text from the name strings and append it to an empty list. Use match and span to find the indices and use them

In [16]:
lst_titles=[]
for i in range(100):
    id1,id2=re.match('^[a-zA-Z ]*',lst_titles_temp[i]).span()
    lst_titles.append(lst_titles_temp[i][id1:id2])

Print the list of titles

In [17]:
for l in lst_titles:
    print(l)



























Menu



About
          




About Project Gutenberg
Collection Development
Contact Us
History 
Permissions 
Privacy Policy
Terms of Use



Search and Browse
      




Book Search
Bookshelves
Frequently Downloaded
Offline Catalogs



Help
          




All help topics 
Copyright Procedures
Errata
File Formats
Frequently Asked Questions
Policies 
Public Domain eBook Submission
Submitting Your Own Work
Tablets
The Attic 


Donate










Donation







Frequently Viewed or Downloaded
These listings are based on the number of times each eBook gets downloaded
      Multiple downloads from the same Internet address on the same day count as one download

