### Author: KR
### Date: 02/12/2024

### URL Scrap

1 - Import the necessary libraries, including regex and beautifulsoup.

In [1]:
# Import the libraries
import re
from bs4 import BeautifulSoup
import requests

2 - Check the SSL certificate.

3 - Read the HTML from the URL.

In [3]:
url = 'https://www.gutenberg.org/browse/scores/top'
response = requests.get(url)

4 - Write a small function to check the status of the web request.

In [4]:
def check_status(r):
    if r.status_code == 200:
        print("Success! Status code: 200")
        return True
    else:
        print(f"Failed! Status code: {r.status_code}")
        return False

In [5]:
if check_status(response):
    print("Proceed with parsing the response.")
else:
    print("Check the URL or your network connection and try again.")

Success! Status code: 200
Proceed with parsing the response.


5 - Decode the response and pass this on to BeautifulSoup for HTML parsing.

In [6]:
if response.status_code == 200:
    html_content = response.content.decode('utf-8')
    soup = BeautifulSoup(html_content, 'html.parser')
    print("HTML parsing successful!")
else:
    print("Failed to retrieve or decode the HTML.")

HTML parsing successful!


6 - Find all the href tags and store them in the list of links. Check what the list looks like – print the first 30 elements.

In [7]:
links = [a.get('href') for a in soup.find_all('a', href=True)]

In [8]:
print("First 30 links found:")
for link in links[:30]:
    print(link)

First 30 links found:
/
/about/
/about/
/policy/collection_development.html
/about/contact_information.html
/about/background/
/policy/permission.html
/policy/privacy_policy.html
/policy/terms_of_use.html
/ebooks/
/ebooks/
/ebooks/bookshelf/
/browse/scores/top
/ebooks/offline_catalogs.html
/help/
/help/
/help/copyright.html
/help/errata.html
/help/file_formats.html
/help/faq.html
/policy/
/help/public_domain_ebook_submission.html
/help/submitting_your_own_work.html
/help/mobile.html
/attic/
/donate/
/donate/
#books-last1
#authors-last1
#books-last7


7 - Use a regular expression to find the numeric digits in these links. These are the file numbers for the top 100 eBooks.

In [9]:
pattern = r'/ebooks/(\d+)'

8 - Initialize the empty list to hold the file numbers over an appropriate range and use regex to find the numeric digits in the link href string. Use the findall method.

In [10]:
# Initialize an empty list to hold the file numbers
file_numbers = []

for link in links:
    found = re.findall(pattern, link)
    if found:  
        file_numbers.extend(found)

In [11]:
print("File numbers found:", file_numbers[:100])

File numbers found: ['84', '1342', '2701', '1513', '145', '2641', '100', '37106', '16389', '67979', '394', '6761', '2160', '6593', '4085', '5197', '1259', '64317', '11', '72907', '31591', '844', '174', '1952', '2542', '5200', '98', '1080', '2554', '345', '72909', '76', '25344', '1400', '43', '1260', '72906', '72910', '58585', '28054', '72908', '46', '1661', '6130', '408', '4300', '219', '55387', '2591', '72911', '2000', '1727', '3207', '2600', '5740', '1232', '72904', '36034', '768', '2814', '74', '996', '1998', '23', '15399', '1497', '205', '514', '30254', '1184', '4363', '45', '8800', '16', '41445', '42324', '730', '27827', '72913', '2852', '7370', '55', '2680', '16328', '72914', '158', '67098', '72902', '120', '8492', '72901', '600', '829', '244', '161', '35', '10', '5827', '72916', '3296']


9 - What does the soup object's text look like? Use the .text method and print only the first 2,000 characters (do not print the whole thing, as it is too long).

In [12]:
soup_text = soup.text
print(soup_text[:2000])





Top 100 | Project Gutenberg



























Menu▾



About
          ▾

▾


About Project Gutenberg
Collection Development
Contact Us
History & Philosophy
Permissions & License
Privacy Policy
Terms of Use



Search and Browse
      	  ▾

▾


Book Search
Bookshelves
Frequently Downloaded
Offline Catalogs



Help
          ▾

▾


All help topics →
Copyright How-To
Errata, Fixes and Bug Reports
File Formats
Frequently Asked Questions
Policies →
Public Domain eBook Submission
Submitting Your Own Work
Tablets, Phones and eReaders
The Attic →


Donate










Donation







Frequently Viewed or Downloaded
These listings are based on the number of times each eBook gets downloaded.
      Multiple downloads from the same Internet address on the same day count as one download, and addresses that download more than 100 eBooks in a day are considered robots and are not counted.

Downloaded Books
2024-02-09287760
last 7 days2236551
last 30 days7154714



Top 100 EBooks yesterday
T

10 - Search in the extracted text (using a regular expression) from the soup object to find the names of the top 100 eBooks (yesterday's ranking).  
11 -  Create a starting index. It should point at the text Top 100 Ebooks yesterday. Use the splitlines method of soup.text. It splits the lines of text of the soup object.

In [15]:
ebook_list_temp = []
start_index=soup.text.splitlines().index('Top 100 EBooks yesterday')

12 - Loop 1-100 to add the strings of the next 100 lines to this temporary list. Hint: use the splitlines method.

In [16]:
for i in range(start_index + 1, start_index + 101):
    ebook_list_temp.append(lines[i])

In [18]:
print("First 5 ebook entries in temporary list:")
for ebook_entry in ebook_list_temp[:5]:
    print(ebook_entry)

First 5 ebook entries in temporary list:
Top 100 Authors yesterday
Top 100 EBooks last 7 days
Top 100 Authors last 7 days
Top 100 EBooks last 30 days
Top 100 Authors last 30 days


13 - Use a regular expression to extract only text from the name strings and append it to an empty list. Use match and span to find the indices and use them.

In [26]:
clean_ebook_titles = []
pattern = re.compile(r'^(.+?)\s+\(\d+\)$')

for entry in ebook_list_temp:
    match = pattern.match(entry)
    if match:
        title_and_author = match.group(1)
        clean_ebook_titles.append(title_and_author)

In [27]:
print("Extracted eBook titles and authors:")
for title in clean_ebook_titles[:10]:
    print(title)

Extracted eBook titles and authors:
Frankenstein; Or, The Modern Prometheus by Mary Wollstonecraft Shelley
Pride and Prejudice by Jane Austen
Moby Dick; Or, The Whale by Herman Melville
Romeo and Juliet by William Shakespeare
Middlemarch by George Eliot
A Room with a View by E. M.  Forster
The Complete Works of William Shakespeare by William Shakespeare
Little Women; Or, Meg, Jo, Beth, and Amy by Louisa May Alcott
The Enchanted April by Elizabeth Von Arnim
The Blue Castle: a novel by L. M.  Montgomery
