# Note about the source: 
This script was downloaded from [this repository](https://github.com/SatriaImawan12/Top-Ranked-Gutenberg-Ebooks-Download). 
Minimal modifications and corrections were made, but most of the script remains as is. 

## Top Gutenberg Ebooks (yesterday's ranking) download

### What is Project Gutenberg? -
Project Gutenberg is a volunteer effort to digitize and archive cultural works, to "encourage the creation and distribution of eBooks". It was founded in 1971 by American writer Michael S. Hart and is the **oldest digital library.** This longest-established ebook project releases books that entered the public domain, and can be freely read or downloaded in various electronic formats.

* **This starter code scrapes the url of the Project Gutenberg's Top 100 ebooks (yesterday's ranking) for identifying the ebook links.**
* **It uses BeautifulSoup4 for parsing the HTML and regular expression code for identifying the Top 100 ebook file numbers.**
* **It includes a function to take an usser input on how many books to download and then crawls the server to download them in a dictionary object.**
* **Finally, it also includes a function to save the downloaded Ebooks as text files in a local directory.**

In [1]:
import urllib.request, urllib.parse, urllib.error
from tqdm import tqdm
from bs4 import BeautifulSoup
import ssl
import re
import os

#### Ignore SSL certificate errors

In [2]:
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

#### Read the HTML from the URL and pass on to BeautifulSoup

In [3]:
# Read the HTML from the URL and pass on to BeautifulSoup
top100url = 'https://www.gutenberg.org/browse/scores/top'
url = top100url
print(f"Opening the file connection to {url}")
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
print("Connection established and HTML parsed...")

Opening the file connection to https://www.gutenberg.org/browse/scores/top
Connection established and HTML parsed...


#### Find all the _'href'_ tags and store them in the list of links

In [4]:
# Empty list to hold all the http links in the HTML page
lst_links=[]

In [5]:
# Find all the href tags and store them in the list of links
for link in soup.find_all('a'):
    #print(link.get('href'))
    lst_links.append(link.get('href'))

#### Use regular expression to find the numeric digits in these links. These are the file number for the Top 100 books.

In [6]:
# Use regular expression to find the numeric digits in these links. These are the file number for the Top 100 books.
# Initialize empty list to hold the file numbers
booknum=[]

In [7]:
# Number 19 to 119 in the original list of links have the Top 100 books' number.
for i in range(19, 119):
	link=lst_links[i]
	link=link.strip()
	# Regular expression to find the numeric digits in the link (href) string
	n=re.findall('[0-9]+',link)
	if len(n)==1:
		# Append the file number cast as an integer
		booknum.append(int(n[0]))

print ("\nThe file numbers for the top 100 ebooks on Gutenberg are shown below\n"+"-"*70)
print(booknum)


The file numbers for the top 100 ebooks on Gutenberg are shown below
----------------------------------------------------------------------
[1, 1, 7, 7, 30, 30, 25558, 84, 2701, 1513, 100, 46, 2641, 145, 37106, 11, 67979, 16389, 1342, 6761, 394, 6593, 2160, 4085, 1259, 5197, 27104, 2542, 174, 25344, 43, 5200, 844, 2554, 64317, 76, 1080, 7700, 1260, 345, 24162, 55, 74822, 98, 28054, 1952, 1232, 1400, 1661, 2650, 16119, 1998, 3207, 2600, 74818, 4300, 5740, 7370, 1184, 4363, 2591, 50150, 23, 408, 2000, 6130, 3296, 74, 24022, 36034, 219, 34901, 205, 768, 135, 35899, 1727, 30508, 1497, 45, 514, 2814, 30254, 2680, 996, 244, 8800, 67098, 10615, 10676, 2852]


#### Search in the extracted text (using Regular Expression) from the soup object to find the names of top 100 Ebooks (Yesterday's rank)

In [8]:
start_idx=soup.text.splitlines().index('Top 100 EBooks yesterday')
lst_titles_temp=[] # Empty list of Ebook names
for i in range(100):
    lst_titles_temp.append(soup.text.splitlines()[start_idx+2+i])

In [9]:
# Use regular expression to extract only text from the name strings and append to an empty list
lst_titles=[]
for i in range(100):
    id1,id2=re.match('^[a-zA-Z ]*',lst_titles_temp[i]).span()
    lst_titles.append(lst_titles_temp[i][id1:id2])
for l in lst_titles:
    print(l)


Frankenstein
Moby Dick
Romeo and Juliet by William Shakespeare 
The Complete Works of William Shakespeare by William Shakespeare 
A Christmas Carol in Prose
A Room with a View by E
Middlemarch by George Eliot 
Little Women
Alice
The Blue Castle
The Enchanted April by Elizabeth Von Arnim 
Pride and Prejudice by Jane Austen 
The Adventures of Ferdinand Count Fathom 
Cranford by Elizabeth Cleghorn Gaskell 
History of Tom Jones
The Expedition of Humphry Clinker by T
The Adventures of Roderick Random by T
Twenty years after by Alexandre Dumas and Auguste Maquet 
My Life 

A Doll
The Picture of Dorian Gray by Oscar Wilde 
The Scarlet Letter by Nathaniel Hawthorne 
The Strange Case of Dr
Metamorphosis by Franz Kafka 
The Importance of Being Earnest
Crime and Punishment by Fyodor Dostoyevsky 
The Great Gatsby by F
Adventures of Huckleberry Finn by Mark Twain 
A Modest Proposal by Jonathan Swift 
Lysistrata by Aristophanes 
Jane Eyre
Dracula by Bram Stoker 

The Wonderful Wizard of Oz by L
Our

### Define a function that takes an user input of how many top books to download and crawls the server to download

In [10]:
def download_top_books(num_download=10, verbosity=0):
    """
    Function: Download top N books from Gutenberg.org where N is specified by user
    Verbosity: If verbosity is turned on (set to 1) then prints the downloading status for every book
    Returns: Returns a dictionary where keys are the names of the books and values are the raw text.
    Exception Handling: If a book is not found on the server (due to broken link or whatever reason), inserts "NOT FOUND" as the text.
    """
    topEBooks = {}

    if num_download<=0:
        print("I guess no download is necessary")
        return topEBooks

    if num_download>100:
        print("You asked for more than 100 downloads.\nUnfortunately, Gutenberg ranks only top 100 books.\nProceeding to download top 100 books.")
        num_download=100

    # Base URL for files repository
    baseurl= 'http://www.gutenberg.org/files/'

    if verbosity==1:
        count_done=0
        for i in range(num_download):
            print ("Working on book:", lst_titles[i])

            # Create the proper download link (url) from the book id
            # You have to examine the Gutenberg.org file structure carefully to come up with the proper url
            bookid=booknum[i]
            bookurl= baseurl+str(bookid)+'/'+str(bookid)+'-0.txt'
            # Create a file handler object
            try:
                fhand = urllib.request.urlopen(bookurl)
                txt_dump = ''
                # Iterate over the lines in the file handler object and dump the data into the text string
                for line in fhand:
                    # Use decode method to convert the UTF-8 to Unicode string
                    txt_dump+=line.decode()
                # Add downloaded text to the dictionary with keys matching the list of book titles.
                # This puts the raw text as the value of the key of the dictionary bearing the name of the Ebook
                topEBooks[lst_titles[i]]=txt_dump
                count_done+=1
                print (f"Finished downloading {round(100*count_done/num_download,2)}%")
            except urllib.error.URLError as e:
                topEBooks[lst_titles[i]]="NOT FOUND"
                count_done+=1
                print(f"**ERROR: {lst_titles[i]} {e.reason}**")
    else:
        count_done=0
        for i in tqdm(range(num_download),desc='Download % completed',dynamic_ncols=True):
            # Create the proper download link (url) from the book id
            # You have to examine the Gutenberg.org file structure carefully to come up with the proper url
            bookid=booknum[i]
            bookurl= baseurl+str(bookid)+'/'+str(bookid)+'-0.txt'
            # Create a file handler object
            try:
                fhand = urllib.request.urlopen(bookurl)
                txt_dump = ''
                # Iterate over the lines in the file handler object and dump the data into the text string
                for line in fhand:
                    # Use decode method to convert the UTF-8 to Unicode string
                    txt_dump+=line.decode()
                # Add downloaded text to the dictionary with keys matching the list of book titles.
                # This puts the raw text as the value of the key of the dictionary bearing the name of the Ebook
                topEBooks[lst_titles[i]]=txt_dump
                count_done+=1
            except urllib.error.URLError as e:
                topEBooks[lst_titles[i]]="NOT FOUND"
                count_done+=1
                print(f"**ERROR: {lst_titles[i]} {e.reason}**")

    print ("-"*40+"\nFinished downloading all books!\n"+"-"*40)

    return (topEBooks)

#### Test the function with verbosity=0 (default)

In [11]:
dict_books=download_top_books(1)

Download % completed: 100%|██████████| 1/1 [00:00<00:00,  2.23it/s]

----------------------------------------
Finished downloading all books!
----------------------------------------





#### Show the final dictionary and an example of the downloaded text

In [12]:
print(dict_books[lst_titles[0]][:1500])




     NOTE:  This file combines the first two Project Gutenberg
     files, both of which were given the filenumber #1. There are
     several duplicate files here. There were many updates over
     the years.  All of the original files are included in the
     "old" subdirectory which may be accessed under the "More
     Files" listing in the PG Catalog of this file. No changes
     have been made in these original etexts.



**Welcome To The World of Free Plain Vanilla Electronic Texts**

**Etexts Readable By Both Humans and By Computers, Since 1971**

*These Etexts Prepared By Hundreds of Volunteers and Donations*

Below you will find the first nine Project Gutenberg Etexts, in
one file, with one header for the entire file.  This is to keep
the overhead down, and in response to requests from Gopher site
keeper to eliminate as much of the headers as possible.

However, for legal and financial reasons, we must request these
headers be left at the beginning o

### Write a function to download and save the downloaded texts

In [15]:
def save_text_files(num_download=10, verbosity=1):
    """
    Downloads top N books from Gutenberg.org where N is specified by user.
    If verbosity is turned on (set to 1) then prints the downloading status for every book.
    Asks user for a location on computer where to save the downloaded Ebooks and process accordingly.
    Returns status message indicating how many ebooks could be successfully downloaded and saved
    """

    # Download the Ebooks and save in a dictionary object (in-memory)
    dict_books=download_top_books(num_download=num_download,verbosity=verbosity)

    if dict_books=={}:
        return None

    # Ask use for a save location (directory path)
    savelocation="./data/raw/gutenberg_data/" 

    count_successful_download=0

    # Create a default folder/directory in the current working directory if the input is blank
    if (len(savelocation)<1):
        savelocation=os.getcwd()+'\\'+'Ebooks'+'\\'
        # Creates new directory if the directory does not exist. Otherwise, just use the existing path.
        if not os.path.isdir(savelocation):
            os.mkdir(savelocation)
    else:
        if savelocation[-1]=='\\':
            os.mkdir(savelocation)
        else:
            os.mkdir(savelocation+'\\')
    #print("Saving files at:",savelocation)
    for k,v in dict_books.items():
        if (v!="NOT FOUND"):
            filename=savelocation+str(k)+'.txt'
            file=open(filename,'wb')
            file.write(v.encode("UTF-8",'ignore'))
            file.close()
            count_successful_download+=1

    # Status message
    print (f"{count_successful_download} book(s) was/were successfully downloaded and saved to the location {savelocation}")
    if (num_download!=count_successful_download):
        print(f"{num_download-count_successful_download} books were not found on the server!")

In [None]:
save_text_files(90, verbosity=1)