# Lesson 7 Activity 1: Top 100 ebooks' name extraction from Gutenberg.org

## What is Project Gutenberg? - 
Project Gutenberg is a volunteer effort to digitize and archive cultural works, to "encourage the creation and distribution of eBooks". It was founded in 1971 by American writer Michael S. Hart and is the **oldest digital library.** This longest-established ebook project releases books that entered the public domain, and can be freely read or downloaded in various electronic formats.

## What is this activity all about?
* **This activity aims to scrape the url of the Project Gutenberg's Top 100 ebooks (yesterday's ranking) for identifying the ebook links. **
* **It uses BeautifulSoup4 for parsing the HTML and regular expression code for identifying the Top 100 ebook file numbers.**
* **You can use those book ID numbers to download the book into your local drive if you want**

### Import necessary libraries including regex, and beautifulsoup

In [1]:
import urllib.request, urllib.parse, urllib.error
import requests
from bs4 import BeautifulSoup
import ssl
import re

### Ignore SSL errors (this code will be given)

In [2]:
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

### Read the HTML from the URL

In [3]:
url = "https://www.gutenberg.org/browse/scores/top"
response= requests.get(url)

### Write a small function to check the status of web request

In [6]:
def status(response):
    if response.status_code == 200:
        print("Loaded")
    else:
        print("There was a problem")

In [7]:
status(response)

Loaded


### Decode the response and pass on to `BeautifulSoup` for HTML parsing

In [8]:
decode = response.content.decode(response.encoding)

In [9]:
soup = BeautifulSoup(decode, "html.parser")

### Find all the _href_ tags and store them in the list of links. Check how the list looks like - print first 30 elements

In [15]:
tags = []
for tag in soup.find_all("a"):
    tags.append(tag.get("href"))

In [16]:
tags[:30]

['/wiki/Main_Page',
 '/catalog/',
 '/ebooks/',
 '/browse/recent/last1',
 '/browse/scores/top',
 '/wiki/Gutenberg:Offline_Catalogs',
 '/catalog/world/mybookmarks',
 '/wiki/Main_Page',
 'https://www.paypal.com/xclick/business=donate%40gutenberg.org&item_name=Donation+to+Project+Gutenberg',
 '/wiki/Gutenberg:Project_Gutenberg_Needs_Your_Donation',
 'http://www.ibiblio.org',
 'http://www.pgdp.net/',
 'pretty-pictures',
 '#books-last1',
 '#authors-last1',
 '#books-last7',
 '#authors-last7',
 '#books-last30',
 '#authors-last30',
 '/ebooks/1342',
 '/ebooks/11',
 '/ebooks/2701',
 '/ebooks/1661',
 '/ebooks/1635',
 '/ebooks/25525',
 '/ebooks/1952',
 '/ebooks/1080',
 '/ebooks/2542',
 '/ebooks/74',
 '/ebooks/98']

### Use regular expression to find the numeric digits in these links. <br>These are the file number for the Top 100 books.

#### Initialize empty list to hold the file numbers

In [19]:
files = []

* Number 19 to 118 in the original list of links have the Top 100 ebooks' number. 
* Loop over appropriate range and use regex to find the numeric digits in the link (href) string.
* Hint: Use `findall()` method

In [20]:
for i in range(19,119):
    tag = tags[i]
    tag = tag.strip()
    file = re.findall("[0-9]+", tag)
    #looking for numbers so 0-9
    if len(file) == 1:
        files.append(int(file[0]))

#### Print the file numbers

In [21]:
files

[1342,
 11,
 2701,
 1661,
 1635,
 25525,
 1952,
 1080,
 2542,
 74,
 98,
 46,
 84,
 205,
 2600,
 2591,
 5200,
 76,
 43,
 844,
 514,
 1184,
 14975,
 120,
 4300,
 345,
 174,
 1727,
 1232,
 58975,
 158,
 6130,
 62678,
 408,
 42108,
 203,
 1400,
 1260,
 1497,
 16328,
 16,
 2554,
 376,
 28054,
 45,
 58585,
 5740,
 996,
 4363,
 27827,
 1998,
 19033,
 219,
 244,
 135,
 730,
 3600,
 2852,
 55,
 62677,
 36,
 62682,
 33283,
 2680,
 4507,
 20738,
 62668,
 113,
 2814,
 160,
 863,
 768,
 766,
 215,
 2097,
 62680,
 147,
 8800,
 1250,
 25344,
 62669,
 1399,
 3207,
 23,
 2500,
 3296,
 3090,
 521,
 43453,
 20203,
 148,
 3825,
 30254,
 829,
 100,
 236,
 25717,
 1228,
 161,
 61]

### How does the `soup` object's text look like? Use `.text()` method and print only first 2000 characters (i.e. do not print the whole thing, it is long).

You will notice lot of empty spaces/blanks here and there. Ignore them. They are part of HTML page markup and its whimsical nature!

In [22]:
soup.text[:2000]

"\n\n\n\n\n\n\n\n\n      if (top != self) {\n        top.location.replace ('http://www.gutenberg.org');\n        alert ('Project Gutenberg is a FREE service with NO membership required. If you paid somebody else to get here, make them give you your money back!');\n      }\n    \n \nTop 100 - Project Gutenberg\n\n\n\n\n\n\n\n\nOnline Book Catalog\n=> \n\n\n\n Book  Search\n-- Recent  Books\n-- Top  100\n-- Offline Catalogs\n-- My Bookmarks\n\n\nMain Page\n\n\n\n\nProject Gutenberg needs your donation! \n        More Info\n\n\n\n\n\n\n\n\nDid you know that you can help us produce ebooks\nby proof-reading just one page a day?\nGo to: Distributed Proofreaders\n\n\n\nTop 100\n\n\nTo determine the ranking we count the times each file gets downloaded.\nBoth HTTP and FTP transfers are counted.\nOnly transfers from ibiblio.org are counted as we have no access to our mirrors log files.\nMultiple downloads from the same IP address on the same day count as one download.\nIP addresses that download

### Search in the extracted text (using regular expression) from the `soup` object to find the names of top 100 Ebooks (Yesterday's rank)

In [23]:
enames = []

#### Create a starting index. It should point at the text _"Top 100 Ebooks yesterday"_. Hint: Use `splitlines()` method of the `soup.text`. It splits the lines of the text of the `soup` object.

In [24]:
index = soup.text.splitlines().index("Top 100 EBooks yesterday")

#### Loop 1-100 to add the strings of next 100 lines to this temporary list. Hint: `splitlines()` method

In [27]:
for i in range(100):
    enames.append(soup.text.splitlines()[index+2+i])

#### Use regular expression to extract only text from the name strings and append to an empty list
* Hint: Use `match` and `span` to find indices and use them

In [30]:
enames2 = []
for i in range(100):
    id1, id2 = re.match("^[a-zA-Z]*", enames[i]).span()
    enames2.append(enames[i][id1:id2])

#### Print the list of titles

In [31]:
for i in enames2:
    print(i)

Pride
Alice
Moby
The
Ion
The
The
A
Et
The
A
A
Frankenstein
Walden
War
Grimms
Metamorphosis
Adventures
The
The
Little
The
Southern
Treasure
Ulysses
Dracula
The
The
Il
Index
Emma
The
Some
The
The
Uncle
Great
Jane
The
Beowulf
Peter
Prestuplenie
A
The
Anne
The
Tractatus
Don
Beyond
The
Also
Alice
Heart
A
Les
Oliver
Essays
The
The
Street
The
The
Calculus
Meditations
As
Diccionario
La
The
Dubliners
The
The
Wuthering
David
The
The
The
Common
An
Anthem
The
The
Anna
Leviathan
Narrative
Siddhartha
The
Complete
The
A
Autobiography
The
Pygmalion
The
Gulliver
The
The
The
On
Sense
Manifest
