[Project Gutenberg](https://www.gutenberg.org/) is a great resource for free eBooks, and has lots of great classic texts for NLP.
While there exist some libraries for accessing Project Gutenberg from Python such as [py-gutenberg](https://github.com/peterrauscher/py-gutenberg) and [GutenbergPy](https://github.com/raduangelescu/gutenbergpy) these require implicitly or explicitly building a database which makes them complex to use.
The R package [gutenberr](https://github.com/ropensci/gutenbergr) is much easier to use because it distributes a snapshot of the catalog and loads it into memory, but I can't find an equivalent in Python.
So instead we're going to directly search for books from Project Gutenberg's CSV exports, and use them to download all the books of [P. G. Wodehouse](https://en.wikipedia.org/wiki/P._G._Wodehouse)

In [1]:
import csv
from collections import Counter
from io import BytesIO
from pathlib import Path

import requests

# Reading the Catalog

Project Gutenberg doesn't have an API but has [documentation on offline catalogs](https://www.gutenberg.org/ebooks/offline_catalogs.html).
There exists a large RDF catalog (around 100MB compressed) with detailed metadata, and a smaller CSV catalog (14MB uncompressed) that contains limited metadata.

The CSV catalog is small enough we can download it quickly into memory (note that requests automatically decompresses):

In [2]:
import requests
GUTENBERG_CSV_URL = "https://www.gutenberg.org/cache/epub/feeds/pg_catalog.csv.gz"

r = requests.get(GUTENBERG_CSV_URL)
csv_text = r.content.decode("utf-8")

f"Total size: {len(r.content) / 1024**2:0.2f}MB"

'Total size: 14.04MB'

The text is a standard CSV file:

In [3]:
print(csv_text[:400])

Text#,Type,Issued,Title,Language,Authors,Subjects,LoCC,Bookshelves
1,Text,1971-12-01,The Declaration of Independence of the United States of America,en,"Jefferson, Thomas, 1743-1826","United States -- History -- Revolution, 1775-1783 -- Sources; United States. Declaration of Independence",E201; JK,Politics; American Revolutionary War; United States Law
2,Text,1972-12-01,"The United States Bill o


An easy way to process it is with a [`DictReader`](https://docs.python.org/3/library/csv.html#csv.DictReader), wrapping the text in StringIO to make it look like a file

In [4]:
import csv
from io import StringIO

next(csv.DictReader(StringIO(csv_text)))

{'Text#': '1',
 'Type': 'Text',
 'Issued': '1971-12-01',
 'Title': 'The Declaration of Independence of the United States of America',
 'Language': 'en',
 'Authors': 'Jefferson, Thomas, 1743-1826',
 'Subjects': 'United States -- History -- Revolution, 1775-1783 -- Sources; United States. Declaration of Independence',
 'LoCC': 'E201; JK',
 'Bookshelves': 'Politics; American Revolutionary War; United States Law'}

We can then search for all P. G. Wodehouse books by looking for authors containing "Wodehouse":

In [5]:
wodehouse_books = [book for book in csv.DictReader(StringIO(csv_text)) 
                   if 'Wodehouse' in book['Authors']]

len(wodehouse_books)

56

Let's show our results in a HTML table (it's a bit long - feel free to skim past it):

In [6]:
from IPython.display import display, HTML

def dicts_to_html_table(dicts):
    html = []
    header = None
    for d in dicts:
        if header is None:
            header = d.keys()
            html.append("<table><tr>" +
                        "".join([f"<th>{h}</th>" for h in header]) +
                        "</tr>")
        html.append("<tr>" +
                    "".join([f"<td>{d[h]}</td>" for h in header]) +
                    "</tr>")
    html.append("</table>")

    return "".join(html)

display(HTML(dicts_to_html_table(wodehouse_books)))

Text#,Type,Issued,Title,Language,Authors,Subjects,LoCC,Bookshelves
2005,Text,1999-12-01,Piccadilly Jim,en,"Wodehouse, P. G. (Pelham Grenville), 1881-1975","Humorous stories; Piccadilly (London, England) -- Fiction",PR,Best Books Ever Listings; Humor
2042,Text,2000-01-01,Something New,en,"Wodehouse, P. G. (Pelham Grenville), 1881-1975",Humorous stories; Nobility -- Fiction; Blandings Castle (England : Imaginary place) -- Fiction; Shropshire (England) -- Fiction,PR,Best Books Ever Listings; Humor
2233,Text,2000-06-01,A Damsel in Distress,en,"Wodehouse, P. G. (Pelham Grenville), 1881-1975",Humorous stories; Nobility -- Fiction; Blandings Castle (England : Imaginary place) -- Fiction; Shropshire (England) -- Fiction,PR,Humor
2607,Text,2001-04-01,"Psmith, Journalist",en,"Wodehouse, P. G. (Pelham Grenville), 1881-1975",Humorous stories,PR,Humor
3756,Text,2008-06-25,Indiscretions of Archie,en,"Wodehouse, P. G. (Pelham Grenville), 1881-1975","New York (N.Y.) -- Fiction; Humorous stories; British -- United States -- Fiction; World War, 1914-1918 -- Veterans -- Fiction; Hotels -- Fiction; Married men -- Fiction; Fathers-in-law -- Fiction",PR,Humor
3829,Text,2003-03-01,Love Among the Chickens,en,"Wodehouse, P. G. (Pelham Grenville), 1881-1975","Humorous stories; England -- Fiction; Farm life -- Fiction; Chicken breeders -- Fiction; Ukridge, Stanley Featherstonehaugh (Fictitious character) -- Fiction",PR,Humor
4075,Text,2003-05-01,The Intrusion of Jimmy,en,"Wodehouse, P. G. (Pelham Grenville), 1881-1975",New York (N.Y.) -- Fiction; Humorous stories; Love stories; Burglary -- Fiction; British -- United States -- Fiction; Police -- Family relationships -- Fiction,PR,Humor
6683,Text,2004-10-01,The Little Nugget,en,"Wodehouse, P. G. (Pelham Grenville), 1881-1975",Humorous stories; Kidnapping -- Fiction,PR,Humor
6684,Text,2004-10-01,Uneasy Money,en,"Wodehouse, P. G. (Pelham Grenville), 1881-1975",New York (N.Y.) -- Fiction; Humorous stories; Inheritance and succession -- Fiction; Love stories; Aristocracy (Social class) -- Fiction; British -- United States -- Fiction,PR,Humor
6753,Text,2004-10-01,Psmith in the City,en,"Wodehouse, P. G. (Pelham Grenville), 1881-1975",Humorous stories,PR; PZ,Humor


There are a couple of results above that aren't what I am looking for:

* Other authors with the name Wodehouse: Wodehouse, C. N. and Thomas Wodehouse Legh
* The "Index of the Project Gutenberg Works of Pelham Grenville Wodehouse"
* Some of them are "Sound" not text

We can filter these out to get just the books we need.

In [7]:
wodehouse_books = [b for b in wodehouse_books
                   if "Wodehouse, P. G." in b["Authors"]
                   and "Indexes" not in b["Subjects"]
                   and b["Type"] == "Text"]
len(wodehouse_books)

48

# Downloading the text

Once we have the id of the book (Text#), it can be downloaded from a standard URL.
For human access we can get them from `https://www.gutenberg.org/ebooks/{id}.txt.utf-8`:

In [8]:
GUTENBERG_TEXT_URL = "https://www.gutenberg.org/ebooks/{id}.txt.utf-8"

book_id = wodehouse_books[0]["Text#"]

#r = requests.get(GUTENBERG_TEXT_URL.format(id=book_id))
#text = r.text

*but* their [robots access policy](https://www.gutenberg.org/policy/robot_access.html) suggests using a special URL to get links (we set the filetypes to `txt` here to get text).

In [9]:
GUTENBERG_ROBOT_URL = "http://www.gutenberg.org/robot/harvest?filetypes[]=txt"
r = requests.get(GUTENBERG_ROBOT_URL)

print(r.text[:750])

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">

<html lang="en">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <title>All Files (offset: 0, filetypes: txt) - Project Gutenberg</title>
  </head>
  <body>
    <h1>All Files (offset: 0, filetypes: txt)</h1>    <p><a href="http://aleph.gutenberg.org/etext02/comed10.zip">http://aleph.gutenberg.org/etext02/comed10.zip</a></p>

    <p><a href="http://aleph.gutenberg.org/1/2/3/7/12370/12370-8.zip">http://aleph.gutenberg.org/1/2/3/7/12370/12370-8.zip</a></p>

    <p><a href="http://aleph.gutenberg.org/1/2/3/7/12370/12370.zip">http://aleph.gutenberg.org/1/2/3/7/12370/12370.zip</a></p>

    <p><a href="http://aleph.guten


The mirror can be extracted from the URLs:

In [10]:
import re

GUTENBERG_MIRROR = re.search('(https?://[^/]+)[^"]*.zip', r.text).group(1)
GUTENBERG_MIRROR

'http://aleph.gutenberg.org'

Then we can construct the URL using the same logic as [gutenbergr](https://github.com/ropensci/gutenbergr/blob/f5ab38bea2f91871f32922ef920e31ee1a46ac89/R/gutenberg_download.R#L79-L89).
Note that sometimes we need to add a suffix (e.g. look at http://aleph.gutenberg.org/0/1/ which only has a `-0`)

In [11]:
def gutenberg_text_urls(id: str, mirror=GUTENBERG_MIRROR, suffixes=("", "-8", "-0")) -> list[str]:
    path = "/".join(id[:-1]) or "0"
    return [f"{mirror}/{path}/{id}/{id}{suffix}.zip" for suffix in suffixes]

gutenberg_text_urls(book_id)

['http://aleph.gutenberg.org/2/0/0/2005/2005.zip',
 'http://aleph.gutenberg.org/2/0/0/2005/2005-8.zip',
 'http://aleph.gutenberg.org/2/0/0/2005/2005-0.zip']

We can then try each URL in turn until we find the file, and then unzip it:

In [12]:
import logging
import zipfile

def download_gutenberg(id: str) -> str:
    for url in gutenberg_text_urls(id):
        r = requests.get(url)
        if r.status_code == 404:
            logging.warning(f"404 for {url}")
            continue
        r.raise_for_status()
        break
    
    z = zipfile.ZipFile(BytesIO(r.content))
    
    if len(z.namelist()) != 1:
        raise Exception(f"Expected 1 file in {z.namelist()}")
        
    return z.read(z.namelist()[0]).decode('utf-8')

In [13]:
text = download_gutenberg(book_id)

print(text[:1500])

The Project Gutenberg EBook of Piccadilly Jim, by P. G. Wodehouse

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: Piccadilly Jim

Author: P. G. Wodehouse

Release Date: September 12, 2012 [EBook #2005]
Last Updated: August 16, 2016

Language: English

Character set encoding: ASCII

*** START OF THIS PROJECT GUTENBERG EBOOK PICCADILLY JIM ***




Produced by Jim Tinsley









Piccadilly Jim


by

Pelham Grenville Wodehouse





CHAPTER I

A RED-HAIRED GIRL

The residence of Mr. Peter Pett, the well-known financier, on
Riverside Drive is one of the leading eyesores of that breezy and
expensive boulevard. As you pass by in your limousine, or while
enjoying ten cents worth of fresh air on top of a green omnibus,
it jumps out and bites at you. Architects, confronted with it,
reel 

Searching for this text we can see it also appears near the end of the text (actually this book has some transcriber's notes after the end of the text, but we'll leave them in)

In [14]:
GUTENBERG_TEXT = "PROJECT GUTENBERG EBOOK "

lines = text.splitlines()

first = True
for idx, line in enumerate(lines):
    if GUTENBERG_TEXT in line:
        if first:
            first = False
            continue
        print('=' * 80)
        print('\n'.join(lines[idx-20:idx+20]))
        print('=' * 80)
        print()


This is a somewhat clumsy construction, and quite un-Wodehousian.
The original passage in the serialization read:

 "Before his stony eye the immaculate Bartling wilted. All that
 he had ever heard and read about doubles came to him."

--------------------------------










End of the Project Gutenberg EBook of Piccadilly Jim, by P. G. Wodehouse

*** END OF THIS PROJECT GUTENBERG EBOOK PICCADILLY JIM ***

***** This file should be named 2005.txt or 2005.zip *****
This and all associated files of various formats will be found in:
        http://www.gutenberg.org/2/0/0/2005/

Produced by Jim Tinsley

Updated editions will replace the previous one--the old editions
will be renamed.

Creating the works from public domain print editions means that no
one owns a United States copyright in these works, so the Foundation
(and you!) can copy and distribute it in the United States without
permission and without paying copyright royalties.  Special rules,
set forth in the General Terms of Us

We can read everything between the first and last header with a simple state machine:

In [15]:
def strip_headers(text):
    in_text = False
    output = []
    
    for line in text.splitlines():        
        if GUTENBERG_TEXT in line:
            if not in_text:
                in_text = True
            else:
                break
        else:
            if in_text:
                output.append(line)

    return "\n".join(output).strip()

stripped_text = strip_headers(text)

And check that they have worked:

In [16]:
print(stripped_text[:200])
print("*" * 80)
print(stripped_text[-500:])

Produced by Jim Tinsley









Piccadilly Jim


by

Pelham Grenville Wodehouse





CHAPTER I

A RED-HAIRED GIRL

The residence of Mr. Peter Pett, the well-known financier, on
Riverside Drive is one
********************************************************************************
ling wilted.
 It was a perfectly astounding likeness, but it was
 apparent to him when what he had ever heard and read
 about doubles came to him."

This is a somewhat clumsy construction, and quite un-Wodehousian.
The original passage in the serialization read:

 "Before his stony eye the immaculate Bartling wilted. All that
 he had ever heard and read about doubles came to him."

--------------------------------










End of the Project Gutenberg EBook of Piccadilly Jim, by P. G. Wodehouse


# Downloading all the files

Now we can download all the files in a simple loop; let's create a simple function that gets and cleans the text:

In [17]:
def book_text(book_id):
    r = requests.get(GUTENBERG_TEXT_URL.format(id=book_id))
    text = r.text
    clean_text = strip_headers(text)
    return clean_text

We'll save each book into the "data" folder

In [18]:
data_path = Path("data")
data_path.mkdir(exist_ok=True)

And finally save all the books (one at a time to not overload the server):

In [19]:
for book in wodehouse_books:
    id = book["Text#"]
    text = book_text(id)
    print(f"Saving {book['Title']} by {book['Authors']} containing {len(text):_} characters")
    with open(data_path / (id + ".txt"), "wt") as f:
        f.write(text)

Saving Piccadilly Jim by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 449_842 characters
Saving Something New by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 419_221 characters
Saving A Damsel in Distress by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 429_025 characters
Saving Psmith, Journalist by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 322_135 characters
Saving Indiscretions of Archie by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 413_041 characters
Saving Love Among the Chickens by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 278_160 characters
Saving The Intrusion of Jimmy by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 381_406 characters
Saving The Little Nugget by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 388_673 characters
Saving Uneasy Money by Wodehouse, P. G. (Pelham Grenville), 1881-1975 containing 364_858 characters
Saving Psmith in the City by Wodehouse, P. G. 

We can also save our metadata for future reference:

In [20]:
with open(data_path / 'metadata.csv', 'wt') as f:
    csv_writer = csv.DictWriter(f, fieldnames=wodehouse_books[0].keys())
    csv_writer.writeheader()
    for book in wodehouse_books:
        csv_writer.writerow(book)

# Conclusion

It's really simple to search for books using the Project Gutenberg CSV catalog, and to download the books in a way that complies with their robots and crawlers guidelines (thanks to [gutenbergr](https://github.com/ropensci/gutenbergr) for showing the way).
You can easily get books from Project Gutenberg for further data analysis or machine learning; I'm going to train a language model on P. G. Wodehouse.