<a href="https://colab.research.google.com/github/Hearlvein/colab/blob/main/guten_tag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
# install commands
!pip install gutenbergpy beautifulsoup4 requests

Collecting gutenbergpy
  Downloading gutenbergpy-0.3.5-py3-none-any.whl.metadata (7.7 kB)
  Downloading gutenbergpy-0.3.5-py3-none-any.whl.metadata (7.7 kB)
Collecting future>=0.15.2 (from gutenbergpy)
  Using cached future-1.0.0-py3-none-any.whl.metadata (4.0 kB)
Collecting future>=0.15.2 (from gutenbergpy)
  Using cached future-1.0.0-py3-none-any.whl.metadata (4.0 kB)
Collecting httpsproxy-urllib2 (from gutenbergpy)
Collecting httpsproxy-urllib2 (from gutenbergpy)
  Downloading httpsproxy_urllib2-1.0.tar.gz (28 kB)
  Preparing metadata (setup.py) ... [?25l  Downloading httpsproxy_urllib2-1.0.tar.gz (28 kB)
  Preparing metadata (setup.py) ... [?25l-done
[?25done
[?25hCollecting lxml>=3.2.0 (from gutenbergpy)
  Downloading lxml-5.4.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (3.5 kB)
Collecting lxml>=3.2.0 (from gutenbergpy)
  Downloading lxml-5.4.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (3.5 kB)
Collecting pymongo (from gutenbergpy)
  Downloading pymongo-4.13.0-cp312-

In [5]:
import os
from gutenbergpy.textget import get_text_by_id
from gutenbergpy.gutenbergcache import GutenbergCache
from bs4 import BeautifulSoup
import requests

# Step 1: Scrape the bookshelf for book IDs
def get_book_ids_from_bookshelf(url, limit=10):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    book_links = soup.select('li.booklink a.link')
    book_ids = []

    for link in book_links:
        href = link.get('href')
        if href.startswith('/ebooks/'):
            book_id = href.split('/')[-1]
            if book_id.isdigit():
                book_ids.append(int(book_id))
                if len(book_ids) == limit:
                    break
    return book_ids

# Step 2: Download and save books
def download_books(book_ids, output_folder):
    os.makedirs(output_folder, exist_ok=True)
    print("Loading Gutenberg metadata cache...")
    cache = GutenbergCache.get_cache()
    for book_id in book_ids:
        print(f"Downloading book ID {book_id}...")
        try:
            text_bytes = get_text_by_id(book_id)
            text_str = text_bytes.decode('utf-8', errors='ignore')
            output_path = os.path.join(output_folder, f"{book_id}.txt")
            with open(output_path, 'w', encoding='utf-8') as f:
                f.write(text_str)
            print(f"Saved book {book_id} to {output_path}")
        except Exception as e:
            print(f"Error downloading book {book_id}: {e}")

# Utility: Download books by genre into a coherent folder structure
def download_books_to_dataset(bookshelf_url, genre, limit=10, base_folder="gutenberg_dataset"):
    output_folder = os.path.join(base_folder, genre)
    book_ids = get_book_ids_from_bookshelf(bookshelf_url, limit=limit)
    download_books(book_ids, output_folder=output_folder)

# Example genres and bookshelf URLs
bookshelves = {
    'fiction': 'https://www.gutenberg.org/ebooks/bookshelf/480',
    'poetry': 'https://www.gutenberg.org/ebooks/bookshelf/60',
    # Add more genres/bookshelves as needed
}

# Download for each genre into a clean structure
for genre, url in bookshelves.items():
    download_books_to_dataset(url, genre=genre, limit=10)


Loading Gutenberg metadata cache...
Downloading book ID 84...
Saved book 84 to gutenberg_dataset/fiction/84.txt
Downloading book ID 43...
Saved book 84 to gutenberg_dataset/fiction/84.txt
Downloading book ID 43...
Saved book 43 to gutenberg_dataset/fiction/43.txt
Downloading book ID 345...
Saved book 43 to gutenberg_dataset/fiction/43.txt
Downloading book ID 345...
Saved book 345 to gutenberg_dataset/fiction/345.txt
Downloading book ID 41445...
Saved book 345 to gutenberg_dataset/fiction/345.txt
Downloading book ID 41445...
Saved book 41445 to gutenberg_dataset/fiction/41445.txt
Downloading book ID 55...
Saved book 41445 to gutenberg_dataset/fiction/41445.txt
Downloading book ID 55...
Saved book 55 to gutenberg_dataset/fiction/55.txt
Downloading book ID 2148...
Saved book 55 to gutenberg_dataset/fiction/55.txt
Downloading book ID 2148...
Saved book 2148 to gutenberg_dataset/fiction/2148.txt
Downloading book ID 829...
Saved book 2148 to gutenberg_dataset/fiction/2148.txt
Downloading boo

## Building a Structured Gutenberg Dataset

All books are now organized by genre in subfolders under `gutenberg_dataset/`.

- `gutenberg_dataset/fiction/` contains fiction books (bookshelf 480).
- `gutenberg_dataset/poetry/` contains poetry books (bookshelf 60).
- Each book is saved as a `.txt` file named by its Gutenberg ID.

This structure is suitable for LLM dataset preparation and can be extended with more genres.