<a href="https://colab.research.google.com/github/Hearlvein/colab/blob/main/guten_tag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

jupytext --to notebook guten_tag.py

In [1]:
# install commands
%pip install gutenbergpy beautifulsoup4 requests
%pip install datasets
%pip install transformers
%pip install accelerate

Collecting gutenbergpy
  Downloading gutenbergpy-0.3.5-py3-none-any.whl.metadata (7.7 kB)
Collecting httpsproxy-urllib2 (from gutenbergpy)
  Downloading httpsproxy_urllib2-1.0.tar.gz (28 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pymongo (from gutenbergpy)
  Downloading pymongo-4.13.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)
Collecting dnspython<3.0.0,>=1.16.0 (from pymongo->gutenbergpy)
  Downloading dnspython-2.7.0-py3-none-any.whl.metadata (5.8 kB)
Downloading gutenbergpy-0.3.5-py3-none-any.whl (22 kB)
Downloading pymongo-4.13.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m33.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dnspython-2.7.0-py3-none-any.whl (313 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m313.6/313.6 kB[0m [31m25.1 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected 

In [2]:
import os
from gutenbergpy.textget import get_text_by_id
from gutenbergpy.gutenbergcache import GutenbergCache
from bs4 import BeautifulSoup
import requests

# Step 1: Scrape the bookshelf for book IDs
def get_book_ids_from_bookshelf(url, limit=10):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    book_links = soup.select('li.booklink a.link')
    book_ids = []

    for link in book_links:
        href = link.get('href')
        if href.startswith('/ebooks/'):
            book_id = href.split('/')[-1]
            if book_id.isdigit():
                book_ids.append(int(book_id))
                if len(book_ids) == limit:
                    break
    return book_ids

# Step 2: Download and save books (skip if file exists)
def download_books(book_ids, output_folder):
    os.makedirs(output_folder, exist_ok=True)
    print("Loading Gutenberg metadata cache...")
    cache = GutenbergCache.get_cache()
    for book_id in book_ids:
        output_path = os.path.join(output_folder, f"{book_id}.txt")
        if os.path.exists(output_path) and os.path.getsize(output_path) > 0:
            print(f"Book {book_id} already exists at {output_path}, skipping download.")
            continue
        print(f"Downloading book ID {book_id}...")
        try:
            text_bytes = get_text_by_id(book_id)
            text_str = text_bytes.decode('utf-8', errors='ignore')
            with open(output_path, 'w', encoding='utf-8') as f:
                f.write(text_str)
            print(f"Saved book {book_id} to {output_path}")
        except Exception as e:
            print(f"Error downloading book {book_id}: {e}")

# Utility: Download books by genre into a coherent folder structure
def download_books_to_dataset(bookshelf_url, genre, limit=10, base_folder="gutenberg_dataset"):
    output_folder = os.path.join(base_folder, genre)
    book_ids = get_book_ids_from_bookshelf(bookshelf_url, limit=limit)
    download_books(book_ids, output_folder=output_folder)

# Example genres and bookshelf URLs
bookshelves = {
    'fiction': 'https://www.gutenberg.org/ebooks/bookshelf/480',
    'poetry': 'https://www.gutenberg.org/ebooks/bookshelf/60',
    # Add more genres/bookshelves as needed
}

# Download for each genre into a clean structure
for genre, url in bookshelves.items():
    download_books_to_dataset(url, genre=genre, limit=10)


Loading Gutenberg metadata cache...
Downloading book ID 84...
Saved book 84 to gutenberg_dataset/fiction/84.txt
Downloading book ID 43...
Saved book 43 to gutenberg_dataset/fiction/43.txt
Downloading book ID 345...
Saved book 345 to gutenberg_dataset/fiction/345.txt
Downloading book ID 41445...
Saved book 41445 to gutenberg_dataset/fiction/41445.txt
Downloading book ID 55...
Saved book 55 to gutenberg_dataset/fiction/55.txt
Downloading book ID 2148...
Saved book 2148 to gutenberg_dataset/fiction/2148.txt
Downloading book ID 829...
Saved book 829 to gutenberg_dataset/fiction/829.txt
Downloading book ID 1251...
Saved book 1251 to gutenberg_dataset/fiction/1251.txt
Downloading book ID 16...
Saved book 16 to gutenberg_dataset/fiction/16.txt
Downloading book ID 36...
Saved book 36 to gutenberg_dataset/fiction/36.txt
Loading Gutenberg metadata cache...
Downloading book ID 16328...
Saved book 16328 to gutenberg_dataset/poetry/16328.txt
Downloading book ID 1322...
Saved book 1322 to gutenberg_

## Building a Structured Gutenberg Dataset

All books are now organized by genre in subfolders under `gutenberg_dataset/`.

- `gutenberg_dataset/fiction/` contains fiction books (bookshelf 480).
- `gutenberg_dataset/poetry/` contains poetry books (bookshelf 60).
- Each book is saved as a `.txt` file named by its Gutenberg ID.

This structure is suitable for LLM dataset preparation and can be extended with more genres.

In [3]:
import os
import re
import json
from pathlib import Path
from tqdm import tqdm

# Configuration
INPUT_DIRS = {
    "fiction": Path("gutenberg_dataset/fiction"),
    "poetry": Path("gutenberg_dataset/poetry"),
}
OUTPUT_FILE = Path("gutenberg_dataset.jsonl")

# Regex patterns to strip Gutenberg headers/footers
HEADER_PATTERN = re.compile(
    r"\*{3}\s*START OF THIS PROJECT GUTENBERG EBOOK.*?\*{3}", re.IGNORECASE | re.DOTALL
)
FOOTER_PATTERN = re.compile(
    r"\*{3}\s*END OF THIS PROJECT GUTENBERG EBOOK.*", re.IGNORECASE | re.DOTALL
)


def clean_gutenberg_text(text: str) -> str:
    """
    Remove Project Gutenberg header/footer and extra whitespace.
    """
    # Remove header
    text = HEADER_PATTERN.sub("", text)
    # Remove footer
    text = FOOTER_PATTERN.sub("", text)
    # Normalize whitespace
    text = text.strip()
    return text


def process_and_write_jsonl(input_dirs: dict, output_path: Path):
    """
    Walk through input_dirs, clean each .txt file, and write to a single JSONL output.
    Each JSONL record has fields: source, filename, text.
    """
    if output_path.exists() and output_path.stat().st_size > 0:
        print(f"{output_path} already exists and is non-empty, skipping cleaning and writing.")
        return
    with output_path.open("w", encoding="utf-8") as out_file:
        for source_label, folder in input_dirs.items():
            txt_files = list(folder.rglob("*.txt"))
            for txt_path in tqdm(txt_files, desc=f"Processing {source_label}"):
                try:
                    raw = txt_path.read_text(encoding="utf-8", errors="ignore")
                    clean = clean_gutenberg_text(raw)
                    if not clean:
                        continue
                    record = {
                        "source": source_label,
                        "filename": txt_path.name,
                        "text": clean,
                    }
                    out_file.write(json.dumps(record, ensure_ascii=False) + "\n")
                except Exception as e:
                    print(f"Error processing {txt_path}: {e}")


if __name__ == "__main__":
    os.makedirs(OUTPUT_FILE.parent, exist_ok=True)
    process_and_write_jsonl(INPUT_DIRS, OUTPUT_FILE)
    print(f"Dataset written to {OUTPUT_FILE}")

Processing fiction: 100%|██████████| 10/10 [00:00<00:00, 51.62it/s]
Processing poetry: 100%|██████████| 10/10 [00:00<00:00, 77.12it/s]

Dataset written to gutenberg_dataset.jsonl





In [4]:
from datasets import load_dataset, Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    DataCollatorForLanguageModeling,
    Trainer,
    TrainingArguments,
)
import os

# Paths
DATASET_PATH = "gutenberg_dataset.jsonl"
MODEL_NAME = (
    "distilgpt2"  # or try "gpt2" / "tiiuae/falcon-rw-1b" if you want larger models
)

# Step 1: Load dataset
print("Loading dataset...")
print(os.path.exists("gutenberg_dataset.jsonl"))
print(os.path.getsize("gutenberg_dataset.jsonl"))

from pathlib import Path
from datasets import Dataset
import json

# Manually read the JSONL and convert it to a list of dicts
with open(DATASET_PATH, 'r', encoding='utf-8') as f:
    data = [json.loads(line) for line in f if line.strip()]

# Create HuggingFace Dataset
dataset = Dataset.from_list(data)


# Optional: filter very short or very long texts
# dataset = dataset.filter(lambda x: 100 < len(x["text"]) < 5000)

# Print dataset statistics
print(f"Dataset loaded with {len(dataset)} records.")

Loading dataset...
True
8114735
Dataset loaded with 20 records.


In [5]:
# Step 2: Load tokenizer and model
print("Loading tokenizer and model...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = (
    tokenizer.eos_token
)  # GPT-style models don't have pad_token by default

model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)


# Step 3: Tokenize dataset
def tokenize_function(examples):
    encodings = tokenizer(
        examples["text"], truncation=True, padding="max_length", max_length=512
    )
    encodings["labels"] = encodings["input_ids"].copy()
    return encodings


print("Tokenizing dataset...")
tokenized_dataset = dataset.map(
    tokenize_function, batched=True, remove_columns=["text", "filename", "source"]
)

# Step 4: Data collator
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# Step 5: Training arguments
training_args = TrainingArguments(
    output_dir="./poetic-sci-fi-model",
    run_name="poetic-sci-fi",  # Optional: just for logs
    report_to="none",          # <<< disables W&B
    overwrite_output_dir=True,
    per_device_train_batch_size=4,
    num_train_epochs=3,
    save_strategy="epoch",
    logging_steps=100,
    fp16=True,
    remove_unused_columns=False,
)

# Step 6: Trainer setup
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

Loading tokenizer and model...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Tokenizing dataset...


Map:   0%|          | 0/20 [00:00<?, ? examples/s]

  trainer = Trainer(


In [6]:
# Step 7: Train
print("Starting training...")
trainer.train()

# Step 8: Save locally
model_path = "./poetic-sci-fi-model"
trainer.save_model(model_path)
tokenizer.save_pretrained(model_path)

print(f"Model saved to {model_path}")

Starting training...


`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss


Model saved to ./poetic-sci-fi-model


In [7]:
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

MODEL_PATH = "./poetic-sci-fi-model"

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(MODEL_PATH)

generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Sci-fi poetic prompt
prompt = "Beneath the rusted moons of Elarion, the last poet of Earth recited verses to the wind."

# Generate a full story
output = generator(
    prompt,
    max_new_tokens=300,
    do_sample=True,
    temperature=0.95,
    top_k=50,
    top_p=0.92,
    repetition_penalty=1.1,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,  # optional, helps cut off
)

print("\nGenerated Poetic Sci-Fi Story:\n")
print(output[0]["generated_text"])

Device set to use cuda:0



Generated Poetic Sci-Fi Story:

Beneath the rusted moons of Elarion, the last poet of Earth recited verses to the wind. His wife Alena had been a bit nervous when her husband's visit ended with him and he was alone for most of his life while trying to recover from them.[citation needed]
The tale recounted one day in particular which Arianon came across Eren’s words as:
And I heard that—you know how they are so! And this is what you do about their friendship; who can remember those two friends? You hear such laughter together,—they never left an end now,[d] but no longer fear." [1] In The Epicurean Comedy Book (Esteemus), 1/7-8 by Henry Devenet, 2/12-13 by Arthur H. Sturgis. As we go on our way toward Dorne’s story of Aritholdius being sent abroad en route to Tiberium, it turns out that Anaxosian philosopher Cephalasides has found himself caught up in conversation between several ancient philosophers at some point during much of Middle Ages. This exchange took place after Diogenes arri