#  Automated Book Publication Workflow

This project demonstrates a semi-automated publishing workflow using AI and human collaboration. It features:
- Web scraping and screenshots from Wikisource
- AI-based chapter spinning using Hugging Face
- Human-in-the-loop editing and review
- Content versioning with ChromaDB
- Intelligent search with reinforcement learning-inspired scoring

Developed in Python, fully runnable in Google Colab.


In [None]:
!pip install playwright
!playwright install


In [None]:
from google.colab import drive
drive.mount('/content/drive')


In [None]:
!pip install nest_asyncio
!pip install beautifulsoup4 chromadb


In [None]:
import nest_asyncio
import asyncio

nest_asyncio.apply()


In [None]:
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup
from IPython.display import Image, display

URL = "https://en.wikisource.org/wiki/The_Gates_of_Morning/Book_1/Chapter_1"
screenshot_path = "/content/chapter1.png"

async def fetch_and_screenshot():
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()
        await page.goto(URL)
        await page.screenshot(path=screenshot_path, full_page=True)
        content = await page.content()
        await browser.close()
        return content


In [None]:
html_content = await fetch_and_screenshot()
display(Image(filename=screenshot_path))


In [None]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_content, "html.parser")

# Wikisource main content is typically inside <div id="mw-content-text">
content_div = soup.find("div", id="mw-content-text")

# Clean and join all paragraphs
paragraphs = content_div.find_all("p")
chapter_text = "\n\n".join([para.get_text(strip=True) for para in paragraphs if para.get_text(strip=True)])

print("📖 Extracted Chapter Text:\n")
print(chapter_text[:1000])  # Preview the first 1000 characters


##  AI Writer – Chapter Spinning

The original chapter is rephrased ("spun") using a Hugging Face Transformer (`gpt2`) to simulate an AI writing assistant. This is the first draft that will later go through human and AI review.


In [None]:
!pip install transformers sentencepiece

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "Vamsi/T5_Paraphrase_Paws"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)


In [None]:
spun_version = spin_chapter(chapter_text)
print("🌀 Spun Chapter:\n")
print(spun_version)


In [None]:
!pip install chromadb

##  Versioning via ChromaDB

Each chapter version (original, spun, reviewed) is stored in ChromaDB along with metadata. This allows:
- Easy version tracking
- Metadata-based search and retrieval


In [None]:
import chromadb

# Use the new default way (no need for Settings)
client = chromadb.PersistentClient(path="/content/chromadb")

collection = client.get_or_create_collection("gates_of_morning")


In [None]:
collection.add(
    documents=[chapter_text, spun_version],
    metadatas=[
        {"version": "v1", "role": "original", "notes": "Raw HTML scrape"},
        {"version": "v2", "role": "AI Writer", "notes": "Spun with HF T5 model"},
    ],
    ids=["chapter1_v1", "chapter1_v2"]
)


In [None]:
human_revision = spun_version.replace("He", "The man")  # Dummy edit

collection.add(
    documents=[human_revision],
    metadatas=[{"version": "v3", "role": "Human Editor", "notes": "Reviewed and adjusted"}],
    ids=["chapter1_v3"]
)


In [None]:
results = collection.get(ids=["chapter1_v1", "chapter1_v2", "chapter1_v3"])
for doc, meta in zip(results["documents"], results["metadatas"]):
    print(f"📘 {meta['version']} ({meta['role']}):\n{doc[:1500]}...\n")


##  Intelligent Retrieval via RL-style Scoring

Simulated reinforcement learning rewards are applied to rank the chapter versions. Final selection is based on:
- Embedding similarity to the query
- Reward score (e.g., human feedback or editor preference)

This helps find the "best" version dynamically.


In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

# Embed all versions
embeddings = model.encode([chapter_text, spun_version])

# Add embedded documents (if not already added)
collection.upsert(
    ids=["chapter1_v1", "chapter1_v2"],
    embeddings=embeddings,
    documents=[chapter_text, spun_version],
    metadatas=[
        {"version": "v1", "role": "original"},
        {"version": "v2", "role": "AI Writer"},
    ]
)


In [None]:
# Simulated feedback scores
reward_scores = {
    "chapter1_v1": 0.4,  # Original
    "chapter1_v2": 0.85, # AI Writer output
}


In [None]:
query = "human-like, literary rewrite of the chapter"

query_vec = model.encode([query])[0]

results = collection.query(
    query_embeddings=[query_vec],
    n_results=2,
    include=["metadatas", "documents", "distances"]
)

# Combine similarity + reward
for i, doc_id in enumerate(results["ids"][0]):
    score = (1 - results["distances"][0][i]) * 0.5 + reward_scores[doc_id] * 0.5
    print(f"📄 Doc ID: {doc_id} | Final Score: {score:.4f}")


In [None]:
!ls *.ipynb
