<a href="https://colab.research.google.com/github/Arvind6446/RNNMachineLearning/blob/main/maximus_rag_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Maximus.com Content Downloader + RAG Content Generator (Colab, No API Keys)

This notebook:
1. Crawls **public** pages from `https://www.maximus.com/` (respecting `robots.txt`).
2. Extracts clean text and builds a small local corpus.
3. Splits into chunks and builds a **FAISS** vector index using **sentence-transformers** embeddings.
4. Runs a local **small instruction LLM** (Transformers) and answers/generates content using **RAG** (retrieval-augmented generation).

## Important notes
- Website content is copyrighted and governed by the site’s Terms of Use. Ensure you have permission to reuse content for ML.
- This notebook is designed for **RAG** (recommended) rather than fine-tuning (which is heavier and riskier).
- This runs **without Hugging Face tokens**. Some models may still require license acceptance; if so, pick a different open model in the model list.

---


In [1]:
# Colab setup
!pip -q install -U trafilatura beautifulsoup4 lxml requests tqdm   sentence-transformers faiss-cpu transformers accelerate

import os, re, json, time, hashlib
from urllib.parse import urljoin, urlparse
import requests
import trafilatura
from bs4 import BeautifulSoup
from tqdm import tqdm

import numpy as np
from sentence_transformers import SentenceTransformer
import faiss
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/132.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.6/132.6 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/107.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m107.7/107.7 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.7/64.7 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.8/23.8 MB[0m [31m67.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m837.9/837.9 kB[0m [31m20.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m315.5/315.5 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━



## 1) Crawl maximus.com (polite + robots.txt aware)

You can adjust:
- `MAX_PAGES`: how many pages to download
- `SLEEP_SEC`: delay between requests
- `ALLOWED_PATH_PREFIXES`: optional allow-list for site sections (recommended)


In [2]:
BASE = "https://www.maximus.com/"
MAX_PAGES = 150
SLEEP_SEC = 1.0
TIMEOUT = 30

# Optional: restrict crawl to certain sections (recommended).
# Example: only crawl /news, /about, /services
ALLOWED_PATH_PREFIXES = [
    "/",  # keep "/" to allow everything; replace with specific prefixes for tighter scope
    # "/about",
    # "/services",
    # "/news",
]

HEADERS = {
    "User-Agent": "Mozilla/5.0 (compatible; educational-research-bot/1.0)"
}

def same_domain(url: str) -> bool:
    return urlparse(url).netloc == urlparse(BASE).netloc

def allowed_path(url: str) -> bool:
    path = urlparse(url).path or "/"
    return any(path.startswith(p) for p in ALLOWED_PATH_PREFIXES)

def normalize_url(url: str) -> str:
    url = url.split("#")[0]
    # remove trailing slash except domain root
    if url.endswith("/") and url != BASE:
        url = url[:-1]
    return url

def fetch(url: str) -> str:
    r = requests.get(url, headers=HEADERS, timeout=TIMEOUT)
    r.raise_for_status()
    return r.text

def extract_text(html: str, url: str) -> str:
    text = trafilatura.extract(html, url=url, include_comments=False, include_tables=False)
    if text:
        text = re.sub(r"\s+", " ", text).strip()
        return text
    # fallback
    soup = BeautifulSoup(html, "lxml")
    text = soup.get_text(" ", strip=True)
    return re.sub(r"\s+", " ", text).strip()

def get_links(html: str, url: str) -> set[str]:
    soup = BeautifulSoup(html, "lxml")
    links = set()
    for a in soup.select("a[href]"):
        href = a.get("href")
        if not href:
            continue
        if href.startswith("mailto:") or href.startswith("javascript:"):
            continue
        full = normalize_url(urljoin(url, href))
        if full.startswith("http") and same_domain(full) and allowed_path(full):
            links.add(full)
    return links


In [3]:
# Robots.txt handling
import urllib.robotparser as robotparser

rp = robotparser.RobotFileParser()
robots_url = urljoin(BASE, "/robots.txt")
try:
    rp.set_url(robots_url)
    rp.read()
    print("Loaded robots.txt:", robots_url)
except Exception as e:
    print("Could not load robots.txt. Proceeding cautiously:", e)

def can_fetch(url: str) -> bool:
    try:
        return rp.can_fetch(HEADERS["User-Agent"], url)
    except Exception:
        return True


Loaded robots.txt: https://www.maximus.com/robots.txt


In [4]:
# Crawl BFS
visited = set()
queue = [normalize_url(BASE)]
docs = []

for _ in tqdm(range(MAX_PAGES)):
    if not queue:
        break
    url = queue.pop(0)
    if url in visited:
        continue
    visited.add(url)

    if not can_fetch(url):
        continue

    try:
        html = fetch(url)
        text = extract_text(html, url)
        # keep only meaningful pages
        if len(text) >= 400:
            docs.append({"url": url, "text": text})
        # expand links
        for link in get_links(html, url):
            if link not in visited and link not in queue:
                queue.append(link)
        time.sleep(SLEEP_SEC)
    except Exception:
        # ignore failures (rate-limits, 404s, etc.)
        time.sleep(SLEEP_SEC)
        continue

print("Visited:", len(visited))
print("Saved docs:", len(docs))
docs[:2]


  1%|          | 1/150 [00:01<03:56,  1.59s/it]

Visited: 1
Saved docs: 1





[{'url': 'https://www.maximus.com/',
  'text': "Frictionless government starts here Where experience meets innovation Technology services We are driving innovation and delivering impactful mission outcomes through emerging technologies, advanced architectures, and modern methodologies that accelerate IT transformation. Customer experience Our seamless integration of technology and service delivery ensures every interaction creates a holistic and exceptional journey. Why Maximus? Our varied demographic team communicates in more than 120 languages across 6 countries, helping to eliminate barriers to matching the right services with the right people at the right time. We're passionate about what we do because we care. Learn more about life and careers at MaximusPeople-first culture, inside and out We’re in the business of building connections, with our team, with our partners, and with the customers we ultimately help serve. Learn more about life at MaximusWork with us Do business with us

### Save corpus (JSONL)


In [5]:
CORPUS_PATH = "maximus_corpus.jsonl"
with open(CORPUS_PATH, "w", encoding="utf-8") as f:
    for d in docs:
        f.write(json.dumps(d, ensure_ascii=False) + "\n")
print("Saved:", CORPUS_PATH)


Saved: maximus_corpus.jsonl


## 2) Clean + Chunk the text

We chunk into ~800 characters with overlap. You can also chunk by words or tokens.


In [6]:
def clean_text(t: str) -> str:
    t = re.sub(r"\s+", " ", t).strip()
    return t

def chunk_text(text: str, chunk_size=900, overlap=150) -> list[str]:
    text = clean_text(text)
    chunks = []
    i = 0
    while i < len(text):
        chunk = text[i:i+chunk_size]
        if len(chunk) >= 200:
            chunks.append(chunk)
        i += max(1, chunk_size - overlap)
    return chunks

chunks = []
metas = []

for d in docs:
    c = chunk_text(d["text"], chunk_size=900, overlap=150)
    for idx, ch in enumerate(c):
        chunks.append(ch)
        metas.append({"url": d["url"], "chunk_id": idx})

print("Total chunks:", len(chunks))
chunks[0][:300]


Total chunks: 2


'Frictionless government starts here Where experience meets innovation Technology services We are driving innovation and delivering impactful mission outcomes through emerging technologies, advanced architectures, and modern methodologies that accelerate IT transformation. Customer experience Our sea'

## 3) Build embeddings + FAISS index (no API keys)

We use `sentence-transformers/all-MiniLM-L6-v2` (fast, light).  
If it fails due to download limits, try another open embedding model.


In [7]:
EMBED_MODEL = "sentence-transformers/all-MiniLM-L6-v2"

embedder = SentenceTransformer(EMBED_MODEL)
emb = embedder.encode(chunks, batch_size=64, show_progress_bar=True, normalize_embeddings=True).astype("float32")

dim = emb.shape[1]
index = faiss.IndexFlatIP(dim)  # cosine similarity when embeddings are normalized
index.add(emb)

print("FAISS index size:", index.ntotal, "dim:", dim)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

FAISS index size: 2 dim: 384


## 4) Load a local small LLM (Transformers)

Pick a small instruct model that runs on Colab. Good choices:
- `Qwen/Qwen2.5-1.5B-Instruct` (recommended)
- `google/gemma-2-2b-it` (also strong)

If one fails, change `GEN_MODEL` below.


In [8]:
GEN_MODEL = "Qwen/Qwen2.5-1.5B-Instruct"  # change if needed

tokenizer = AutoTokenizer.from_pretrained(GEN_MODEL, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(GEN_MODEL, device_map="auto")

gen = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=350,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
)

print("Loaded generation model:", GEN_MODEL)


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/660 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

Device set to use cpu


Loaded generation model: Qwen/Qwen2.5-1.5B-Instruct


## 5) RAG: Retrieve relevant chunks and generate grounded output

We retrieve top-k chunks and ask the model to write content **using only the retrieved context**.


In [9]:
def retrieve(query: str, k=5):
    q_emb = embedder.encode([query], normalize_embeddings=True).astype("float32")
    scores, ids = index.search(q_emb, k)
    results = []
    for score, i in zip(scores[0], ids[0]):
        if i == -1:
            continue
        results.append({
            "score": float(score),
            "text": chunks[i],
            "meta": metas[i]
        })
    return results

def make_prompt(query: str, retrieved: list[dict]) -> str:
    context_blocks = []
    for r in retrieved:
        context_blocks.append(f"Source: {r['meta']['url']}\n{r['text']}")
    context = "\n\n---\n\n".join(context_blocks)

    return f"""You are a helpful assistant. You must follow these rules:
- Use ONLY the provided context to answer or generate content.
- If the context is insufficient, say what is missing and do not invent facts.
- Keep the writing professional and clear.

USER REQUEST:
{query}

CONTEXT:
{context}

OUTPUT:
"""

def rag_generate(query: str, k=5):
    retrieved = retrieve(query, k=k)
    prompt = make_prompt(query, retrieved)
    out = gen(prompt)[0]["generated_text"]
    return out, retrieved

# Example
query = "Write a short company overview paragraph about Maximus, and list key service areas."
answer, sources = rag_generate(query, k=5)
print(answer[:2000])


You are a helpful assistant. You must follow these rules:
- Use ONLY the provided context to answer or generate content.
- If the context is insufficient, say what is missing and do not invent facts.
- Keep the writing professional and clear.

USER REQUEST:
Write a short company overview paragraph about Maximus, and list key service areas.

CONTEXT:
Source: https://www.maximus.com/
Frictionless government starts here Where experience meets innovation Technology services We are driving innovation and delivering impactful mission outcomes through emerging technologies, advanced architectures, and modern methodologies that accelerate IT transformation. Customer experience Our seamless integration of technology and service delivery ensures every interaction creates a holistic and exceptional journey. Why Maximus? Our varied demographic team communicates in more than 120 languages across 6 countries, helping to eliminate barriers to matching the right services with the right people at the r

### Show retrieved sources (for transparency)


In [10]:
for s in sources:
    print(f"- score={s['score']:.3f} url={s['meta']['url']} chunk={s['meta']['chunk_id']}")


- score=0.519 url=https://www.maximus.com/ chunk=0
- score=0.481 url=https://www.maximus.com/ chunk=1


## 6) “Content generation” templates

These prompts generate useful marketing / product / summary content grounded in the retrieved pages.


In [11]:
templates = {
    "Blog post outline": "Create a blog post outline about: {topic}. Include headings and bullet points.",
    "Press-style summary": "Write a press-style summary about: {topic}. Keep it factual, no hype.",
    "FAQ": "Create an FAQ (8 questions) about: {topic}. Answer using only context.",
}

topic = "Maximus services in government and healthcare"

for name, t in templates.items():
    q = t.format(topic=topic)
    ans, src = rag_generate(q, k=6)
    print("\n" + "="*80)
    print(name)
    print("="*80)
    print(ans[-1800:])  # show tail where answer usually ends



Blog post outline
are

#### Introduction
- Brief overview of Maximus as a leading provider of technology solutions for government and healthcare sectors.
- Importance of integrating Maxiust’s services in these critical areas.

#### Headings & Bullet Points

##### 1. Maximus Services Overview
- What does Maximus offer?
- Focus on innovation and transformative projects
- How technology drives efficiency and effectiveness

##### 2. Maximus in Government Services
- Case studies showcasing successful implementation
- Benefits to governments (cost savings, improved services, citizen satisfaction)
- Scalability and flexibility in handling large-scale projects

##### 3. Maximus in Healthcare Sector
- Key challenges faced by healthcare systems (data interoperability, patient engagement, administrative burden)
- Maximus’ approach to solving these issues
- Success stories from various healthcare organizations

##### 4. Maximus' Commitment to Diversity and Inclusion
- Maximus' global reach and co

## 7) Save the index + chunks for reuse


In [12]:
ART_DIR = "rag_artifacts"
os.makedirs(ART_DIR, exist_ok=True)

# Save chunks + metadata
with open(os.path.join(ART_DIR, "chunks.jsonl"), "w", encoding="utf-8") as f:
    for ch, m in zip(chunks, metas):
        f.write(json.dumps({"text": ch, "meta": m}, ensure_ascii=False) + "\n")

# Save FAISS index
faiss.write_index(index, os.path.join(ART_DIR, "faiss.index"))

print("Saved artifacts to:", ART_DIR)


Saved artifacts to: rag_artifacts


## Troubleshooting

### If the crawl returns very few pages
- The site may block bots or restrict via `robots.txt`.
- Reduce `MAX_PAGES`, increase `SLEEP_SEC`, and narrow `ALLOWED_PATH_PREFIXES`.
- Try seeding with specific URLs you know are allowed.

### If the model download fails
- Switch `GEN_MODEL` to a smaller open model:
  - `google/gemma-2-2b-it`
  - `Qwen/Qwen2.5-0.5B-Instruct` (very small)

### If you want fine-tuning
RAG is recommended. Fine-tuning requires extra steps (dataset construction, LoRA/QLoRA, GPU time) and may not be permitted for scraped content. If you have explicit rights and still want it, ask and I’ll provide a separate fine-tuning notebook.
