<a href="https://colab.research.google.com/github/Arvind6446/RNNMachineLearning/blob/main/website_rag_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Website Content → Clean → Deduplicate → Embeddings → FAISS (Colab)
This notebook uploads a `website_content.txt` file (your crawled website text), cleans it, removes duplicates/boilerplate, chunks it, creates embeddings using a free local Sentence-Transformers model, and builds a FAISS index for similarity search.

**No OpenAI key required.**

In [3]:
!pip -q uninstall -y langchain-classic langgraph-prebuilt langgraph || true
!pip -q install -U "langchain==0.3.27" "langchain-core==0.3.81" "langchain-community==0.3.27" "langchain-text-splitters==0.3.11"
!pip -q install -U faiss-cpu sentence-transformers requests==2.32.4


## Upload your `website_content.txt`

In [4]:
from google.colab import files
uploaded = files.upload()  # choose your website_content.txt


Saving website_content.txt to website_content.txt


## Load the file

In [5]:
from pathlib import Path

fname = next(iter(uploaded.keys()))
path = Path(fname)

raw_text = path.read_text(encoding="utf-8", errors="ignore")
print("File:", path)
print("Chars:", len(raw_text))
print(raw_text[:500])


File: website_content.txt
Chars: 968970
URL: https://www.pixelsoftwares.com/

0

Pixel

  * Home
  * Services
    * Website Design
    * Design Services
    * UI/UX Design
    * Small Business Starter Kit
  * Development
    * Custom Software Development
    * Mobile App Development
    * PHP Development
    * NodeJS Development
    * ReactJS Development
  * Industries
    * eCommerce App
    * Recharge & Bill Payments
    * Travel Booking App
    * Fantasy Sports App
    * Cab Booking App
    * Video Sharing App
    * Beauty Services


## Parse into URL blocks (keeps per-page metadata)
Expected format inside file:

`URL: https://...` then page text, then a separator line.

In [6]:
import re
from collections import Counter
from langchain_core.documents import Document

def split_into_url_blocks(text: str):
    text = text.replace("\r\n", "\n")
    blocks = re.split(r"\n(?=URL:\s*https?://)", text.strip())
    return [b.strip() for b in blocks if b.strip()]

def parse_block(block: str):
    m = re.match(r"URL:\s*(https?://\S+)\s*\n+(.*)$", block, flags=re.DOTALL)
    if not m:
        return ("unknown", block.strip())
    return (m.group(1).strip(), m.group(2).strip())

blocks = split_into_url_blocks(raw_text)
pairs = [parse_block(b) for b in blocks]

print("Blocks:", len(pairs))
if pairs:
    print("Example URL:", pairs[0][0])
    print(pairs[0][1][:300])


Blocks: 103
Example URL: https://www.pixelsoftwares.com/
0

Pixel

  * Home
  * Services
    * Website Design
    * Design Services
    * UI/UX Design
    * Small Business Starter Kit
  * Development
    * Custom Software Development
    * Mobile App Development
    * PHP Development
    * NodeJS Development
    * ReactJS Development
  * Industries
    * 


## Clean text: remove boilerplate + deduplicate pages
This removes lines that appear on many pages (menus/footers/nav) and drops duplicate pages after cleaning.

Tune `BOILERPLATE_THRESHOLD`:
- Increase (0.45–0.60): removes more repeated UI text
- Decrease (0.20–0.35): keeps more content

In [7]:
import re

def normalize_lines(text: str):
    lines = [ln.strip() for ln in text.splitlines()]
    lines = [ln for ln in lines if ln]
    lines = [re.sub(r"\s+", " ", ln) for ln in lines]
    return lines

# Document-frequency: count how many pages contain each line
page_lines = []
for url, body in pairs:
    page_lines.append(set(normalize_lines(body)))

df = Counter()
for s in page_lines:
    df.update(s)

num_pages = max(1, len(page_lines))
BOILERPLATE_THRESHOLD = 0.35
boilerplate = {line for line, c in df.items() if (c / num_pages) >= BOILERPLATE_THRESHOLD}

print("Pages:", num_pages)
print("Boilerplate lines detected:", len(boilerplate))
print("Sample boilerplate:", list(sorted(boilerplate))[:10])

def clean_page(body: str):
    lines = normalize_lines(body)
    lines = [ln for ln in lines if ln not in boilerplate]
    # remove consecutive duplicates
    out, prev = [], None
    for ln in lines:
        if ln != prev:
            out.append(ln)
        prev = ln
    return "\n".join(out).strip()

docs = []
seen = set()

for url, body in pairs:
    cleaned = clean_page(body)
    if len(cleaned) < 200:
        continue
    h = hash(cleaned)
    if h in seen:
        continue
    seen.add(h)
    docs.append(Document(page_content=cleaned, metadata={"source": url}))

print("Docs after clean+dedupe:", len(docs))
if docs:
    print("Sample doc source:", docs[0].metadata.get("source"))
    print(docs[0].page_content[:300])


Pages: 103
Boilerplate lines detected: 83
Sample boilerplate: ['## Let’s Create Something', '### About', '### Contact Us', '### Quick Links', '#### About .', '#### Blockchain .', '#### Design .', '#### Development .', '#### Industries .', '##### Main Menu']
Docs after clean+dedupe: 82
Sample doc source: https://www.pixelsoftwares.com/
# Transforming Industries Through
Expertise!
Accelerate your business growth by creating unique products with us that
deliver exceptional experiences.
Website Design
UI/UX Design
Software Development
Web3 Development
Mobile App Development
Wallet Development
Website Design
UI/UX Design
Software Deve


## Chunk the documents
Chunking improves retrieval quality. Adjust sizes if needed.

In [8]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=150,
    separators=["\n\n", "\n", ". ", " ", ""],
)

chunks = splitter.split_documents(docs)

for i, d in enumerate(chunks):
    d.metadata["chunk_index"] = i

print("Total chunks:", len(chunks))
if chunks:
    print("Sample chunk metadata:", chunks[0].metadata)
    print(chunks[0].page_content[:300])


Total chunks: 737
Sample chunk metadata: {'source': 'https://www.pixelsoftwares.com/', 'chunk_index': 0}
# Transforming Industries Through
Expertise!
Accelerate your business growth by creating unique products with us that
deliver exceptional experiences.
Website Design
UI/UX Design
Software Development
Web3 Development
Mobile App Development
Wallet Development
Website Design
UI/UX Design
Software Deve


## Create embeddings (free) and build FAISS vector store
Uses `sentence-transformers/all-MiniLM-L6-v2` (fast, good baseline).

In [9]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
db = FAISS.from_documents(chunks, embeddings)

print("FAISS index built.")


  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

FAISS index built.


## Save / Load the FAISS index (optional)

In [10]:
# Save
db.save_local("faiss_index")
print("Saved index to ./faiss_index")

# Load (later)
# db = FAISS.load_local("faiss_index", embeddings, allow_dangerous_deserialization=True)


Saved index to ./faiss_index


## Similarity search test

In [15]:
query = "who is hr"
results = db.similarity_search(query, k=5)

for r in results:
    print("\n" + "="*80)
    print("SOURCE:", r.metadata.get("source"))
    print(r.page_content[:500])



SOURCE: https://www.pixelsoftwares.com/company-profile
and aptitude in software development with a strong character, steadfast
commitment, and 11 years of professional experience. He... has successfully
developed and delivered over 100 blockchain and financial projects since 2012.
He also demonstrates his leadership skills when leading teams to create
centralized and decentralized bitcoin exchanges. He is progressive and
supremely energetic, accoutred with meticulous knowledge & skills to handle
Blockchain Endeavors. Simply put, he possesses the int

SOURCE: https://www.pixelsoftwares.com/company-profile
of the wholesome Recruitment... & Management works from the year 2016. She
carries an overall experience of over 8 years which lead her to establish
herself as a person of sheer consummation & resolution. She effectively
liaises between management works, compliance with the company’s directives,
regulatory concerns, events, payrolls, etc. She challenges the status quo &
extends succes

## Next step (optional): RAG Q&A
If you want full question-answering, we can add a Colab-compatible LLM. Tell me whether you want a local HF model (slow) or a hosted endpoint.