# Generative Wikipedia QA System

This Colab notebook implements a small-scale generative QA system that answers user questions about machine learning topics using crawled Wikipedia content. The system combines semantic retrieval with a generative language model (Flan-T5) to synthesize coherent, context-rich answers. It includes modules for web crawling, text preprocessing, semantic indexing with SentenceTransformers and FAISS, and answer generation using Hugging Face Transformers.

#Imports & Helper Functions

In [None]:
!pip install --upgrade pip
!pip install "transformers<5.0.0" sentence-transformers requests beautifulsoup4 scikit-learn accelerate wordninja faiss-cpu nltk
!pip install nltk
import os, json, time, unicodedata, wordninja
from datetime import datetime
from bs4 import BeautifulSoup
from urllib.parse import urlparse, urlunparse, urljoin
import requests, nltk
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
import numpy as np
import torch
import faiss
from nltk.tokenize import sent_tokenize
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [None]:
def demash_text(text):
    out = []
    for tok in text.split():
        pieces = wordninja.split(tok)
        out.append(" ".join(pieces) if len(pieces) > 1 else tok)
    return " ".join(out)

In [None]:
def safe_sent_tokenize(text):
    try:
        return sent_tokenize(text)
    except LookupError:
        nltk.download('punkt')
        return sent_tokenize(text)

In [None]:
def clean_text(text):
    # Remove weird characters, normalize, filter empty lines
    text = unicodedata.normalize("NFKC", text)
    text = text.replace("\n", " ").replace("\xa0", " ").strip()
    sentences = safe_sent_tokenize(text)
    return ". ".join(s for s in sentences if len(s) > 10 and "[" not in s)

In [None]:
def chunk_paragraphs(text, max_len=500):
    paras = [p.strip() for p in text.split("\n") if len(p.strip()) > 50]
    return [clean_text(p[:max_len]) for p in paras if len(p) > 50]

In [None]:
def normalize_url(url):
    p = urlparse(url)
    clean = p._replace(params="", query="", fragment="")
    norm = urlunparse(clean)
    return norm[:-1] if norm.endswith("/") else norm

In [None]:
def is_valid_en_wiki(url):
    p = urlparse(url)
    return p.scheme in ("http", "https") and p.netloc == "en.wikipedia.org" and p.path.startswith("/wiki/")

#Crawler

In [None]:
# English-only crawler with metadata
def crawl(start_url, max_depth=1, max_pages=2500, delay=1.0, visited=None):
    if visited is None:
        visited = set()
    seed = normalize_url(start_url)
    if not is_valid_en_wiki(seed):
        raise ValueError("Seed must be en.wikipedia.org/wiki/...")

    to_visit = [(seed, 0)]
    data = []
    while to_visit and len(visited) < max_pages:
        url, depth = to_visit.pop(0)
        if url in visited or depth > max_depth:
            continue
        print(f"Crawling: {url} (depth {depth})")
        try:
            resp = requests.get(url, timeout=10)
            if 'text/html' not in resp.headers.get('Content-Type', ''):
                continue
            soup = BeautifulSoup(resp.text, 'html.parser')
            title = soup.title.string if soup.title else ""
            paras = soup.find_all("p")
            page_paragraphs = chunk_paragraphs(" ".join(p.get_text() for p in paras))

            for i, para in enumerate(page_paragraphs):
                data.append({
                    "url": url,
                    "title": title,
                    "paragraph_id": i,
                    "depth": depth,
                    "fetched_at": datetime.utcnow().isoformat() + "Z",
                    "text": para
                })

            visited.add(url)
            for link in soup.find_all('a', href=True):
                raw = urljoin(url, link['href'])
                norm = normalize_url(raw)

                # Skip non-content pages (meta, file, help, etc.)
                if any(part in norm for part in [
                    "Special:", "Wikipedia:", "Help:", "Talk:", "File:",
                    "Template:", "Portal:", "Category:"
                ]):
                    continue

                if is_valid_en_wiki(norm) and norm not in visited:
                    to_visit.append((norm, depth + 1))

            time.sleep(delay)
        except Exception as e:
            print(f"[Error crawling {url}]: {e}")
    return data, visited

In [None]:
# Crawl seeds (cap=2500 pages) & save JSON
seeds = [
    "https://en.wikipedia.org/wiki/Machine_learning",
    "https://en.wikipedia.org/wiki/Deep_learning",
    "https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)",
    "https://en.wikipedia.org/wiki/Supervised_learning",
    "https://en.wikipedia.org/wiki/Unsupervised_learning",
    "https://en.wikipedia.org/wiki/Reinforcement_learning",
    "https://en.wikipedia.org/wiki/Support_vector_machine",
    "https://en.wikipedia.org/wiki/Artificial_neural_network",
    "https://en.wikipedia.org/wiki/Random_forest",
    "https://en.wikipedia.org/wiki/K-means_clustering",
    "https://en.wikipedia.org/wiki/DBSCAN",
    "https://en.wikipedia.org/wiki/Backpropagation",
    "https://en.wikipedia.org/wiki/Cross-validation_(statistics)",
    "https://en.wikipedia.org/wiki/Gradient_descent",
    "https://en.wikipedia.org/wiki/Long_short-term_memory",
]

In [None]:
visited = set()
crawled_data = []
max_total_pages = 2500

In [None]:
for url in seeds:
    if len(visited) >= max_total_pages:
        break
    remaining = max_total_pages - len(visited)
    pages, visited = crawl(url, max_depth=1, max_pages=remaining, delay=1.0, visited=visited)
    crawled_data.extend(pages)

Crawling: https://en.wikipedia.org/wiki/Machine_learning (depth 0)
Crawling: https://en.wikipedia.org/wiki/Main_Page (depth 1)
Crawling: https://en.wikipedia.org/wiki/Machine_Learning_(journal) (depth 1)
Crawling: https://en.wikipedia.org/wiki/Statistical_learning_in_language_acquisition (depth 1)
Crawling: https://en.wikipedia.org/wiki/Data_mining (depth 1)
Crawling: https://en.wikipedia.org/wiki/Supervised_learning (depth 1)
Crawling: https://en.wikipedia.org/wiki/Unsupervised_learning (depth 1)
Crawling: https://en.wikipedia.org/wiki/Semi-supervised_learning (depth 1)
Crawling: https://en.wikipedia.org/wiki/Self-supervised_learning (depth 1)
Crawling: https://en.wikipedia.org/wiki/Reinforcement_learning (depth 1)
Crawling: https://en.wikipedia.org/wiki/Meta-learning_(computer_science) (depth 1)
Crawling: https://en.wikipedia.org/wiki/Online_machine_learning (depth 1)
Crawling: https://en.wikipedia.org/wiki/Batch_learning (depth 1)
Crawling: https://en.wikipedia.org/wiki/Curriculum_l

In [None]:
with open("crawled_data.json", "w", encoding="utf-8") as f:
    json.dump(crawled_data, f, indent=2, ensure_ascii=False)

#Preprocess and Embed

In [None]:
with open("crawled_data.json", "r", encoding="utf-8") as f:
    crawled_data = json.load(f)

In [None]:
texts = [entry["text"] for entry in crawled_data]
urls = [entry["url"] for entry in crawled_data]
metadata = [entry for entry in crawled_data]

embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
embeddings = embedder.encode(texts, convert_to_tensor=True).cpu().numpy()

index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)

In [None]:
# Flan-T5 model, MAKE SURE USING T4 GPU RUNTIME!!!!!
device = 0 if torch.cuda.is_available() else -1
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")
gen_model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base").to(device)
tokenizer.truncation_side = "left"
generator = pipeline("text2text-generation", model=gen_model, tokenizer=tokenizer, device=device, max_length=300, do_sample=False)

Device set to use cuda:0


In [None]:
# Query Expansion
def expand_query(query, top_k=3, max_terms=200):
    q_vec = embedder.encode([query], convert_to_tensor=True).cpu().numpy()
    _, idxs = index.search(q_vec, top_k)
    terms = []
    for i in idxs[0]:
        terms.extend(texts[i].split()[:max_terms // top_k])
    return f"{query} {' '.join(terms)}"

In [None]:
# Question Answering
def answer_question(query, top_k=3, max_context_chars=4000, sim_threshold=0.3):
    expanded = expand_query(query, top_k=top_k)
    q_vec = embedder.encode([expanded], convert_to_tensor=True).cpu().numpy()
    dists, idxs = index.search(q_vec, top_k)
    avg_dist = np.mean(dists[0])
    if avg_dist < 1.0:
        per = max_context_chars // top_k
        snippets = []
        for i in idxs[0]:
            chunk = texts[i][:per]
            chunk = chunk.rsplit(" ", 1)[0]
            snippets.append(chunk.strip() + " ")
        context = "\n\n".join(snippets)
        raw = generator(f"You are a helpful assistant.\n\n{context}\n\nQuestion: {query}\nAnswer:\n")[0]["generated_text"].strip()
        clean = generator(f"Rewrite with perfect spacing and punctuation:\n\n{raw}\n")[0]["generated_text"].strip()
        return demash_text(clean)
    else:
        return f"(Low similarity - avg distance {avg_dist:.4f}) I'm not confident I can answer that based on the available information."

In [None]:
# Relevance scoring utility
def get_relevance_scores(query, top_k=5):
    q_vec = embedder.encode([query], convert_to_tensor=True).cpu().numpy()
    sims, idxs = index.search(q_vec, top_k)
    return [(urls[i], float(1 - sims[0][j])) for j, i in enumerate(idxs[0])]

#Question and Answering

In [None]:
questions = [
    "What is machine learning?",
    "What is deep learning?",
    "What is supervised learning?",
    "What is unsupervised learning?",
    "What is reinforcement learning?",
    "What is a support vector machine?",
    "What is an artificial neural network?",
    "What is the difference between classification and regression?",
    "What is overfitting in machine learning?",
    "What is underfitting?",
    "What is a training set and a test set?",
    "What does it mean to train a model?",
    "What is a feature in machine learning?",
    "What is feature engineering?",
    "What is feature selection?",
    "What is dimensionality reduction?",
    "What is principal component analysis?",
    "What is clustering?",
    "How does k-means clustering work?",
    "What is DBSCAN?",
    "What is a decision tree?",
    "What is an ensemble method?",
    "What is random forest?",
    "What is boosting in machine learning?",
    "What is bagging?",
    "What is logistic regression used for?",
    "What is a confusion matrix?",
    "What is precision and recall?",
    "What is a learning curve?",
    "What is cross-validation?",
]


In [None]:
for q in questions:
    print("Q:", q)
    print("A:", answer_question(q))
    print("\nTop 3 relevant documents and scores:")
    for url, score in get_relevance_scores(q, top_k=3):
        print(f"- {url} (score: {score:.4f})")
    print("-" * 60)

Q: What is machine learning?
A: ability of a machine to learn and then mimic human behavior that requires intelligence.

Top 3 relevant documents and scores:
- https://en.wikipedia.org/wiki/Theoretical_computer_science (score: 0.7113)
- https://en.wikipedia.org/wiki/Predictive_analytics (score: 0.6256)
- https://en.wikipedia.org/wiki/Machine_learning (score: 0.6151)
------------------------------------------------------------
Q: What is deep learning?
A: a class of machine learning algorithms in which a hierarchy of layers is used to transform input data into a progressively more abstract and composite representation.

Top 3 relevant documents and scores:
- https://en.wikipedia.org/wiki/Artificial_intelligence_in_mental_health (score: 0.6380)
- https://en.wikipedia.org/wiki/Deep_learning (score: 0.6205)
- https://en.wikipedia.org/wiki/Deep_neural_network (score: 0.6205)
------------------------------------------------------------
Q: What is supervised learning?
A: a type of algorithm t

In [None]:
questions = [
    # Hard questions
    "How does the bias–variance tradeoff affect model generalization?",
    "What is the Vapnik–Chervonenkis dimension and why is it important?",
    "How does t-SNE work and when should it be used?",
    "What are the differences between generative and discriminative models?",
    "How do attention mechanisms improve transformer performance compared to RNNs?"
]

In [None]:
for q in questions:
    print("Q:", q)
    print("A:", answer_question(q))
    print("\nTop 3 relevant documents and scores:")
    for url, score in get_relevance_scores(q, top_k=3):
        print(f"- {url} (score: {score:.4f})")
    print("-" * 60)

Q: How does the bias–variance tradeoff affect model generalization?
A: Even though the bias variance decomposition does not directly apply in reinforcement learning, a similar tradeoff can also characterize generalization.

Top 3 relevant documents and scores:
- https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff (score: 0.3270)
- https://en.wikipedia.org/wiki/Bias%E2%80%93variance_decomposition (score: 0.3270)
- https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff (score: 0.2095)
------------------------------------------------------------
Q: What is the Vapnik–Chervonenkis dimension and why is it important?
A: Geometric anomalies in higher dimensions lead to the well known curse of dimensionality.

Top 3 relevant documents and scores:
- https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_theory (score: 0.0673)
- https://en.wikipedia.org/wiki/Interaction_design (score: 0.0511)
- https://en.wikipedia.org/wiki/Linear_discriminant_analysis (score: 0.0085)
---------

In [None]:
questions = [
    # Nonsense Question
    "How does bababooie babado dabo da booo booie?"
]

In [None]:
for q in questions:
    print("Q:", q)
    print("A:", answer_question(q))
    print("\nTop 3 relevant documents and scores:")
    for url, score in get_relevance_scores(q, top_k=3):
        print(f"- {url} (score: {score:.4f})")
    print("-" * 60)

Q: How does bababooie babado dabo da booo booie?
A: a karaoke mini game .

Top 3 relevant documents and scores:
- https://en.wikipedia.org/wiki/Deepfake (score: -0.3134)
- https://en.wikipedia.org/wiki/MiniMax_(company) (score: -0.4156)
- https://en.wikipedia.org/wiki/Product_placement (score: -0.4277)
------------------------------------------------------------
