
# KuBertNetes - Projeto de PLN

**Nome: Kayky Fidelis Serafim**

**Email: kayky.fidelis.serafim@ccc.ufcg.edu.br**

**Matrícula: 122110481**


## 1) Environment setup

## Instalação das bibliotecas necessárias

In [None]:
!pip -q install scikit-learn==1.5.2 transformers==4.44.2 pypdf==4.3.1 unidecode==1.3.8

from pathlib import Path

import re, unicodedata

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch



## Scraping dos Dados na documentação do K8S

O código abaixo coleta recursivamente dados da documentação do kubernetes, essa coleta é salva e um arquivo que é crucial para a execução deste projeto, no entanto, leva um tempo para ser realizada.

In [None]:
!pip -q install beautifulsoup4

OUT_DIR = "/content/"
SEEDS = [
    "https://kubernetes.io/docs/home/",
    "https://kubernetes.io/docs/concepts/",
    "https://kubernetes.io/docs/reference/glossary/",
]
MAX_PAGES = 600
SLEEP_SEC = 0.8
MAX_CHUNK_WORDS = 220
CHUNK_OVERLAP = 80

import time, re, hashlib
from urllib.parse import urljoin, urlparse, urldefrag
import requests
from bs4 import BeautifulSoup
from urllib import robotparser
from collections import deque
from pathlib import Path

ALLOWED_DOMAINS = {
    "kubernetes.io",
    "kubernetes.io:443",
}

ALLOWED_PATH_PREFIXES = (
    "/docs/",
    "/blog/",
    "/reference/",
    "/es/docs/",
    "/en/docs/",
)

HEADERS = {"User-Agent": "k8s-corpus-crawler/0.1 (+research/edu)"}

def can_fetch(rp, url):
    try:
        return rp.can_fetch(HEADERS["User-Agent"], url)
    except Exception:
        return True

def normalize_url(base, href):
    if not href:
        return None
    href = urljoin(base, href)
    href, _frag = urldefrag(href)
    parsed = urlparse(href)
    if parsed.scheme not in ("http","https"):
        return None
    host = parsed.netloc.lower()
    if host not in ALLOWED_DOMAINS:
        return None
    if not any(parsed.path.startswith(p) for p in ALLOWED_PATH_PREFIXES):
        return None
    return href

def fetch(url, session, rp, timeout=15):
    if not can_fetch(rp, url):
        return None, None
    try:
        r = session.get(url, headers=HEADERS, timeout=timeout)
        ctype = r.headers.get("Content-Type","").lower()
        if r.status_code != 200 or ("text/html" not in ctype):
            return None, None
        return r.text, r.url
    except requests.RequestException:
        return None, None

def html_to_text(html):
    soup = BeautifulSoup(html, "html.parser")
    for tag in soup(["script","style","noscript","header","footer","nav","aside"]):
        tag.decompose()

    for pre in soup.find_all("pre"):
        pre.insert_before("\n```code\n")
        pre.insert_after("\n```\n")

    for i in range(1,7):
        for h in soup.find_all(f"h{i}"):
            h.insert_before("\n" + "#"*i + " " + (h.get_text(" ", strip=True) or "") + "\n")
    text = soup.get_text("\n", strip=True)
    text = re.sub(r"\n{3,}", "\n\n", text)
    return text.strip()

def chunk_text(txt, max_words=220, overlap=80):
    words = txt.split()
    chunks = []
    i = 0
    while i < len(words):
        chunk = words[i:i+max_words]
        if not chunk:
            break
        chunks.append(" ".join(chunk))
        i += max_words - overlap
    return chunks

out = Path(OUT_DIR); out.mkdir(parents=True, exist_ok=True)
pages_dir = out / "pages"; pages_dir.mkdir(exist_ok=True)
chunks_dir = out / "chunks"; chunks_dir.mkdir(exist_ok=True)

session = requests.Session()

parsed_seed = urlparse(SEEDS[0])
robots_url = f"{parsed_seed.scheme}://{parsed_seed.netloc}/robots.txt"
rp = robotparser.RobotFileParser()
try:
    rp.set_url(robots_url); rp.read()
except Exception:
    pass

seen = set()
q = deque(SEEDS)
visited = 0
all_chunks = []

print(f"Começando crawl… (MAX_PAGES={MAX_PAGES})")
start_time = time.time()

while q and visited < MAX_PAGES:
    url = q.popleft()
    if url in seen:
        continue
    seen.add(url)

    html, final_url = fetch(url, session, rp)
    time.sleep(SLEEP_SEC)
    if not html:
        continue

    visited += 1
    text = html_to_text(html)

    uid = hashlib.md5((final_url or url).encode("utf-8")).hexdigest()[:10]
    page_path = pages_dir / f"{uid}.txt"
    page_path.write_text(f"URL: {final_url or url}\n\n{text}\n", encoding="utf-8")

    soup = BeautifulSoup(html, "html.parser")
    for a in soup.find_all("a"):
        nurl = normalize_url(final_url or url, a.get("href"))
        if nurl and nurl not in seen:
            q.append(nurl)

    for j, ch in enumerate(chunk_text(text, max_words=MAX_CHUNK_WORDS, overlap=CHUNK_OVERLAP)):
        (chunks_dir / f"{uid}_{j:03d}.txt").write_text(ch, encoding="utf-8")
        all_chunks.append(ch)

    if visited % 25 == 0:
        elapsed = time.time() - start_time
        print(f"[{visited} páginas]  fila={len(q)}  tempo={elapsed:.1f}s")

corpus = out / "k8s_corpus_scraped.txt"
corpus.write_text("\n\n=== DOC CHUNK ===\n\n".join(all_chunks), encoding="utf-8")

elapsed = time.time() - start_time
print(f"\nConcluído ✅  Visited: {visited} páginas  |  Chunks: {len(all_chunks)}")
print(f"Corpus: {corpus}")
print(f"Tempo total: {elapsed:.1f}s")

print("\nExemplo de arquivos:")
!ls -lah "$OUT_DIR" | head -n 20
!ls -lah "$OUT_DIR/chunks" | head -n 20


Começando crawl… (MAX_PAGES=600)
[25 páginas]  fila=20411  tempo=40.6s
[50 páginas]  fila=39851  tempo=75.3s
[75 páginas]  fila=58744  tempo=111.8s
[100 páginas]  fila=76953  tempo=148.7s
[125 páginas]  fila=94456  tempo=183.5s
[150 páginas]  fila=111366  tempo=219.3s
[175 páginas]  fila=127542  tempo=254.5s
[200 páginas]  fila=143180  tempo=289.4s
[225 páginas]  fila=158038  tempo=324.5s
[250 páginas]  fila=172252  tempo=359.5s
[275 páginas]  fila=185837  tempo=395.4s
[300 páginas]  fila=198822  tempo=431.1s
[325 páginas]  fila=211172  tempo=467.0s
[350 páginas]  fila=222894  tempo=502.7s
[375 páginas]  fila=233970  tempo=539.1s
[400 páginas]  fila=244430  tempo=574.3s
[425 páginas]  fila=254244  tempo=610.2s
[450 páginas]  fila=263534  tempo=647.5s
[475 páginas]  fila=273061  tempo=685.5s
[500 páginas]  fila=282524  tempo=721.7s
[525 páginas]  fila=290846  tempo=757.3s
[550 páginas]  fila=298546  tempo=793.1s
[575 páginas]  fila=304598  tempo=828.9s
[600 páginas]  fila=310045  tempo=


## 2) Verificação da Coleta

In [None]:

CORPUS_PATH = "/content/k8s_corpus_scraped.txt"
p = Path(CORPUS_PATH)
print("Exists:", p.exists(), "\nPath:", p)


Exists: True 
Path: /content/k8s_corpus_scraped.txt


## 3) Normalização do Corpus utilizando regex

In [None]:
def normalize_text(t: str) -> str:
    t = unicodedata.normalize("NFKC", t)
    t = t.replace("\r\n", "\n").replace("\r", "\n")
    t = re.sub(r"[ \t\u00A0]+", " ", t)
    t = re.sub(r"\n{3,}", "\n\n", t).strip()
    return t

raw = Path(CORPUS_PATH).read_text(encoding="utf-8", errors="ignore")
text = normalize_text(raw)
print("Characters:", len(text))
print(text[:1200])


Characters: 10299347
Kubernetes Documentation | Kubernetes Kubernetes is an open source container orchestration engine for automating deployment, scaling, and management of containerized applications. The open source project is hosted by the Cloud Native Computing Foundation ( CNCF ). ## Understand Kubernetes Understand Kubernetes Learn about Kubernetes and its fundamental concepts. Why Kubernetes? Components of a cluster The Kubernetes API Objects In Kubernetes Containers Workloads and Pods View Concepts ## Try Kubernetes Try Kubernetes Follow tutorials to learn how to deploy applications in Kubernetes. Hello Minikube Walkthrough the basics Stateless Example: PHP Guestbook with Redis Stateful Example: Wordpress with Persistent Volumes View Tutorials ## Set up a K8s cluster Set up a K8s cluster Get Kubernetes running based on your resources and needs. Learning environment Production environment Install the kubeadm setup tool Securing a cluster kubeadm command reference Set up Kubernete

## 4) Criação dos Chunks com sobreposição

In [None]:
def chunk_text(text: str, max_words=320, overlap=64):
    words = text.split()
    chunks = []
    i = 0
    while i < len(words):
        j = min(i + max_words, len(words))
        chunk = " ".join(words[i:j])
        chunks.append(chunk)
        i = j - overlap if j - overlap > i else j
    return chunks

chunks = chunk_text(text, max_words=320, overlap=64)
print("Total chunks:", len(chunks))
print("Sample chunk:\n", chunks[0][:600])


Total chunks: 5965
Sample chunk:
 Kubernetes Documentation | Kubernetes Kubernetes is an open source container orchestration engine for automating deployment, scaling, and management of containerized applications. The open source project is hosted by the Cloud Native Computing Foundation ( CNCF ). ## Understand Kubernetes Understand Kubernetes Learn about Kubernetes and its fundamental concepts. Why Kubernetes? Components of a cluster The Kubernetes API Objects In Kubernetes Containers Workloads and Pods View Concepts ## Try Kubernetes Try Kubernetes Follow tutorials to learn how to deploy applications in Kubernetes. Hello Min


## 5) Criação do TF‑IDF (n-grams)

In [None]:
vectorizer = TfidfVectorizer(
    lowercase=False,
    analyzer="char_wb",
    ngram_range=(3,5),
    min_df=1
)
X = vectorizer.fit_transform(chunks)

def tfidf_search(query, top_k=8):
    qv = vectorizer.transform([query])
    sims = cosine_similarity(qv, X)[0]
    idx = sims.argsort()[::-1][:top_k]
    return [(chunks[i], float(sims[i]), i) for i in idx]

print("Matrix shape:", X.shape)


Matrix shape: (5965, 268219)


## 6) Carregamento do Bert para Pergunta e Resposta (QA) - Sem Treino

In [None]:
qa_name = "deepset/bert-base-cased-squad2"
qa_tok = AutoTokenizer.from_pretrained(qa_name)
qa_model = AutoModelForQuestionAnswering.from_pretrained(qa_name)

def extract_answer(question: str, context: str):
    inputs = qa_tok(question, context, return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        out = qa_model(**inputs)
    start = int(torch.argmax(out.start_logits))
    end = int(torch.argmax(out.end_logits))
    if end < start:
        return "", -1e9
    score = float(out.start_logits[0, start] + out.end_logits[0, end])
    ans = qa_tok.decode(inputs["input_ids"][0][start:end+1]).replace(" ##", "")
    return ans.strip(), score

def answer_question(question: str, retrieve_k=6):
    cands = tfidf_search(question, top_k=retrieve_k)
    best = {"answer":"", "score":-1e9, "context":"", "rank":None}
    for rank, (ctx, sim, idx) in enumerate(cands, start=1):
        ans, s = extract_answer(question, ctx)
        if s > best["score"] and ans:
            best.update({"answer": ans, "score": s, "context": ctx, "rank": rank})
    return best

Some weights of the model checkpoint at deepset/bert-base-cased-squad2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## 7) Testes para verificar se o modelo funciona

In [None]:
res = answer_question("What is a Pod?")
res


{'answer': 'a group of one or more containers, with shared storage and network resources, and a specification for how to run the containers',
 'score': 10.744248390197754,
 'context': 'CHUNK === suggest an improvement . Last modified July 12, 2023 at 1:25 AM PST: Revise docs home page (9520b96a61) Edit this page Create child page Create an issue Print entire section === DOC CHUNK === Pods | Kubernetes # Pods Pods Pods are the smallest deployable units of computing that you can create and manage in Kubernetes. A Pod (as in a pod of whales or pea pod) is a group of one or more containers , with shared storage and network resources, and a specification for how to run the containers. A Pod\'s contents are always co-located and co-scheduled, and run in a shared context. A Pod models an application-specific "logical host": it contains one or more application containers which are relatively tightly coupled. In non-cloud contexts, applications executed on the same physical or virtual machine a

In [None]:
res = answer_question("What is a Cluster?")
res


{'answer': '[CLS]',
 'score': 12.867141723632812,
 'context': "ClusterIP allocation Service ClusterIP allocation In Kubernetes, Services are an abstract way to expose an application running on a set of Pods. Services can have a cluster-scoped virtual IP address (using a Service of type: ClusterIP ). Clients can connect using that virtual IP address, and Kubernetes then load-balances traffic to that Service across the different backing Pods. ## How Service ClusterIPs are allocated? How Service ClusterIPs are allocated? When Kubernetes needs to assign a virtual IP address for a Service, that assignment happens one of two ways: dynamically the cluster's control plane automatically picks a free IP address from within the configured IP range for type: ClusterIP Services. statically you specify an IP address of your choice, from within the configured IP range for Services. Across your whole cluster, every Service ClusterIP must be unique. Trying to create a Service with a specific ClusterIP 

In [None]:
res = answer_question("What is Kubernetes?")
res


{'answer': 'open source container orchestration engine',
 'score': 15.701431274414062,
 'context': 'run the database in one StatefulSet and the web server in a Deployment . ## Feedback Feedback Was this page helpful? Yes No Thanks for the feedback. If you have a specific, answerable question about how to use Kubernetes, ask it on Stack Overflow . Open an issue in the GitHub Repository if you want to report a problem or suggest an improvement . Last modified April 20, 2024 at 7:09 PM PST: Ready glossary page for vanilla Docsy (2f3602cef0) Edit this page Create child page Create an issue Print entire section === DOC CHUNK === a problem or suggest an improvement . Last modified April 20, 2024 at 7:09 PM PST: Ready glossary page for vanilla Docsy (2f3602cef0) Edit this page Create child page Create an issue Print entire section === DOC CHUNK === Kubernetes Documentation | Kubernetes Kubernetes is an open source container orchestration engine for automating deployment, scaling, and manageme

# Resultados
*   **Recuperação com TF-IDF: usei TF-IDF (conceito clássico visto na disciplina) para buscar os trechos mais relevantes do corpus de Kubernetes.**

*   **Leitor com BERT: apliquei um BERT pré-treinado (SQuAD2) como leitor extractive QA para extrair a resposta do trecho recuperado.**

*   **Pipeline funcional: perguntas comuns (Pods, Node, Cluster) retornaram respostas curtas e corretas.**

*   **Aplicação, na prática, TF-IDF e BERT ensinados em aula para construir um sistema de perguntas e respostas sobre Kubernetes.**