## Notebook completo ‚Äî PDF ‚Üí Chunks ‚Üí Embeddings FAISS ‚Üí RAG Offline

### üìå 1. Instalar Depend√™ncias
(Cole numa c√©lula separada)

In [None]:
# Instala√ß√£o das depend√™ncias necess√°rias
%pip install pdfplumber faiss-cpu sentence-transformers transformers torch 
# %pip install --upgrade ipywidgets jupyter

Note: you may need to restart the kernel to use updated packages.


### üìå 2. Importa√ß√µes

In [1]:
# Importa√ß√£o das bibliotecas necess√°rias
import pdfplumber
import json
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

### üìå 3. Fun√ß√£o para Ler PDF

In [2]:
# Fun√ß√£o para extrair texto de um PDF usando pdfplumber
def ler_pdf(caminho_pdf):
    texto = ""
    with pdfplumber.open(caminho_pdf) as pdf:
        for pagina in pdf.pages:
            texto += pagina.extract_text() + "\n"
    return texto

print("Fun√ß√µes carregadas.")

Fun√ß√µes carregadas.


### üìå 4. Dividir texto em chunks
Chunk size ajust√°vel para n√£o estourar contexto do modelo.

In [3]:
# Fun√ß√£o para dividir o texto em peda√ßos (chunks)
def criar_chunks(texto, tamanho=400, sobreposicao=50):
    palavras = texto.split()
    chunks = []
    
    i = 0
    while i < len(palavras):
        chunk = palavras[i:i + tamanho]
        chunks.append(" ".join(chunk))
        i += tamanho - sobreposicao
    
    return chunks

print("Fun√ß√µes de chunk prontas.")

Fun√ß√µes de chunk prontas.


### üìå 5. Gerar Embeddings com modelo offline

In [4]:
# Carrega modelo de embeddings local (n√£o precisa internet)
modelo_emb = SentenceTransformer("all-MiniLM-L6-v2")

def gerar_embeddings(lista_textos):
    return modelo_emb.encode(lista_textos)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### üìå 6. Criar e salvar o √≠ndice FAISS

In [6]:
# Cria √≠ndice FAISS baseado em similaridade
def criar_faiss(embeddings, caminho_index="faiss_index.bin"):
    dim = embeddings.shape[1]  
    index = faiss.IndexFlatL2(dim)
    index.add(embeddings)
    faiss.write_index(index, caminho_index)
    print("Index salvo em:", caminho_index)
    return index

### üìå 7. Pipeline completo: PDF ‚Üí Chunks ‚Üí Embeddings ‚Üí FAISS

In [12]:
# Caminho do PDF
caminho_pdf = "Little-Red-Riding-Hood.pdf"  # arquivo na MESMA pasta do .ipynb # <-- troque aqui

texto = ler_pdf(caminho_pdf)
chunks = criar_chunks(texto)

# Salva chunks para uso offline
with open("chunks.json", "w", encoding="utf-8") as f:
    json.dump(chunks, f, ensure_ascii=False, indent=4)

print(f"{len(chunks)} chunks criados e salvos.")

# Gera embeddings
embeddings = gerar_embeddings(chunks)
embeddings = np.array(embeddings).astype("float32")

# Cria FAISS
index = criar_faiss(embeddings)

1 chunks criados e salvos.
Index salvo em: faiss_index.bin


### üìå 8. Carregar tudo offline depois

In [13]:
# Recarregar chunks e √≠ndice FAISS sem precisar refazer tudo
def carregar_tudo():
    chunks = json.load(open("chunks.json", "r", encoding="utf-8"))
    index = faiss.read_index("faiss_index.bin")
    return chunks, index

chunks, index = carregar_tudo()
print("Chunks e √≠ndice carregados!")

Chunks e √≠ndice carregados!


### üìå 9. Buscar contexto relevante no PDF via FAISS

In [14]:
# Busca vetorial
def buscar(query, k=3):
    query_emb = modelo_emb.encode([query]).astype("float32")
    dist, idx = index.search(query_emb, k)
    resultados = [chunks[i] for i in idx[0]]
    return resultados

### üìå 10. Carregar modelo LLM offline para responder (FLAN-T5)

In [10]:
# Carrega modelo de linguagem 100% offline
modelo = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small")
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small")

def gerar_resposta(prompt):
    entrada = tokenizer(prompt, return_tensors="pt")
    saida = modelo.generate(**entrada, max_length=300)
    return tokenizer.decode(saida[0], skip_special_tokens=True)

config.json: 0.00B [00:00, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
'(ProtocolError('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None)), '(Request ID: 6b86b1cb-90a6-4e26-9bb9-adfdb5f41f9b)')' thrown while requesting GET https://huggingface.co/google/flan-t5-small/resolve/main/model.safetensors
Retrying in 1s [Retry 1/5].


model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

### üìå 11. Fazer perguntas sobre o PDF (RAG)

In [17]:
# Interface final do sistema RAG offline
def perguntar(pergunta):
    contexto = "\n\n".join(buscar(pergunta))
    prompt = f"""
Voc√™ √© um assistente. Use somente o contexto abaixo para responder:

Contexto:
{contexto}

Pergunta: {pergunta}
Resposta:
"""
    return gerar_resposta(prompt)

# Teste
# pergunta = "Qual √© o resumo do PDF?"
pergunta = "Who are the characters?"
print(perguntar(pergunta))

a wolf, a hunter, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf, a wolf
