
# Interroger un PDF financier avec OpenAI

 1) *Prompt-only* avec **fitz** 
2) *RAG léger* (pypdf + embeddings).



## 0) Pré‑requis
Installez (une seule fois) :
```bash
pip install openai pymupdf pypdf numpy pandas
```
Définissez votre clé :
```bash
export OPENAI_API_KEY="sk-..."
```


In [None]:
%pip install openai pymupdf pypdf numpy pandas

In [None]:

import os, re, json
import numpy as np
import pandas as pd
import fitz            # PyMuPDF
from pypdf import PdfReader
from openai import OpenAI

# Imports communs

import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = (8,4)
plt.rcParams['axes.grid'] = True

# Détection  de la clé API
api_key = os.getenv("OPENAI_API_KEY", "")
print("OPENAI_API_KEY défini :", api_key[:4])



In [None]:

client = OpenAI(api_key=api_key)
pd.set_option("display.max_colwidth", 160)





## A) Prompt-only avec **fitz** 
> Idéal pour **un document court** ou **une section précise** : on extrait un intervalle de pages et on l'injecte **tel quel** dans le prompt.



### A.1 — Choisir le PDF et l'intervalle de pages
- Mettez votre fichier PDF (ex. `rapport.pdf`).  
- Par défaut ici : exemple `teslafinancialreport.pdf`.  
- Définissez `PAGE_START`, `PAGE_END` et la limite de caractères `MAX_CHARS`.


In [None]:

PDF_PATH = "data/teslafinancialreport.pdf"   # ⬅️ remplacez par votre PDF
PAGE_START = 1
PAGE_END   = 6
MAX_CHARS  = 12000

doc = fitz.open(PDF_PATH)
pages = []
for i in range(doc.page_count):
    page = doc.load_page(i)
    txt = page.get_text("text")
    txt = re.sub(r"[ \t]+", " ", txt)
    txt = re.sub(r"\s*\n\s*", "\n", txt).strip()
    pages.append({"page": i+1, "text": txt})

df_fitx = pd.DataFrame(pages)
print("Pages lues (fitz) :", len(df_fitx))
df_fitx.head(3)



### A.2 — Construire le contexte et poser la question
Le modèle répond **uniquement** à partir du contexte fourni et **cite les pages** avec `(p.X)`.


In [None]:

blocks = []
for p in range(PAGE_START, PAGE_END + 1):
    row = df_fitx[df_fitx["page"] == p].iloc[0]
    blocks.append(f"[PAGE {p}]\n{row['text']}")
CONTEXT_PO = "\n\n---\n".join(blocks)[:MAX_CHARS]

QUESTION_PO = "Fais un résumé clair des points saillants financiers et opérationnels, en citant les pages (p.X)."

SYSTEM_PO = (
    "Vous êtes analyste financier. Répondez uniquement à partir du CONTEXTE fourni. "
    "Citez systématiquement les pages (p.X). "
    "Si l'information manque, répondez : 'Non trouvé dans le contexte fourni'."
)

USER_PO = f"""QUESTION :
{QUESTION_PO}

CONTEXTE (pages {PAGE_START}–{PAGE_END}) :
{CONTEXT_PO}
"""

resp_po = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role":"system","content":SYSTEM_PO},
              {"role":"user","content":USER_PO}],
    temperature=0.2,
)
answer_po = resp_po.choices[0].message.content
print(answer_po)



### A.3 — Variantes de questions 
- *Quels sont les risques majeurs évoqués et leurs impacts financiers ?*  
- *Comment évoluent revenus et marges ; quelles explications sont données ?*  
- *Quelles perspectives (guidance, capex, drivers) ?*  
- *Quels éléments de gouvernance/compliance sont mentionnés ?*



## B) RAG léger (pypdf + embeddings OpenAI)
> Pour **documents longs** : segmentation en *chunks*, embeddings, récupération des meilleurs extraits, réponse **sourcée**.



### B.1 — Importer & normaliser le PDF (pypdf)


In [None]:

reader = PdfReader(PDF_PATH)
pages = []
for i, page in enumerate(reader.pages, start=1):
    txt = page.extract_text() or ""
    txt = re.sub(r"[ \t]+", " ", txt)
    txt = re.sub(r"\s*\n\s*", "\n", txt).strip()
    pages.append({"page": i, "text": txt})

df_pages = pd.DataFrame(pages)
print("Pages lues (pypdf) :", len(df_pages))
df_pages.head(3)



### B.2 — Segmenter en *chunks* (fenêtre + chevauchement)
- Taille cible : **1500** caractères ; chevauchement **400** caractères.


In [None]:

CHUNK_SIZE = 1500
CHUNK_OVERLAP = 400

chunks = []
for _, row in df_pages.iterrows():
    p = int(row["page"]); t = row["text"]
    if not t:
        continue
    s = 0
    while s < len(t):
        e = s + CHUNK_SIZE
        chunks.append({"page": p, "start": s, "end": min(e, len(t)), "text": t[s:e]})
        s += CHUNK_SIZE - CHUNK_OVERLAP

df_chunks = pd.DataFrame(chunks)
print("Chunks créés :", len(df_chunks))
df_chunks.sample(2)



### B.3 — Embeddings & index en mémoire
Modèle recommandé : `text-embedding-3-small`.


In [None]:

EMBED_MODEL = "text-embedding-3-small"

texts = df_chunks["text"].tolist()
emb = client.embeddings.create(model=EMBED_MODEL, input=texts)
MAT = np.array([e.embedding for e in emb.data], dtype=np.float32)
print("Matrice embeddings :", MAT.shape)



### B.4 — Question → similarité cosinus → top‑k extraits


In [None]:

def cosine_scores(mat: np.ndarray, q: np.ndarray) -> np.ndarray:
    denom = (np.linalg.norm(mat, axis=1) * np.linalg.norm(q) + 1e-8)
    return (mat @ q) / denom

QUESTION = "Quels sont les principaux risques évoqués et leurs impacts financiers ?"

q_vec = np.array(client.embeddings.create(model=EMBED_MODEL, input=[QUESTION]).data[0].embedding, dtype=np.float32)
scores = cosine_scores(MAT, q_vec)

TOP_K = 6
idx = np.argsort(-scores)[:TOP_K]
df_top = df_chunks.iloc[idx].copy()
df_top["score"] = scores[idx]
df_top = df_top.sort_values("score", ascending=False).reset_index(drop=True)
df_top[["page","score","text"]].head(TOP_K)



### B.5 — Construire la réponse **sourcée** (citer les pages)
- Répondre **uniquement** à partir du **CONTEXTE**.  
- Citer systématiquement les pages avec `(p.X)`.


In [None]:

blocks = []
for _, r in df_top.iterrows():
    blocks.append(f"[PAGE {int(r['page'])}]\n{r['text']}")
CONTEXT = "\n\n---\n".join(blocks)

SYSTEM = (
    "Vous êtes analyste financier. Répondez en français, de façon concise et sourcée. "
    "N'utilisez QUE le CONTEXTE fourni. Citez systématiquement les pages (p.X). "
    "Si l'information n'est pas dans le contexte, répondez : 'Non trouvé dans le contexte fourni'."
)

USER = f"""QUESTION :
{QUESTION}

CONTEXTE (extraits du PDF) :
{CONTEXT}
"""

resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role":"system","content":SYSTEM},
              {"role":"user","content":USER}],
    temperature=0.2,
)
answer = resp.choices[0].message.content
print(answer)



## C) Banque de questions & conseils
**Questions génériques :** marges, FCF, liquidité, risques, dette, capex, guidance, gouvernance.  
**Conseils :** documents longs → RAG ; sections ciblées → prompt-only. Toujours **citer les pages**.


In [None]:
# ============================================================
# Gradio — 3 approches (OpenAI) avec chemins relatifs "data/..."
# 1) Prompt-only (fitz) : intervalle de pages injecté tel quel
# 2) RAG léger (PDF dans data/...) : chunks + embeddings + top-k
# 3) RAG léger (PDF uploadé)
#
# Prérequis :
#   pip install gradio openai pymupdf pypdf numpy pandas
#   export OPENAI_API_KEY="sk-..."
# ============================================================

import os, re, glob
import numpy as np
import pandas as pd
import fitz                      # PyMuPDF
from pypdf import PdfReader
import gradio as gr
from openai import OpenAI

# === Config "relative path" ===
DATA_DIR    = "data"   # racine locale (ex: data/tesla/...)
DEFAULT_PDF = "data/teslafinancialreport.pdf"  # ⬅️ valeur par défaut demandée
EMBED_MODEL = "text-embedding-3-small"
CHAT_MODEL  = "gpt-4o-mini"

# === Client OpenAI (exige OPENAI_API_KEY) ===
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# ========== Helpers communs (sans try/except) =================================
def extract_pdf_pages_fitz(path: str) -> pd.DataFrame:
    doc = fitz.open(path)
    rows = []
    for i in range(doc.page_count):
        page = doc.load_page(i)
        txt  = page.get_text("text")
        txt  = re.sub(r"[ \t]+", " ", txt)
        txt  = re.sub(r"\s*\n\s*", "\n", txt).strip()
        rows.append({"page": i+1, "text": txt})
    return pd.DataFrame(rows)

def extract_pdf_pages_pypdf(path: str) -> pd.DataFrame:
    reader = PdfReader(path)
    rows = []
    for i, page in enumerate(reader.pages, start=1):
        txt = page.extract_text() or ""
        txt = re.sub(r"[ \t]+", " ", txt)
        txt = re.sub(r"\s*\n\s*", "\n", txt).strip()
        rows.append({"page": i, "text": txt})
    return pd.DataFrame(rows)

def build_chunks(df_pages: pd.DataFrame, chunk_size: int = 1500, overlap: int = 400) -> pd.DataFrame:
    rows = []
    for _, r in df_pages.iterrows():
        p = int(r["page"]); t = r["text"]
        if not t: 
            continue
        s = 0; step = max(1, chunk_size - overlap)
        while s < len(t):
            e = s + chunk_size
            rows.append({"page": p, "start": s, "end": min(e, len(t)), "text": t[s:e]})
            s += step
    return pd.DataFrame(rows)

def embed_texts(texts: list[str]) -> np.ndarray:
    out = client.embeddings.create(model=EMBED_MODEL, input=texts)
    return np.array([e.embedding for e in out.data], dtype=np.float32)

def embed_query(q: str) -> np.ndarray:
    v = client.embeddings.create(model=EMBED_MODEL, input=[q]).data[0].embedding
    return np.array(v, dtype=np.float32)

def cosine_scores(mat: np.ndarray, qvec: np.ndarray) -> np.ndarray:
    denom = (np.linalg.norm(mat, axis=1) * np.linalg.norm(qvec) + 1e-8)
    return (mat @ qvec) / denom

def build_context_from_rows(df_rows: pd.DataFrame, max_chars: int | None = None) -> str:
    ctx = "\n\n---\n".join([f"[PAGE {int(r['page'])}]\n{r['text']}" for _, r in df_rows.iterrows()])
    return ctx[:max_chars] if max_chars else ctx

def answer_from_context(question: str, context: str, model: str = CHAT_MODEL, temperature: float = 0.2) -> str:
    SYSTEM = (
        "Vous êtes analyste financier. Répondez en français, de façon concise et sourcée. "
        "N'utilisez QUE le CONTEXTE fourni. Citez systématiquement les pages (p.X). "
        "Si l'information n'est pas dans le contexte, répondez : 'Non trouvé dans le contexte fourni'."
    )
    USER = f"QUESTION :\n{question}\n\nCONTEXTE :\n{context}"
    resp = client.chat.completions.create(
        model=model,
        messages=[{"role":"system","content":SYSTEM},
                  {"role":"user",  "content":USER}],
        temperature=float(temperature),
    )
    return resp.choices[0].message.content

def list_pdfs_recursive(root: str = DATA_DIR) -> list[str]:
    # Renvoie des chemins RELATIFS : "data/xxx/rapport.pdf" ou "data/rapport.pdf"
    if not os.path.isdir(root):
        return []
    found = sorted(glob.glob(os.path.join(root, "**", "*.pdf"), recursive=True))
    return [os.path.relpath(p, start=".") for p in found]

# ========== Handlers ==========================================================
# 1) PROMPT-ONLY (fitz)
def handle_prompt_only(pdf_rel_path: str, page_start: int, page_end: int, max_chars: int,
                       question: str, model: str, temperature: float):
    df = extract_pdf_pages_fitz(pdf_rel_path)
    page_start = max(1, int(page_start))
    page_end   = min(int(page_end), int(df["page"].max()))
    rows = df[(df["page"] >= page_start) & (df["page"] <= page_end)].copy()
    context = build_context_from_rows(rows, max_chars=max_chars)
    answer  = answer_from_context(question, context, model=model, temperature=temperature)
    pages_used = sorted(rows["page"].unique().tolist())
    return answer, f"Pages dans le contexte : {pages_used}", context

# 2) RAG LÉGER — PDF dans data/
def handle_rag_from_data(pdf_rel_path: str, question: str,
                         chunk_size: int, overlap: int, top_k: int,
                         embed_model: str, chat_model: str, temperature: float):
    global EMBED_MODEL, CHAT_MODEL
    EMBED_MODEL = embed_model
    CHAT_MODEL  = chat_model

    df_pages  = extract_pdf_pages_pypdf(pdf_rel_path)
    df_chunks = build_chunks(df_pages, chunk_size=chunk_size, overlap=overlap)

    M = embed_texts(df_chunks["text"].tolist())
    q = embed_query(question)
    scores = cosine_scores(M, q)
    idx    = np.argsort(-scores)[:int(top_k)]
    df_top = df_chunks.iloc[idx].copy()
    df_top["score"] = scores[idx]
    df_top = df_top.sort_values("score", ascending=False).reset_index(drop=True)

    context = build_context_from_rows(df_top, max_chars=None)
    answer  = answer_from_context(question, context, model=CHAT_MODEL, temperature=temperature)
    pages_used = sorted(df_top["page"].unique().tolist())
    return answer, f"Pages retenues (top-{top_k}) : {pages_used}", context

# 3) RAG LÉGER — PDF uploadé
def handle_rag_upload(file_obj, question: str,
                      chunk_size: int, overlap: int, top_k: int,
                      embed_model: str, chat_model: str, temperature: float):
    global EMBED_MODEL, CHAT_MODEL
    EMBED_MODEL = embed_model
    CHAT_MODEL  = chat_model

    path = file_obj.name
    df_pages  = extract_pdf_pages_pypdf(path)
    df_chunks = build_chunks(df_pages, chunk_size=chunk_size, overlap=overlap)

    M = embed_texts(df_chunks["text"].tolist())
    q = embed_query(question)
    scores = cosine_scores(M, q)
    idx    = np.argsort(-scores)[:int(top_k)]
    df_top = df_chunks.iloc[idx].copy()
    df_top["score"] = scores[idx]
    df_top = df_top.sort_values("score", ascending=False).reset_index(drop=True)

    context = build_context_from_rows(df_top, max_chars=None)
    answer  = answer_from_context(question, context, model=CHAT_MODEL, temperature=temperature)
    pages_used = sorted(df_top["page"].unique().tolist())
    return answer, f"Pages retenues (top-{top_k}) : {pages_used}", context

# ========== Construire l'UI ===================================================
pdf_options = list_pdfs_recursive(DATA_DIR)
default_choice = DEFAULT_PDF if DEFAULT_PDF in pdf_options else (pdf_options[0] if pdf_options else "data/")

with gr.Blocks(fill_height=True) as demo:
    gr.Markdown("# Chat PDF financier — 3 approches (OpenAI)")

    # --- 1) Prompt-only (fitz) ---
    with gr.Tab("1) Prompt-only (fitz)"):
        gr.Markdown("**Idéal pour une section précise** (pas d’embeddings).")
        with gr.Row():
            po_pdf   = gr.Textbox(label="Chemin PDF (relatif)", value=DEFAULT_PDF)
            po_pbeg  = gr.Slider(label="Page début", minimum=1, maximum=1000, step=1, value=1)
            po_pend  = gr.Slider(label="Page fin",   minimum=1, maximum=1000, step=1, value=6)
        with gr.Row():
            po_maxc  = gr.Slider(label="Max caractères injectés", minimum=2000, maximum=40000, step=1000, value=12000)
            po_model = gr.Dropdown(label="Modèle chat", choices=["gpt-4o-mini","gpt-4o","gpt-4o-reasoning"], value=CHAT_MODEL)
            po_temp  = gr.Slider(label="Température", minimum=0.0, maximum=1.0, step=0.1, value=0.2)
        po_q = gr.Textbox(label="Question", value="Fais un résumé clair des points saillants financiers et opérationnels, en citant les pages (p.X).", lines=3)
        po_btn = gr.Button("Analyser (prompt-only)")
        with gr.Row():
            po_answer = gr.Textbox(label="Réponse", lines=10)
            po_meta   = gr.Textbox(label="Infos (pages utilisées)")
        po_ctx = gr.Code(label="Contexte injecté", language="markdown")

        po_btn.click(
            handle_prompt_only,
            inputs=[po_pdf, po_pbeg, po_pend, po_maxc, po_q, po_model, po_temp],
            outputs=[po_answer, po_meta, po_ctx]
        )

    # --- 2) RAG léger — PDF dans data/ ---
    with gr.Tab("2) RAG léger — PDF dans data/"):
        gr.Markdown("Sélectionnez un PDF présent dans **data/** (et sous-dossiers).")
        with gr.Row():
            rag_pdf   = gr.Dropdown(label="Fichier PDF", choices=pdf_options, value=default_choice)
            rag_q     = gr.Textbox(label="Question", value="Quels sont les principaux risques évoqués et leurs impacts financiers ?", lines=3)
        with gr.Row():
            rag_chunk = gr.Slider(label="Taille chunk", minimum=600, maximum=3000, step=100, value=1500)
            rag_ovlp  = gr.Slider(label="Chevauchement", minimum=100, maximum=800, step=50, value=400)
            rag_topk  = gr.Slider(label="top-k", minimum=2, maximum=10, step=1, value=6)
        with gr.Row():
            rag_embed = gr.Dropdown(label="Modèle embeddings", choices=["text-embedding-3-small","text-embedding-3-large"], value=EMBED_MODEL)
            rag_chat  = gr.Dropdown(label="Modèle chat", choices=["gpt-4o-mini","gpt-4o","gpt-4o-reasoning"], value=CHAT_MODEL)
            rag_temp  = gr.Slider(label="Température", minimum=0.0, maximum=1.0, step=0.1, value=0.2)

        def refresh_list():
            files = list_pdfs_recursive(DATA_DIR)
            return gr.update(choices=files, value=(DEFAULT_PDF if DEFAULT_PDF in files else (files[0] if files else "")))

        refresh_btn = gr.Button("🔄 Rafraîchir la liste")
        refresh_btn.click(fn=refresh_list, inputs=None, outputs=rag_pdf)

        rag_btn = gr.Button("Analyser (RAG léger — data)")
        with gr.Row():
            rag_answer = gr.Textbox(label="Réponse", lines=10)
            rag_meta   = gr.Textbox(label="Infos (pages top-k)")
        rag_ctx = gr.Code(label="Contexte (extraits top-k)", language="markdown")

        rag_btn.click(
            handle_rag_from_data,
            inputs=[rag_pdf, rag_q, rag_chunk, rag_ovlp, rag_topk, rag_embed, rag_chat, rag_temp],
            outputs=[rag_answer, rag_meta, rag_ctx]
        )

    # --- 3) RAG léger — PDF uploadé ---
    with gr.Tab("3) RAG léger — PDF uploadé"):
        gr.Markdown("Importez votre PDF et posez votre question.")
        with gr.Row():
            up_file = gr.File(label="PDF")
            up_q    = gr.Textbox(label="Question", value="Quels sont les principaux risques évoqués et leurs impacts financiers ?", lines=3)
        with gr.Row():
            up_chunk = gr.Slider(label="Taille chunk", minimum=600, maximum=3000, step=100, value=1500)
            up_ovlp  = gr.Slider(label="Chevauchement", minimum=100, maximum=800, step=50, value=400)
            up_topk  = gr.Slider(label="top-k", minimum=2, maximum=10, step=1, value=6)
        with gr.Row():
            up_embed = gr.Dropdown(label="Modèle embeddings", choices=["text-embedding-3-small","text-embedding-3-large"], value=EMBED_MODEL)
            up_chat  = gr.Dropdown(label="Modèle chat", choices=["gpt-4o-mini","gpt-4o","gpt-4o-reasoning"], value=CHAT_MODEL)
            up_temp  = gr.Slider(label="Température", minimum=0.0, maximum=1.0, step=0.1, value=0.2)

        up_btn = gr.Button("Analyser (RAG léger — upload)")
        with gr.Row():
            up_answer = gr.Textbox(label="Réponse", lines=10)
            up_meta   = gr.Textbox(label="Infos (pages top-k)")
        up_ctx = gr.Code(label="Contexte (extraits top-k)", language="markdown")

        up_btn.click(
            handle_rag_upload,
            inputs=[up_file, up_q, up_chunk, up_ovlp, up_topk, up_embed, up_chat, up_temp],
            outputs=[up_answer, up_meta, up_ctx]
        )

# Lancer l’interface locale
demo.launch()
