Purpose: usage examples of my pipeline

In [1]:
import pandas as pd
skipped_reports = pd.read_csv("skipped_reports.csv")

In [40]:
skipped_reports['link'][20]

'https://www.fluidra.com/pdf-viewer.php?file=https://www.fluidra.com/wp-content/uploads/2025/03/3.-Integrated-Report_ENG_vFinal-ENG.pdf'

In [41]:
skipped_reports['report_id'][20]

'Fluidra_2024'

In [1]:
### Load Metadata of Reports to be analyzed
import pandas as pd
esrs_reports = pd.read_excel("./data_preparation/esrs_reports.xlsx")

In [2]:
keep_ids = [
    "Acerinox_2024", "Commerzbank_2024", "Iberdrola_2024", "Philips_2024", "Vivendi_2024",
    "AeroportsdeParis_2024", "CréditAgricole_2024", "Merck_2024", "Thales_2024",
    "Cenergy_2024", "EssilorLuxottica_2024", "Mowi_2024", "UniCredit_2024"
]

filtered_reports = esrs_reports[esrs_reports["report_id"].isin(keep_ids)]
print(len(filtered_reports))
filtered_reports.head()

13


Unnamed: 0,company,isin,country,publication_date,auditor,link,SASB_industry,report_id
44,Philips,NL0000009538,Netherlands,2025-02-21,EY,https://www.results.philips.com/publications/a...,Electric Utilities & Power Generators,Philips_2024
50,UniCredit,IT0005239360,Italy,2025-02-26,KPMG,https://www.unicreditgroup.eu/content/dam/unic...,Commercial Banks,UniCredit_2024
73,Acerinox,ES0132105018,Spain,2025-02-28,PwC,https://www.acerinox.com/export/sites/acerinox...,Iron & Steel Producers,Acerinox_2024
89,Merck,DE0006599905,Germany,2025-03-06,Deloitte,https://www.merckgroup.com/en/annualreport/202...,Biotechnology & Pharmaceuticals,Merck_2024
110,Iberdrola,ES0144580Y14,Spain,2025-02-28,KPMG,https://www.iberdrola.com/documents/20125/4238...,Electric Utilities & Power Generators,Iberdrola_2024


In [3]:
import os
import re
import json
import pandas as pd
from pathlib import Path
from dotenv import load_dotenv
from huggingface_hub import login
from rag_system import RAGSystem



# ----------------------------
# 0) Environment & Auth
# ----------------------------
os.environ["TOKENIZERS_PARALLELISM"] = "false"
dotenv_path = os.path.expanduser("~/thesis/esg_extraction/.env")
load_dotenv(dotenv_path=dotenv_path)
HF_TOKEN = os.getenv("HUGGINGFACE_TOKEN")
login(HF_TOKEN)


# ----------------------------
# 1) Config
# ----------------------------
DB_ROOT = Path("./faiss_dbs")
DB_ROOT.mkdir(parents=True, exist_ok=True)
ESRS_METADATA_PATH = 'EsrsMetadata.xlsx'
REPORTS_CSV_PATH = './data_preparation/esrs_reports.csv' 

RESULTS_PATH = "all_results.jsonl" # nested dict: { report_id: { query_id: {verdict, analysis, sources} } }
SKIPPED_PATH = "skipped_reports.csv" # keep track of skipped reports

# Ensure skipped_reports.csv has a header if file doesn't exist
if not os.path.exists(SKIPPED_PATH):
    pd.DataFrame(columns=esrs_reports.columns).to_csv(SKIPPED_PATH, index=False)

# ----------------------------
# 2) Instantiate your system
# ----------------------------
rag = RAGSystem(ESRS_METADATA_PATH)


  backends.update(_get_backends("networkx.backends"))


Initializing RAG System...
Loading embedding model: Qwen/Qwen3-Embedding-0.6B...
Loading generation model: meta-llama/Llama-3.1-8B-Instruct...


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Device set to use cuda:0


RAG System initialized successfully.


In [4]:
for idx, row in filtered_reports.iterrows():
    url = row.get("link", None)
    company_name = row['company']
    report_id = row['report_id']
    pdf_path = f"./skipped_reports/{report_id}.pdf"
    db_path = str(DB_ROOT / report_id)
    
    print(f"\n=== Running pipeline for: {idx} - {report_id} ===")
    try:
        # Important: use the method your class exposes
        # (expects report_id, db_path, and either pdf_url or pdf_path)
        result = rag.process_and_analyze_report(
            report_id=report_id,
            db_path=db_path,
            pdf_path=pdf_path 
        )

        # augment result with metadata before saving
        record = {
            "report_id": report_id,
            "company": company_name,
            "row_index": int(idx),
            "result": result.get(report_id, {}),
        }

        # append result as one JSON line
        with open(RESULTS_PATH, "a", encoding="utf-8") as f:
            f.write(json.dumps(record, ensure_ascii=False) + "\n")

    except Exception as e:
        print(f"Failed on {report_id}: {e}")
        # also log failure in skipped_reports.csv
        row.to_frame().T.to_csv(SKIPPED_PATH, mode="a", header=False, index=False)


=== Running pipeline for: 44 - Philips_2024 ===
Creating new vector store at faiss_dbs/Philips_2024...
Create vectorestore at faiss_dbs/Philips_2024
Starting augmented generation for 65 Prompts...
Clearing GPU cache...
--- Finished Pipeline for: Philips_2024 ---

=== Running pipeline for: 50 - UniCredit_2024 ===
Creating new vector store at faiss_dbs/UniCredit_2024...
Create vectorestore at faiss_dbs/UniCredit_2024
Starting augmented generation for 65 Prompts...
Clearing GPU cache...
--- Finished Pipeline for: UniCredit_2024 ---

=== Running pipeline for: 73 - Acerinox_2024 ===
Creating new vector store at faiss_dbs/Acerinox_2024...
Create vectorestore at faiss_dbs/Acerinox_2024
Starting augmented generation for 65 Prompts...
Clearing GPU cache...
--- Finished Pipeline for: Acerinox_2024 ---

=== Running pipeline for: 89 - Merck_2024 ===
Creating new vector store at faiss_dbs/Merck_2024...
Create vectorestore at faiss_dbs/Merck_2024
Starting augmented generation for 65 Prompts...
Clea

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Starting augmented generation for 65 Prompts...
Clearing GPU cache...
--- Finished Pipeline for: AeroportsdeParis_2024 ---

=== Running pipeline for: 370 - Mowi_2024 ===
Creating new vector store at faiss_dbs/Mowi_2024...
Create vectorestore at faiss_dbs/Mowi_2024
Starting augmented generation for 65 Prompts...
Clearing GPU cache...
--- Finished Pipeline for: Mowi_2024 ---

=== Running pipeline for: 410 - Thales_2024 ===
Creating new vector store at faiss_dbs/Thales_2024...
Create vectorestore at faiss_dbs/Thales_2024
Starting augmented generation for 65 Prompts...
Clearing GPU cache...
--- Finished Pipeline for: Thales_2024 ---
