# SGX Annual Report NER Pipeline (Notebook Format)

> **Goal**: Extract **people**, **organizations**, and **industry sectors** from SGX annual‑report PDFs, then infer relationships (person ↔ org, org ↔ industry) and export everything to CSV — without using LLMs.
>
> This notebook reorganises the original `ner.py` script into clearly separated, runnable sections.

---

## Table of Contents

1. [Environment Setup](#1-environment-setup)
2. [Imports & Global Config](#2-imports--global-config)
3. [PDF Utilities](#3-pdf-utilities)
4. [Entity Extraction](#4-entity-extraction)
5. [Relationship Inference](#5-relationship-inference)
6. [Batch Processing Helpers](#6-batch-processing-helpers)
7. [Run the Pipeline](#7-run-the-pipeline)
8. [Combine Outputs](#8-combine-outputs)
9. [Next Steps / TODOs](#9-next-steps--todos)

---

## 1  Environment Setup

In [None]:
# 📦 One‑time installs (comment out after first run)
!pip install pdfminer.six spacy pandas -q
!python -m spacy download en_core_web_lg -q
# Windows 环境下推荐：
!pip install -U pip setuptools wheel
!pip install "blis==0.7.11" --only-binary :all:
!pip install "spacy==3.7.2" --prefer-binary
!python -m spacy download en_core_web_lg


  error: subprocess-exited-with-error
  
  × Building wheel for blis (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [36 lines of output]
      BLIS_COMPILER? None
      !!
      
              ********************************************************************************
              Please consider removing the following classifiers in favor of a SPDX license expression:
      
              License :: OSI Approved :: BSD License
      
              See https://packaging.python.org/en/latest/guides/writing-pyproject-toml/#license for details.
              ********************************************************************************
      
      !!
        self._finalize_license_expression()
      running bdist_wheel
      running build
      running build_py
      creating build\lib.win-amd64-cpython-39\blis
      copying blis\about.py -> build\lib.win-amd64-cpython-39\blis
      copying blis\benchmark.py -> build\lib.win-amd64-cpython-39\blis
      copying 

> *We use ****pdfminer.six**** for text extraction and ****spaCy**** (**``**) for classic rule‑based NER.*

---

## 2  Imports & Global Config

In [3]:
import os, io, sys, ast
from typing import List, Dict
import pandas as pd
import spacy
from pdfminer.high_level import extract_text_to_fp
from pdfminer.pdfdocument import PDFSyntaxError

# Load spaCy model once ↓
nlp = spacy.load("en_core_web_lg")

---

## 3  PDF Utilities



### 3.1 `extract_text_from_pdf`

Extract all text while silencing pdfminer warnings.

In [5]:
def extract_text_from_pdf(pdf_path: str) -> str:
    """Return full text of a PDF or **None** on failure."""
    out = io.StringIO()
    old_stderr = sys.stderr
    sys.stderr = open(os.devnull, "w")
    try:
        with open(pdf_path, 'rb') as f:
            extract_text_to_fp(f, out)
        return out.getvalue()
    except PDFSyntaxError as e:
        print(f"[PDFSyntaxError] {pdf_path}: {e}")
    except Exception as e:
        print(f"[Error] {pdf_path}: {e}")
    finally:
        sys.stderr.close(); sys.stderr = old_stderr

---

## 4  Entity Extraction



### 4.1 `extract_entities`

Captures **PERSON**, **ORG**, quick‑n‑dirty **INDUSTRY** terms, and sentence‑level context.

In [6]:
def extract_entities(text: str) -> Dict[str, list]:
    if not text:
        return {k: [] for k in ("PERSON","ORG","INDUSTRY","ORG_CONTEXT")}

    doc = nlp(text)
    ents = {"PERSON": [], "ORG": [], "INDUSTRY": [], "ORG_CONTEXT": []}

    # Named entities
    org_spans = []
    for ent in doc.ents:
        if ent.label_ == "PERSON":
            ents["PERSON"].append(ent.text)
        elif ent.label_ == "ORG":
            org_spans.append(ent); ents["ORG"].append(ent.text)

    # Simple dictionary lookup for industries (extend as needed)
    industry_terms = [
        "banking","finance","technology","real estate",
        "telecommunications","manufacturing","healthcare"
    ]
    ents["INDUSTRY"] = [t for t in industry_terms if t in text.lower()]

    # Sentence context for each organisation
    for org in org_spans:
        sent = next(s for s in doc.sents if org.start_char >= s.start_char <= org.end_char <= s.end_char)
        ents["ORG_CONTEXT"].append({'organization': org.text, 'context': sent.text})

    # Deduplicate
    for k in ("PERSON","ORG","INDUSTRY"):
        ents[k] = list(set(ents[k]))
    ents["ORG_CONTEXT"] = [dict(t) for t in {tuple(d.items()) for d in ents["ORG_CONTEXT"]}]
    return ents

---

## 5  Relationship Inference

In [7]:
def infer_org_industry(df_context: pd.DataFrame, industries: List[str]) -> List[Dict]:
    rels = []
    if df_context.empty or not industries:
        return rels
    for _, row in df_context.iterrows():
        ctx = str(row['Context']).lower()
        for ind in industries:
            if ind.lower() in ctx:
                rels.append({'Filename': row['Filename'],
                             'Organization': row['Organization'],
                             'Industry': ind})
    return [dict(t) for t in {tuple(d.items()) for d in rels}]


def infer_person_org(doc: spacy.tokens.Doc, persons: List[str], orgs: List[str]) -> List[Dict]:
    rels = []
    p_set, o_set = set(persons), set(orgs)
    for sent in doc.sents:
        found_p = [p for p in p_set if p in sent.text]
        found_o = [o for o in o_set if o in sent.text]
        for p in found_p:
            for o in found_o:
                rels.append({'Person': p, 'Organization': o})
    return [dict(t) for t in {tuple(d.items()) for d in rels}]

---

## 6  Batch Processing Helpers

In [15]:
def process_reports(pdf_dir: str, base: str = 'sgx') -> None:
    """Loop through PDFs, save four CSVs: *_entities, *_org_context,
    *_person_org_relationships, *_org_industry_relationships."""

    data_ents, data_ctx, data_p2o = [], [], []
    pdfs = [f for f in os.listdir(pdf_dir) if f.lower().endswith('.pdf')]
    if not pdfs:
        print('[!] No PDF found'); return

    for f in pdfs:
        print('→', f)
        txt = extract_text_from_pdf(os.path.join(pdf_dir, f))
        if not txt:
            continue
        doc = nlp(txt)
        ent = extract_entities(txt)
        # ---------------- Entities -----------------
        data_ents.append({
            'Filename': f,
            'Persons': ent['PERSON'],
            'Organizations': ent['ORG'],
            'Industries': ent['INDUSTRY']
        })
        # ---------------- Org‑level context (capitalised keys!) -----------------
        for c in ent['ORG_CONTEXT']:
            data_ctx.append({
                'Filename': f,
                'Organization': c['organization'],
                'Context': c['context']
            })
        # ---------------- Person ↔ Org relationships -----------------
        data_p2o.extend([
            {'Filename': f, **r} for r in infer_person_org(doc, ent['PERSON'], ent['ORG'])
        ])

    # --- Save intermediate CSVs ---
    df_ent = pd.DataFrame(data_ents)
    df_ent.to_csv(f'{base}_entities.csv', index=False)

    df_ctx = pd.DataFrame(data_ctx)
    df_ctx.to_csv(f'{base}_org_context.csv', index=False)

    df_p2o = pd.DataFrame(data_p2o)
    df_p2o.to_csv(f'{base}_person_org_relationships.csv', index=False)

    # --- Org ↔ Industry ---
    # Flatten unique industry terms
    all_inds = sorted({i for sub in df_ent['Industries'] for i in (sub if isinstance(sub, list) else [sub])})
    rel_oi = infer_org_industry(df_ctx, all_inds)
    pd.DataFrame(rel_oi).to_csv(f'{base}_org_industry_relationships.csv', index=False)

    print('✅  Processing complete.')

---

## 7  Run the Pipeline

In [16]:
PDF_DIR = 'C:\\Users\\22601\\Downloads\\finer\\data'   # ← change as needed
os.makedirs(PDF_DIR, exist_ok=True)
process_reports(PDF_DIR, base='sgx')

→ 842974_E01238.pdf
✅  Processing complete.


After execution you will find these files in the working directory:

- `sgx_entities.csv`
- `sgx_org_context.csv`
- `sgx_person_org_relationships.csv`
- `sgx_org_industry_relationships.csv`

---

## 8  Combine Outputs

If you prefer a single edge‑list style CSV:

In [17]:
def combine_relationships(base='sgx'):
    try:
        df_oi = pd.read_csv(f'{base}_org_industry_relationships.csv')
        df_p2o = pd.read_csv(f'{base}_person_org_relationships.csv')
    except FileNotFoundError:
        print('[!] Run `process_reports` first'); return

    rows = []
    rows += [{'Filename': r.Filename, 'Entity1': r.Organization, 'Relation': 'ASSOCIATED_INDUSTRY', 'Entity2': r.Industry}
              for r in df_oi.itertuples(index=False)]
    rows += [{'Filename': r.Filename, 'Entity1': r.Person, 'Relation': 'ASSOCIATED_WITH',    'Entity2': r.Organization}
              for r in df_p2o.itertuples(index=False)]
    pd.DataFrame(rows).to_csv(f'{base}_combined_relationships.csv', index=False)
    print('🔗  Saved', f'{base}_combined_relationships.csv')

---

## 9  Next Steps / TODOs



- **Improve industry detection**: replace keyword list with a trained classifier or Gazetteer.
- **Add Chinese support** (`zh_core_web_lg`) for bilingual reports.
- **Switch to GPU‑accelerated extraction** with **PyMuPDF** or **pdfplumber** for speed.
- **Unit tests** for all helper functions.
- **Packaging**: turn this notebook into a CLI (`python -m sgx_ner path/to/pdfs`).

---

> *Notebook prepared from the original `ner.py` script — organised, deduplicated, and documented for clarity.*

In [23]:
from neo4j import GraphDatabase

# URI examples: "neo4j://localhost", "neo4j+s://xxx.databases.neo4j.io"
URI = "neo4j+s://a0183311.databases.neo4j.io"
AUTH = ("neo4j", "g9nN2A4Pp_ExSlJtescRkeZBI9BhhZulnwawbZla2oA")

with GraphDatabase.driver(URI
, auth=AUTH
) as driver:
    driver.verify_connectivity()

In [27]:
"""
Utility: push an (entities, relations) payload into Neo4j/Aura.

* Reads NEO4J_URI / NEO4J_USERNAME / NEO4J_PASSWORD from the process
  (or a .env file dropped by Aura’s “Download Credentials” button).
* Works with any secured Aura instance because it uses the `neo4j+s://`
  scheme and leverages py2neo's built-in routing/TLS support (>=2021.2).
"""

from __future__ import annotations

import os
from pathlib import Path
from typing import Dict, List, Tuple

from dotenv import load_dotenv           # pip install python-dotenv
from py2neo import Graph, Node, Relationship

# --------------------------------------------------------------------------- #
# Public API
# --------------------------------------------------------------------------- #
def store_entities_relations_in_neo4j(
    entities: Dict[str, Dict[str, str]],
    relations: List[Tuple[str, str, str]],
    *,
    uri: str | None = None,
    user: str | None = None,
    password: str | None = None,
    clear: bool = False,
) -> Graph:
    """
    Push entities & relations to Neo4j/Aura.

    Parameters
    ----------
    entities   : {"entity_id": {"type": "Label", **props}}
    relations  : [(source_id, "REL_TYPE", target_id), ...]
    uri        : Override for Neo4j URI.  Defaults to $NEO4J_URI.
    user       : Override for username.    Defaults to $NEO4J_USERNAME.
    password   : Override for password.    Defaults to $NEO4J_PASSWORD.
    clear      : If True, wipes the DB with `MATCH (n) DETACH DELETE n`.
    """
    _bootstrap_dotenv()

    uri = uri or os.getenv("NEO4J_URI", "neo4j+s://a0183311.databases.neo4j.io")
    user = user or os.getenv("NEO4J_USERNAME", "neo4j")
    password = password or os.getenv("NEO4J_PASSWORD", "g9nN2A4Pp_ExSlJtescRkeZBI9BhhZulnwawbZla2oA")

    if not password:
        raise ValueError(
            "Neo4j password not provided.  "
            "Set NEO4J_PASSWORD in your environment or pass `password=`."
        )

    graph = Graph(uri, auth=(user, password))

    if clear:
        graph.run("MATCH (n) DETACH DELETE n")   # beware of large TXNs!

    # --- create entity nodes ------------------------------------------------ #
    entity_nodes: Dict[str, Node] = {}
    for eid, meta in entities.items():
        label = meta.get("type", "Entity")
        props = {k: v for k, v in meta.items() if k != "type"}
        props.setdefault("name", eid)
        node = Node(label, **props)
        graph.merge(node, label, "name")         # idempotent insert/update
        entity_nodes[eid] = node

    # --- create relationships ---------------------------------------------- #
    for src, rel_type, tgt in relations:
        if src in entity_nodes and tgt in entity_nodes:
            rel = Relationship(entity_nodes[src], rel_type, entity_nodes[tgt])
            graph.merge(rel)

    return graph


# --------------------------------------------------------------------------- #
# Helpers
# --------------------------------------------------------------------------- #
def _bootstrap_dotenv() -> None:
    """
    Load a `.env` file from the current working directory if it exists.
    This is where Aura Free saves the URI / user / password bundle.
    In a Jupyter notebook, this is typically the directory of the .ipynb file.
    """
    # load_dotenv() will automatically search for a .env file in the current
    # directory and its parents. This is compatible with Jupyter notebooks.
    # It will not override existing environment variables.
    load_dotenv()

In [28]:
import pandas as pd

# 1. Define the path to your data file
#    Make sure this path is correct for your environment.
csv_file_path = 'g:/My Drive/NUS MSBA SEM2/UOB/SGX Annual Reports/sgx_person_org_relationships.csv'

# 2. Load the relations data from your CSV
try:
    df_relations = pd.read_csv(csv_file_path)
    # Ensure we don't have rows with missing Person or Organization
    df_relations.dropna(subset=['Person', 'Organization'], inplace=True)
    print(f"Successfully loaded {len(df_relations)} relationships from CSV.")
except FileNotFoundError:
    print(f"Error: The file was not found at {csv_file_path}")
    # Create an empty DataFrame to prevent further errors if file is not found
    df_relations = pd.DataFrame(columns=['Person', 'Organization'])

# 3. Prepare the 'entities' and 'relations' data structures for Neo4j
entities_to_store = {}
relations_to_store = []

if not df_relations.empty:
    # Create entity entries for each unique person and organization
    for person in df_relations['Person'].unique():
        entities_to_store[str(person)] = {'type': 'Person'}
    
    for org in df_relations['Organization'].unique():
        entities_to_store[str(org)] = {'type': 'Organization'}

    # Create relation entries from each row in the DataFrame
    for _, row in df_relations.iterrows():
        source_person = str(row['Person'])
        target_org = str(row['Organization'])
        # You can customize the relationship type if needed
        relation_type = "ASSOCIATED_WITH" 
        relations_to_store.append((source_person, relation_type, target_org))

    print(f"Prepared {len(entities_to_store)} unique entities and {len(relations_to_store)} relations to be stored.")

    # 4. Call the function to store the data in your Neo4j database
    # This will use the credentials from your .env file or the defaults in the function.
    # The `clear=True` flag will wipe the database before adding new data.
    # Set clear=False if you want to add to existing data without deleting it first.
    try:
        graph = store_entities_relations_in_neo4j(
            entities=entities_to_store,
            relations=relations_to_store,
            clear=True
        )
        print("\nSuccessfully stored data in Neo4j.")
        print(f"Graph details: {graph}")
    except Exception as e:
        print(f"\nAn error occurred while connecting to or writing to Neo4j: {e}")
else:
    print("No data to store. Please check the CSV file path and its content.")


Successfully loaded 2731 relationships from CSV.
Prepared 717 unique entities and 2731 relations to be stored.

Successfully stored data in Neo4j.
Graph details: Graph('neo4j+s://a0183311.databases.neo4j.io:7687')


### unusable because a lot of meaningless entities

# using gemini 2.5 pro to perform NER again on the same document

In [None]:
AIzaSyDfwYYn4mgi1HE2EbOq-QiLE_sRvo0XknI

In [18]:

"""
sgx_ner_to_neo4j.py
-------------------

End‑to‑end pipeline to extract People‑↔︎Organisation role relations from an SGX
annual‑report PDF and push them into Neo4j Aura using Google Gemini 2.5 Pro for
NER.

⚙️  Requirements
    pip install google-generativeai pdfplumber python-dotenv langchain py2neo

The script expects these **environment variables** (e.g. in a `.env` file):

    GOOGLE_API_KEY      # your Google AI Developer key
    NEO4J_URI           # e.g. neo4j+s://a0183311.databases.neo4j.io
    NEO4J_USERNAME      # neo4j
    NEO4J_PASSWORD      # 40‑char secret from Aura
    PDF_PATH            # path to local annual‑report PDF
"""

from __future__ import annotations

import json
import os
import re
from collections import defaultdict
from typing import Dict, List

import pdfplumber                         # PDF text extraction
from dotenv import load_dotenv            # env helper
from google import genai                     # <-- main change
import google.generativeai as genai
from langchain.text_splitter import RecursiveCharacterTextSplitter
from py2neo import Graph, Node, Relationship


In [19]:

# --------------------------------------------------------------------------- #
# ----------------------------  CONFIGURATION  ------------------------------ #
# --------------------------------------------------------------------------- #

load_dotenv()                             # loads .env if present

GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY", "AIzaSyDfwYYn4mgi1HE2EbOq-QiLE_sRvo0XknI")
PDF_PATH       = os.getenv("PDF_PATH", "C:\\Users\\22601\\Downloads\\finer\\data\\842974_E01238.pdf")
CHUNK_SIZE     = int(os.getenv("CHUNK_SIZE", 3000))
CHUNK_OVERLAP  = int(os.getenv("CHUNK_OVERLAP", 250))
MODEL_NAME     = os.getenv("GEMINI_MODEL", "gemini-2.5-pro")


In [20]:

# new instance from my edu account
NEO4J_URI      = os.getenv("NEO4J_URI", "neo4j+s://d8d4e86b.databases.neo4j.io")
NEO4J_USERNAME = os.getenv("NEO4J_USERNAME", "neo4j")
NEO4J_PASSWORD = os.getenv("NEO4J_PASSWORD", "IVVi_p1Rl2ca-O5g5ULkd5KHtg2uSXkLaj1So_oHL4Q")
NEO4J_CLEAR    = os.getenv("NEO4J_CLEAR", "false").lower() == "true"


In [None]:
# Check for required environment variables
for var_name in ("GOOGLE_API_KEY", "NEO4J_URI", "NEO4J_USERNAME", "NEO4J_PASSWORD"):
    if not locals()[var_name]:
        raise EnvironmentError(f"Missing required environment variable: {var_name}")

# Model and Chunking Configuration
MODEL_NAME = "gemini-2.5-pro" # Using a modern model name
CHUNK_SIZE = 8000
CHUNK_OVERLAP = 400
NEO4J_CLEAR = False # Set to False to append to existing graph data


# old code

In [None]:

# # --------------------------------------------------------------------------- #
# # ---------------------- GEMINI CLIENT INITIALIZATION ----------------------- #
# # --------------------------------------------------------------------------- #

# # Configure the GenAI client with the API key
# genai.configure(api_key=GOOGLE_API_KEY)

# # Create the Generative Model instance with the system prompt
# gemini_ner_model = genai.GenerativeModel(
#     model_name=MODEL_NAME,
#     system_instruction=SYSTEM_PROMPT
# )


# # --------------------------------------------------------------------------- #
# # ---------------------------  PDF HELPERS  --------------------------------- #
# # --------------------------------------------------------------------------- #

# def extract_text_from_pdf(path: str) -> str:
#     """Return concatenated text from every page of a PDF."""
#     text_parts = []
#     with pdfplumber.open(path) as pdf:
#         for page in pdf.pages:
#             page_text = page.extract_text() or ""
#             text_parts.append(page_text)
#     return "\n".join(text_parts)


# def chunk_text(text: str,
#                chunk_size: int = CHUNK_SIZE,
#                overlap: int = CHUNK_OVERLAP) -> List[str]:
#     """Splits text into manageable chunks for the model."""
#     splitter = RecursiveCharacterTextSplitter(
#         chunk_size=chunk_size,
#         chunk_overlap=overlap,
#         separators=["\n\n", "\n", " ", ""]
#     )
#     return splitter.split_text(text)

# # --------------------------------------------------------------------------- #
# # --------------------------  GEMINI NER  ----------------------------------- #
# # --------------------------------------------------------------------------- #

# def ner_chunk(chunk: str) -> List[Dict[str, str]]:
#     """
#     Sends a text chunk to the Gemini model for NER and parses the response.
#     This function is now fixed to use the current API.
#     """
#     # Call the modern API on the initialized model object
#     response = gemini_ner_model.generate_content(chunk)

#     # Model returns a blob of lines; filter and parse JSON
#     relations = []
#     for line in response.text.strip().splitlines():
#         line = line.strip()
#         if not line:
#             continue
#         try:
#             # First attempt: load the line as a clean JSON object
#             obj = json.loads(line)
#             if obj:  # Ensure it's not an empty object {}
#                 relations.append(obj)
#         except json.JSONDecodeError:
#             # Second attempt: salvage with a greedy regex if model adds extra text
#             match = re.search(r"{.*}", line)
#             if match:
#                 try:
#                     relations.append(json.loads(match.group(0)))
#                 except json.JSONDecodeError:
#                     # Ignore lines that are truly malformed
#                     pass
#     return relations

# # --------------------------------------------------------------------------- #
# # ------------------------  NEO4J LOADER  ----------------------------------- #
# # --------------------------------------------------------------------------- #

# def push_to_neo4j(relations: List[Dict[str, str]]) -> None:
#     """Pushes the extracted person-role-company relations into a Neo4j graph."""
#     graph = Graph(NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD))

#     if NEO4J_CLEAR:
#         print("Clearing existing Neo4j database...")
#         graph.run("MATCH (n) DETACH DELETE n")

#     # Use a cache to avoid merging the same node multiple times
#     node_cache = {}  # (Label, name) -> Node

#     print(f"Pushing {len(relations)} relations to Neo4j...")
#     for rel in relations:
#         person = rel.get("person")
#         role = rel.get("role")
#         company = rel.get("company")

#         # Skip if any of the core components are missing
#         if not all((person, role, company)):
#             continue

#         # Create or merge Person node
#         key_person = ("Person", person)
#         if key_person not in node_cache:
#             node = Node("Person", name=person)
#             graph.merge(node, "Person", "name")
#             node_cache[key_person] = node

#         # Create or merge Company node
#         key_comp = ("Company", company)
#         if key_comp not in node_cache:
#             node = Node("Company", name=company)
#             graph.merge(node, "Company", "name")
#             node_cache[key_comp] = node

#         # Create the relationship between the two nodes
#         rel_obj = Relationship(
#             node_cache[key_person],
#             "HAS_ROLE_AT",
#             node_cache[key_comp],
#             role=role.lower() # Standardize role to lowercase
#         )
#         graph.merge(rel_obj)


# new code

In [22]:
TOC_SYSTEM_PROMPT = """You are an expert document analysis AI. Your task is to analyze the provided text, which represents the Table of Contents from a corporate annual report, and find the starting page number for the section detailing the company's directors.

**Analysis Rules:**
1.  Look for section titles like "Board of Directors", "Directors' Profile", "Information on Directors", or "Corporate Governance".
2.  Extract the exact title and its corresponding starting page number.
3.  The page number is typically the last number on the line associated with the title.

**Output Format:**
You MUST return ONLY a single JSON object with two keys:
- `section_title`: The exact string of the section title you found.
- `start_page`: The integer page number for that section.

If you cannot confidently identify the section, return `null` for both values.

---
**Example:**

**Input Text:**
"Message to Shareholders ..................... 2
Financial Highlights ......................... 4
Board of Directors ........................... 8
Statement on Corporate Governance .......... 20
Report of the Audit Committee .............. 35"

**Output JSON:**
```json
{
  "section_title": "Board of Directors",
  "start_page": 8
}
"""

In [23]:
NER_SYSTEM_PROMPT = """You are an expert-level financial-domain Named Entity Recognition (NER) and Relationship Extraction engine. Your task is to analyze text chunks extracted exclusively from the 'Board of Directors' or 'Directors' Profile' section of a company's annual report. Your goal is to create a structured JSON object containing all people, companies, and their relationships.

**Extraction Rules:**
1.  **Be Specific:** Do NOT extract generic, non-specific entities. Ignore terms like 'the company', 'the group', 'the university', 'our auditors', 'the school', 'a bank' unless they are part of a full proper noun.
2.  **Differentiate Entities:** Carefully distinguish between a person's name and an organization's name. A person's name usually consists of a first and last name. A company name often includes suffixes like 'Ltd', 'Group', 'Holdings', or 'Corporation'.
3.  **Canonicalize Names:** For each unique entity, determine its most complete, official name from the text to use as its `canonicalName`. All other references (e.g., acronyms, shorter names) should be listed in the `mentions` array.
4.  **Extract Timestamps:** Only populate the `effectiveDate` field if a specific date or year is mentioned in direct connection to a role or appointment (e.g., "appointed on 1 Jan 2024", "since 2022"). If no date is present, the value must be `null`.

**Output Format:**
You MUST return ONLY a single JSON object with two keys: "entities" and "relationships".

---
**Example:**

**Input Text:**
"Mr. Tan Ah Kow joined the board of directors of SGX Group in 2023. He has been the Chief Executive Officer of DBS Group Holdings Ltd. (also known as DBS) since his appointment on Feb 1, 2022. His colleague, Ms. Jane Lim, is a director at Keppel Ltd."

**Output JSON:**
```json
{
  "entities": [
    {
      "entityId": "PERSON_1", "type": "Person", "canonicalName": "Tan Ah Kow",
      "mentions": ["Mr. Tan Ah Kow"]
    },
    {
      "entityId": "COMPANY_1", "type": "Company", "canonicalName": "SGX Group",
      "mentions": ["SGX Group"]
    },
    {
      "entityId": "COMPANY_2", "type": "Company", "canonicalName": "DBS Group Holdings Ltd.",
      "mentions": ["DBS Group Holdings Ltd.", "DBS"]
    },
    {
      "entityId": "PERSON_2", "type": "Person", "canonicalName": "Jane Lim",
      "mentions": ["Ms. Jane Lim"]
    },
    {
      "entityId": "COMPANY_3", "type": "Company", "canonicalName": "Keppel Ltd",
      "mentions": ["Keppel Ltd"]
    }
  ],
  "relationships": [
    {
      "sourceEntityId": "PERSON_1", "targetEntityId": "COMPANY_1",
      "role": "director", "effectiveDate": "2023"
    },
    {
      "sourceEntityId": "PERSON_1", "targetEntityId": "COMPANY_2",
      "role": "chief executive officer", "effectiveDate": "2022-02-01"
    },
    {
      "sourceEntityId": "PERSON_2", "targetEntityId": "COMPANY_3",
      "role": "director", "effectiveDate": null
    }
  ]
}
"""

In [24]:
import os
import json
import re
from typing import List, Dict, Any, Optional

# Dependency Imports
import google.generativeai as genai
import pdfplumber
from langchain.text_splitter import RecursiveCharacterTextSplitter
from py2neo import Graph, Node, Relationship

# --- Assume environment variables and constants are defined above ---
# --- This includes the two new prompts: TOC_SYSTEM_PROMPT and NER_SYSTEM_PROMPT ---

TOC_PAGE_LIMIT = 10 # How many pages to scan for the Table of Contents

DIRECTOR_SECTION_END_KEYWORDS = [
    "directors' statement", "statement by directors", "independent auditor's report",
    "financial statements", "remuneration report"
]


# --------------------------------------------------------------------------- #
# ---------------------- GEMINI CLIENT INITIALIZATION ----------------------- #
# --------------------------------------------------------------------------- #

genai.configure(api_key=GOOGLE_API_KEY)

# Model for finding the section in the Table of Contents
gemini_toc_model = genai.GenerativeModel(
    model_name=MODEL_NAME,
    system_instruction=TOC_SYSTEM_PROMPT
)

# Model for performing the actual NER on the section text
gemini_ner_model = genai.GenerativeModel(
    model_name=MODEL_NAME,
    system_instruction=NER_SYSTEM_PROMPT
)


In [25]:


# --------------------------------------------------------------------------- #
# ---------------------- PDF HELPERS (MODIFIED) ----------------------------- #
# --------------------------------------------------------------------------- #

def find_directors_section_page(toc_text: str) -> Optional[Dict[str, Any]]:
    """Uses an LLM to find the Board of Directors section page from ToC text."""
    try:
        response = gemini_toc_model.generate_content(toc_text)
        return json.loads(response.text)
    except (json.JSONDecodeError, AttributeError, ValueError):
        return None

def extract_director_section_text(path: str) -> str:
    """
    Implements the two-step process:
    1. Use LLM to find the director section page from the Table of Contents.
    2. Extract text starting from that page until an end keyword is found.
    """
    text_parts = []
    
    with pdfplumber.open(path) as pdf:
        # Step 1: Analyze the Table of Contents using the LLM
        print("   → 1a. Analyzing Table of Contents with LLM...")
        toc_pages_text = "\n".join([page.extract_text() or "" for page in pdf.pages[:TOC_PAGE_LIMIT]])
        section_info = find_directors_section_page(toc_pages_text)
        
        start_page = section_info.get("start_page") if section_info else None
        
        # Step 2: Extract text based on the found page number
        if start_page and isinstance(start_page, int):
            print(f"   → 1b. Section '{section_info.get('section_title')}' found. Starting extraction from page {start_page}.")
            # PDF pages are 0-indexed, but page numbers in reports are 1-indexed.
            for page_num in range(start_page - 1, len(pdf.pages)):
                page = pdf.pages[page_num]
                page_text = page.extract_text() or ""
                lower_text = page_text.lower()
                
                # Stop if we hit a common section that follows the directors' report
                if any(keyword in lower_text for keyword in DIRECTOR_SECTION_END_KEYWORDS):
                    break
                text_parts.append(page_text)
        else:
            print("[Warning] Could not determine section from ToC. Falling back to keyword search across document.")
            # Fallback logic from previous version
            in_section = False
            start_keywords = ["board of directors", "directors' profile"]
            for page in pdf.pages:
                page_text = page.extract_text() or ""
                lower_text = page_text.lower()
                if in_section and any(keyword in lower_text for keyword in DIRECTOR_SECTION_END_KEYWORDS):
                    break
                if not in_section and any(keyword in lower_text for keyword in start_keywords):
                    in_section = True
                if in_section:
                    text_parts.append(page_text)

    if not text_parts:
         raise ValueError("Could not find the directors' section or extract any text from the PDF.")

    return "\n".join(text_parts)


def chunk_text(text: str, chunk_size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP) -> List[str]:
    """Splits text into manageable chunks."""
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=overlap, separators=["\n\n", "\n", " ", ""]
    )
    return splitter.split_text(text)


# --------------------------------------------------------------------------- #
# ----------- UNCHANGED FUNCTIONS (ner_chunk, push_to_neo4j) ---------------- #
# --------------------------------------------------------------------------- #
def ner_chunk(chunk: str) -> Dict[str, List[Dict[str, Any]]]:
    # This function remains the same as the previous version
    empty_response = {"entities": [], "relationships": []}
    try:
        response = gemini_ner_model.generate_content(chunk)
        response_text = response.text.strip()
        match = re.search(r"```json\s*({.*})\s*```", response_text, re.DOTALL)
        if match: json_str = match.group(1)
        else: json_str = response_text
        return json.loads(json_str)
    except Exception: return empty_response

def push_to_neo4j(relations: List[Dict[str, Any]]) -> None:
    # This function remains the same as the previous version
    graph = Graph(NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD))
    if NEO4J_CLEAR: graph.run("MATCH (n) DETACH DELETE n")
    node_cache = {}
    for rel in relations:
        p_name, c_name, role, date = rel.get("person"), rel.get("company"), rel.get("role"), rel.get("effectiveDate")
        if not all((p_name, c_name, role)): continue
        k_person, k_comp = ("Person", p_name), ("Company", c_name)
        if k_person not in node_cache:
            node = Node("Person", name=p_name); graph.merge(node, "Person", "name"); node_cache[k_person] = node
        if k_comp not in node_cache:
            node = Node("Company", name=c_name); graph.merge(node, "Company", "name"); node_cache[k_comp] = node
        rel_obj = Relationship(node_cache[k_person], "HAS_ROLE_AT", node_cache[k_comp], role=role.lower(), effective_date=date)
        graph.merge(rel_obj)

# --------------------------------------------------------------------------- #
# --------------------------- MAIN EXECUTION -------------------------------- #
# --------------------------------------------------------------------------- #


In [27]:
import os
import json
import re
from typing import List, Dict, Any

# --- Assume all previous code, including prompts, functions, and Gemini client initialization, is defined above ---
# --- No changes are needed to the prompts or the functions themselves. ---
print("📖 Starting PDF processing for:", PDF_PATH)
# The extract_director_section_text function handles finding the correct text
raw_text = extract_director_section_text(PDF_PATH)

print("✂️  Splitting extracted text into chunks...")
chunks = chunk_text(raw_text)
print(f"   → {len(chunks)} chunks to process")

# Master dictionary to hold de-duplicated entities, keyed by their NORMALIZED name
master_entities = {}  # Key: normalized_name, Value: full entity object
resolved_relations = []

print("🧠 Processing chunks with Gemini NER...")
for i, chunk in enumerate(chunks, 1):
    print(f"   Processing chunk {i}/{len(chunks)}...", end="\r")
    
    data = ner_chunk(chunk)
    if not data.get("entities"):
        continue

    # A map to resolve a chunk's temporary ID to a permanent, normalized name
    chunk_id_to_normalized_name_map = {}
    
    # === MODIFICATION START: De-duplicate entities using a normalized key ===
    for entity in data.get("entities", []):
        canonical_name = entity["canonicalName"]
        
        # THE FIX: Create a normalized key for consistent lookup
        normalized_name = canonical_name.lower().strip()
        
        # Map the temporary ID of this chunk to the permanent normalized name
        chunk_id_to_normalized_name_map[entity["entityId"]] = normalized_name

        # If we have not seen this entity before (based on its normalized name), add it
        if normalized_name not in master_entities:
            master_entities[normalized_name] = entity
        else:
            # If we have seen this entity, merge the mentions to enrich the data
            # This handles cases where "DBS" and "DBS Group" are found in different chunks
            # but have the same canonical name ("DBS Group Holdings Ltd.")
            existing_mentions = set(master_entities[normalized_name].get("mentions", []))
            new_mentions = set(entity.get("mentions", []))
            
            # Also add the new canonicalName variation itself to the mentions list
            existing_mentions.add(master_entities[normalized_name]["canonicalName"])
            new_mentions.add(canonical_name)
            
            master_entities[normalized_name]["mentions"] = sorted(list(existing_mentions.union(new_mentions)))

    # Resolve relationships using the map of normalized names
    for rel in data.get("relationships", []):
        person_id = rel.get("sourceEntityId")
        company_id = rel.get("targetEntityId")
        
        person_normalized_name = chunk_id_to_normalized_name_map.get(person_id)
        company_normalized_name = chunk_id_to_normalized_name_map.get(company_id)
        
        if person_normalized_name and company_normalized_name:
            # Crucially, retrieve the original CANONICAL name from the master list
            # to use in the final relationship record.
            person_canonical_name = master_entities[person_normalized_name]['canonicalName']
            company_canonical_name = master_entities[company_normalized_name]['canonicalName']
            
            resolved_relations.append({
                "person": person_canonical_name,
                "company": company_canonical_name,
                "role": rel.get("role"),
                "effectiveDate": rel.get("effectiveDate")
            })
    # === MODIFICATION END ===

print(f"\n🔗 Extracted {len(resolved_relations)} total relations from {len(master_entities)} unique entities after merging.")

if resolved_relations:
    print("💾 Pushing data to Neo4j...")
    push_to_neo4j(resolved_relations)
    print("\nPipeline finished successfully!")
else:
    print("\nNo relations were extracted to push to the database.")


📖 Starting PDF processing for: C:\Users\22601\Downloads\finer\data\842974_E01238.pdf
   → 1a. Analyzing Table of Contents with LLM...
✂️  Splitting extracted text into chunks...
   → 6 chunks to process
🧠 Processing chunks with Gemini NER...
   Processing chunk 6/6...
🔗 Extracted 197 total relations from 127 unique entities after merging.
💾 Pushing data to Neo4j...

Pipeline finished successfully!
