# üèóÔ∏è Build a Knowledge Graph with Gemini & BigQuery

This notebook demonstrates how to construct a Knowledge Graph from unstructured PDF manuals using:
*   **Document AI**: For parsing PDF documents.
*   **Gemini**: For extracting entities and relationships.
*   **BigQuery**: For storing graph data and executing graph queries.


## üèÅ Phase 1: Environment Setup & Data Loading

### üõ†Ô∏è  Set Up Environment
Install the necessary Python libraries to interact with Google Cloud services.


In [105]:
!pip install google-cloud-documentai --quiet

Define project constants such as Project ID, location and table names. Ensure `PROJECT_ID` matches your environment.


In [None]:
PROJECT_ID = "your-project-id" # @param {type:"string"}
LOCATION = 'us' # @param {type:"string"}
DATASET_ID = 'kg_demo' # @param {type:"string"}
GCS_FILE_LOCATION = "gs://sample-data-and-media/knowledge_graph_demo"

GEMINI_MODEL_VERSION= f"https://aiplatform.googleapis.com/v1/projects/{PROJECT_ID}/locations/global/publishers/google/models/gemini-3.1-pro-preview"
DOCUMENT_MODEL_NAME = "layout_parser"

OBJECT_TABLE_NAME= "manufacturing_manuals"
GRAPH_TABLE_NAME = "manufacturing_kg"
EXTRACTED_KG_TABLE_NAME = "extracted_knowledge_graph"
PROCESSOR_RESULTS_TABLE="processed_documents_raw"
PROCESSED_DOCUMENTS_TABLE="processed_documents"

Enable the Document AI API


In [107]:
!gcloud services enable documentai.googleapis.com --project={PROJECT_ID}

Check for an existing Document AI Processor or create a new one if it does not exist.

In [None]:
from google.api_core.client_options import ClientOptions
from google.cloud import documentai

# 1. Setup Document AI Client
processor_display_name = "pdf_parser_processor"
opts = ClientOptions(api_endpoint=f"{LOCATION}-documentai.googleapis.com")
client = documentai.DocumentProcessorServiceClient(client_options=opts)
parent = client.common_location_path(PROJECT_ID, LOCATION)

# 2. Check for existing processor (to avoid duplicates)
existing_processors = client.list_processors(parent=parent)
processor = next(
    (p for p in existing_processors if p.display_name == processor_display_name),
    None
)

# 3. Create new LAYOUT_PARSER_PROCESSOR if none exists
if processor:
    print(f"Found existing processor: {processor.name}")
else:
    print(f"Creating new processor...")
    processor = client.create_processor(
        parent=parent,
        processor=documentai.Processor(
            display_name=processor_display_name, 
            type_="LAYOUT_PARSER_PROCESSOR" # Uses the Layout Parser (v1)
        ),
    )

print(f"Processor Name: {processor.name}")
print(f"Processor Type: {processor.type_}")

# Extract the Processor ID, which will be used to configure the remote model in BigQuery.
PROCESSOR_ID = processor.name.split('/')[-1]
print(f"Processor Type: {PROCESSOR_ID}")

### üõ†Ô∏è Prepare BigQuery Environment


Create a BigQuery dataset to store the tables and models.


In [None]:
!bq --location={LOCATION} mk -d {DATASET_ID}

Initialize the Python client for BigQuery to execute setup queries.


In [110]:
from google.cloud import bigquery, storage
client = bigquery.Client(project=PROJECT_ID)

Load existing structural data (Customers, Products) from JSON files in GCS into BigQuery tables.


In [None]:
bucket_name, prefix = GCS_FILE_LOCATION[5:].split('/', 1)

storage_client = storage.Client()

# Iterate and Load
for blob in storage_client.list_blobs(bucket_name, prefix=prefix):
    if blob.name.endswith(".json"):
        # Extract filename only (e.g., 'parts')
        table_name = blob.name.split('/')[-1].replace(".json", "")
        table_id = f"{PROJECT_ID}.{DATASET_ID}.{table_name}"
        
        # Configure Load
        config = bigquery.LoadJobConfig(
            source_format="NEWLINE_DELIMITED_JSON", 
            autodetect=True, 
            write_disposition="WRITE_TRUNCATE"
        )
        print(f"Loading {table_name}...")
        client.load_table_from_uri(f"gs://{bucket_name}/{blob.name}", table_id, job_config=config).result()

print("Done.")

## Phase 2: Document Ingestion and Parsing


### üìÑ Create Object Table
Create an Object Table to reference the unstructured PDF files stored in GCS.


In [None]:
query = f"""
CREATE OR REPLACE EXTERNAL TABLE `{PROJECT_ID}.{DATASET_ID}.{OBJECT_TABLE_NAME}`
WITH CONNECTION DEFAULT
OPTIONS(
 object_metadata = 'SIMPLE',
 uris = ['{GCS_FILE_LOCATION}/*.pdf']);
"""

job = client.query(query)

result = job.result()
print(f"Object table: document_object_table created successfully in dataset {DATASET_ID}.")

### üìù Create Document AI Model
Create a Remote Model in BigQuery that interfaces with the Document AI processor.


In [None]:
query = f"""
CREATE OR REPLACE MODEL
`{DATASET_ID}.{DOCUMENT_MODEL_NAME}`
REMOTE WITH CONNECTION DEFAULT
OPTIONS (
  REMOTE_SERVICE_TYPE = 'CLOUD_AI_DOCUMENT_V1',
  DOCUMENT_PROCESSOR = '{PROCESSOR_ID}'
);
"""

job = client.query(query)

result = job.result()
print(f"Model {DOCUMENT_MODEL_NAME} created successfully in dataset {DATASET_ID}.")

### ‚öôÔ∏è Process Documents
We run `ML.PROCESS_DOCUMENT` to parse the PDF files using the created Document AI model. 

This function analyzes the layout and visual structure of the documents, splitting them into logical **chunks** (e.g., paragraphs, list items). 

The result is a table containing the raw JSON output for each document.

In [97]:
query = f"""
 CREATE OR REPLACE TABLE `{DATASET_ID}.{PROCESSOR_RESULTS_TABLE}` AS (
 SELECT * FROM ML.PROCESS_DOCUMENT(
   MODEL `{DATASET_ID}.{DOCUMENT_MODEL_NAME}`,
   TABLE `{DATASET_ID}.{OBJECT_TABLE_NAME}`,
   PROCESS_OPTIONS => (JSON '{{"layout_config": {{"chunking_config": {{"chunk_size": 250, "include_ancestor_headings": true}}}}}}')
 )
)
"""

job = client.query(query)

result = job.result()
print(f"Table: {PROCESSOR_RESULTS_TABLE} created successfully.")

print(f"Previewing {PROCESSOR_RESULTS_TABLE}:")
client.query(f"SELECT * FROM `{DATASET_ID}.{PROCESSOR_RESULTS_TABLE}` LIMIT 5").to_dataframe()

Table: processed_documents_raw created successfully.
Previewing processed_documents_raw:


Unnamed: 0,ml_process_document_result,ml_process_document_status,uri,generation,content_type,size,md5_hash,updated,metadata,ref
0,"{""chunkedDocument"":{""chunks"":[{""chunkId"":""c1"",...",,gs://sample-data-and-media/knowledge_graph_dem...,1772008213929247,application/pdf,129832,d49d758bd6e07796da63c69bdce7cbca,2026-02-25 08:30:13.991000+00:00,[],{'uri': 'gs://sample-data-and-media/knowledge_...
1,"{""chunkedDocument"":{""chunks"":[{""chunkId"":""c1"",...",,gs://sample-data-and-media/knowledge_graph_dem...,1772008213942756,application/pdf,158969,9fa7eed00a2b1830c39ae8a205d2be3b,2026-02-25 08:30:13.991000+00:00,[],{'uri': 'gs://sample-data-and-media/knowledge_...
2,"{""chunkedDocument"":{""chunks"":[{""chunkId"":""c1"",...",,gs://sample-data-and-media/knowledge_graph_dem...,1772008213947479,application/pdf,178448,808a181e7a4b3c2f1b826db17054a6fa,2026-02-25 08:30:13.994000+00:00,[],{'uri': 'gs://sample-data-and-media/knowledge_...
3,"{""chunkedDocument"":{""chunks"":[{""chunkId"":""c1"",...",,gs://sample-data-and-media/knowledge_graph_dem...,1772008213981671,application/pdf,138083,68ac2930ba78080e338c56e5667f2a1d,2026-02-25 08:30:14.028000+00:00,[],{'uri': 'gs://sample-data-and-media/knowledge_...
4,"{""chunkedDocument"":{""chunks"":[{""chunkId"":""c1"",...",,gs://sample-data-and-media/knowledge_graph_dem...,1772008213999297,application/pdf,237744,6d224874f6b06a87f4810a2c6d110e5e,2026-02-25 08:30:14.046000+00:00,[],{'uri': 'gs://sample-data-and-media/knowledge_...


### üßπ Flatten Results
The output from Document AI is deeply nested. Here, we transform it into a flat, usable format:
1.  **`UNNEST`**: Breaks the document trails down so that **one row = one text chunk**.
2.  **`JSON_EXTRACT_SCALAR`**: Pulls out specific fields like the `content` (text), `page_span` (location), and `chunkId`.
This gives us a clean table of text snippets ready for the LLM.

In [98]:
query = f"""
CREATE OR REPLACE TABLE `{DATASET_ID}.{PROCESSED_DOCUMENTS_TABLE}` AS
SELECT
 uri,
 JSON_EXTRACT_SCALAR(json , '$.chunkId') AS id,
 JSON_EXTRACT_SCALAR(json , '$.content') AS content,
 JSON_EXTRACT_SCALAR(json , '$.pageFooters[0].text') AS page_footers_text,
 JSON_EXTRACT_SCALAR(json , '$.pageSpan.pageStart') AS page_span_start,
 JSON_EXTRACT_SCALAR(json , '$.pageSpan.pageEnd') AS page_span_end
FROM `{DATASET_ID}.{PROCESSOR_RESULTS_TABLE}`,
UNNEST(JSON_EXTRACT_ARRAY(ml_process_document_result.chunkedDocument.chunks, '$')) json
"""

job = client.query(query)

result = job.result()
print(f"Table: {PROCESSED_DOCUMENTS_TABLE} created successfully.")

print(f"Previewing {PROCESSED_DOCUMENTS_TABLE}:")
client.query(f"SELECT * FROM `{DATASET_ID}.{PROCESSED_DOCUMENTS_TABLE}` LIMIT 5").to_dataframe()

Table: processed_documents created successfully.
Previewing processed_documents:


Unnamed: 0,uri,id,content,page_footers_text,page_span_start,page_span_end
0,gs://sample-data-and-media/knowledge_graph_dem...,c14,# 4.0 Troubleshooting\n\n## 4.4 Symptom-Based ...,Page 9,9,10
1,gs://sample-data-and-media/knowledge_graph_dem...,c3,# FROZONE ZENITH\n\n## 2.0 Electronics\n\n### ...,Page 4,4,6
2,gs://sample-data-and-media/knowledge_graph_dem...,c1,# FROZONE 3000\n\n## OFFICIAL SERVICE MANUAL\n...,Page 2,1,6
3,gs://sample-data-and-media/knowledge_graph_dem...,c2,# FROZONE 3000\n\n## 1.0 System Assembly\n\n##...,Page 2,2,4
4,gs://sample-data-and-media/knowledge_graph_dem...,c5,# FROZONE 3000\n\n## 2.0 Electronics\n\n### 2....,Page 5,5,6


## Phase 3: Knowledge Extraction with Gemini


### üí° Extract Knowledge
We use **Gemini 3.1 Pro** (`AI.GENERATE`) to extract structured data from the text chunks. The query does three key things:
1.  **Prompts** the model to identify specific entities (Parts, Materials) and relationships (MADE_OF, CONTAINS).
2.  **Enforces a Schema**: The `output_schema` argument ensures Gemini returns a strict JSON array of relationships, not free text.
3.  **Flattens**: We `UNNEST` the resulting array to create a standard BigQuery table where each row is a single relationship.

In [99]:
query = f'''
CREATE OR REPLACE TABLE `{DATASET_ID}.{EXTRACTED_KG_TABLE_NAME}` AS
SELECT
  uri,
  r.subject,
  r.subject_entity_type,
  r.relationship,
  r.object,
  r.object_entity_type,
  r.domain,
  r.source_snippet
FROM (
  SELECT
    uri,
    AI.GENERATE(
      -- ARGUMENT 1: The Prompt concatenated with the Content (POSITIONAL)
      """
      You are a technical knowledge graph extractor.
      Your task is to extract a comprehensive list of ALL component relationships from the text.
      You must also identify the ENTITY TYPE for both the subject and object.
      Valid Entity Types: Product, Part, Material, Other.

      ### CRITICAL: HANDLE LISTS EXHAUSTIVELY
      The text often lists multiple items for a single subject.
      You must create a separate entry in the 'relationships' array for EACH item.
      * Example: "Pump is made of Steel" -> 
          {{subject: 'Pump', subject_entity_type: 'Part', relationship: 'MADE_OF', object: 'Steel', object_entity_type: 'Material'}}

      ### RELATIONSHIP TYPES
      * **CONTAINS_PART**: Product -> Part ID
      * **MADE_OF**: Part ID -> Material Name
            * **REQUIRES_FIRMWARE**: Part ID -> Version
      * **CONNECTS_TO**: Part A -> Part B
      * **REQUIRES_PART**: Maintenance Task -> Part ID
      """ || content,

      -- ARGUMENT 2: The Output Schema (NAMED ARGUMENT)
      output_schema => """
        relationships ARRAY<STRUCT<
          subject STRING,
          subject_entity_type STRING,
          relationship STRING,
          object STRING,
          object_entity_type STRING,
          domain STRING,
          source_snippet STRING
        >>
      """,
      endpoint => '{GEMINI_MODEL_VERSION}'
    ) AS extracted_data
  FROM `{DATASET_ID}.{PROCESSED_DOCUMENTS_TABLE}`
),
-- Flatten the array: Turn 1 Document into N Relationship Rows
UNNEST(extracted_data.relationships) AS r
;
'''

job = client.query(query)

result = job.result()
print(f"table: {EXTRACTED_KG_TABLE_NAME} created successfully.")

client.query(f"SELECT * FROM `{DATASET_ID}.{EXTRACTED_KG_TABLE_NAME}` LIMIT 5").to_dataframe()

print(f"Previewing {EXTRACTED_KG_TABLE_NAME}:")
client.query(f"SELECT * FROM `{DATASET_ID}.{EXTRACTED_KG_TABLE_NAME}` LIMIT 5").to_dataframe()

table: extracted_knowledge_graph created successfully.
Previewing extracted_knowledge_graph:


Unnamed: 0,uri,subject,subject_entity_type,relationship,object,object_entity_type,domain,source_snippet
0,gs://sample-data-and-media/knowledge_graph_dem...,HMI-CHILL-10,Part,REQUIRES_FIRMWARE,CHILL-OS-5,Other,Firmware Management,HMI-CHILL-10 must run FIRMWARE CHILL-OS-5.
1,gs://sample-data-and-media/knowledge_graph_dem...,recalibration,Other,REQUIRES_PART,SENS-TEMP-X,Part,Equipment Troubleshooting,point to a fault with or the requirement for r...
2,gs://sample-data-and-media/knowledge_graph_dem...,Arctic Swirl 5000,Product,CONTAINS_PART,DISP-VALVE-PRO,Part,System Assembly,"At the product output point, the Arctic Swirl ..."
3,gs://sample-data-and-media/knowledge_graph_dem...,FroZone Zenith 5000,Product,CONTAINS_PART,GEAR-DRIVE-LITE,Part,Equipment,Safely power down the FroZone Zenith 5000 and ...
4,gs://sample-data-and-media/knowledge_graph_dem...,Replace affected components,Other,REQUIRES_PART,gears,Part,Maintenance and Troubleshooting,"If inspection reveals worn gears, bearings, or..."


## Phase 4: Knowledge Graph Construction


### A. Create Node Tables
Nodes are the **entities** in our graph. We create two specific node tables:
*   **`part_nodes`**: Derived by finding all unique subjects/objects labeled as 'Part'.
*   **`material_nodes`**: Derived by finding all unique objects labeled as 'Material'.
We uses `UNION DISTINCT` to ensure we get a comprehensive list of unique items from across the entire extracted dataset.

In [12]:
query_create_nodes = f'''
CREATE OR REPLACE TABLE `{DATASET_ID}.part_nodes`
AS
SELECT DISTINCT part_id AS part_name, 'Part' AS type
FROM
  (
    SELECT subject AS part_id
    FROM `{DATASET_ID}.extracted_knowledge_graph`
    WHERE subject_entity_type = 'Part'
    UNION DISTINCT
    SELECT object AS part_id
    FROM `{DATASET_ID}.extracted_knowledge_graph`
    WHERE object_entity_type = 'Part'
  )
WHERE part_id IS NOT NULL;

CREATE OR REPLACE TABLE `{DATASET_ID}.material_nodes`
AS
SELECT DISTINCT object AS material_name, 'Material' AS type
FROM `{DATASET_ID}.extracted_knowledge_graph`
WHERE object_entity_type = 'Material' AND object IS NOT NULL;
'''

client.query(query_create_nodes).result()
print("Node tables created successfully.")

print("Part Nodes:")
display(client.query(f"SELECT * FROM `{DATASET_ID}.part_nodes` LIMIT 5").to_dataframe())
print("\nMaterial Nodes:")
display(client.query(f"SELECT * FROM `{DATASET_ID}.material_nodes` LIMIT 5").to_dataframe())

Node tables created successfully.
Part Nodes:


Unnamed: 0,part_name,type
0,mixing blades,Part
1,power cables,Part
2,cooling system,Part
3,Main power switch,Part
4,cables,Part



Material Nodes:


Unnamed: 0,material_name,type
0,Nylon 66,Material
1,Hardened Steel,Material
2,Gold-Plated Contacts,Material
3,Food-Grade Silicone,Material
4,Epoxy Resin,Material


### B. Create Edge Tables
Edges represent the **connections** between our nodes. We create separate tables for each relationship type:
*   **`edges_product_contains`**: Links `Products` to `Parts` (e.g., 'Toaster' contains 'Heating Element').
*   **`edges_part_material`**: Links `Parts` to `Materials` (e.g., 'Heating Element' is made of 'Nichrome').

These tables act as the 'join tables' for our Property Graph.

In [13]:
query_create_edges = f'''
CREATE OR REPLACE TABLE `{DATASET_ID}.edges_product_contains`
AS
SELECT DISTINCT
  p.product_name AS product_name,
  ekg.object AS part_name
FROM `{DATASET_ID}.extracted_knowledge_graph` ekg
JOIN `{DATASET_ID}.products` p
  ON ekg.subject = p.product_name
WHERE ekg.relationship = 'CONTAINS_PART';

CREATE OR REPLACE TABLE `{DATASET_ID}.edges_part_material`
AS
SELECT DISTINCT
  subject AS part_name,
  object AS material_name
FROM `{DATASET_ID}.extracted_knowledge_graph`
WHERE relationship = 'MADE_OF';

CREATE OR REPLACE TABLE `{DATASET_ID}.edges_purchase_orders`
AS
SELECT DISTINCT
  po.customer_id,
  p.product_name
FROM `{DATASET_ID}.purchase_orders` po
JOIN `{DATASET_ID}.products` p
  ON po.product_id = p.product_id;
'''

client.query(query_create_edges).result()
print("Edge tables created successfully.")

print("Product -> Part Edges:")
display(client.query(f"SELECT * FROM `{DATASET_ID}.edges_product_contains` LIMIT 5").to_dataframe())
print("\nPart -> Material Edges:")
display(client.query(f"SELECT * FROM `{DATASET_ID}.edges_part_material` LIMIT 5").to_dataframe())
print("\nProduct -> Customer Edges:")
display(client.query(f"SELECT * FROM `{DATASET_ID}.edges_purchase_orders` LIMIT 5").to_dataframe())

Edge tables created successfully.
Product -> Part Edges:


Unnamed: 0,product_name,part_name
0,AUGER-SS-3,auger structure
1,AUGER-SS-3,mixing blades
2,AUGER-SS-3,blade
3,Arctic Swirl 5000,CRYO-PUMP-9
4,Arctic Swirl 5000,SHROUD-EXT-V2



Part -> Material Edges:


Unnamed: 0,part_name,material_name
0,AUGER-SS-3,Food-Grade Silicone
1,AUGER-SS-3,Stainless Steel 316
2,CRYO-PUMP-9,High-Density Polyethylene
3,CRYO-PUMP-9,Fiberglass Reinforced Plastic
4,CRYO-PUMP-9,Ceramic



Product -> Customer Edges:


Unnamed: 0,customer_id,product_name
0,C007,AUGER-SS-3
1,C003,ArcticSwirl 9000
2,C006,ArcticSwirl 9000
3,C007,FROZONE ELITE 7000
4,C002,FROZONE ELITE 7000


### Define Property Graph
Here we officially define the **Property Graph** schema (`CREATE PROPERTY GRAPH`).This statement maps our relational BigQuery tables to Graph concepts:
*   **NODE TABLES**: We define `Customer`, `Product`, `Part`, and `Material` nodes using their respective primary keys (e.g., `product_id`).


*   **EDGE TABLES**: We define the connections. For each edge (e.g., `product_parts`), we specify:
    *   **SOURCE KEY**: Where the edge starts (e.g., `product_node`).
    *   **DESTINATION KEY**: Where the edge ends (e.g., `part_node`).
    *   **LABEL**: A semantic name for the relationship (e.g., `CONTAINS_PART`).

In [None]:
# 3. Define the Property Graph
query_create_graph = f'''
CREATE OR REPLACE PROPERTY GRAPH `{DATASET_ID}.{GRAPH_TABLE_NAME}`
  NODE TABLES (
    `{DATASET_ID}.customers` AS customer_node KEY (customer_id) LABEL Customer PROPERTIES (company),
    `{DATASET_ID}.products` AS product_node KEY (product_name) LABEL Product,
    `{DATASET_ID}.part_nodes` AS part_node KEY (part_name) LABEL Part,
    `{DATASET_ID}.material_nodes` AS material_node KEY (material_name) LABEL Material
  )
  EDGE TABLES (
    `{DATASET_ID}.edges_purchase_orders` AS customer_purchases KEY (customer_id)
      SOURCE KEY (customer_id) REFERENCES customer_node
      DESTINATION KEY (product_name) REFERENCES product_node
      LABEL PURCHASED,

    `{DATASET_ID}.edges_product_contains` AS product_parts KEY (product_name)
      SOURCE KEY (product_name) REFERENCES product_node
      DESTINATION KEY (part_name) REFERENCES part_node
      LABEL CONTAINS_PART,

    `{DATASET_ID}.edges_part_material` AS part_materials KEY (part_name)
      SOURCE KEY (part_name) REFERENCES part_node
      DESTINATION KEY (material_name) REFERENCES material_node
      LABEL IS_MADE_OF
  );
'''

client.query(query_create_graph).result()
print("Property Graph created successfully.")

Property Graph created successfully.


## üé® Phase 5: Visualize your BigQuery Graph!
Now we can query the graph using **GQL (Graph Query Language)**. The query below matches a specific pattern:

*   **`MATCH (source)-[r]->(target)`**: This finds *any* two nodes connected by *any* relationship.

We return the `Source Node`, the `Edge` properties, and the `Target Node` as JSON objects. BigQuery Notebooks will automatically render this JSON result as an interactive graph visualization.

In [None]:
%%bigquery --graph display_only
GRAPH `kg_demo.manufacturing_kg`
MATCH (source)-[r]->(target)
RETURN
  TO_JSON(source) AS Source_Node,
  TO_JSON(r) AS Edge,
  TO_JSON(target) AS Target_Node


Let's see how products, parts and materials are related:

In [None]:
%%bigquery --graph display_only

GRAPH `kg_demo.manufacturing_kg`
MATCH  (p:Product)-[e]->(pt:Part)-[c]->(m:Material)
RETURN
  TO_JSON(p) AS product,
  TO_JSON(e) AS contains_part,
  TO_JSON(pt) AS part,
  TO_JSON(c) AS made_of,
  TO_JSON(m) AS material

Now let's run a query to find all customers that have purchased products with parts that contain fiberglass.

In [None]:
%%bigquery --graph display_only
GRAPH `kg_demo.manufacturing_kg`
MATCH  (c:Customer)-[r]->(p:Product)-[e]->(pt:Part)-[f]->(m:Material {material_name:"Fiberglass"})
RETURN
TO_JSON(c) AS customer,
TO_JSON(r) AS purchased,
TO_JSON(p) AS product,
TO_JSON(e) AS contains_part,
TO_JSON(pt) AS part,
TO_JSON(f) AS made_of,
TO_JSON(m) AS material
