# Causal Search Demo - Query and Context Analysis

This notebook demonstrates how to use the Causal Search method in GraphRAG and inspect the context used to generate responses. Causal Search performs causal analysis on knowledge graphs through a two-stage process:

1. **Stage 1**: Extract extended graph information (k + s nodes) and generate causal analysis report
2. **Stage 2**: Use the causal report to generate final response to user query

## Key Features

- Extended node extraction beyond local search limits
- Two-stage processing for comprehensive causal analysis
- Automatic output saving to data folders
- Configurable parameters for retrieval breadth and context proportions
- Integration with existing GraphRAG pipeline
- **Context inspection**: See exactly what data was used to generate responses

## Prerequisites

Before running this notebook, ensure you have:

1. Run the GraphRAG indexing pipeline to generate entities, relationships, and community reports
2. Set up your configuration in `settings.yaml` with causal search parameters
3. Configured your language models and API keys

In [1]:
import os
import asyncio
import json
import pandas as pd
import tiktoken
from pathlib import Path
from typing import Any, Dict, List

# GraphRAG imports
from graphrag.config.enums import ModelType
from graphrag.config.load_config import load_config
from graphrag.config.models.language_model_config import LanguageModelConfig
from graphrag.language_model.manager import ModelManager
from graphrag.query.context_builder.entity_extraction import EntityVectorStoreKey
from graphrag.query.factory import get_causal_search_engine
from graphrag.query.indexer_adapters import (
    read_indexer_covariates,
    read_indexer_entities,
    read_indexer_relationships,
    read_indexer_reports,
    read_indexer_text_units,
)
from graphrag.query.structured_search.causal_search.search import CausalSearchError
from graphrag.query.structured_search.causal_search.search import CausalSearch
from graphrag.query.structured_search.local_search.mixed_context import (
    LocalSearchMixedContext,
)
from graphrag.vector_stores.lancedb import LanceDBVectorStore

# IPython display utilities
from IPython.display import Markdown, display

## Configuration Setup

First, let's load the GraphRAG configuration and set up the environment.

In [2]:
# Configuration setup
ROOT_DIR = Path("/home/chuaxu/projects/graphrag/ragsas")  # Adjust this path to your project root
CONFIG_FILE = None  # Use default settings.yaml

# Load configuration
try:
    config = load_config(ROOT_DIR, CONFIG_FILE)
    print("✅ Configuration loaded successfully")
    print(f"📁 Root directory: {ROOT_DIR}")
    print(f"🔧 Causal search s_parameter: {config.causal_search.s_parameter}")
    print(f"🔧 Causal search top_k_entities: {config.causal_search.top_k_mapped_entities}")
    print(f"🔧 Causal search max_context_tokens: {config.causal_search.max_context_tokens}")
except Exception as e:
    print(f"❌ Failed to load configuration: {e}")
    raise

✅ Configuration loaded successfully
📁 Root directory: /home/chuaxu/projects/graphrag/ragsas
🔧 Causal search s_parameter: 3
🔧 Causal search top_k_entities: 10
🔧 Causal search max_context_tokens: 100000


## Data Loading

Load the required data from your GraphRAG pipeline outputs using the same functions as the visualization notebook.

In [3]:
# Data loading setup
INPUT_DIR = f"{ROOT_DIR}/output"
LANCEDB_URI = f"{INPUT_DIR}/lancedb"

COMMUNITY_REPORT_TABLE = "community_reports"
COMMUNITY_TABLE = "communities"
ENTITY_TABLE = "entities"
RELATIONSHIP_TABLE = "relationships"
COVARIATE_TABLE = "covariates"
TEXT_UNIT_TABLE = "text_units"
COMMUNITY_LEVEL = 2

### Load tables to dataframes

#### Read entities

In [4]:
# read nodes table to get community and degree data
entity_df = pd.read_parquet(f"{INPUT_DIR}/{ENTITY_TABLE}.parquet")
community_df = pd.read_parquet(f"{INPUT_DIR}/{COMMUNITY_TABLE}.parquet")

print(f"✅ Loaded {len(entity_df)} entities")
print(f"✅ Loaded {len(community_df)} communities")

✅ Loaded 3738 entities
✅ Loaded 507 communities


#### Read relationships

In [5]:
relationship_df = pd.read_parquet(f"{INPUT_DIR}/{RELATIONSHIP_TABLE}.parquet")
relationships = read_indexer_relationships(relationship_df)

print(f"✅ Loaded {len(relationship_df)} relationships")

✅ Loaded 3917 relationships


#### Read other data tables

In [6]:
# Load text units
text_unit_df = pd.read_parquet(f"{INPUT_DIR}/{TEXT_UNIT_TABLE}.parquet")
text_units = read_indexer_text_units(text_unit_df)

# Load community reports
report_df = pd.read_parquet(f"{INPUT_DIR}/{COMMUNITY_REPORT_TABLE}.parquet")
reports = read_indexer_reports(report_df, community_df, COMMUNITY_LEVEL)

# Load covariates if they exist
try:
    covariate_df = pd.read_parquet(f"{INPUT_DIR}/{COVARIATE_TABLE}.parquet")
    claims = read_indexer_covariates(covariate_df)
    covariates = {"claims": claims}
    print(f"✅ Loaded {len(claims)} covariates")
except FileNotFoundError:
    print("ℹ️  No covariates found, proceeding without covariates")
    covariates = {}

print(f"✅ Loaded {len(text_units)} text units")
print(f"✅ Loaded {len(reports)} community reports")

ℹ️  No covariates found, proceeding without covariates
✅ Loaded 995 text units
✅ Loaded 484 community reports


## Model Setup

Set up the language models and context builder using the same approach as the visualization notebook.

In [7]:
# Get model configurations from the loaded config
chat_model_config = config.get_language_model_config("default_chat_model")
embedding_model_config = config.get_language_model_config("default_embedding_model")

# Create chat model
chat_model = ModelManager().get_or_create_chat_model(
    name="causal_search",
    model_type=chat_model_config.type,
    config=chat_model_config,
)

# Create token encoder
token_encoder = tiktoken.encoding_for_model(chat_model_config.model)

# Create embedding model
text_embedder = ModelManager().get_or_create_embedding_model(
    name="causal_search_embedding",
    model_type=embedding_model_config.type,
    config=embedding_model_config,
)

# Create vector store
description_embedding_store = LanceDBVectorStore(
    collection_name="default-entity-description",
)
description_embedding_store.connect(db_uri=LANCEDB_URI)

print("✅ Models and vector store setup complete")

✅ Models and vector store setup complete


## Context Builder Setup

Create the context builder using the same parameters as the visualization notebook.

In [8]:
entities = read_indexer_entities(entity_df, community_df, COMMUNITY_LEVEL)

# Context builder parameters (same as visualization notebook)
context_builder_params = {
    "text_unit_prop": 0.5,
    "community_prop": 0.25,
    "conversation_history_max_turns": 5,
    "conversation_history_user_turns_only": True,
    "top_k_mapped_entities": 3,  # Increased for causal search
    "top_k_relationships": 3,     # Increased for causal search
    "include_entity_rank": True,
    "include_relationship_weight": True,
    "include_community_rank": False,
    "return_candidate_context": False,
    "embedding_vectorstore_key": EntityVectorStoreKey.ID,
    "max_tokens": 80_000,
    # Output control parameters
    "save_network_data": True,   # Whether to save extracted network data to files
    "save_causal_report": True,  # Whether to save causal analysis report to files
    "output_folder": "causal_search",  # Subfolder under data/outputs/ for causal search outputs
    "output_base_dir": str(ROOT_DIR / "output"),  # Base output directory
}

# Create context builder
context_builder = LocalSearchMixedContext(
    community_reports=reports,
    text_units=text_units,
    entities=entities,
    relationships=relationships,
    covariates=covariates,
    entity_text_embeddings=description_embedding_store,
    embedding_vectorstore_key=EntityVectorStoreKey.ID,
    text_embedder=text_embedder,
    token_encoder=token_encoder,
)

print("✅ Context builder setup complete")

✅ Context builder setup complete


## Causal Search Engine Setup

Create the causal search engine with the same model parameters as the visualization notebook.

In [9]:
# Model parameters (same as visualization notebook)
model_params = {
    "max_tokens": 16_384,
    "temperature": 0.0,
}

# Create causal search engine directly (not using factory function)
causal_search_engine = CausalSearch(
    model=chat_model,
    context_builder=context_builder,
    token_encoder=token_encoder,
    model_params=model_params,
    context_builder_params=context_builder_params,
    s_parameter=3,  # Additional nodes for causal analysis
    max_context_tokens=80_000,
)

print("✅ Causal search engine setup complete")

2025-08-18 11:07:32.0623 - INFO - graphrag.query.structured_search.causal_search.search - Loaded causal discovery prompt from GraphRAG prompts
2025-08-18 11:07:32.0626 - INFO - graphrag.query.structured_search.causal_search.search - Loaded causal summary prompt from GraphRAG prompts
✅ Causal search engine setup complete


## Causal Search Example

Now let's run a causal search query and inspect the context used to generate the response.

### Run causal search on sample queries

In [10]:
# Sample query for causal analysis
question = "How can we use SAS Econometric products to help analyze the impact of different pricing strategies on business revenue?"
print(f"🔍 Query: {question}")

# Execute causal search
try:
    result = await causal_search_engine.search(question)
    print("✅ Causal search completed successfully!")
except Exception as e:
    print(f"❌ Causal search failed: {e}")
    raise

🔍 Query: How can we use SAS Econometric products to help analyze the impact of different pricing strategies on business revenue?
2025-08-18 11:07:32.0637 - INFO - graphrag.query.structured_search.causal_search.search - 🚀 Starting causal search for query: 'How can we use SAS Econometric products to help analyze the impact of different pricing strategies on business revenue?'
2025-08-18 11:07:32.0640 - INFO - graphrag.query.structured_search.causal_search.search - 📊 Parameters: s_parameter=3, max_context_tokens=80000
2025-08-18 11:07:32.0641 - INFO - graphrag.query.structured_search.causal_search.search - 🔍 Step 1: Extracting extended nodes with k=3, s=3
2025-08-18 11:07:32.0642 - INFO - graphrag.query.structured_search.causal_search.search - Requesting 12 nodes: (k=3 + s=3) * 2
2025-08-18 11:07:33.0082 - INFO - graphrag.query.structured_search.causal_search.search - ✅ Extracted 7 extended nodes (requested: 12)
2025-08-18 11:07:33.0083 - INFO - graphrag.query.structured_search.causal_sea

### Display the response

In [11]:
# Display as formatted Markdown
print("\n📝 Causal Search Response:")
print("=" * 50)
display(Markdown(result.response))


📝 Causal Search Response:


# Analyzing the Impact of Pricing Strategies on Business Revenue Using SAS Econometric Products

## Introduction

SAS Econometric products offer a comprehensive suite of tools designed to enhance data analysis capabilities, particularly in econometric modeling and statistical analysis. These tools can be instrumental in analyzing the impact of different pricing strategies on business revenue. By leveraging procedures such as PROC SEVSELECT and PROC CCDM, businesses can gain insights into how pricing adjustments may affect their financial outcomes.

## Key Procedures and Their Applications

### PROC SEVSELECT

**Role and Functionality**: PROC SEVSELECT is pivotal in fitting and selecting severity distribution models. It is particularly adept at handling data with censoring and truncation effects, which are common in revenue data where certain transactions may be incomplete or partially observed.

**Application in Pricing Strategy Analysis**: By using PROC SEVSELECT, businesses can model the distribution of revenue outcomes under different pricing scenarios. This procedure allows for the optimization of regression effects, which can help in understanding how changes in pricing influence revenue distribution. The strong causal link between SAS and PROC SEVSELECT, with a weight of 34.0, underscores its importance in econometric analyses.

### PROC CCDM

**Role and Functionality**: PROC CCDM is designed for simulating aggregate loss distributions using Monte Carlo methods. It is essential for risk management and insurance modeling, providing tools for parameter perturbation analysis and severity adjustment.

**Application in Pricing Strategy Analysis**: In the context of pricing strategies, PROC CCDM can simulate the potential revenue outcomes under various pricing models. This simulation capability is crucial for assessing the risk and variability associated with different pricing strategies. The causal link between SAS and PROC CCDM, with a weight of 8.0, highlights its role in comprehensive statistical modeling.

## Implications for Business Revenue Analysis

The use of SAS Econometric products in analyzing pricing strategies can significantly enhance decision-making processes. By modeling and simulating revenue outcomes, businesses can:

- **Optimize Pricing Models**: Use PROC SEVSELECT to identify the most effective pricing strategies that maximize revenue while accounting for data complexities such as censoring and truncation.

- **Assess Risk and Variability**: Leverage PROC CCDM to simulate revenue outcomes and assess the risk associated with different pricing strategies, enabling more informed decision-making.

## Recommendations

To effectively analyze the impact of pricing strategies on business revenue, organizations should:

1. **Integrate PROC SEVSELECT and PROC CCDM** into their analytical workflows to optimize data management and enhance decision-making processes.
2. **Leverage the robust capabilities** of these procedures to model and simulate revenue outcomes, providing insights into the potential impacts of pricing adjustments.
3. **Utilize the insights gained** from these analyses to refine pricing strategies, ultimately driving business growth and revenue optimization.

## Conclusion

SAS Econometric products, with their robust procedures like PROC SEVSELECT and PROC CCDM, offer powerful tools for analyzing the impact of pricing strategies on business revenue. By providing the flexibility and precision needed to address complex analytical challenges, these tools enable businesses to make data-driven decisions that enhance their financial performance.

### Display the generated causal search report

In [12]:
# Find and display the most recent causal search report
import glob
import os

def find_latest_causal_report():
    """Find the most recent causal search report file."""
    # Look for causal search reports in the output directory
    report_pattern = f"{ROOT_DIR}/output/causal_search/causal_search_report_*.md"
    report_files = glob.glob(report_pattern)
    
    if not report_files:
        print("❌ No causal search reports found")
        return None
    
    # Sort by modification time to get the most recent
    latest_report = max(report_files, key=os.path.getmtime)
    return latest_report

def display_causal_report(report_path):
    """Display the causal search report as formatted markdown."""
    if not report_path or not os.path.exists(report_path):
        print("❌ Report file not found")
        return
    
    print(f"📄 Displaying report: {os.path.basename(report_path)}")
    print(f"📁 Full path: {report_path}")
    print("=" * 50)
    
    try:
        with open(report_path, 'r', encoding='utf-8') as f:
            report_content = f.read()
        
        # Display as formatted Markdown
        display(Markdown(report_content))
        
        # Also show some metadata
        file_size = os.path.getsize(report_path)
        mod_time = os.path.getmtime(report_path)
        mod_time_str = pd.to_datetime(mod_time, unit='s').strftime('%Y-%m-%d %H:%M:%S')
        
        print(f"\n📊 Report metadata:")
        print(f"   - File size: {file_size:,} bytes")
        print(f"   - Last modified: {mod_time_str}")
        
    except Exception as e:
        print(f"❌ Error reading report: {e}")

# Find and display the latest report
latest_report_path = find_latest_causal_report()
if latest_report_path:
    display_causal_report(latest_report_path)
else:
    print("ℹ️  No causal search reports found. Run a causal search query first.")

📄 Displaying report: causal_search_report_a11a03db_1755529664.md
📁 Full path: /home/chuaxu/projects/graphrag/ragsas/output/causal_search/causal_search_report_a11a03db_1755529664.md


# Causal Analysis Report

**Query:** How can we use SAS Econometric products to help analyze the impact of different pricing strategies on business revenue?

**Generated:** 2025-08-18 11:07:44

**1. Introduction**

This report aims to analyze the causal relationships within the SAS software suite, focusing on the procedures PROC SEVSELECT and PROC CCDM. These procedures are integral components of SAS Econometrics, designed to enhance data analysis capabilities, particularly in severity distribution modeling and aggregate loss distribution simulation. The analysis will identify key entities, explore major causal pathways, assess the strength of causal claims, and discuss the implications of these relationships.

**2. Key Entities and Their Roles**

- **SAS**: As the central entity, SAS provides a comprehensive analytics platform that supports various procedures and tools for statistical analysis. It serves as the backbone for data management and econometric modeling, facilitating complex analytical tasks across academia, industry, and government sectors.

- **PROC SEVSELECT**: This procedure is pivotal in fitting and selecting severity distribution models, handling data with censoring and truncation effects, and optimizing regression effects. It contributes significantly to the SAS suite by enhancing model selection and evaluation capabilities, particularly in severity distributions.

- **PROC CCDM**: A sophisticated procedure within SAS Econometrics, PROC CCDM is designed for simulating aggregate loss distributions using Monte Carlo methods. It is essential for risk management and insurance modeling, providing tools for parameter perturbation analysis and severity adjustment.

**3. Major Causal Pathways**

- **SAS to PROC SEVSELECT**: SAS supports PROC SEVSELECT by providing a robust environment for severity distribution modeling. The procedure writes warnings to the SAS log, enhancing user experience and data management capabilities. This relationship is characterized by a strong causal link, with a weight of 34.0, indicating PROC SEVSELECT's integral role within the SAS suite.

- **SAS to PROC CCDM**: PROC CCDM is another procedure within the SAS suite, focusing on aggregate loss distribution simulation. The causal link between SAS and PROC CCDM, with a weight of 8.0, highlights its role in comprehensive statistical modeling and risk management.

**4. Confidence and Evidence Strength**

The causal claims presented in this report are supported by the structured relationships within the SAS software suite. The weights assigned to each relationship (34.0 for PROC SEVSELECT and 8.0 for PROC CCDM) reflect the strength and importance of these procedures within the SAS ecosystem. The detailed descriptions of each procedure further substantiate their roles and contributions to data analysis and econometric modeling.

**5. Implications and Recommendations**

The causal relationships within the SAS suite have significant implications for data analysis and econometric modeling. PROC SEVSELECT's ability to handle censored and truncated data enhances model selection and evaluation, making it a valuable tool for researchers and analysts. Similarly, PROC CCDM's capabilities in simulating aggregate loss distributions are crucial for risk management and insurance modeling.

Recommendations include leveraging PROC SEVSELECT for severity distribution modeling in econometric analyses and utilizing PROC CCDM for comprehensive risk assessment and simulation tasks. Organizations should consider integrating these procedures into their analytical workflows to optimize data management and enhance decision-making processes.

In conclusion, the SAS software suite, with its robust procedures, offers powerful tools for statistical analysis and econometric modeling, providing users with the flexibility and precision needed to address complex analytical challenges.


📊 Report metadata:
   - File size: 3,907 bytes
   - Last modified: 2025-08-18 15:07:44


## Summary

This notebook demonstrates:

1. **Data Loading**: Using the same functions as the visualization notebook
2. **Model Setup**: Consistent with the visualization notebook approach
3. **Context Building**: Same parameters and structure
4. **Causal Search**: Extended node extraction and two-stage processing
5. **Context Inspection**: Detailed analysis of what data was used
6. **Filtering Analysis**: Understanding how context filtering works

The key insight is that causal search uses **intelligent filtering** to ensure:
- **LLM Compatibility**: Data fits within model context limits
- **Relevance**: Most important entities/relationships are preserved
- **Performance**: Efficient processing without context length errors

The apparent "loss" of data (e.g., 40+ nodes → 7 entities) is actually **smart optimization** that preserves the most relevant information while maintaining system stability.