# Visualizing the knowledge graph with `yfiles-jupyter-graphs`

This notebook is a partial copy of [local_search.ipynb](../../local_search.ipynb) that shows how to use `yfiles-jupyter-graphs` to add interactive graph visualizations of the parquet files  and how to visualize the result context of `graphrag` queries (see at the end of this notebook).

In [1]:
# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License.

In [1]:
import os

import pandas as pd
import tiktoken

from graphrag.config.enums import ModelType
from graphrag.config.models.language_model_config import LanguageModelConfig
from graphrag.language_model.manager import ModelManager
from graphrag.query.context_builder.entity_extraction import EntityVectorStoreKey
from graphrag.query.indexer_adapters import (
    read_indexer_covariates,
    read_indexer_entities,
    read_indexer_relationships,
    read_indexer_reports,
    read_indexer_text_units,
)
from graphrag.query.structured_search.local_search.mixed_context import (
    LocalSearchMixedContext,
)
from graphrag.query.structured_search.local_search.search import LocalSearch
from graphrag.vector_stores.lancedb import LanceDBVectorStore

## Local Search Example

Local search method generates answers by combining relevant data from the AI-extracted knowledge-graph with text chunks of the raw documents. This method is suitable for questions that require an understanding of specific entities mentioned in the documents (e.g. What are the healing properties of chamomile?).

### Load text units and graph data tables as context for local search

- In this test we first load indexing outputs from parquet files to dataframes, then convert these dataframes into collections of data objects aligning with the knowledge model.

### Load tables to dataframes

In [2]:
INPUT_DIR = "/home/chuaxu/projects/graphrag/ragsas/output"
LANCEDB_URI = f"{INPUT_DIR}/lancedb"

COMMUNITY_REPORT_TABLE = "community_reports"
COMMUNITY_TABLE = "communities"
ENTITY_TABLE = "entities"
RELATIONSHIP_TABLE = "relationships"
COVARIATE_TABLE = "covariates"
TEXT_UNIT_TABLE = "text_units"
COMMUNITY_LEVEL = 2

#### Read entities

In [3]:
# read nodes table to get community and degree data
entity_df = pd.read_parquet(f"{INPUT_DIR}/{ENTITY_TABLE}.parquet")
community_df = pd.read_parquet(f"{INPUT_DIR}/{COMMUNITY_TABLE}.parquet")

#### Read relationships

In [4]:
relationship_df = pd.read_parquet(f"{INPUT_DIR}/{RELATIONSHIP_TABLE}.parquet")
relationships = read_indexer_relationships(relationship_df)

# Visualizing nodes and relationships with `yfiles-jupyter-graphs`

`yfiles-jupyter-graphs` is a graph visualization extension that provides interactive and customizable visualizations for structured node and relationship data.

In this case, we use it to provide an interactive visualization for the knowledge graph of the [local_search.ipynb](../../local_search.ipynb) sample by passing node and relationship lists converted from the given parquet files. The requirements for the input data is an `id` attribute for the nodes and `start`/`end` properties for the relationships that correspond to the node ids. Additional attributes can be added in the `properties` of each node/relationship dict:

In [5]:
%pip install yfiles_jupyter_graphs --quiet
from yfiles_jupyter_graphs import GraphWidget
import numpy as np
import pandas as pd


# converts the entities dataframe to a list of dicts for yfiles-jupyter-graphs
def convert_entities_to_dicts(df):
    """Convert the entities dataframe to a list of dicts for yfiles-jupyter-graphs."""
    def clean_value(value):
        """Clean a value to make it JSON serializable."""
        # Handle arrays first (before checking for NaN)
        if isinstance(value, (np.ndarray, list)):
            # Convert arrays to strings or take first element if single value
            if len(value) == 0:
                return None
            elif len(value) == 1:
                return str(value[0])
            else:
                return str(value)
        # Now check for NaN on scalar values
        elif pd.isna(value):
            return None
        elif isinstance(value, (np.integer, np.floating)):
            # Convert numpy numbers to Python numbers
            if np.isnan(value) or np.isinf(value):
                return None
            return value.item()
        elif isinstance(value, float):
            # Handle Python floats that might be NaN or inf
            if pd.isna(value) or np.isinf(value):
                return None
            return value
        else:
            return value
    
    nodes_dict = {}
    for _, row in df.iterrows():
        # Create a dictionary for each row and collect unique nodes
        node_id = row["title"]
        if node_id not in nodes_dict:
            # Clean all properties to make them JSON serializable
            cleaned_properties = {k: clean_value(v) for k, v in row.to_dict().items()}
            nodes_dict[node_id] = {
                "id": node_id,
                "properties": cleaned_properties,
            }
    return list(nodes_dict.values())


# converts the relationships dataframe to a list of dicts for yfiles-jupyter-graphs
def convert_relationships_to_dicts(df):
    """Convert the relationships dataframe to a list of dicts for yfiles-jupyter-graphs."""
    def clean_value(value):
        """Clean a value to make it JSON serializable."""
        # Handle arrays first (before checking for NaN)
        if isinstance(value, (np.ndarray, list)):
            # Convert arrays to strings or take first element if single value
            if len(value) == 0:
                return None
            elif len(value) == 1:
                return str(value[0])
            else:
                return str(value)
        # Now check for NaN on scalar values
        elif pd.isna(value):
            return None
        elif isinstance(value, (np.integer, np.floating)):
            # Convert numpy numbers to Python numbers
            if np.isnan(value) or np.isinf(value):
                return None
            return value.item()
        elif isinstance(value, float):
            # Handle Python floats that might be NaN or inf
            if pd.isna(value) or np.isinf(value):
                return None
            return value
        else:
            return value
    
    relationships = []
    for _, row in df.iterrows():
        # Create a dictionary for each row
        cleaned_properties = {k: clean_value(v) for k, v in row.to_dict().items()}
        relationships.append({
            "start": row["source"],
            "end": row["target"],
            "properties": cleaned_properties,
        })
    return relationships


w = GraphWidget()
w.directed = True
w.nodes = convert_entities_to_dicts(entity_df)
w.edges = convert_relationships_to_dicts(relationship_df)

Note: you may need to restart the kernel to use updated packages.


## Configure data-driven visualization

The additional properties can be used to configure the visualization for different use cases.

In [6]:
# show title on the node
w.node_label_mapping = "title"


# map community to a color
def community_to_color(community):
    """Map a community to a color."""
    colors = [
        "crimson",
        "darkorange",
        "indigo",
        "cornflowerblue",
        "cyan",
        "teal",
        "green",
    ]
    return (
        colors[int(community) % len(colors)] if community is not None else "lightgray"
    )


def edge_to_source_community(edge):
    """Get the community of the source node of an edge."""
    source_node = next(
        (entry for entry in w.nodes if entry["properties"]["title"] == edge["start"]),
        None,
    )
    if source_node is None:
        return None
    # Handle missing community property gracefully
    source_node_community = source_node["properties"].get("community", None)
    return source_node_community if source_node_community is not None else None


w.node_color_mapping = lambda node: community_to_color(node["properties"].get("community", None))
w.edge_color_mapping = lambda edge: community_to_color(edge_to_source_community(edge))
# map size data to a reasonable factor
w.node_scale_factor_mapping = lambda node: 0.5 + node["properties"].get("size", 0) * 1.5 / 20
# use weight for edge thickness
w.edge_thickness_factor_mapping = "weight"

## Automatic layouts

The widget provides different automatic layouts that serve different purposes: `Circular`, `Hierarchic`, `Organic (interactiv or static)`, `Orthogonal`, `Radial`, `Tree`, `Geo-spatial`.

For the knowledge graph, this sample uses the `Circular` layout, though `Hierarchic` or `Organic` are also suitable choices.

In [7]:
# Use the circular layout for this visualization. For larger graphs, the default organic layout is often preferrable.
w.circular_layout()

## Display the graph

In [8]:
display(w)

GraphWidget(layout=Layout(height='800px', width='100%'))

# Visualizing the result context of `graphrag` queries

The result context of `graphrag` queries allow to inspect the context graph of the request. This data can similarly be visualized as graph with `yfiles-jupyter-graphs`.

## Making the request

The following cell recreates the sample queries from [local_search.ipynb](../../local_search.ipynb).

In [17]:
# setup (see also ../../local_search.ipynb)
entities = read_indexer_entities(entity_df, community_df, COMMUNITY_LEVEL)

description_embedding_store = LanceDBVectorStore(
    collection_name="default-entity-description",
)
description_embedding_store.connect(db_uri=LANCEDB_URI)

# Comment out covariates for now if file doesn't exist
try:
    covariate_df = pd.read_parquet(f"{INPUT_DIR}/{COVARIATE_TABLE}.parquet")
    claims = read_indexer_covariates(covariate_df)
    covariates = {"claims": claims}
except FileNotFoundError:
    print("Covariate file not found, proceeding without covariates")
    covariates = {}

report_df = pd.read_parquet(f"{INPUT_DIR}/{COMMUNITY_REPORT_TABLE}.parquet")
reports = read_indexer_reports(report_df, community_df, COMMUNITY_LEVEL)
text_unit_df = pd.read_parquet(f"{INPUT_DIR}/{TEXT_UNIT_TABLE}.parquet")
text_units = read_indexer_text_units(text_unit_df)

# Load configuration from settings.yaml
from graphrag.config.load_config import load_config
from pathlib import Path

config_path = Path("/home/chuaxu/projects/graphrag/ragsas")
config = load_config(config_path)

# Get model configurations from the loaded config
chat_model_config = config.get_language_model_config("default_chat_model")
embedding_model_config = config.get_language_model_config("default_embedding_model")

chat_model = ModelManager().get_or_create_chat_model(
    name="local_search",
    model_type=chat_model_config.type,
    config=chat_model_config,
)

token_encoder = tiktoken.encoding_for_model(chat_model_config.model)

text_embedder = ModelManager().get_or_create_embedding_model(
    name="local_search_embedding",
    model_type=embedding_model_config.type,
    config=embedding_model_config,
)

context_builder = LocalSearchMixedContext(
    community_reports=reports,
    text_units=text_units,
    entities=entities,
    relationships=relationships,
    covariates=covariates,
    entity_text_embeddings=description_embedding_store,
    embedding_vectorstore_key=EntityVectorStoreKey.ID,  # if the vectorstore uses entity title as ids, set this to EntityVectorStoreKey.TITLE
    text_embedder=text_embedder,
    token_encoder=token_encoder,
)

local_context_params = {
    "text_unit_prop": 0.5,
    "community_prop": 0.1,
    "conversation_history_max_turns": 5,
    "conversation_history_user_turns_only": True,
    "top_k_mapped_entities": 3,
    "top_k_relationships": 3,
    "include_entity_rank": True,
    "include_relationship_weight": True,
    "include_community_rank": False,
    "return_candidate_context": False,
    "embedding_vectorstore_key": EntityVectorStoreKey.ID,  # set this to EntityVectorStoreKey.TITLE if the vectorstore uses entity title as ids
    "max_tokens": 80_000,  # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 5000)
}

model_params = {
    "max_tokens": 16_000,  # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 1000=1500, the model supports at most 16384 completion tokens)
    "temperature": 0.0,
}

search_engine = LocalSearch(
    model=chat_model,
    context_builder=context_builder,
    token_encoder=token_encoder,
    model_params=model_params,
    context_builder_params=local_context_params,
    response_type="multiple paragraphs",  # free form text describing the response type and format, can be anything, e.g. prioritized list, single paragraph, multiple paragraphs, multiple-page report
)

Covariate file not found, proceeding without covariates


## Run local search on sample queries

In [10]:
result = await search_engine.search("Tell me about Agent Mercer")
print(result.response)

I'm sorry, but there is no information available about Agent Mercer in the provided data tables. If you have any other questions or need information on a different topic, feel free to ask!


In [19]:
from IPython.display import Markdown, display

question = "How can we use SAS Econometric products to help analyze the impact of different pricing strategies on business revenue?"
# question = "Which published studies in our knowledge base used both panel data methods and cointegration analysis on emerging market economies?"
result = await search_engine.search(question)

# Display as formatted Markdown instead of plain text
display(Markdown(result.response))

### Introduction to SAS Econometrics for Pricing Strategy Analysis

SAS Econometrics is a comprehensive software suite developed by SAS Institute Inc., designed to facilitate advanced econometric analysis. It provides a wide array of tools and procedures tailored for statistical and econometric modeling, making it an invaluable resource for analyzing the impact of pricing strategies on business revenue [Data: Entities (2); Reports (163)].

### Key Procedures for Pricing Strategy Analysis

One of the standout features of SAS Econometrics is its support for Bayesian analysis and advanced econometric methods, which are crucial for understanding complex pricing dynamics. The suite includes specialized procedures such as PROC DEEPPRICE, which is particularly designed for deep learning and policy evaluation in the context of analyzing price and demand data. This procedure is instrumental in estimating demand curves and optimal revenue per user by accounting for heterogeneous price effects based on user characteristics [Data: Entities (1762); Relationships (1940)].

PROC DEEPPRICE allows businesses to specify the correct functional form to accurately capture the relationship between price and demand. It also provides options for handling missing values and ensuring reproducibility, which are essential for maintaining the integrity and accuracy of the data analysis. By integrating these capabilities, PROC DEEPPRICE serves as a powerful tool for those seeking to leverage data-driven insights to optimize pricing and maximize revenue [Data: Entities (1762)].

### Additional Tools and Techniques

In addition to PROC DEEPPRICE, SAS Econometrics offers other procedures such as PROC CNTSELECT and PROC SEVSELECT, which are used for count data modeling and severity modeling, respectively. These procedures are essential for accurately capturing the underlying patterns in data and addressing issues such as overdispersion and excess zeros, which are common in real-world datasets [Data: Entities (290); Relationships (2131, 2137)].

SAS Econometrics operates in a cloud environment, providing users with the flexibility and scalability needed for large-scale data analysis. This cloud-based approach ensures that users can access the software's powerful features from anywhere, facilitating collaboration and efficiency in data-driven decision-making [Data: Entities (2)].

### Conclusion

Overall, SAS Econometrics provides a robust and versatile toolset that empowers users to conduct sophisticated econometric analyses, develop predictive models, and perform Bayesian inference with ease and precision. Its comprehensive suite of procedures and cloud-based accessibility make it an invaluable resource for professionals in the field of econometrics, particularly when analyzing the impact of different pricing strategies on business revenue [Data: Entities (2); Reports (163)].

## Inspecting the context data used to generate the response

In [20]:
result.context_data["entities"].head()

Unnamed: 0,id,entity,description,number of relationships,in_context
0,290,SAS ECONOMETRICS PROCEDURES,SAS Econometrics Procedures are a comprehensiv...,1,True
1,2,SAS ECONOMETRICS,SAS Econometrics is a comprehensive software s...,31,True
2,1102,SAS/ETS SOFTWARE,,2,True
3,4,SAS/ETS,SAS/ETS is a comprehensive software suite with...,14,True
4,1762,PROC DEEPPRICE,PROC DEEPPRICE is a versatile procedure design...,24,True


In [21]:
result.context_data["relationships"].head()

Unnamed: 0,id,source,target,description,weight,links,in_context
0,4,SAS ECONOMETRICS,SAS/ETS,SAS Econometrics and SAS/ETS are software suit...,16.0,1,True
1,0,SAS INSTITUTE INC.,SAS ECONOMETRICS,SAS Institute Inc. developed and provides the ...,9.0,1,True
2,3484,SAS,SAS ECONOMETRICS,SAS Econometrics is a product within the SAS s...,9.0,2,True
3,247,SAS,SAS ECONOMETRICS PROCEDURES,SAS Econometrics Procedures are part of the SA...,1.0,2,True
4,2143,SAS ECONOMETRICS,PROC CCDM,PROC CCDM is a procedure within SAS Econometri...,8.0,2,True


## Visualizing the result context as graph

In [22]:
"""
Helper function to visualize the result context with `yfiles-jupyter-graphs`.

The dataframes are converted into supported nodes and relationships lists and then passed to yfiles-jupyter-graphs.
Additionally, some values are mapped to visualization properties.
"""


def show_graph(result):
    """Visualize the result context with yfiles-jupyter-graphs."""
    from yfiles_jupyter_graphs import GraphWidget

    if (
        "entities" not in result.context_data
        or "relationships" not in result.context_data
    ):
        msg = "The passed results do not contain 'entities' or 'relationships'"
        raise ValueError(msg)

    # converts the entities dataframe to a list of dicts for yfiles-jupyter-graphs
    def convert_entities_to_dicts(df):
        """Convert the entities dataframe to a list of dicts for yfiles-jupyter-graphs."""
        nodes_dict = {}
        for _, row in df.iterrows():
            # Create a dictionary for each row and collect unique nodes
            node_id = row["entity"]
            if node_id not in nodes_dict:
                nodes_dict[node_id] = {
                    "id": node_id,
                    "properties": row.to_dict(),
                }
        return list(nodes_dict.values())

    # converts the relationships dataframe to a list of dicts for yfiles-jupyter-graphs
    def convert_relationships_to_dicts(df):
        """Convert the relationships dataframe to a list of dicts for yfiles-jupyter-graphs."""
        relationships = []
        for _, row in df.iterrows():
            # Create a dictionary for each row
            relationships.append({
                "start": row["source"],
                "end": row["target"],
                "properties": row.to_dict(),
            })
        return relationships

    w = GraphWidget()
    # use the converted data to visualize the graph
    w.nodes = convert_entities_to_dicts(result.context_data["entities"])
    w.edges = convert_relationships_to_dicts(result.context_data["relationships"])
    w.directed = True
    # show title on the node
    w.node_label_mapping = "entity"
    # use weight for edge thickness
    w.edge_thickness_factor_mapping = "weight"
    display(w)


show_graph(result)

GraphWidget(layout=Layout(height='500px', width='100%'))

In [23]:
# Analyze the context data and token usage that the LLM receives
import tiktoken
from IPython.display import Markdown, display

def analyze_context_and_tokens(result, search_engine):
    """Analyze the context data and count tokens for different message components."""
    
    # Get the token encoder
    encoding = tiktoken.encoding_for_model("gpt-4")
    
    print("=== CONTEXT DATA ANALYSIS ===\n")
    
    # Show context data statistics
    if "entities" in result.context_data:
        entities_df = result.context_data["entities"]
        print(f"📊 Entities in context: {len(entities_df)} entities")
        print(f"   - Columns: {list(entities_df.columns)}")
        print(f"   - Sample entity: {entities_df.iloc[0]['entity'] if len(entities_df) > 0 else 'None'}")
    
    if "relationships" in result.context_data:
        relationships_df = result.context_data["relationships"]
        print(f"📊 Relationships in context: {len(relationships_df)} relationships")
        print(f"   - Columns: {list(relationships_df.columns)}")
        print(f"   - Sample relationship: {relationships_df.iloc[0]['source']} -> {relationships_df.iloc[0]['target'] if len(relationships_df) > 0 else 'None'}")
    
    if "reports" in result.context_data:
        reports_df = result.context_data["reports"]
        print(f"📊 Community reports in context: {len(reports_df)} reports")
    
    if "sources" in result.context_data:
        sources_df = result.context_data["sources"]
        print(f"📊 Text sources in context: {len(sources_df)} text units")
    
    print("\n=== TOKEN ANALYSIS ===\n")
    
    # Reconstruct the context that was sent to the LLM
    # This is an approximation of what the LocalSearch builds
    
    # 1. System prompt
    with open("/home/chuaxu/projects/graphrag/ragsas/prompts/local_search_system_prompt.txt", "r") as f:
        system_prompt = f.read()

    # 2. Context data formatting (full version of what LocalSearch does)
    context_parts = []
    
    # Add entities context (FULL - no truncation)
    if "entities" in result.context_data and len(result.context_data["entities"]) > 0:
        entities_context = "## Relevant Entities:\n\n"
        for _, entity in result.context_data["entities"].iterrows():  # Show ALL entities
            desc = entity.get('description', 'No description')
            rank = entity.get('rank', 'N/A')
            entities_context += f"**{entity['entity']}** (Rank: {rank})\n"
            entities_context += f"Description: {desc}\n\n"
        context_parts.append(entities_context)
    
    # Add relationships context (FULL - no truncation)
    if "relationships" in result.context_data and len(result.context_data["relationships"]) > 0:
        relationships_context = "## Relevant Relationships:\n\n"
        for _, rel in result.context_data["relationships"].iterrows():  # Show ALL relationships
            desc = rel.get('description', 'No description')
            weight = rel.get('weight', 'N/A')
            relationships_context += f"**{rel['source']} → {rel['target']}** (Weight: {weight})\n"
            relationships_context += f"Description: {desc}\n\n"
        context_parts.append(relationships_context)
    
    # Add community reports context (FULL - no truncation)
    if "reports" in result.context_data and len(result.context_data["reports"]) > 0:
        reports_context = "## Relevant Community Reports:\n\n"
        for _, report in result.context_data["reports"].iterrows():  # Show ALL reports
            title = report.get('title', 'Untitled Report')
            content = report.get('content', report.get('summary', 'No content'))
            rank = report.get('rank', 'N/A')
            reports_context += f"**{title}** (Rank: {rank})\n"
            reports_context += f"{content}\n\n"
        context_parts.append(reports_context)
    
    # Add sources context (FULL - no truncation)
    if "sources" in result.context_data and len(result.context_data["sources"]) > 0:
        sources_context = "## Relevant Text Sources:\n\n"
        for _, source in result.context_data["sources"].iterrows():  # Show ALL sources
            text_content = source.get('text', source.get('content', 'No content'))
            source_id = source.get('id', 'Unknown')
            rank = source.get('rank', 'N/A')
            sources_context += f"**Source {source_id}** (Rank: {rank})\n"
            sources_context += f"{text_content}\n\n"
        context_parts.append(sources_context)
    
    # 3. User question
    user_question = question  # From the previous cell
    
    # Combine all context
    full_context = "\n".join(context_parts)
    
    # 4. Calculate tokens for each component
    system_prompt_tokens = len(encoding.encode(system_prompt))
    context_tokens = len(encoding.encode(full_context))
    user_question_tokens = len(encoding.encode(user_question))
    response_tokens = len(encoding.encode(result.response))
    
    total_input_tokens = system_prompt_tokens + context_tokens + user_question_tokens
    total_tokens = total_input_tokens + response_tokens
    
    print(f"🔢 Token Breakdown:")
    print(f"   - System prompt: {system_prompt_tokens:,} tokens")
    print(f"   - Context data: {context_tokens:,} tokens")
    print(f"   - User question: {user_question_tokens:,} tokens")
    print(f"   - Total INPUT: {total_input_tokens:,} tokens")
    print(f"   - Response: {response_tokens:,} tokens")
    print(f"   - TOTAL MESSAGE: {total_tokens:,} tokens")
    
    # Show model limits
    model_limit = 128_000  # GPT-4o limit
    print(f"\n📏 Model Capacity:")
    print(f"   - Model limit: {model_limit:,} tokens")
    print(f"   - Used: {total_tokens:,} tokens ({total_tokens/model_limit*100:.1f}%)")
    print(f"   - Remaining: {model_limit - total_tokens:,} tokens")
    
    if total_tokens > model_limit:
        print("   ⚠️  WARNING: Token count exceeds model limit!")
    elif total_tokens > model_limit * 0.9:
        print("   ⚠️  WARNING: Token count is near model limit!")
    else:
        print("   ✅ Token count is within safe limits")
    
    print("\n" + "="*80)
    print("=== FULL CONTEXT SENT TO LLM ===")
    print("="*80)
    
    # Display the complete context in a nice formatted way
    print(f"\n� SYSTEM PROMPT (Skipped)")
    print("-" * 40)
    
    print(f"\n🔵 USER QUESTION:")
    print("-" * 40)
    print(user_question)
    
    print(f"\n🔵 CONTEXT DATA ({context_tokens:,} tokens):")
    print("-" * 40)
    
    # Display full context as Markdown for better formatting
    display(Markdown(full_context))
    
    print("="*80)
    print("=== END OF CONTEXT ===")
    print("="*80)
    
    return {
        "system_prompt_tokens": system_prompt_tokens,
        "context_tokens": context_tokens,
        "user_question_tokens": user_question_tokens,
        "response_tokens": response_tokens,
        "total_tokens": total_tokens,
        "context_data": result.context_data,
        "full_context": full_context
    }

# Run the analysis
token_analysis = analyze_context_and_tokens(result, search_engine)

=== CONTEXT DATA ANALYSIS ===

📊 Entities in context: 6 entities
   - Columns: ['id', 'entity', 'description', 'number of relationships', 'in_context']
   - Sample entity: SAS ECONOMETRICS PROCEDURES
📊 Relationships in context: 20 relationships
   - Columns: ['id', 'source', 'target', 'description', 'weight', 'links', 'in_context']
   - Sample relationship: SAS ECONOMETRICS -> SAS/ETS
📊 Community reports in context: 1 reports
📊 Text sources in context: 3 text units

=== TOKEN ANALYSIS ===

🔢 Token Breakdown:
   - System prompt: 604 tokens
   - Context data: 7,568 tokens
   - User question: 21 tokens
   - Total INPUT: 8,193 tokens
   - Response: 516 tokens
   - TOTAL MESSAGE: 8,709 tokens

📏 Model Capacity:
   - Model limit: 128,000 tokens
   - Used: 8,709 tokens (6.8%)
   - Remaining: 119,291 tokens
   ✅ Token count is within safe limits

=== FULL CONTEXT SENT TO LLM ===

� SYSTEM PROMPT (Skipped)
----------------------------------------

🔵 USER QUESTION:
------------------------------

## Relevant Entities:

**SAS ECONOMETRICS PROCEDURES** (Rank: N/A)
Description: SAS Econometrics Procedures are a comprehensive suite of statistical analysis tools designed for conducting a wide range of econometric analyses. These procedures are particularly useful for the levelization of classification variables, which is a critical step in preparing data for further econometric modeling. The suite includes a variety of statistical methods tailored to address different aspects of econometric analysis, ensuring that users have the flexibility and precision needed for their specific research or business needs.

Among the key procedures included in SAS Econometrics are CNTSELECT, CPANEL, CQLIM, CSPATIALREG, and SEVSELECT. Each of these methods serves a distinct purpose within the econometric analysis framework. CNTSELECT is typically used for selecting the best model from a set of candidate models, ensuring that the chosen model is the most appropriate for the data at hand. CPANEL is designed for panel data analysis, allowing users to handle data that involves observations over multiple time periods for the same entities, which is common in economic and financial studies.

CQLIM is another critical procedure, used for limited dependent variable models, which are essential when dealing with outcomes that are categorical or otherwise constrained. CSPATIALREG is employed for spatial regression analysis, enabling users to account for spatial dependencies in their data, which is particularly important in fields such as regional economics and real estate. Lastly, SEVSELECT is used for selection models, helping to correct for selection bias in econometric models, which can otherwise lead to inaccurate estimates and conclusions.

Overall, SAS Econometrics Procedures provide a robust and versatile toolkit for econometricians and analysts, facilitating the execution of complex statistical analyses with precision and efficiency. These procedures are integral to the SAS software suite, known for its powerful data analysis capabilities, and are widely used in academia, government, and industry for their reliability and comprehensive approach to econometric challenges.

**SAS ECONOMETRICS** (Rank: N/A)
Description: SAS Econometrics is a comprehensive software suite developed by SAS Institute Inc., designed to facilitate advanced econometric analysis. This suite is part of the broader SAS software ecosystem and provides a wide array of tools and procedures tailored for statistical and econometric modeling. It is particularly noted for its capabilities in time series analysis, econometric modeling, and Bayesian inference.

The suite includes a variety of specialized procedures such as PROC UCM for time series analysis, PROC SEVSELECT and PROC CNTSELECT for econometric analysis, and the ECM procedure for developing economic capital models. These procedures are instrumental in implementing modeling and simulation steps in economic capital modeling. SAS Econometrics also offers the CCDM and CCOPULA procedures, which are used for modeling compound distributions and copulas, respectively.

One of the standout features of SAS Econometrics is its support for Bayesian analysis, which is facilitated through procedures like CQLIM, CNTSELECT, and SMC. These procedures employ advanced algorithms such as the random walk Metropolis (RWM) algorithm, which is enhanced with self-tuning capabilities to optimize Bayesian inference processes. The suite also includes tools for predictive modeling and various econometric methods, making it a versatile choice for econometricians and data analysts.

SAS Econometrics operates in a cloud environment, providing users with the flexibility and scalability needed for large-scale data analysis. This cloud-based approach ensures that users can access the software's powerful features from anywhere, facilitating collaboration and efficiency in data-driven decision-making.

Overall, SAS Econometrics is a robust and versatile toolset that empowers users to conduct sophisticated econometric analyses, develop predictive models, and perform Bayesian inference with ease and precision. Its comprehensive suite of procedures and cloud-based accessibility make it an invaluable resource for professionals in the field of econometrics.

**SAS/ETS SOFTWARE** (Rank: N/A)
Description: 

**SAS/ETS** (Rank: N/A)
Description: SAS/ETS is a comprehensive software suite within the SAS software ecosystem, specifically designed to provide robust tools for econometric and time series analysis. This suite includes a variety of procedures that cater to different analytical needs, making it a versatile choice for professionals in the field of econometrics and time series data analysis.

One of the key features of SAS/ETS is its inclusion of the UCM (Unobserved Components Model) procedure, which is instrumental in time series analysis. This procedure allows users to decompose time series data into components such as trend, seasonal, and irregular components, facilitating a deeper understanding of the underlying patterns in the data.

In addition to UCM, SAS/ETS offers a wide array of other procedures that enhance its functionality. These include ARIMA (AutoRegressive Integrated Moving Average), which is widely used for forecasting and understanding time series data; ESM (Exponential Smoothing Models), which provides a framework for smoothing time series data; VARMAX (Vector Autoregressive Moving-Average with Exogenous Inputs), which is useful for multivariate time series analysis; STATESPACE, which is used for state space modeling; and PANEL, which is designed for panel data analysis.

SAS/ETS also includes procedures specifically for count-data modeling, such as COUNTREG and HPCOUNTREG. These procedures are essential for estimating count regression models, which are used when the data being analyzed are counts or non-negative integers. This capability is particularly useful in fields such as biostatistics, epidemiology, and social sciences, where count data is prevalent.

Overall, SAS/ETS stands out as a powerful and flexible toolset for econometric and time series analysis, offering a wide range of procedures that cater to both basic and advanced analytical needs. Its integration within the broader SAS software suite ensures that users have access to a comprehensive set of tools for data analysis, making it a valuable resource for analysts and researchers alike.

**PROC DEEPPRICE** (Rank: N/A)
Description: PROC DEEPPRICE is a versatile procedure designed for deep learning and policy evaluation, particularly in the context of analyzing price and demand data. It is equipped to handle continuous outcome variables by utilizing the identity function for G in the outcome model, which simplifies the process of modeling these types of variables. This procedure is instrumental in estimating demand curves, allowing users to specify the correct functional form to accurately capture the relationship between price and demand. Additionally, PROC DEEPPRICE is adept at estimating optimal revenue per user by accounting for heterogeneous price effects based on user characteristics, thereby enabling more personalized and effective pricing strategies.

The procedure offers several options to enhance its functionality and adaptability. Users can specify the minibatch size, which is crucial for managing computational resources and optimizing the learning process in deep learning applications. Furthermore, PROC DEEPPRICE includes options for random seed generation, ensuring reproducibility and consistency in results across different runs. It also provides mechanisms for handling missing values, which is essential for maintaining the integrity and accuracy of the data analysis.

In the context of price effect estimation, PROC DEEPPRICE saves the estimation details, facilitating a comprehensive analysis of how price changes impact demand. This feature is particularly useful for businesses and researchers aiming to understand and predict consumer behavior in response to pricing strategies. By integrating these capabilities, PROC DEEPPRICE serves as a powerful tool for those seeking to leverage data-driven insights to optimize pricing and maximize revenue.

Overall, PROC DEEPPRICE stands out as a robust procedure that combines deep learning techniques with advanced policy evaluation methods to deliver precise and actionable insights into price and demand dynamics. Its ability to handle continuous outcome variables, estimate demand curves, and account for heterogeneous price effects makes it an invaluable asset for analysts and decision-makers in various industries.

**SAS INSTITUTE INC.** (Rank: N/A)
Description: SAS Institute Inc. is a prominent software company headquartered in Cary, North Carolina, USA. Renowned for its expertise in analytics software, SAS Institute Inc. has established itself as a leader in the development of advanced statistical and econometric tools. The company is particularly well-known for its comprehensive statistical software suite, which is widely used across various industries for data analysis and decision-making processes.

In addition to its statistical software, SAS Institute Inc. offers a range of analytics services that cater to diverse business needs. Among its notable offerings is SAS Econometrics, a sophisticated toolset designed to provide predictive modeling capabilities and econometrics procedures. This suite of tools is instrumental for businesses and researchers who require robust analytical solutions to forecast trends, analyze economic data, and make informed decisions based on quantitative insights.

SAS Institute Inc.'s commitment to innovation and excellence in analytics software has made it a trusted partner for organizations seeking to leverage data for strategic advantage. By continuously enhancing its software capabilities and expanding its service offerings, SAS Institute Inc. remains at the forefront of the analytics industry, empowering users to harness the power of data for improved outcomes.


## Relevant Relationships:

**SAS ECONOMETRICS → SAS/ETS** (Weight: 16.0)
Description: SAS Econometrics and SAS/ETS are software suites designed to facilitate econometric and time series analysis. Both suites include the UCM Procedure, which is a powerful tool for analyzing unobserved components in time series data. SAS Econometrics offers users access to the comprehensive capabilities of SAS/ETS software, enabling them to perform advanced econometric analysis. These suites are integral for professionals and researchers who require robust tools for modeling, forecasting, and analyzing economic and financial data. By leveraging the features of SAS/ETS, SAS Econometrics enhances the user's ability to conduct sophisticated analyses, making it a valuable resource in the field of econometrics.

**SAS INSTITUTE INC. → SAS ECONOMETRICS** (Weight: 9.0)
Description: SAS Institute Inc. developed and provides the SAS Econometrics suite

**SAS → SAS ECONOMETRICS** (Weight: 9.0)
Description: SAS Econometrics is a product within the SAS software suite

**SAS → SAS ECONOMETRICS PROCEDURES** (Weight: 1.0)
Description: SAS Econometrics Procedures are part of the SAS software suite for econometric analysis

**SAS ECONOMETRICS → PROC CCDM** (Weight: 8.0)
Description: PROC CCDM is a procedure within SAS Econometrics used for estimating compound distribution models

**PROC CCDM → SAS ECONOMETRICS** (Weight: 23.0)
Description: PROC CCDM is a specialized procedure integrated within the SAS Econometrics software suite, designed to facilitate advanced statistical analysis. As a component of SAS Econometrics, PROC CCDM plays a crucial role in the simulation of counts and severity, making it an essential tool for econometricians and data analysts who require precise modeling capabilities. The procedure is utilized in conjunction with other features of SAS Econometrics to enhance the accuracy and reliability of statistical simulations, particularly in scenarios where understanding the distribution and impact of various economic factors is critical. By leveraging PROC CCDM, users can effectively analyze complex datasets, enabling them to derive meaningful insights and make informed decisions based on robust statistical evidence. Overall, PROC CCDM, as part of SAS Econometrics, provides a comprehensive framework for conducting sophisticated econometric analyses, ensuring that users have access to the tools necessary for tackling intricate statistical challenges.

**PROC CCDM → SAS/ETS** (Weight: 5.0)
Description: PROC CCDM is used in conjunction with SAS/ETS for simulating counts and severity

**SAS ECONOMETRICS → UCM PROCEDURE** (Weight: 9.0)
Description: The UCM Procedure is a part of the SAS Econometrics software suite

**UCM PROCEDURE → SAS ECONOMETRICS** (Weight: 8.0)
Description: The UCM procedure is part of SAS Econometrics for time series analysis

**SAS ECONOMETRICS → PROC UCM** (Weight: 9.0)
Description: PROC UCM is a procedure available in the SAS Econometrics software suite

**SAS/ETS → UCM PROCEDURE** (Weight: 1.0)
Description: The UCM Procedure is a part of the SAS/ETS software suite

**UCM PROCEDURE → SAS/ETS** (Weight: 8.0)
Description: The UCM procedure is also part of SAS/ETS for time series analysis

**PROC DEEPPRICE → MODEL** (Weight: 1.0)
Description: PROC DEEPPRICE includes a MODEL statement for specifying outcome models and DNN fitting

**SAS/ETS → PROC UCM** (Weight: 1.0)
Description: PROC UCM is a procedure available in the SAS/ETS software suite

**MODEL → SAS/ETS** (Weight: 7.0)
Description: The MODEL procedure is part of the SAS/ETS software suite

**SAS ECONOMETRICS → PROC SEVSELECT** (Weight: 8.0)
Description: PROC SEVSELECT is a procedure within SAS Econometrics for severity modeling

**PROC SEVSELECT → SAS ECONOMETRICS** (Weight: 1.0)
Description: PROC SEVSELECT is a procedure within SAS Econometrics

**PROC DEEPPRICE → MYLIB** (Weight: 7.0)
Description: PROC DEEPPRICE uses data from MYLIB to estimate price effects and save results

**PROC CSSM → SAS/ETS** (Weight: 7.0)
Description: PROC CSSM complements several SAS/ETS procedures by offering solutions for more general problems or detailed analysis

**PROC CNTSELECT → SAS ECONOMETRICS** (Weight: 9.0)
Description: PROC CNTSELECT is a specialized procedure within SAS Econometrics designed for count data modeling and statistical analysis. As part of the SAS Econometrics suite, PROC CNTSELECT provides users with robust tools for handling count data, which is data that represents the number of occurrences of an event. This procedure is particularly useful in fields such as economics, healthcare, and social sciences, where count data is prevalent and requires precise modeling techniques.

SAS Econometrics, the broader framework within which PROC CNTSELECT operates, is a comprehensive software package that offers a wide range of econometric and statistical tools. It is widely used by researchers and analysts to perform complex data analysis, model economic phenomena, and make informed decisions based on empirical data. The inclusion of PROC CNTSELECT in SAS Econometrics enhances the suite's capabilities by allowing users to effectively model count data, which can often be challenging due to its discrete nature and potential for overdispersion.

PROC CNTSELECT is equipped with advanced statistical methods that enable users to select the most appropriate model for their count data. It supports various modeling techniques, including Poisson regression, negative binomial regression, and zero-inflated models, among others. These techniques are essential for accurately capturing the underlying patterns in count data and addressing issues such as overdispersion and excess zeros, which are common in real-world datasets.

The procedure is designed to be user-friendly, with a syntax that is consistent with other SAS procedures, making it accessible to both novice and experienced users. It provides comprehensive output that includes parameter estimates, goodness-of-fit statistics, and diagnostic measures, allowing users to thoroughly evaluate the performance of their models. Additionally, PROC CNTSELECT offers options for model selection and validation, ensuring that users can identify the best-fitting model for their data.

Overall, PROC CNTSELECT is a valuable tool within SAS Econometrics for anyone dealing with count data. Its integration into the SAS Econometrics suite underscores its importance in the realm of statistical modeling, providing users with the necessary resources to tackle complex count data challenges and derive meaningful insights from their analyses.


## Relevant Community Reports:

**SAS Econometrics and SAS Institute Inc.** (Rank: N/A)
# SAS Econometrics and SAS Institute Inc.

The community is centered around SAS Econometrics, a comprehensive software suite developed by SAS Institute Inc., which is headquartered in Cary, NC. This suite is integral to advanced econometric analysis and is supported by various data tables and technical support. Key figures such as Schervish and Gilks contribute to the theoretical underpinnings of the software's capabilities, particularly in Bayesian statistics and MCMC methods.

## SAS Econometrics as a central tool for econometric analysis

SAS Econometrics is a pivotal software suite developed by SAS Institute Inc., designed to facilitate advanced econometric analysis. It offers a wide array of tools tailored for statistical and econometric modeling, including time series analysis, econometric modeling, and Bayesian inference. The suite's cloud-based environment provides flexibility and scalability, making it a valuable resource for professionals in the field of econometrics [Data: Entities (2, 0); Relationships (0, 2)].

## Role of SAS Institute Inc. in analytics software development

SAS Institute Inc., headquartered in Cary, NC, is a leader in analytics software development, renowned for its expertise in creating advanced statistical and econometric tools. The company's commitment to innovation and excellence has established it as a trusted partner for organizations seeking to leverage data for strategic advantage. SAS Econometrics is one of its notable offerings, providing predictive modeling capabilities and econometrics procedures essential for data-driven decision-making [Data: Entities (0, 1); Relationships (0)].

## Schervish's contributions to statistical methods

Schervish is a prominent figure in the field of statistics, known for his contributions to probability and statistical methods. His work on large-sample theory, particularly concerning the posterior distribution of parameters, has been influential in shaping contemporary statistical thought. Schervish's expertise is utilized in SAS Econometrics for tuning random walk Metropolis algorithms, enhancing the software's capabilities in Bayesian analysis [Data: Entities (23); Relationships (55)].

## Operational risk management with SAS Econometrics

SAS Econometrics plays a crucial role in operational risk management through its integration with data tables such as OpRiskLossCounts, OpRiskLosses, and OpRiskMatchedLosses. These tables provide structured formats for capturing and analyzing loss data, enabling organizations to develop effective risk mitigation strategies. The comprehensive view of operational risk exposure offered by these tables facilitates informed decision-making and strategic planning [Data: Entities (1881, 1882, 1883); Relationships (2132, 2133, 2134)].

## Technical support for SAS Econometrics users

SAS Technical Support provides assistance to users of SAS Econometrics, addressing technical questions and ensuring the effective use of the software's procedures. This support is crucial for users to fully leverage the suite's capabilities in econometric analysis and operational risk management, enhancing the overall user experience and efficiency in data-driven decision-making [Data: Entities (7); Relationships (7)].

## Gilks's impact on Bayesian statistics and MCMC methods

Gilks is an esteemed author recognized for his contributions to statistical methods, particularly in MCMC methods and Bayesian statistics. His work has advanced the understanding and application of these techniques, which are pivotal in various scientific domains. Gilks's research on acceptance rates and scale parameters in Metropolis algorithms has improved the efficiency and accuracy of MCMC simulations, benefiting users of SAS Econometrics [Data: Entities (53); Relationships (43, 44)].


## Relevant Text Sources:

**Source 30** (Rank: N/A)
 and associated cloud services in SAS Viya. This section describes how to create a CAS session and set up a CAS engine libref that you can use to connect to the CAS session. It assumes that you have a CAS server already available; contact your system administrator if you need help starting and terminating a server. This CAS server is identifed by specifying the host on which it runs and the port on which it listens for communications. To simplify your interactions with this CAS server, the host information and port information for the server are stored as SAS option values that are retrieved automatically whenever this CAS server needs to be accessed. You can examine the host and port values for the server at your site by using the following statements: 
proc options option=(CASHOST CASPORT); run; 
In addition to starting a CAS server, your system administrator might also have created a CAS session and a CAS engine libref for your use. You can defne your own sessions and CAS engine librefs that connect to the CAS server as shown in the following statements: 
cas mysess; libname mylib cas sessref=mysess; 
The CAS statement creates the CAS session named mysess, and the LIBNAME statement creates the mylib CAS engine libref that you use to connect to this session. It is not necessary to explicitly name the CASHOST and CASPORT of the CAS server in the CAS statement, because these values are retrieved from the corresponding SAS option values. 
If you have created the mysess session, you can terminate it by using the TERMINATE option in the CAS statement as follows: 
cas mysess terminate; 
For more information about the CAS statement and the LIBNAME statement, see SAS Cloud Analytic Services: Users Guide. For general information about CAS and CAS sessions, see SAS Cloud Analytic Services: Fundamentals. 

Loading a SAS Data Set onto a CAS Server 
Procedures in this book require the input data to reside on a CAS server. To work with a SAS data set, you must frst load the data set onto the CAS server. Data loaded on the CAS server are called data tables. This section lists three methods of loading a SAS data set onto a CAS server. In this section, mylib is the name of the caslib that is connected to the mysess CAS session. 
 You can use a single DATA step to create a data table on the CAS server as follows: 
data mylib.Sample; input y x @@; datalines; 
.461 .472 .573.61 4.62 5.68 6.69 7 ; 
Note that DATA step operations might not work as intended when you perform them on the CAS server instead of the SAS client. 
 You can create a SAS data set frst, and when it contains exactly what you want, you can use another DATA step to load it onto the CAS server as follows: 
data Sample; input y x @@; datalines; 
.461 .472 .573.61 4.62 5.68 6.69 7.788 ; data mylib.Sample; 
set Sample; run; 
 You can use the CASUTIL procedure as follows: 
proc casutil sessref=mysess; load data=Sample casout="Sample"; quit; 
The CASUTIL procedure can load data onto a CAS server more effciently than the DATA step. For more information about the CASUTIL procedure, see SAS Cloud Analytic Services: Users Guide. 
The mylib caslib stores the Sample data table, which can be distributed across many machine nodes. You must use a caslib reference in procedures in this book to enable the SAS client machine to communicate with the CAS session. For example, the following SEVSELECT procedure statements use a data table that resides in the mylib caslib: 
proc sevselect data = mylib.Sample; ...statements...; run; 
You can delete your data table by using the DELETE procedure as follows: 
proc delete data = mylib.Sample; run; 
The Sample data table is accessible only in the mysess session. When you terminate the mysess session, the Sample data table is no longer accessible from the CAS server. If you want your Sample data table to be available to other CAS sessions, then you must promote your data table. For more information about data tables, see SAS Cloud Analytic Services: Users Guide. 


Syntax Common to SAS Econometrics Procedures 
CLASS Statement 
CLASS variable < (options) >:::< variable < (options) >>< / global-options > ; 
This section applies to the following procedures: CNTSELECT, CPANEL, CQLIM, CSPATIALREG, and SEVSELECT. 
The CLASS statement names the classifcation variables to be used as explanatory variables in the analysis. These variables enter the analysis not through their values, but through levels to which the unique values are mapped. For more information about these mappings, see the section Levelization of Classifcation Variables on page 89. 
If the procedure permits a classifcation variable as a response (dependent variable or target), the response does not need to be specifed in the CLASS statement. 
You can specify options either as individual variable options, by enclosing the options in parentheses after the variable name, or as global-options, by placing them after a slash (/). Global-options are applied to all variables that are specifed in the CLASS statement. If you specify more than one CLASS statement, the global-options that are specifed in any one CLASS statement apply to all CLASS statements. However, individual CLASS variable options override the global-options. 
Table 4.1 summarizes the values you can use for either an option or a global-option. The options are described in detail in the list that follows Table 4.1.

**Source 45** (Rank: N/A)
 the vertex that has the 
.k.1/
lowest function value and .j is defned as the vertex that has the highest function value in the simplex. The default value is r = 1E8 for the NMSIMP technique and r = 0 otherwise. 
XSIZE=r 
specifes the XSIZE parameter of the relative parameter termination criterion. The value of r must be 
greater than or equal to 0; the default is r D 0. For more information, see the XCONV= option. 


Details for SAS Econometrics Procedures 
Levelization of Classifcation Variables 
This section applies to the following procedures: CNTSELECT, CPANEL, CQLIM, CSPATIALREG, and SEVSELECT. 
A classifcation variable enters the statistical analysis or model not through its values but through its levels. The process of associating values of a variable with levels is called levelization. 
During the process of levelization, observations that share the same value are assigned to the same level. The manner in which values are grouped can be affected by the inclusion of formats. The sort order of the levels can be determined by specifying the ORDER= option in the procedure statement. In procedures in this book, you can also control the sorting order separately for each variable in the CLASS statement. 
Consider the data on nine observations in Table 4.7. The variable A is integer-valued, and the variable X is a continuous variable that has a missing value for the fourth observation. The fourth and ffth columns of Table 4.7 apply two different formats to the variable X. 
Table 4.7 Example Data for Levelization 
Obs  A  x  FORMAT  FORMAT  
x 3.0  x 3.1  
1  2  1.09  1  1.1  
2  2  1.13  1  1.1  
3  2  1.27  1  1.3  
4  3  .  .  .  
5  3  2.26  2  2.3  
6  3  2.48  2  2.5  
7  4  3.34  3  3.3  
8  4  3.34  3  3.3  
9  4  3.14  3  3.1  

By default, levelization of the variables groups the observations by the formatted value of the variable, except for numerical variables for which no explicit format is provided. Numerical variables for which no explicit format is provided are sorted by their internal value. The levelization of the four columns in Table 4.7 leads to the level assignment in Table 4.8. 
Table 4.8 Values and Levels 
A  X  FORMAT x 3.0  FORMAT x 3.1  
Obs  Value Level  Value Level  Value Level  Value Level  
1  2  1  1.09  1  1  1  1.1  1  

Table 4.8 continued 
A  X  FORMAT x 3.0  FORMAT x 3.1  
Obs  Value Level  Value Level  Value Level  Value Level  
2  2  1  1.13  2  1  1  1.1  1  
3  2  1  1.27  3  1  1  1.3  2  
4  3  2  .  .  .  .  .  .  
5  3  2  2.26  4  2  2  2.3  3  
6  3  2  2.48  5  2  2  2.5  4  
7  4  3  3.34  7  3  3  3.3  6  
8  4  3  3.34  7  3  3  3.3  6  
9  4  3  3.14  6  3  3  3.1  5  

The sort order for the levels of CLASS variables can be specifed in the ORDER= option in the CLASS statement. 
When ORDER=FORMATTED (which is the default) is in effect for numeric variables for which you have supplied no explicit format, the levels are ordered by their internal values. To order numeric classifcation levels that have no explicit format by their BEST12. formatted values, you can specify the BEST12. format explicitly for the CLASS variables. 
Table 4.9 shows how values of the ORDER= option are interpreted. 
Table 4.9 Interpretation of Values of ORDER= Option 
Value of ORDER=  Levels Sorted By  
FORMATTED  External formatted value, except for numeric variables that have no explicit format, which are sorted by their unformatted (internal) value. The sort order is machine-dependent.  
FREQ  Descending frequency count (levels that have the most observations come frst in the order)  
INTERNAL  Unformatted value. The sort order is machine-dependent.  

For more information about sort order, see the chapter about the SORT procedure in the Base SAS Procedures Guide and the discussion of BY-group processing in the Grouping Data section of SAS Programmers Guide: Essentials. 
When the MISSING option is specifed in

**Source 10** (Rank: N/A)
 of the log posterior at the posterior mode (Gelman et al. 2004, Appendix B; Schervish 1995, Section 7.4). That is why a normal proposal distribution often works well in practice. The proposal distribution for random walk Metropolis in SAS Econometrics procedures is always the normal distribution. It is generated in the following fashion: 
prop  N.cur curCOVcur
;c / 
where prop is the proposed parameter vector, cur is the previous value of the parameters in the Markov chain, ccur is a scale parameter that is tuned in the tuning phase of the algorithm, and COVcur is an approximation of the posterior covariance that is also tuned in the tuning phase of the algorithm. When there are multiple parameter blocksthat is, in a Metropolis within Gibbs schemeeach block is tuned separately, each with its own ccur and COVcur. In particular for a scalar block, COVcur is also a scalaran approximation to the posterior variance of the only parameter in that block. 
The tuning phase consists of nstage tuning stages, each ntune iterations long. Between tuning stages, two sorts of adjustments to the proposal distribution are made: adjustments to the scale parameter ccur and adjustments to the proposal covariance matrix COVcur. 
Scale Tuning 
The acceptance rate of a Metropolis chain is closely related to its sampling effciency. For a random walk Metropolis algorithm, a high acceptance rate means that most new samples occur right around the current parameter value. Their frequent acceptance means that the Markov chain is moving rather slowly and not fully exploring the parameter space. A low acceptance rate means that the proposed samples are often rejected; hence the chain is not moving much. An effcient random walk Metropolis sampler has an acceptance rate that is neither too high nor too low. The scale parameter ccur in the proposal distribution effectively controls this acceptance probability. Roberts, Gelman, and Gilks (1997) show that if both the target density and the proposal density are normal, the optimal acceptance probability for the Markov chain should be around 0.45 in a one-dimensional problem and should asymptotically approach 0.234 in higher-dimensional problems. The corresponding optimal scale is 2.38. 
Because of the nature of stochastic simulations, it is impossible to fne-tune a set of variables so that the Metropolis chain has exactly the target acceptance rate that you want. In addition, Roberts and Rosenthal (2001) empirically demonstrate that an acceptance rate between 0.15 and 0.5 is at least 80% effcient, so there is really no need to fne-tune the algorithms to reach an acceptance probability that is within a small tolerance of the optimal values. SAS Econometrics procedures tune the scale parameter so that the observed acceptance rate falls within some tolerance of the target acceptance rate. Let pobs denote the observed acceptance rate in a given tuning stage, let idenote the target acceptance rate, and let denote the tolerance. 
obs obs
Then if p <i. , the value of ccur is decreased. On the other hand, if p >iC , then the value of ccur is increased. Typically  D 0:075is a good choice, i D 0:45is a good choice for blocks of parameters larger than three or four dimensions, and i D 0:234is a good choice for scalar blocks. 
When ccur is adjusted between tuning stages, it is done using the following scheme:1 
ccur  .1.i=2/ 
new D
c .1 obs=2/ 
.p 
where ccur is the scale from the previous tuning stage; cnew is the new scale for the next stage or any further sampling; pobs is the observed acceptance rate in the previous tuning stage; and iis the target acceptance rate. A good choice for the initial value of ccur is 2.38 because it is optimal for a normal model with a normal proposal. 

Covariance Tuning 
When the proposal scale parameter is adjusted between stages, the proposal covariance matrix is also adjusted. To do this, SAS Econometrics procedures take a weighted average of the previously used covariance matrix and the observed covariance matrix in the last tuning stage. The formula to update the covariance matrix is 
COVnew D COVobs C .1. w/COVcur 
w 
1 Roberts, Gelman, and Gilks (1997) and Roberts and Rosenthal (2001) demonstrate that the relationship between acceptance 
 p  probability and scale in a random walk Metropolis scheme is p D 2 . Ic=2 , where cis the scale; p is the acceptance rate;  
is the cumulative distribution function of a standard normal distribution; and I  Ef.f0.x/=f.x//2, where f.x/is the density function of samples. This relationship determines the updating scheme, with I replaced by the identity matrix to simplify calculation. 
where w is the tuning weight, between 0 and 1; COVcur is the covariance matrix that was used in the previous tuning stage; COVobs is the covariance matrix that was observed in the previous tuning stage; and COVnew is the new covariance matrix that will be used for the next tuning stage or any further sampling. Larger weights cause the sampler to adapt the proposal covariance matrix faster during the tuning process, but if this value is too large, the sampler might never settle into a good choice. Typically, a good tuning weight is w D 0:75. If the initial value of COVcur is very different from the posterior covariance matrix, many 
tuning stages might be



=== END OF CONTEXT ===


# Understanding Relationship Weights in GraphRAG

In GraphRAG, the **weight** property on relationships represents the **strength** or **importance** of the connection between two entities. Here's what it means:

## What is Weight?

**Weight** is a numerical value that indicates:
- **Frequency**: How often two entities appear together in the source documents
- **Co-occurrence strength**: The statistical significance of their relationship
- **Semantic closeness**: How tightly connected the entities are in the knowledge graph

## How is Weight Calculated?

The weight is typically derived from:
1. **Text co-occurrence**: How many times the entities appear in the same text units/chunks
2. **Window proximity**: How close the entities appear to each other in the text
3. **Relationship strength**: The confidence level of the extracted relationship
4. **Document frequency**: Across how many documents the relationship appears

## Weight Values

- **Higher weights** (e.g., 8.0, 10.0): Strong, frequently occurring relationships
- **Lower weights** (e.g., 1.0, 2.0): Weaker or less frequent relationships
- **Weight = 1.0**: Often the default/minimum weight for detected relationships

## Usage in GraphRAG

Weights are used for:
- **Ranking relationships**: More important relationships get higher priority in search results
- **Graph visualization**: Thicker edges represent stronger relationships (as seen in the yfiles visualization)
- **Context selection**: Higher-weight relationships are more likely to be included in LLM context
- **Graph algorithms**: Centrality and community detection algorithms use weights to identify key entities

Let's examine the weights in our current result:

In [24]:
# Analyze relationship weights in our current search result
print("=== RELATIONSHIP WEIGHT ANALYSIS ===\n")

if "relationships" in result.context_data:
    relationships_df = result.context_data["relationships"]
    
    if 'weight' in relationships_df.columns:
        # Convert weights to numeric, handling any string values
        relationships_df['weight_numeric'] = pd.to_numeric(relationships_df['weight'], errors='coerce')
        weights = relationships_df['weight_numeric'].dropna()
        
        if len(weights) > 0:
            print(f"📊 Weight Statistics:")
            print(f"   - Total relationships: {len(relationships_df)}")
            print(f"   - Relationships with weights: {len(weights)}")
            print(f"   - Weight range: {weights.min():.2f} to {weights.max():.2f}")
            print(f"   - Average weight: {weights.mean():.2f}")
            print(f"   - Median weight: {weights.median():.2f}")
            
            print(f"\n🔝 Top 5 Strongest Relationships (by weight):")
            top_relationships = relationships_df.nlargest(5, 'weight_numeric')[['source', 'target', 'weight_numeric', 'description']]
            for idx, row in top_relationships.iterrows():
                weight_val = row['weight_numeric']
                if pd.notna(weight_val):
                    print(f"   {row['source']} → {row['target']} (Weight: {weight_val:.2f})")
                else:
                    print(f"   {row['source']} → {row['target']} (Weight: N/A)")
                desc = str(row['description'])[:100] if pd.notna(row['description']) else "No description"
                print(f"      Description: {desc}...")
                print()
            
            print(f"🔻 Bottom 5 Weakest Relationships (by weight):")
            bottom_relationships = relationships_df.nsmallest(5, 'weight_numeric')[['source', 'target', 'weight_numeric', 'description']]
            for idx, row in bottom_relationships.iterrows():
                weight_val = row['weight_numeric']
                if pd.notna(weight_val):
                    print(f"   {row['source']} → {row['target']} (Weight: {weight_val:.2f})")
                else:
                    print(f"   {row['source']} → {row['target']} (Weight: N/A)")
                desc = str(row['description'])[:100] if pd.notna(row['description']) else "No description"
                print(f"      Description: {desc}...")
                print()
            
            # Weight distribution
            print(f"📈 Weight Distribution:")
            weight_bins = [0, 1, 2, 5, 10, float('inf')]
            weight_labels = ['0-1', '1-2', '2-5', '5-10', '10+']
            
            for i, (low, high) in enumerate(zip(weight_bins[:-1], weight_bins[1:])):
                if high == float('inf'):
                    count = len(weights[weights >= low])
                    label = weight_labels[i]
                else:
                    count = len(weights[(weights >= low) & (weights < high)])
                    label = weight_labels[i]
                
                percentage = count / len(weights) * 100 if len(weights) > 0 else 0
                print(f"   - Weight {label}: {count} relationships ({percentage:.1f}%)")
        else:
            print("❌ No valid numeric weights found in relationships data")
            print(f"Sample weight values: {relationships_df['weight'].head().tolist()}")
    else:
        print("❌ No 'weight' column found in relationships data")
        print(f"Available columns: {list(relationships_df.columns)}")
else:
    print("❌ No relationships found in context data")

=== RELATIONSHIP WEIGHT ANALYSIS ===

📊 Weight Statistics:
   - Total relationships: 20
   - Relationships with weights: 20
   - Weight range: 1.00 to 23.00
   - Average weight: 7.35
   - Median weight: 8.00

🔝 Top 5 Strongest Relationships (by weight):
   PROC CCDM → SAS ECONOMETRICS (Weight: 23.00)
      Description: PROC CCDM is a specialized procedure integrated within the SAS Econometrics software suite, designed...

   SAS ECONOMETRICS → SAS/ETS (Weight: 16.00)
      Description: SAS Econometrics and SAS/ETS are software suites designed to facilitate econometric and time series ...

   SAS INSTITUTE INC. → SAS ECONOMETRICS (Weight: 9.00)
      Description: SAS Institute Inc. developed and provides the SAS Econometrics suite...

   SAS → SAS ECONOMETRICS (Weight: 9.00)
      Description: SAS Econometrics is a product within the SAS software suite...

   SAS ECONOMETRICS → UCM PROCEDURE (Weight: 9.00)
      Description: The UCM Procedure is a part of the SAS Econometrics software s