# Visualizing the knowledge graph with `yfiles-jupyter-graphs`

This notebook is a partial copy of [local_search.ipynb](../../local_search.ipynb) that shows how to use `yfiles-jupyter-graphs` to add interactive graph visualizations of the parquet files  and how to visualize the result context of `graphrag` queries (see at the end of this notebook).

In [13]:
# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License.

In [14]:
import os

import pandas as pd
import tiktoken

from graphrag.config.enums import ModelType
from graphrag.config.models.language_model_config import LanguageModelConfig
from graphrag.language_model.manager import ModelManager
from graphrag.query.context_builder.entity_extraction import EntityVectorStoreKey
from graphrag.query.indexer_adapters import (
    read_indexer_covariates,
    read_indexer_entities,
    read_indexer_relationships,
    read_indexer_reports,
    read_indexer_text_units,
)
from graphrag.query.structured_search.local_search.mixed_context import (
    LocalSearchMixedContext,
)
from graphrag.query.structured_search.local_search.search import LocalSearch
from graphrag.vector_stores.lancedb import LanceDBVectorStore

## Local Search Example

Local search method generates answers by combining relevant data from the AI-extracted knowledge-graph with text chunks of the raw documents. This method is suitable for questions that require an understanding of specific entities mentioned in the documents (e.g. What are the healing properties of chamomile?).

### Load text units and graph data tables as context for local search

- In this test we first load indexing outputs from parquet files to dataframes, then convert these dataframes into collections of data objects aligning with the knowledge model.

### Load tables to dataframes

In [15]:
INPUT_DIR = "/home/chuaxu/projects/graphrag/ragsas/output"
LANCEDB_URI = f"{INPUT_DIR}/lancedb"

COMMUNITY_REPORT_TABLE = "community_reports"
COMMUNITY_TABLE = "communities"
ENTITY_TABLE = "entities"
RELATIONSHIP_TABLE = "relationships"
COVARIATE_TABLE = "covariates"
TEXT_UNIT_TABLE = "text_units"
COMMUNITY_LEVEL = 2

#### Read entities

In [16]:
# read nodes table to get community and degree data
entity_df = pd.read_parquet(f"{INPUT_DIR}/{ENTITY_TABLE}.parquet")
community_df = pd.read_parquet(f"{INPUT_DIR}/{COMMUNITY_TABLE}.parquet")

#### Read relationships

In [17]:
relationship_df = pd.read_parquet(f"{INPUT_DIR}/{RELATIONSHIP_TABLE}.parquet")
relationships = read_indexer_relationships(relationship_df)

# Visualizing nodes and relationships with `yfiles-jupyter-graphs`

`yfiles-jupyter-graphs` is a graph visualization extension that provides interactive and customizable visualizations for structured node and relationship data.

In this case, we use it to provide an interactive visualization for the knowledge graph of the [local_search.ipynb](../../local_search.ipynb) sample by passing node and relationship lists converted from the given parquet files. The requirements for the input data is an `id` attribute for the nodes and `start`/`end` properties for the relationships that correspond to the node ids. Additional attributes can be added in the `properties` of each node/relationship dict:

In [18]:
%pip install yfiles_jupyter_graphs --quiet
from yfiles_jupyter_graphs import GraphWidget
import numpy as np
import pandas as pd


# converts the entities dataframe to a list of dicts for yfiles-jupyter-graphs
def convert_entities_to_dicts(df):
    """Convert the entities dataframe to a list of dicts for yfiles-jupyter-graphs."""
    def clean_value(value):
        """Clean a value to make it JSON serializable."""
        # Handle arrays first (before checking for NaN)
        if isinstance(value, (np.ndarray, list)):
            # Convert arrays to strings or take first element if single value
            if len(value) == 0:
                return None
            elif len(value) == 1:
                return str(value[0])
            else:
                return str(value)
        # Now check for NaN on scalar values
        elif pd.isna(value):
            return None
        elif isinstance(value, (np.integer, np.floating)):
            # Convert numpy numbers to Python numbers
            if np.isnan(value) or np.isinf(value):
                return None
            return value.item()
        elif isinstance(value, float):
            # Handle Python floats that might be NaN or inf
            if pd.isna(value) or np.isinf(value):
                return None
            return value
        else:
            return value
    
    nodes_dict = {}
    for _, row in df.iterrows():
        # Create a dictionary for each row and collect unique nodes
        node_id = row["title"]
        if node_id not in nodes_dict:
            # Clean all properties to make them JSON serializable
            cleaned_properties = {k: clean_value(v) for k, v in row.to_dict().items()}
            nodes_dict[node_id] = {
                "id": node_id,
                "properties": cleaned_properties,
            }
    return list(nodes_dict.values())


# converts the relationships dataframe to a list of dicts for yfiles-jupyter-graphs
def convert_relationships_to_dicts(df):
    """Convert the relationships dataframe to a list of dicts for yfiles-jupyter-graphs."""
    def clean_value(value):
        """Clean a value to make it JSON serializable."""
        # Handle arrays first (before checking for NaN)
        if isinstance(value, (np.ndarray, list)):
            # Convert arrays to strings or take first element if single value
            if len(value) == 0:
                return None
            elif len(value) == 1:
                return str(value[0])
            else:
                return str(value)
        # Now check for NaN on scalar values
        elif pd.isna(value):
            return None
        elif isinstance(value, (np.integer, np.floating)):
            # Convert numpy numbers to Python numbers
            if np.isnan(value) or np.isinf(value):
                return None
            return value.item()
        elif isinstance(value, float):
            # Handle Python floats that might be NaN or inf
            if pd.isna(value) or np.isinf(value):
                return None
            return value
        else:
            return value
    
    relationships = []
    for _, row in df.iterrows():
        # Create a dictionary for each row
        cleaned_properties = {k: clean_value(v) for k, v in row.to_dict().items()}
        relationships.append({
            "start": row["source"],
            "end": row["target"],
            "properties": cleaned_properties,
        })
    return relationships


w = GraphWidget()
w.directed = True
w.nodes = convert_entities_to_dicts(entity_df)
w.edges = convert_relationships_to_dicts(relationship_df)

Note: you may need to restart the kernel to use updated packages.


## Configure data-driven visualization

The additional properties can be used to configure the visualization for different use cases.

In [19]:
# show title on the node
w.node_label_mapping = "title"


# map community to a color
def community_to_color(community):
    """Map a community to a color."""
    colors = [
        "crimson",
        "darkorange",
        "indigo",
        "cornflowerblue",
        "cyan",
        "teal",
        "green",
    ]
    return (
        colors[int(community) % len(colors)] if community is not None else "lightgray"
    )


def edge_to_source_community(edge):
    """Get the community of the source node of an edge."""
    source_node = next(
        (entry for entry in w.nodes if entry["properties"]["title"] == edge["start"]),
        None,
    )
    if source_node is None:
        return None
    # Handle missing community property gracefully
    source_node_community = source_node["properties"].get("community", None)
    return source_node_community if source_node_community is not None else None


w.node_color_mapping = lambda node: community_to_color(node["properties"].get("community", None))
w.edge_color_mapping = lambda edge: community_to_color(edge_to_source_community(edge))
# map size data to a reasonable factor
w.node_scale_factor_mapping = lambda node: 0.5 + node["properties"].get("size", 0) * 1.5 / 20
# use weight for edge thickness
w.edge_thickness_factor_mapping = "weight"

## Automatic layouts

The widget provides different automatic layouts that serve different purposes: `Circular`, `Hierarchic`, `Organic (interactiv or static)`, `Orthogonal`, `Radial`, `Tree`, `Geo-spatial`.

For the knowledge graph, this sample uses the `Circular` layout, though `Hierarchic` or `Organic` are also suitable choices.

In [20]:
# Use the circular layout for this visualization. For larger graphs, the default organic layout is often preferrable.
w.circular_layout()

## Display the graph

In [21]:
display(w)

GraphWidget(layout=Layout(height='800px', width='100%'))

# Visualizing the result context of `graphrag` queries

The result context of `graphrag` queries allow to inspect the context graph of the request. This data can similarly be visualized as graph with `yfiles-jupyter-graphs`.

## Making the request

The following cell recreates the sample queries from [local_search.ipynb](../../local_search.ipynb).

In [22]:
# setup (see also ../../local_search.ipynb)
entities = read_indexer_entities(entity_df, community_df, COMMUNITY_LEVEL)

description_embedding_store = LanceDBVectorStore(
    collection_name="default-entity-description",
)
description_embedding_store.connect(db_uri=LANCEDB_URI)

# Comment out covariates for now if file doesn't exist
try:
    covariate_df = pd.read_parquet(f"{INPUT_DIR}/{COVARIATE_TABLE}.parquet")
    claims = read_indexer_covariates(covariate_df)
    covariates = {"claims": claims}
except FileNotFoundError:
    print("Covariate file not found, proceeding without covariates")
    covariates = {}

report_df = pd.read_parquet(f"{INPUT_DIR}/{COMMUNITY_REPORT_TABLE}.parquet")
reports = read_indexer_reports(report_df, community_df, COMMUNITY_LEVEL)
text_unit_df = pd.read_parquet(f"{INPUT_DIR}/{TEXT_UNIT_TABLE}.parquet")
text_units = read_indexer_text_units(text_unit_df)

# Load configuration from settings.yaml
from graphrag.config.load_config import load_config
from pathlib import Path

config_path = Path("/home/chuaxu/projects/graphrag/ragsas")
config = load_config(config_path)

# Get model configurations from the loaded config
chat_model_config = config.get_language_model_config("default_chat_model")
embedding_model_config = config.get_language_model_config("default_embedding_model")

chat_model = ModelManager().get_or_create_chat_model(
    name="local_search",
    model_type=chat_model_config.type,
    config=chat_model_config,
)

token_encoder = tiktoken.encoding_for_model(chat_model_config.model)

text_embedder = ModelManager().get_or_create_embedding_model(
    name="local_search_embedding",
    model_type=embedding_model_config.type,
    config=embedding_model_config,
)

context_builder = LocalSearchMixedContext(
    community_reports=reports,
    text_units=text_units,
    entities=entities,
    relationships=relationships,
    covariates=covariates,
    entity_text_embeddings=description_embedding_store,
    embedding_vectorstore_key=EntityVectorStoreKey.ID,  # if the vectorstore uses entity title as ids, set this to EntityVectorStoreKey.TITLE
    text_embedder=text_embedder,
    token_encoder=token_encoder,
)

local_context_params = {
    "text_unit_prop": 0.5,
    "community_prop": 0.1,
    "conversation_history_max_turns": 5,
    "conversation_history_user_turns_only": True,
    "top_k_mapped_entities": 10,
    "top_k_relationships": 10,
    "include_entity_rank": True,
    "include_relationship_weight": True,
    "include_community_rank": False,
    "return_candidate_context": False,
    "embedding_vectorstore_key": EntityVectorStoreKey.ID,  # set this to EntityVectorStoreKey.TITLE if the vectorstore uses entity title as ids
    "max_tokens": 80_000,  # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 5000)
}

model_params = {
    "max_tokens": 3_000,  # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 1000=1500)
    "temperature": 0.0,
}

search_engine = LocalSearch(
    model=chat_model,
    context_builder=context_builder,
    token_encoder=token_encoder,
    model_params=model_params,
    context_builder_params=local_context_params,
    response_type="multiple paragraphs",  # free form text describing the response type and format, can be anything, e.g. prioritized list, single paragraph, multiple paragraphs, multiple-page report
)

Covariate file not found, proceeding without covariates


## Run local search on sample queries

In [23]:
result = await search_engine.search("Tell me about Agent Mercer")
print(result.response)

I'm sorry, but there is no information available about Agent Mercer in the provided data tables. If you have any other questions or need information on a different topic, feel free to ask!


In [24]:
from IPython.display import Markdown, display

# question = "How do different pricing strategies impact the business revenue?"
question = "Which published studies in our knowledge base used both panel data methods and cointegration analysis on emerging market economies?"
result = await search_engine.search(question)

# Display as formatted Markdown instead of plain text
display(Markdown(result.response))



Based on the data provided, there is no specific mention of published studies that used both panel data methods and cointegration analysis on emerging market economies. The available data primarily focuses on dynamic panel estimation methods and various econometric techniques, such as those implemented in the PROC CPANEL procedure within SAS software. This procedure is designed for dynamic panel estimation and econometric analysis, incorporating methods like the Swamy-Arora Method, Wansbeek-Kapteyn Method, and Windmeijer's bias correction techniques [Data: Reports (17); Entities (900, 901, 906, 891, +more); Relationships (934, 935, 1035, +more)].

While the PROC CPANEL procedure is a central tool for panel data analysis, including dynamic panel models and instrumental variables regression, there is no explicit reference to cointegration analysis or its application to emerging market economies within the provided datasets [Data: Reports (17); Sources (199)].

If you have access to additional datasets or specific studies, I recommend reviewing them for any mention of cointegration analysis in the context of emerging market economies. Alternatively, you may need to consult external academic databases or publications for more comprehensive information on this topic.

## Inspecting the context data used to generate the response

In [25]:
result.context_data["entities"].head()

Unnamed: 0,id,entity,description,number of relationships,in_context
0,900,ARELLANO AND BOND,Arellano and Bond are researchers renowned for...,3,True
1,901,BLUNDELL AND BOND,Blundell and Bond are renowned researchers in ...,2,True
2,992,SONG,An author who discussed the Fuller-Battese met...,2,True
3,899,BALTAGI,BALTAGI is a distinguished economist and resea...,13,True
4,906,AMEMIYA AND MACURDY,Amemiya and MaCurdy are researchers whose mode...,1,True


In [26]:
result.context_data["relationships"].head()

Unnamed: 0,id,source,target,description,weight,links,in_context
0,934,CPANEL PROCEDURE,ARELLANO AND BOND,The CPANEL procedure is a statistical method d...,14.0,2,True
1,935,CPANEL PROCEDURE,BLUNDELL AND BOND,The CPANEL procedure is a sophisticated statis...,8.0,2,True
2,1035,PROC CPANEL,ARELLANO AND BOND,PROC CPANEL implements the Arellano and Bond t...,1.0,1,True
3,1010,SONG,BALTAGI,Baltagi and Song co-authored a discussion on t...,7.0,1,True
4,1026,ARELLANO AND BOND,WINDMEIJER,Windmeijer built upon the work of Arellano and...,5.0,1,True


## Visualizing the result context as graph

In [27]:
"""
Helper function to visualize the result context with `yfiles-jupyter-graphs`.

The dataframes are converted into supported nodes and relationships lists and then passed to yfiles-jupyter-graphs.
Additionally, some values are mapped to visualization properties.
"""


def show_graph(result):
    """Visualize the result context with yfiles-jupyter-graphs."""
    from yfiles_jupyter_graphs import GraphWidget

    if (
        "entities" not in result.context_data
        or "relationships" not in result.context_data
    ):
        msg = "The passed results do not contain 'entities' or 'relationships'"
        raise ValueError(msg)

    # converts the entities dataframe to a list of dicts for yfiles-jupyter-graphs
    def convert_entities_to_dicts(df):
        """Convert the entities dataframe to a list of dicts for yfiles-jupyter-graphs."""
        nodes_dict = {}
        for _, row in df.iterrows():
            # Create a dictionary for each row and collect unique nodes
            node_id = row["entity"]
            if node_id not in nodes_dict:
                nodes_dict[node_id] = {
                    "id": node_id,
                    "properties": row.to_dict(),
                }
        return list(nodes_dict.values())

    # converts the relationships dataframe to a list of dicts for yfiles-jupyter-graphs
    def convert_relationships_to_dicts(df):
        """Convert the relationships dataframe to a list of dicts for yfiles-jupyter-graphs."""
        relationships = []
        for _, row in df.iterrows():
            # Create a dictionary for each row
            relationships.append({
                "start": row["source"],
                "end": row["target"],
                "properties": row.to_dict(),
            })
        return relationships

    w = GraphWidget()
    # use the converted data to visualize the graph
    w.nodes = convert_entities_to_dicts(result.context_data["entities"])
    w.edges = convert_relationships_to_dicts(result.context_data["relationships"])
    w.directed = True
    # show title on the node
    w.node_label_mapping = "entity"
    # use weight for edge thickness
    w.edge_thickness_factor_mapping = "weight"
    display(w)


show_graph(result)

GraphWidget(layout=Layout(height='620px', width='100%'))