# Tokenflow: Running the RAI GRS Pipeline and Performing Data Analysis on Agents Data

### Pipeline Execution Methods
After installing the native app in your Snowflake account, you have three options to run the pipeline for input documents:

1. **SQL Worksheet** - Execute pipeline steps directly from a new SQL worksheet in Snowsight TODO: add here link to the worksheet code (config based or the other one)
2. **Streamlit User Interface** - Run the pipeline through the application's UI TODO: add here link to the presentation and / or video
3. **Python SDK** - Use the Python SDK (currently under development)

#### Project Configuration
Every new project requires configuration for the customization of the pipeline's execution. This configuration is stored in the `CONFIG` column under the `PROJECTS` table in YAML format and includes prompts for the tasks that require LLM calls, algorithm parameters (e.g., retrieval settings) and other pipeline execution settings.

For the purposes of this demo, we have already adapted the prompts for our specific use case (Knowledge Graph -KG- extraction for the Tokenflow agents data), but you can customize the prompts and the whole configuration by opening the YAML file in any text editor, editing the desired parameters, and saving the changes.

#### Notebook-Based Pipeline Execution
This notebook demonstrates running the pipeline by wrapping SQL statements into Python Snowflake Connector calls. The notebook involves:
- Configuration Loading: Read the configuration YAML file from disk
- Pipeline Invocation: Execute corresponding pipeline steps using direct SQL calls throught the Python Snowflake Connector
- Post-Processing and Analysis: After pipeline execution, the workflow includes:
    - Export of the GenAI generated structured data: Export agent data, including agent properties, to a separate Snowflake table
    - Visualization: Generate and display the Knowledge Graph visualization


**Note:** In this notebook, we demonstrate only the basic pipeline execution. However, the application also supports fine-tuning one of the available Cortex LLMs on your own documents. For the purposes of this demo, we use the default LLMs for both the extraction and question answering steps.

**Alternative Workflow:** If you have already run the pipeline through the UI, you can use this notebook by skipping the initial sections related to pipeline execution. Instead, go directly to the *'Load extracted graph data from Snowflake'* section. This allows you to proceed straight to the post-processing and visualization phases.

In [None]:
import pandas as pd
import yaml
import json
from tqdm import tqdm
from datetime import datetime, timezone
import snowflake.connector
from snowflake.connector.pandas_tools import write_pandas

In [None]:
# pip install "snowflake-connector-python[pandas]"

In [None]:
# Connect to Snowflake.
# Note: MFA must be temporarily disabled on the Snowflake account before running code with the Snowpark Python connector.
# You can do this by running the following query in a new SQL Worksheet inside Snowsight: ALTER USER <your_username> SET MINS_TO_BYPASS_MFA = 900;

# You can find the account_identifier by running the following query in Snowflake: SELECT CURRENT_ORGANIZATION_NAME() || '-' || CURRENT_ACCOUNT_NAME();
account_identifier = "NDSOEBE-RAI_PROD_GEN_AI_AWS_US_WEST_2_CONSUMER"  
# Your credentials for logging into the same Snowflake account.
username = ""
password = ""
# Use consumer role for the usage of the app.
role = "RAI_GRS_CONSUMER_ADMIN_ROLE"
# The database name is the native app's installation name.
database = "RAI_GRS_ILIAS"
# The APP schema contains the functions and procedures, while the DATA schema holds the related data from pipeline execution.
schema = "DATA"  
# Your warehouse name.
warehouse = "RAI_GRS_WAREHOUSE"

conn = snowflake.connector.connect(
                user=username,
                password=password,
                account=account_identifier,
                role=role, 
                database=database,
                schema=schema,
                warehouse=warehouse
            )
            
            
# Create a cursor for this connection.
cursor = conn.cursor()

## Pipeline's configuration

In [None]:
# Read the configuration for this experiment running.
with open('config.yml', 'r') as file:
    configuration_yaml = yaml.safe_load(file)

In [None]:
# View the config: if you want to make any changes, just edit the file using a text editor, save your changes and rerun the above cell.
# configuration_yaml

In [None]:
todo: search for TODO in all notebook!

In [None]:
# The configuration has been loaded as a Python dictionary, so we can access its individual parts directly.
project_id = configuration_yaml['config']['project']['id']
project_name = configuration_yaml['config']['project']['name']
project_comments = configuration_yaml['config']['project']['comments']

print(f'Project with id "{project_id}" has comments "{project_comments}".')

In [None]:
# app_suffix = ""  # Empty if you want to run it on RAI_GRS
app_suffix = "_ILIAS"# TODO: delete this and replace "{app_suffix}" with empty string

In [None]:
# If this is a new project, we need to insert it as a new record into the PROJECTS table.
# All changes made using the cursor are automatically saved to the SF cloud database.
# Important: Run this only once!


extraction_call = f"""
    INSERT INTO RAI_GRS{app_suffix}.DATA.PROJECTS (ID, NAME, CONFIG, COMMENTS)
    VALUES (%s, %s, %s, %s)
    """

# Execute the SQL call.
cursor.execute(extraction_call, (
    project_id,
    project_name,
    None, # We will upload later the configuration.
    project_comments
))
    

# Fetch and print all rows returned by the Snowflake query executed via the cursor
results = cursor.fetchall() 
for row in results:
    print(row)

In [None]:
# Now we will upload the configuration.
# Prepare config for uploading.
json_str = json.dumps(configuration_yaml)

# Use binding – the connector will do the proper quoting/escaping
sql = f"""
CALL RAI_GRS{app_suffix}.app.save_config(
   %s,
   PARSE_JSON(%s)
);
"""

cursor.execute(sql, (project_id, json_str))

results = cursor.fetchall() 
for row in results:
    print(row)

In [None]:
# Examples of parameters from the config.
# Now that we have saved the configuration to Snowflake there is no need to define the parameters on the calls but we do 
# it for showcasing the specific parameters of the services.

In [None]:
config = configuration_yaml['config']

In [None]:
openai_api_key = config['auth']['openai_api_key']
llm_family = config['models']['family']
completion_model = config['models']['completion']
is_fine_tuned_completion_model = config['models']['is_fine_tuned']
embeddings_model = config['models']['embeddings']
summarization_context = config['operations']['get_embeddings']['summarization_context']

In [None]:
similarity_top_k = config['operations']['question_answering']['retrieval']['similarity_top_k']
similarity_threshold = config['operations']['question_answering']['retrieval']['similarity_threshold']
retriever_type = config['operations']['question_answering']['retrieval']['type']

In [None]:
print(openai_api_key)
print(llm_family)
print(completion_model)
print(is_fine_tuned_completion_model)
print(embeddings_model)
print(summarization_context)

In [None]:
print(similarity_top_k)
print(similarity_threshold)
print(retriever_type)

In [None]:
TODO: ADD THE OPENAI API KEY AND REMOVE IT FROM THE FILE BEFORE COMMIT alSO REMOVE YOUR ACCOUNT NAME AND CREDENTIALS
# TODO: USE THE ACC NAME OF THE SNOWFLAKE THAT ALEX HAS SENT

## Run the pipeline

Here, we outline each step of the pipeline one by one. Since the configuration file has already been defined and saved in Snowflake, specifying input parameters for each call is not necessary. We only need to provide the `project_id`; the service will then retrieve the required parameters for each step (for example, `similarity_top_k` for retrieval) from the configuration stored in the `PROJECTS` table.

#### 0. Corpus conversion
The very first step is to upload your documents to the `FILES` stage (under the schema `RAI_GRS.DATA`) in a new folder named after the `project_id`. Once uploaded, you can run the corpus conversion, which extracts text content from your documents.

There are two options available: standard conversion and visual parsing. Visual parsing leverages an LLM with vision capabilities to interpret and extract meaning from visual content in your documents, such as images and diagrams.

For this example, we already have a table with textual data, so there are no PDF documents to process through the corpus conversion endpoint. Instead, we take the relevant textual column from the input file, and format it to match the structure of the `CORPUS` table, as it would appear after running the conversion on PDF documents (i.e., with the same columns and structure). 

Once formatted, we upload this data to Snowflake so it is ready for the first step of our pipeline.

In [None]:
# # Step 0
# # -- Corpus conversion (e.g. PDF to MarkDown).
# # No needed here as we already have the raw text in a CSV file, so we will provide the corpus table.
# # But if we had uploaded some PDF documents in a folder, then we sould run the following statement:

# # Form the SQL call.
# extraction_call = f"""CALL RAI_GRS{app_suffix}.app.execute_convert_corpus('{project_id}');"""

# # Or if the documents contain visual content, we can run this process (note that it may be costly due to extensive LLM calls): 
# # extraction_call = f"""CALL RAI_GRS{app_suffix}.app.execute_llm_convert_corpus('{project_id}');"""

# # Execute the SQL call.
# cursor.execute(extraction_call)

# # See the results of the call.
# results = cursor.fetchall() 
# for row in results:
#     print(row)

In [None]:
# We read the table of interest that has the textual data. This is the csv with the raw input data.
file_path = "../../data/virtuals-agents.csv"
virtuals_agents_raw = pd.read_csv(file_path)
virtuals_agents_raw.sample(2)

In [None]:
# In this experiment, we extract the name, symbol, and description of the agents, and combine them into a single string.
# This aggregated text will serve as the input document (one for each agent) for our algorithm.
corpus = virtuals_agents_raw[["NAME", "SYMBOL", "DESCRIPTION"]].copy()

# Create the necessary columns to match the schema of the Snowflake corpus table.
corpus["PROJECT_ID"] = project_id
corpus["CHUNK_ID"] = corpus["NAME"].apply(lambda x: f"{project_id}/{x}")

# We could have custom metadata for the different documents, or let the LLM generate some (e.g. short title) but for now we use the same 
# metadata for all the records.
now_utc = datetime.now(timezone.utc)
formatted_time = now_utc.strftime("D:%Y%m%d%H%M%SZ")
metadata_for_all_entries = {
  "creationDate": formatted_time,
  "subject": "Description of Tokenflow agents.",
  "source": file_path
}
corpus["METADATA"] = corpus.apply(lambda _: metadata_for_all_entries, axis=1)
corpus["CONTENT"] = corpus.apply(lambda row: f"Agent with name {row['NAME']} has symbol: {row['SYMBOL']} and description: {row['DESCRIPTION']}.", axis=1)

# The final table format is matches the CORPUS table schema on Snowflake.
final_corpus_df = corpus[["PROJECT_ID", "CHUNK_ID", "CONTENT", "METADATA"]]
final_corpus_df.sample(2)

In [None]:
# Now that we have ready the textual data into the corpus table, we upload it to SF instead of running the corpus conversion step.
# Then we can run the GraphRAG Native App pipeline.

# Upload corpus table to snowflake usint write_pandas from Snowflake API.
success, nchunks, nrows, _ = write_pandas(conn=conn,
                                          df=final_corpus_df,
                                          database=f'RAI_GRS{app_suffix}',
                                          schema='DATA',
                                          table_name='CORPUS')

#### 1. Entities and relations extraction (with customization of the relative prompt)

After preparing the text documents, we can begin our pipeline with the first actual step of the knowledge graph (KG) construction. 

In this example, we demonstrate how to customize the prompt to better fit specific needs. But how can we customize the prompt, and what does that mean for our task?

Let’s explore this through the following example:

Suppose, based on prior analysis of our agent description data—or from specific business requirements—we want to extract the following five properties for each agent:

1. Purpose/Function  
2. Character and Personality  
3. Collaborations with other agents  
4. Skills/Abilities  
5. Key Elements / Expertise / Specialty / Target

We recognize that some agents may have incomplete descriptions, so these values may not be available for every entry. Nevertheless, we can instruct the LLM to extract these specific fields as node properties during the KG extraction process.

We’ve already configured this behavior by modifying the prompt in the `config.yml` file, under the `get_entities_relations` section. The prompt also includes context about the domain of the documents to improve extraction accuracy. 

If you don't have specific customization requirements, you can keep the default prompt that comes with the installation of the app—it has been written to perform well across a variety of document types and domains.

Once the prompt is set and the desired LLM for extraction is selected, we're ready to run the extraction step!

In [None]:
# Step 1
# -- Entities / relations extraction.

# Form the SQL call.
extraction_call = f"""CALL RAI_GRS{app_suffix}.app.execute_get_entities_relations('{project_id}');"""

# Execute the SQL call.
cursor.execute(extraction_call)

# See the results of the call.
results = cursor.fetchall() 
for row in results:
    print(row)

#### 2. Community detection

After extracting entities and relations, the next step is community detection. Several algorithms are available for this task, with configurable parameters. For example, you can set a maximum community size to prevent the formation of overly large communities with too many nodes. With the configuration saved, we can proceed directly to executing the `execute_get_communities` procedure.

In [None]:
# Step 2
# -- Community detection.

# Form the SQL call.
extraction_call = f"""CALL RAI_GRS{app_suffix}.app.execute_get_communities('{project_id}');"""

# Execute the SQL call.
cursor.execute(extraction_call)

# See the results of the call.
results = cursor.fetchall() 
for row in results:
    print(row)

#### 3. Graph indexing: summarization and embeddings

This step performs the following operations:

- Summarization of each community with LLM, capturing the context of the nodes it contains  
- Embedding generation for the `CORPUS` table, node and edge properties, and the community summaries

As with any LLM task of your pipeline, you can adjust the summarization prompt to guide the LLM on the desired level of abstraction and which details to include in the summaries. In this case, we use the default prompt. 

In [None]:
# Step 3
# -- Graph indexing: summarization and embeddings.

# Form the SQL call.
extraction_call = f"""CALL RAI_GRS{app_suffix}.app.execute_get_embeddings('{project_id}');"""

# Execute the SQL call.
cursor.execute(extraction_call)

# See the results of the call.
results = cursor.fetchall() 
for row in results:
    print(row)

#### 4. Question Answering (QA)

After indexing completes, we are ready to use our app for question answering!

Note that the indexing phase occurs not only during embedding generation with your selected embedder but also within the Cortex Search service.

This provides two retrieval options:
- **Vector search**, which uses the embeddings generated in the previous step (step #3)  
- **Cortex Search**, the managed service provided by Snowflake Cortex, which performs hybrid retrieval and reranking behind the scenes

As shown in the `config.yml`, we have selected Cortex Search for this demo and set `similarity_top_k=10`. 

Whenever you need to change retrieval settings, simply edit the YAML file and rerun the cell containing the `save_config` call.

You can also select which sources to include in the retrieved results, since there are multiple content types to search: corpus items, community summaries, and verbalized properties of both nodes and edges. For now, we use the default behavior, which searches the most relevant across all available sources.

In [None]:
print(f'Selected retriever for this demo uses "{retriever_type}" with "similarity_top_k" set to {similarity_top_k} and "similarity_threshold" to {similarity_threshold}.')

In [None]:
# Step 4

# Form the SQL call.
# question = "What is the meaning of the context?"
question = "What do you know about WAI Combinator?"
extraction_call = f"""CALL RAI_GRS{app_suffix}.app.execute_get_answer('{project_id}', '{question}');"""

# Execute the SQL call.
cursor.execute(extraction_call)

# See the results of the call.
result = cursor.fetchone() 
answer = result[1]
context = json.loads(result[2])
print(f"Question: {question}")
print()
print(f"Answer: {answer}")
print()
print("-----------------------------------")
print(f"Context has {len(context)} items.")

<hr>

## Load extracted graph data from snowflake 

The native app proci
After running our pipeline we can download the graph data to use them for other

We use the provided method from Snowflake Python connector to download the graph data:
https://docs.snowflake.com/en/developer-guide/python-connector/python-connector-api#fetch_pandas_all

In [None]:
# cursor.execute(f"SELECT * FROM RAI_GRS{app_suffix}.DATA.CORPUS WHERE PROJECT_ID='{project_id}'")
# # Fetch the result set from the cursor and deliver it as the pandas DataFrame.
# corpus = cursor.fetch_pandas_all()
# corpus.shape

In [None]:
cursor.execute(f"SELECT * FROM RAI_GRS{app_suffix}.DATA.NODES WHERE PROJECT_ID='{project_id}'")
# Fetch the result set from the cursor and deliver it as the pandas DataFrame.
nodes = cursor.fetch_pandas_all()
nodes.shape

In [None]:
cursor.execute(f"SELECT * FROM RAI_GRS{app_suffix}.DATA.NODE_PROPERTIES WHERE PROJECT_ID='{project_id}'")
node_properties = cursor.fetch_pandas_all()
node_properties.shape

In [None]:
cursor.execute(f"SELECT * FROM RAI_GRS{app_suffix}.DATA.EDGES WHERE PROJECT_ID='{project_id}'")
edges = cursor.fetch_pandas_all()
edges.shape

In [None]:
cursor.execute(f"SELECT * FROM RAI_GRS{app_suffix}.DATA.EDGE_PROPERTIES WHERE PROJECT_ID='{project_id}'")
edge_properties = cursor.fetch_pandas_all()
edge_properties.shape

In [None]:
cursor.execute(f"SELECT * FROM RAI_GRS{app_suffix}.DATA.COMMUNITIES WHERE PROJECT_ID='{project_id}'")
communities = cursor.fetch_pandas_all()
communities.shape

In [None]:
conn.close()
cursor.close()

In [None]:
# # In case we want to save our graph data locally and make them available for next time without needing to connect to SF 
# again and use the cursor to download the data.
# nodes.to_csv(r"data/output/output from native app run/nodes.csv", index=False)
# node_properties.to_csv(r"data/output/output from native app run/node_properties.csv", index=False)
# edges.to_csv(r"data/output/output from native app run/edges.csv", index=False)
# communities.to_csv(r"data/output/output from native app run/communities.csv", index=False)

In [None]:
# # If we wanted to use pandas read_csv after downloading the tables as csv outputs.

# nodes = pd.read_csv(r"data/output/output from native app run/nodes.csv")
# node_properties = pd.read_csv("data/output/output from native app run/node_properties.csv")
# edges = pd.read_csv("data/output/output from native app run/edges.csv")
# communities = pd.read_csv("data/output/output from native app run/communities.csv")

## Postprocessing

### Extracted nodes overview
Let's see the nodes and how many of them are agents (they have "ai_agent" as type.)

In [None]:
nodes.head()

In [None]:
nodes['TYPE'].unique()

In [None]:
# How many agents have been extracted?
nodes[nodes['TYPE']=='ai_agent'].shape

There are more than 100 agents while the corpus is of 100 descriptions. So, let us see the extra agent nodes: should we have one agent node for each corpus item? 

In [None]:
# Lowercase to count duplicated agents that might appear
# nodes['ID_lowercase'] = nodes['ID'].str.lower()
only_agent_nodes = nodes[nodes['TYPE']=='ai_agent']
# only_agent_nodes.loc[:, 'ID_lowercase'] = only_agent_nodes['ID'].str.lower()
print(f"Duplicated count: {only_agent_nodes.duplicated(subset=['ID']).sum()}")

In [None]:
# Remove agents with the same name.
condition = (nodes.duplicated(subset='ID')) & (nodes['TYPE'] == 'ai_agent')
nodes = nodes[~condition]
nodes[nodes['TYPE']=='ai_agent'].shape

In [None]:
# Let's find the agents whose names differ from the original filenames, in order to see if some descriptions contain more
# than one agents.
only_agent_nodes.loc[:, 'file_name'] = only_agent_nodes['CHUNK_ID'].str.replace(f"{project_id}/", "") 
only_agent_nodes[only_agent_nodes['ID'] != only_agent_nodes['file_name']][['CHUNK_ID', 'ID', 'CONTEXT']]

As we can see, some descriptions mention more than one agent, so it is appropriate to extract multiple agents from the same text.

Examples where the text mentions more than one agent:

In [None]:
only_agent_nodes[only_agent_nodes['CHUNK_ID']=='tokenflow/Zenith'].CONTEXT.iloc[0]

In [None]:
only_agent_nodes[only_agent_nodes['CHUNK_ID']=='tokenflow/DXAI.app'].CONTEXT.iloc[0]

#### Add the properties and communities to the nodes dataframe for convinience

In [None]:
def get_properties_for_node(node_id, chunk_id, node_properties_df) -> dict:  
    """
    Retrieve all properties for a given node.

    This function searches in `node_properties_df` to find all properties 
    associated with the specified `node_id` and returns them as a dictionary.

    Parameters:
        node_id (str): The unique identifier of the node.
        chunk_id (str): The unique identifier of the chunk id from which the node has been extracted.

    Returns:
        dict: None if no properties found. If there are properties, a dictionary where each key is a 
              property name and the corresponding value is the property value for the given node.
    """
    # Search in the properties df with the properties from all nodes to find the properties of this node.
    properties_of_this_node = node_properties_df[
        (node_properties_df['NODE_ID'] == node_id) &
        (node_properties_df['CHUNK_ID'] == chunk_id)
    ]

    if not properties_of_this_node.empty:
        # Remove duplicates based on 'PROPERTY_NAME' and 'PROPERTY_VALUE'
        unique_properties = properties_of_this_node.drop_duplicates(subset=['PROPERTY_NAME', 'PROPERTY_VALUE'])
        # Convert to a dictionary: PROPERTY_NAME -> PROPERTY_VALUE
        property_dict = dict(zip(unique_properties['PROPERTY_NAME'], unique_properties['PROPERTY_VALUE']))
        return property_dict
    else:
        return None

In [None]:
# In the prompt we asked that the five importand properties will set to 'null' if they are not available. So, here we replace with None.
def clean_property_value(val):
    if val == []:
        return None
    if val == '[]':
        return None
    if val == "null":
        return None
    return val

node_properties['PROPERTY_VALUE'] = node_properties['PROPERTY_VALUE'].apply(clean_property_value)

In [None]:
nodes.sample()

In [None]:
nodes.shape

In [None]:
# Create a new column to store properties as dict in this dataframe.
nodes['PROPERTIES'] = nodes.apply(
    lambda row: get_properties_for_node(row['ID'], row['CHUNK_ID'], node_properties),
    axis=1
)

In [None]:
nodes.shape

In [None]:
# Now we will add the communities too.
communities.shape

In [None]:
nodes.head()

In [None]:
# Merge the nodes and communities dataframes
nodes = pd.merge(
    left=nodes,
    right=communities,
    how='left',
    left_on=['ID'],
    right_on=['NODE_ID']
)

In [None]:
nodes = nodes.drop(columns=['PROJECT_ID_y', 'NODE_ID'])
nodes = nodes.rename(columns={'PROJECT_ID_x': 'PROJECT_ID'})
nodes.head()

## Graph byproduct: agents data analysis

### Take the agents data and store them in a new Snowflake table
Here, we export only the agent nodes along with the five properties we requested in the extraction prompt. We filter the agent nodes and create separate columns for each of these five properties.

In [None]:
agents = nodes[nodes['TYPE']=='ai_agent']
agents.head()

In [None]:
agents.isna().sum()

In [None]:
# agents = agents.dropna()

In [None]:
agents.reset_index(drop=True, inplace=True)

In [None]:
# Check that all agents have the five properties as keys.
count = 0
for property_set in agents['PROPERTIES'].to_list():
    if not all(key in property_set for key in ['purpose', 'character', 'collaborators', 'key_elements', 'skills']):
        count += 1
        print("We found an agent that has not all the five main properties. Let's fix that by assing them to the agent with None.")
        for key in ['purpose', 'character', 'collaborators', 'key_elements', 'skills']:
            if key not in property_set:
                property_set[key] = None
                
print(f"There were {count} agents with missing properties.")

In [None]:
# Take the five important properties and place them as separate columns.

# Define keys to extract
key_properties_to_extract = ['purpose', 'character', 'collaborators', 'key_elements', 'skills']

# Function to extract keys
def extract_properties(prop_dict):
    extracted = {key: prop_dict.get(key) for key in key_properties_to_extract}
    other = {k: v for k, v in prop_dict.items() if k not in key_properties_to_extract}
    extracted['other properties'] = other
    return pd.Series(extracted)

# Apply extraction
df_extracted = agents['PROPERTIES'].apply(extract_properties)

# Combine with original dataframe
agents = pd.concat([agents.drop(columns=['PROPERTIES']), df_extracted], axis=1)

In [None]:
# Rename some columns and keep those we are need.
agents = agents.rename(columns={'ID': 'Name',
                                'CHUNK_ID': 'Original filename', 
                                'CONTEXT': 'Description',
                                'other properties': 'Other properties',
                                'skills': 'Skills',
                                'key_elements': 'Key elements',
                                'purpose': 'Purpose',
                                'character': 'Character',
                                'collaborators': 'Collaborators'
                               })
agents = agents.drop(columns=['PROJECT_ID', 'TYPE', 'COMMUNITY_ID'])
agents.sample(5)

#### Store the agents in a new Snowflake table

In [None]:
conn = snowflake.connector.connect(
                user=username,
                password=password,
                account=account_identifier,
                role=role, 
                database=database,
                schema=schema,
                warehouse=warehouse
            )
            
            
# Create a cursor for this connection (again).
cursor = conn.cursor()

In [None]:
# Create the table
cursor = conn.cursor()

cursor.execute(f"""
DROP TABLE IF EXISTS RAI_GRS{app_suffix}.DATA.TOKENFLOW_AGENTS;
""")

In [None]:
agents.sample()

In [None]:
cursor.execute(f"""
CREATE OR REPLACE TABLE RAI_GRS{app_suffix}.DATA.TOKENFLOW_AGENTS (
    "Original filename" VARCHAR,
    "Name" VARCHAR,
    "Description" VARCHAR,
    "Purpose" VARCHAR,
    "Character" VARCHAR,
    "Collaborators" VARCHAR,
    "Key elements" VARCHAR,
    "Skills" VARCHAR,
    "Other properties" VARCHAR
)
""")

In [None]:
# Upload data to a new SF table.
success, nchunks, nrows, _ = write_pandas(conn=conn,
                                          df=agents,
                                          database=f'RAI_GRS{app_suffix}',
                                          schema='DATA',
                                          table_name='TOKENFLOW_AGENTS')

In [None]:
# agents.to_csv("data/output/other outputs/agents.csv")

In [None]:
# Close this cursor.
cursor.close()
conn.close()

## Visualization

In [None]:
nodes['TYPE'].value_counts()

In [None]:
def get_node_icon(node_type):
    """Get appropriate icon based on node type"""
    icon_map = {
        '🤖': ['ai_agent', 'ai', 'ai_technology', 'ai_framework'],
        '🧑': ['person', 'user', 'family_member'],
        '🖥️': ['platform', 'software', 'technology', 'feature'],
        '₿': ['blockchain', 'cryptocurrency', 'trading_platform', 'token', 'blockchain_paradise', 'meme_coin'],
        '🦾': ['ai_agent_role'],
        '💰': ['financial_product', 'currency'],
        '🧪': ['product'],
        '🏢': ['company', 'organization'],
        '📄': ['document'],
        '🌍': ['country', 'place', 'ecosystem'],
        '🎖️': ['certification'],
        '📜': ['regulation', 'legal', 'protocol', 'algorithm'],
        '📌': ['default']  # Default case
    }

    # Flatten dictionary for quick lookup
    node_to_icon = {key: icon for icon, keys in icon_map.items() for key in keys}

    return node_to_icon.get(node_type, '📌')  # Return default icon if not found

# # Example usage
# print(get_node_icon('company'))  # 🏢
# print(get_node_icon('chemical'))  # 🧫
# print(get_node_icon('unknown'))  # 📌 (default)


In [None]:
def get_node_color(node_type):
    """Get appropriate color based on node type"""
    color_map = {
        '#FFB6C1': ['ai_agent', 'ai', 'ai_technology', 'ai_framework'],  # Light pink
        '#DAA06D': ['person', 'user', 'family_member'], # Brown
        '#98FB98': ['platform', 'software', 'technology', 'feature'],  # Pale green
        '#4682B4': ['blockchain', 'cryptocurrency', 'trading_platform', 'token', 'blockchain_paradise', 'meme_coin'],  # Steel blue
        '#FFD700': ['ai_agent_role'],  # Gold
        '#FFA500': ['financial_product', 'currency'],  # Orange
        '#90EE90': ['company', 'organization'],  # Light green
        '#ADD8E6': ['document'],  # Light blue
        '#DDA0DD': ['country', 'place'],  # Plum
        '#DC143C': ['certification'],  # Crimson
        '#8FBC8F': ['regulation', 'legal', 'protocol', 'algorithm'],  # Dark sea green
        '#F0F0F0': ['default']  # Light gray
    }

    # Flatten dictionary for quick lookup
    node_to_color = {key: color for color, keys in color_map.items() for key in keys}

    return node_to_color.get(node_type, '#F0F0F0')  # Return default color if not found

# # Example usage
# print(get_node_color('company'))  # #90EE90 (Light green)
# print(get_node_color('chemical'))  # #FFC0CB (Pink)
# print(get_node_color('unknown'))  # #F0F0F0 (Default)

### Using the yFiles library
The library `yfiles_jupyter_graphs` is not supported for direct usage on Snowflake [notebooks](https://docs.snowflake.com/en/user-guide/ui-snowsight/notebooks-import-packages) (you must upload the library on a stage and try to use it from there), so we use it here.

In [None]:
# pip install yfiles_jupyter_graphs

In [None]:
from yfiles_jupyter_graphs import GraphWidget

In [None]:
nodes.head()

In [None]:
# # # Store the postprocessed data, so you can read them later for visualization. 
# # But keep in mind that in that case you will need to pay attention to converting the 'PROPERTIES' column of nodes back to list of dicts
# # using the safe_eval function in the next cell.
# nodes.to_csv('data/output from postprocessing on notebook/nodes_with_properties.csv', index=False)
# edges.to_csv('data/output from postprocessing on notebook/edges.csv', index=False)  # Here this df is the same we the input
# agents.to_csv('data/output from postprocessing on notebook/agents.csv', index=False)

In [None]:
# Safely parse the 'PROPERTIES' column, ignoring NaN
def safe_eval(val):
    if pd.isna(val):
        return None
    try:
        return ast.literal_eval(val)
    except Exception as e:
        print(f"Error parsing: {val}\n{e}")
        return val  # Or return np.nan if you prefer to drop bad values

In [None]:
# nodes = pd.read_csv('data/nodes_with_properties.csv')
# nodes['PROPERTIES'] = nodes['PROPERTIES'].apply(safe_eval)

# edges = pd.read_csv('data/edges.csv')

In [None]:
nodes.head(2)

In [None]:
edges.head(2)

In [None]:
nodes_for_yfiles = []

for index, row in nodes.iterrows(): 
    # Check if the node already exists in the graph based on node ID.
    if any(node['id'] == row['ID'] for node in nodes_for_yfiles):
        continue  # Skip adding this node if it already exists in the graph.
    entity_emoji = get_node_icon(node_type=row['TYPE'])
    entity_color = get_node_color(node_type=row['TYPE'])
    entity_label = f"{entity_emoji} {row['ID']}"
    entity_properties = row['PROPERTIES']
    if entity_properties is None:
        entity_properties = {}
        
     # Add the node type as metadata in the first position.
    entity_properties["node_type"] = row['TYPE']
    entity_properties["node_id"] = row['ID']

    node_for_yfiles = {"id": row['ID'],
                       "properties":
                          {"label": entity_label,
                           "properties": entity_properties,
                           "color": entity_color,
                           "type": row['TYPE'],
                           "community": row['COMMUNITY_ID']
                          }
                     }
    # Add the node.
    nodes_for_yfiles.append(node_for_yfiles)

In [None]:
edges_for_yfiles = []

for index, row in edges.iterrows():        
    edge_for_yfiles = {
        "id": index,
        "start": row['SRC_NODE_ID'],
        "end": row['DST_NODE_ID'],
        "properties":
            {
             "label": row['TYPE'],
            }
       }
    edges_for_yfiles.append(edge_for_yfiles)

In [None]:
w = GraphWidget()
w.nodes = nodes_for_yfiles
w.edges = edges_for_yfiles
w.directed = True

In [None]:
# Show with color mapping

w.node_color_mapping = 'color'
w.show()

In [None]:
# Show with color and community mapping

w.node_color_mapping = 'color'
w.node_parent_group_mapping = 'community'
w.show()

In [None]:
# Some nodes seem to not having any edge.
# A check:
check_node = "Yugo"
print(check_node in edges['SRC_NODE_ID'].tolist())
print(check_node in edges['DST_NODE_ID'].tolist())
print()
# full_context = nodes[nodes['ID'] == check_node]['CONTEXT'].values[0]
# print(full_context)