# Graph Triplet Extraction for Financial Documents

SEC (Securities and Exchange Commission) filings, such as 10-K reports, contain vast amounts of structured and unstructured data about a company's financials, risks, strategies, and operations. Extracting graph triplets from these documents provides several benefits:
* Structured Representation: Converts unstructured text into structured knowledge in the form of (subject, relation, object) triplets, making it easier to analyze relationships between entities.
* Enhanced Financial Analysis: Enables analysts to identify connections between companies, risks, financial metrics, and market conditions.
* Scalability: Automates the extraction process for large volumes of filings across multiple companies and years.
* Risk Assessment: Helps uncover hidden risks by linking entities (e.g., "TAUTACHROME INC.") to specific risk factors (e.g., "Market Risk").
* Compliance and Strategy Insights: Identifies regulatory or operational dependencies that can impact business strategies.

By extracting graph triplets from SEC documents, we can transform raw text into actionable insights that are easier to query and visualize.

This notebook demonstrates how to extract graph triples from SEC 10-K filings using NVIDIA's AI endpoints. The extracted triples are useful for building a GraphRAG (Graph-based Retrieval-Augmented Generation) system, enhancing the knowledge graph with detailed financial information.

## How Does a Knowledge Graph Help with Multiple SEC Documents?
A knowledge graph organizes extracted triplets into a connected network of entities and relationships. This structure is particularly valuable for analyzing multiple SEC filings:
* Unified View Across Companies:
Combines data from multiple filings into a single graph, enabling cross-company comparisons.
Example: Identify common risk factors faced by companies in the same industry.
* Queryable Relationships:
Allows users to query specific relationships (e.g., "What market risks does a company face?") without manually sifting through documents.
* Interconnected Insights:
Links related entities across documents (e.g., subsidiaries, competitors, or shared risks).
Example: Connect "TAUTACHROME INC" to its financial performance metrics and regulatory obligations.
* Scalability for Large Datasets:
Handles thousands of filings efficiently by representing them as nodes and edges in a graph.
Example: Visualize how different companies are affected by the same regulation or market condition.
* Improved Decision-Making:
Provides a holistic view of a company's ecosystem, enabling better risk assessment, compliance tracking, and strategic planning.
A knowledge graph built from SEC filings transforms disparate data into an interconnected web of insights that can be queried and analyzed at scale.

## What Will You Learn in This Notebook?
This notebook demonstrates how to extract graph triplets from SEC filings and build a knowledge graph. By the end of this notebook, you will learn:
* Triplet Extraction:
How to extract (subject, relation, object) triplets from SEC filings using natural language processing techniques.
Example: ("TAUTACHROME INC", "Faces", "Market Risk").
* Building a Knowledge Graph:
How to construct a knowledge graph from extracted triplets using tools like NetworkX.
Relabel nodes with meaningful entity names (e.g., "TAUTACHROME INC") and edges with relation names (e.g., "Faces").
* Querying the Knowledge Graph:
How to query the graph for insights using LangChain's GraphQAChain.
Example Query: "What risk factors does TAUTACHROME INC face?"
* Applications of Knowledge Graphs:
Learn how knowledge graphs can be used for financial analysis, risk assessment, compliance tracking, and strategic decision-making.


By following this notebook, you will gain hands-on experience in transforming raw text from SEC filings into actionable insights through graph-based representations. This markdown provides clear sections explaining the utility of graph triplet extraction for SEC documents, the benefits of knowledge graphs for analyzing multiple documents, and what users will learn in the notebook.

## Import Necessary Libraries

In [None]:
import os
import json
import ast
import re
import argparse
import getpass
import unicodedata
import shutil
import concurrent.futures
from tqdm import tqdm
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_nvidia_ai_endpoints import ChatNVIDIA
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
import pandas as pd

Ensure that your NVIDIA API key is set. This key is required to access NVIDIA's AI endpoints, which are used for processing the SEC filings. You can obtain your API key from [NVIDIA's AI portal](https://developer.nvidia.com/).

In [None]:
# Ensure NVIDIA API key is set
if not os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
    nvapi_key = getpass.getpass("Enter your NVIDIA API key: ")
    assert nvapi_key.startswith("nvapi-"), f"{nvapi_key[:5]}... is not a valid key"
    os.environ["NVIDIA_API_KEY"] = nvapi_key


In [None]:
#Instantiate model
llm = ChatNVIDIA(model="mistralai/mixtral-8x22b-instruct-v0.1")

## Sample Dataset
For this example, we'll use 2021 SEC documents hosted on [Kaggle](https://www.kaggle.com/datasets/pranjalverma08/sec-edgar-annual-financial-filings-2021).
We'll store this data in our data directory

## Define Functions for Preprocessing and Triple Extraction

### Preprocess JSON Content

This function preprocesses JSON content by decoding Unicode escape sequences and normalizing characters. Preprocessing ensures that the text is in a suitable format for extraction.

In [None]:
def preprocess_json_content(json_content):
    # Decode Unicode escape sequences
    json_content = json_content.encode('utf-8').decode('unicode_escape')
    
    # Replace escaped newline characters with actual newlines
    json_content = json_content.replace('\\n', '\n')
    json_content = json_content.replace('"\"', '')
    json_content = json_content.replace('""', '"""')
    
    # Ensure \u sequences are treated as string literals
    json_content = json_content.replace('\\u', '\\\\u')
    
    # Normalize Unicode characters
    json_content = unicodedata.normalize('NFKD', json_content).encode('ascii', 'ignore').decode('ascii')
    
    return json_content


### Preprocess Text,
This function preprocesses text by replacing company-specific pronouns with the company name. This step is important for accurate entity recognition and disambiguation.


In [None]:
def preprocess_text(text, company_name):
    replacements = {
        " we ": f" {company_name} ",
        " us ": f" {company_name} ",
        " our ": f" {company_name}'s ",
        " the Company ": f" {company_name} ",
        "We ": f"{company_name} ",
        "Us ": f"{company_name} ",
        "Our ": f"{company_name}'s ",
        "The Company ": f"{company_name} "
    }
    for key, value in replacements.items():
        text = text.replace(key, value)
    return text


### Process Response
This function processes the response from the language model, ensuring that the output is properly formatted as a list of graph triples.

In [None]:
def process_response(triplets_str):
    try:
        # Ensure the string is properly formatted
        triplets_str = triplets_str.strip()
        
        if not triplets_str.startswith("["):
            triplets_str = "[" + triplets_str
        if not triplets_str.endswith("]"):
            triplets_str = triplets_str + "]"
        
        triplets_list = ast.literal_eval(triplets_str)
        json_triplets = []
        
        for triplet in triplets_list:
            try:
                subject, subject_type, relation, object, object_type = triplet
                json_triplet = [subject, subject_type, relation, object, object_type]
                json_triplets.append(json_triplet)
            except ValueError:
                # Skip the malformed triplet and continue with the next one
                continue
        
        return json_triplets
    except (SyntaxError, ValueError) as e:
        print(f"Error processing response: {e}")
        return None

### Extract Triples for Section
This function extracts graph triples for a given section of text. It uses Langchain's text splitting and prompt templates to generate triples that depict relationships between entities in the text.

In [None]:
def extract_triples_for_section(section_text, company_name, section_name, max_length=16384):
    section_text = preprocess_json_content(section_text)
    section_text = preprocess_text(section_text, company_name)
    
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=max_length,
        chunk_overlap=500,
        length_function=len
    )
    
    chunks = text_splitter.create_documents([section_text])
    results = []
    
    for chunk in chunks:
        prompt = ChatPromptTemplate.from_messages(
            [("system", f"""You are an investor analyst. Read the SEC 10-K context and generate graph triples that depict the relationships between entities and objects in the context to build a knowledge graph of the 10-K context. 
            Note that the entities should not be generic, numerical, or temporal (like dates or percentages). Entities must be classified into the following categories:
            - ORG: Organizations other than government or regulatory bodies
            - ORG/GOV: Government bodies (e.g., "United States Government")
            - ORG/REG: Regulatory bodies (e.g., "Securities and Exchange Commission")
            - PERSON: Individuals (e.g., "John Doe")
            - GPE: Geopolitical entities such as countries, cities, etc. (e.g., "United States")
            - COMP: Companies (e.g., "{company_name}")
            - PRODUCT: Products or services (e.g., "Windows OS")
            - EVENT: Specific and Material Events (e.g., "Annual Shareholders Meeting", "Product Launch")
            - SECTOR: Company sectors or industries (e.g., "Software Industry")
            - ECON_INDICATOR: Economic indicators (e.g., "Gross Domestic Product"), numerical value like "10%" is not an ECON_INDICATOR;
            - FIN_INSTRUMENT: Financial and market instruments (e.g., "Bonds", "Equity")
            - CONCEPT: Abstract ideas or notions or themes (e.g., "Market Risk", "Innovation", "Sustainability")
            - RISK: Specific risks that could impact the company (e.g., "Market Risk", "Credit Risk", "Operational Risk")

            The relationships 'r' between these entities must be represented by one of the following relation verbs set: Has, Announce, Operate_In, Introduce, Produce, Control, Participates_In, Impact, Positive_Impact_On, Negative_Impact_On, Relate_To, Is_Member_Of, Invests_In, Raise, Decrease.

            Remember to conduct entity disambiguation, consolidating different phrases or acronyms that refer to the same entity (for instance, "SEC", "Securities and Exchange Commission" should be unified as "Securities and Exchange Commission"). Simplify each entity of the triplet to be less than six words.
            When we refer to “we,” “us,” “our,” or the “Company,” it should use the entity's name "{company_name}".
            Do not use dates as entities.

            From this text, your output MUST be in python list of tuples with each tuple made up of ['h', 'type', 'r', 'o', 'type'], each element of the tuple is the string, where the relationship 'r' must be in the given relation verbs set above. Only output the list.

            As an Example, consider the following SEC 10-K excerpt:
                Input: '{company_name} reported a revenue increase of 15% in the software industry. The company announced the launch of Windows 11, which is expected to positively impact its market share.'
                OUTPUT: [
                    ('{company_name}', 'COMP', 'Report', 'Revenue Increase', 'ECON_INDICATOR'),
                    ('{company_name}', 'COMP', 'Operate_In', 'Software Industry', 'SECTOR'),
                    ('{company_name}', 'COMP', 'Announce', 'Windows 11', 'PRODUCT'),
                    ('Windows 11', 'PRODUCT', 'Positive_Impact_On', 'Market Share', 'FIN_INSTRUMENT')
                ]

            Another Example, consider the following SEC 10-K excerpt:
                Input: 'The company faces significant market risk due to fluctuations in commodity prices. Additionally, there is a credit risk associated with the potential default of key customers.'
                OUTPUT: [
                    ('{company_name}', 'COMP', 'Face', 'Market Risk', 'RISK'),
                    ('{company_name}', 'COMP', 'Face', 'Credit Risk', 'RISK'),
                    ('Market Risk', 'RISK', 'Fluctuate', 'Commodity Prices', 'ECON_INDICATOR'),
                    ('Credit Risk', 'RISK', 'Associate_With', 'Default of Key Customers', 'EVENT')
                ]

            INPUT_TEXT: {input}"""), ("user", chunk.page_content)]
        )
        
        chain = prompt | llm | StrOutputParser()
        response = chain.invoke({"input": chunk.page_content})
        print(response)
        
        processed_response = process_response(response)
        if processed_response:
            results.extend(processed_response)
    
    return results


## Extract Triples from File
This function extracts triples from a given JSON file containing SEC filing data. It processes specific sections and generates triples for each section.

In [None]:
def extract_triples_from_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        data = json.load(file)
    
    company_name = data.get("company", "")  # Ensure the company name is available
    
    # Extract triples for Item 1A
    item_1A_text = data.get("item_1A", "")
    item_1A_triples = extract_triples_for_section(item_1A_text, company_name, "Item 1A")
    
    # Extract triples for Item 7
    item_7_text = data.get("item_7", "")
    item_7_triples = extract_triples_for_section(item_7_text, company_name, "Item 7")
    
    return {
        "filename": os.path.basename(file_path),
        "company_name": company_name,
        "Item 1A": item_1A_triples,
        "Item 7": item_7_triples
    }


## Extract graph triples from SEC 10-K documents

We're also going to save the results to file in JSON format, organizing the data for further analysis or integration into a knowledge graph.

In [None]:
def save_results_to_file(result, output_dir):
    # Create a new filename for the output
    output_filename = f"{result['company_name']}_{result['filename'].replace('.json', '')}_triples.txt"
    output_filename = output_filename.replace("'", "")  # Remove single quotes from the filename
    output_filepath = os.path.join(output_dir, output_filename)
    
    # Prepare the data to be saved in JSON format
    data_to_save = {
        "filename": result['filename'],
        "item_1a": result['Item 1A'],
        "item_7": result['Item 7']
    }
    
    # Save the results to a new file in JSON format
    with open(output_filepath, 'w') as outfile:
        json.dump(data_to_save, outfile, indent=4)

In [None]:
def extract_triples_from_directory(directory):
    # Create the triples_10k directory if it doesn't exist
    output_dir = os.path.join(directory, "triples_10k_dir")
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    
    # Get the list of JSON files to process
    json_files = [os.path.join(directory, filename) for filename in os.listdir(directory) if filename.endswith(".json")]
    
    # Use concurrent.futures to process files in parallel
    results = []
    with concurrent.futures.ProcessPoolExecutor() as executor:
        futures = {executor.submit(extract_triples_from_file, file_path): file_path for file_path in json_files}
        
        # Use tqdm to display a progress bar
        for future in tqdm(concurrent.futures.as_completed(futures), total=len(futures), desc="Processing files"):
            try:
                result = future.result()
                results.append(result)
                save_results_to_file(result, output_dir)
            except Exception as e:
                print(f"Error processing file: {e}")
    
    return results
# triples = extract_triples_from_directory('/workspace/data/')

## Skip next cell due to time limit of this lab. Feel free to run after the lab to extract new graph triplets.


In [None]:
# # SKIP THIS STEP DURING DLI!!!! IF YOU WANT TO RUN CONVERT FROM MARKDOWN CELL TO CODE
# # Create the triples_10k directory if it doesn't exist
# directory = '/workspace/data'
# output_dir = os.path.join(directory, "triples_10k_dir")
# if not os.path.exists(output_dir):
#     os.makedirs(output_dir)

# # Get the list of JSON files to process
# json_files = [os.path.join(directory, filename) for filename in os.listdir(directory) if filename.endswith(".json")]

# # Use concurrent.futures to process files in parallel
# all_triples = []
# with concurrent.futures.ProcessPoolExecutor() as executor:
#     futures = {executor.submit(extract_triples_from_file, file_path): file_path for file_path in json_files}

#     # Use tqdm to display a progress bar
#     for future in tqdm(concurrent.futures.as_completed(futures), total=len(futures), desc="Processing files"):
#         try:
#             result = future.result()
#             all_triples.append(result)
#             save_results_to_file(result, output_dir)
#         except Exception as e:
#             print(f"Error processing file: {e}")

# print(f"All triples saved to {directory}")


In [None]:
import os
import json
directory = '/workspace/data'
output_dir = os.path.join(directory, "triples_10k")
print(output_dir)
# Initialize all_triples
all_triples = []

# Read all the pre-created txt files
for filename in os.listdir(output_dir):
    file_path = os.path.join(output_dir, filename)
    if filename.endswith(".txt"):  # Assuming the extracted triples are stored in txt files
        try:
            with open(file_path, "r", encoding="utf-8") as file:
                data = json.load(file)  # Load JSON data

                # Extract triples from "item_1a" and "item_7"
                item_1a_triples = data.get("item_1a", [])
                item_7_triples = data.get("item_7", [])

                # Merge extracted triples into all_triples
                all_triples.extend(item_1a_triples)
                all_triples.extend(item_7_triples)
                
        except Exception as e:
            print(f"Error reading {filename}: {e}")

print(f"Loaded {len(all_triples)} triples from {output_dir}")

In [None]:
# Extract unique entities (from subject and object) and relationships
unique_entities = set()
unique_relationships = set()

for triplet in all_triples:
    if len(triplet) == 5:  # Ensure the triplet is well-formed
        subject, _, relation, object_, _ = triplet
        unique_entities.add(subject)
        unique_entities.add(object_)
        unique_relationships.add(relation)

# Create mappings for entities and relationships
entities = list(unique_entities)
relations = list(unique_relationships)

entity_to_id = {entity: idx for idx, entity in enumerate(entities)}
relation_to_id = {relation: idx for idx, relation in enumerate(relations)}

In [None]:
# Map triples to IDs
triples_with_ids = []

for triplet in all_triples:
    if len(triplet) == 5:
        subject, _, relation, object_, _ = triplet
        subject_id = entity_to_id.get(subject, -1)  # Use -1 for unknown entities
        object_id = entity_to_id.get(object_, -1)  # Use -1 for unknown objects
        relation_id = relation_to_id.get(relation, -1)  # Use -1 for unknown relations
        triples_with_ids.append((subject_id, relation_id, object_id))

In [None]:
 # Save entities DataFrame
entities_df = pd.DataFrame(list(entity_to_id.items()), columns=['entity_name', 'entity_id'])
entities_df.to_csv('entities_v1.csv', index=False)

# Save relations DataFrame
relations_df = pd.DataFrame(list(relation_to_id.items()), columns=['relation_name', 'relation_id'])
relations_df.to_csv('relations_v1.csv', index=False)

# Save triples DataFrame
triples_df = pd.DataFrame(triples_with_ids, columns=['subject_id', 'relation_id', 'object_id'])
triples_df.to_csv('triples_v1.csv', index=False)


## Accelerated Graph Construction with cuGraph and NetworkX
Now that we have our graph triples, we can construct our full knowledge graph for the corpus of 10-K documents.

In the next section, we demonstrate how to construct a graph using NetworkX and optionally accelerate it with cuGraph (GPU-accelerated graph analytics library). The graph is built from triples extracted from SEC 10-K filings, where each triple represents a relationship between two entities. This process is useful for creating knowledge graphs that can be queried for insights or used in downstream machine learning tasks.

In [None]:
import pandas as pd
import csv
def load_entities(filename):
    all_entities = {}
    with open(filename, 'r', encoding='utf-8') as file:
        reader = csv.reader(file, delimiter=',')
        next(reader)  # Skip header row
        for row in reader:
            entity, id = row
            all_entities[int(id)] = entity
    return all_entities

def load_relations(filename):
    all_relations = {}
    with open(filename, 'r', encoding='utf-8') as file:
        reader = csv.reader(file, delimiter=',')
        next(reader)  # Skip header row
        for row in reader:
            relation, id = row
            all_relations[int(id)] = relation
    
    return all_relations

def get_relation_tuples(all_entities, all_relations, dataset):
    # load the data
    lines = open(dataset).readlines()
    all_tuples = []
    for line in lines:
        subject, relation, obj = line.strip().split("\t")
        all_tuples.append((all_entities[int(subject)], all_relations[int(relation)], all_entities[int(obj)]))
    return all_tuples

In [None]:
import pandas as pd
import networkx as nx

# Load the triples from the CSV file
# Optional line to use NetworkX with cuGraph backend
# triples_df = cudf.DataFrame(all_triples, columns=['subject', 'subject_category', 'relationship', 'object', 'object_category'])

triples_df = pd.read_csv("triples_v1.csv", names=['subject_id', 'relation_id', 'object_id'])

# Load the entities and relations DataFrames
entity_df = pd.read_csv("entities_v1.csv", names=['entity_name', 'entity_id'])
relations_df = pd.read_csv("relations_v1.csv", names=['relation_name', 'relation_id'])

# Create a mapping from IDs to entity names and relation names
entity_name_map = entity_df.set_index("entity_id")["entity_name"].to_dict()
relation_name_map = relations_df.set_index("relation_id")["relation_name"].to_dict()

# Create the graph from the triples DataFrame
G = nx.from_pandas_edgelist(
    triples_df,
    source='subject_id',
    target='object_id',
    edge_attr='relation_id',
    create_using=nx.DiGraph(),
)

# Relabel the nodes with actual entity names
G = nx.relabel_nodes(G, entity_name_map)

# Map relation IDs to relation names for edges
edge_attributes = nx.get_edge_attributes(G, "relation_id")
updated_edge_attributes = {
    (u, v): relation_name_map[edge_attributes[(u, v)]]
    for u, v in G.edges()
}
nx.set_edge_attributes(G, updated_edge_attributes, "relation_name")

print("Graph constructed successfully!")


## Save the knowledge graph object

In [None]:
# Save the graph to a GraphML file so it can be visualized in Gephi Lite
nx.write_graphml(G, "sec_knowledge_graph.gml",)



## Sample query the graph

In [None]:
# print(graph.get_triples())

In [None]:
# Query the graph using LangChain
from langchain.chains import GraphQAChain
from langchain.indexes.graph import NetworkxEntityGraph
graph = NetworkxEntityGraph(G)


# llm.invoke("hello")
chain = GraphQAChain.from_llm(llm = llm, graph=graph, verbose=True)
res = chain.run("What risk factors did ZoomInfo Technologies Inc mention? Only use the knowledge graph, do not use your own context.")
print(res)

In [None]:
res = chain.run("What led to revenue growth for BOX Inc? Please use the knowledge graph triples, not your own learning")
print(res)


In [None]:
res = chain.run("What econ indicators were mentioned by Apple Inc?")
print(res)


In [None]:
res = chain.run("What sectors does Amazon operate in?")
print(res)


In [None]:
res = chain.run("Were there any risk factors for DATADOG, Inc?")
print(res)


In [None]:
res = chain.run("What are the most common risk factors across all companies?")
print(res)


## Adjusting the PromptTemplate via LangChain

In [None]:
from langchain.prompts import PromptTemplate

# Define a custom prompt template
custom_prompt_template = """
You are an expert in financial analysis and knowledge graphs. Use the following knowledge graph triples to answer the user's question.

Knowledge Graph Triples:
{graph_triples}

Note:
- Entities in the graph include both subject names (e.g., "Microsoft") and their categories (e.g., "COMP" for companies).
- Always include subject categories in your reasoning.
- If you don't know the answer, say "I don't know."

Question: {question}
Answer:
"""
# Create a PromptTemplate object
custom_prompt = PromptTemplate(
    template=custom_prompt_template,
    input_variables=["graph_triples", "question"]
)


In [None]:
from langchain.chains.graph_qa.base import GraphQAChain

# Initialize GraphQAChain with the custom prompt
chain = GraphQAChain.from_llm(
    llm=llm,
    graph=graph,
    verbose=True,
    prompt=custom_prompt  # Pass your custom prompt here
)


In [None]:
# Run a query using the modified chain
res = chain.run("What risk factors did Tyler Technologies Inc mention in their 10-K report?")
print(res)


## Add custom context retrieval and chat template

In [None]:
from langchain.chains.llm import LLMChain

def extract_triples_for_entity(graph, entity):
    """Extract triples for the given entity in a simple string format."""
    if entity not in graph.nodes():
        return ""
    triples = []
    for neighbor in graph.neighbors(entity):
        if graph.has_edge(entity, neighbor):
            edge_attr = graph.get_edge_data(entity, neighbor)
            relation = edge_attr.get('relation', '')
            triples.append(f"{entity} -- {relation} --> {neighbor}")
    return "\n".join(triples)

# Pre-extract the triples for "Box, Inc."
entity = "Box Inc."
triples_str = extract_triples_for_entity(G, entity)
print("Extracted Triples:")
print(triples_str)

# Define your custom prompt template
custom_prompt_template = """
You are an expert in financial analysis and knowledge graphs. Use the following knowledge graph triples to answer the user's question.

Knowledge Graph Triples:
{graph_triples}

Note:
- Entities in the graph include both subject names (e.g., "Microsoft") and their categories (e.g., "COMP" for companies).
- Always include subject categories in your reasoning.
- If you don't know the answer, say "I don't know."

Question: {question}
Answer:
"""

prompt = PromptTemplate(
    template=custom_prompt_template,
    input_variables=["graph_triples", "question"]
)

llm = ChatNVIDIA(model="mistralai/mixtral-8x22b-instruct-v0.1")

# Create an LLMChain with the prompt
chain = prompt | llm

# Use your question as before
question = "What risk factors did Box Inc mention in their 10-K report?"

result = chain.invoke({"graph_triples": triples_str, "question": question})
answer_text = result.content
print("Answer:", answer_text)