# Building Dynamic Knowledge Graph from OpenAI Posts

This notebook demonstrates how to build a dynamic knowledge graph from OpenAI social media posts using the iText2KG_Star framework. The process involves extracting facts from posts, building relationships, and creating incrementally a knowledge graph that can be visualized and analyzed.

## Overview

The workflow consists of several key steps:
1. **Setup**: Configure language models and embedding models
2. **Data Preparation**: Prepare OpenAI posts data
3. **Information Extraction**: Extract facts from posts
4. **Knowledge Graph Construction**: Build relationships and create the graph
5. **Visualization**: Visualize the final knowledge graph


In [None]:
%load_ext autoreload
%autoreload 2


import sys
sys.path.append("..")

## Model Setup

iText2KG is compatible with all language models supported by LangChain. To use iText2KG, you will need both a chat model and an embeddings model.

For available chat models, refer to the options listed at: https://python.langchain.com/v0.2/docs/integrations/chat/.
For embedding models, explore the choices at: https://python.langchain.com/v0.2/docs/integrations/text_embedding/.

This notebook will show you how to run iText2KG using Mistral, Ollama, and OpenAI models.

**Please ensure that you install the necessary package for each chat model before use.**


# Mistral

For Mistral, please set up your model using the tutorial here: https://python.langchain.com/v0.2/docs/integrations/chat/mistralai/. Similarly, for the embedding model, follow the setup guide here: https://python.langchain.com/v0.2/docs/integrations/text_embedding/mistralai/ .

In [None]:
from langchain_mistralai import ChatMistralAI
from langchain_mistralai import MistralAIEmbeddings

mistral_api_key = "##"
mistral_llm_model = ChatMistralAI(
    api_key = mistral_api_key,
    model="mistral-large-latest",
    temperature=0,
    max_retries=2,
)


mistral_embeddings_model = MistralAIEmbeddings(
    model="mistral-embed",
    api_key = mistral_api_key
)

# OpenAI

The same applies for OpenAI. 

please setup your model using the tutorial : https://python.langchain.com/v0.2/docs/integrations/chat/openai/
The same for embedding model : https://python.langchain.com/v0.2/docs/integrations/text_embedding/openai/

In [None]:
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

openai_api_key = "##"

openai_llm_model = llm = ChatOpenAI(
    api_key = openai_api_key,
    model="o3-mini",
    max_tokens=None,
    timeout=None,
    max_retries=2,
)

openai_embeddings_model = OpenAIEmbeddings(
    api_key = openai_api_key ,
    model="text-embedding-3-large",
)

# Understanding the OpenAI Posts Dataset

The (small) dataset contains a collection of OpenAI's social media posts from June-July 2025, covering various topics including:
- **Product Announcements**: New features and capabilities of ChatGPT
- **Research Updates**: Latest developments in AI research
- **Company News**: Partnerships, events, and organizational updates
- **Technical Insights**: Information about model capabilities and safety measures

Each post includes:
- **Observation Date**: When the post was made
- **Post Content**: The actual text content of the social media post

This diverse dataset provides an excellent foundation for building a dyanmic knowledge graph that captures the relationships between different entities, and concepts mentioned in OpenAI's communications.


## Information Extraction Process

The **Document Distiller** module is responsible for extracting structured information from the raw text posts. This process involves:

1. **Fact Extraction**: Using LLM-based extraction to identify facts from the post contents.
2. **Structured Output**: Converting unstructured text into structured data format

The extraction uses a simple query approach that instructs the LLM to act as an information extractor, with the "Fact" object as an output structure object.


## Document Distiller

### OpenAI posts

In [3]:
openai_posts = [
  {
    "observation_date": "Jul 17 2025",
    "post": "By developing heuristic algorithms to tackle challenging NP-hard optimization problems, our LLMs demonstrate their ability to reason, strategically explore solutions, and progressively refine their approach. This highlights our models' capacity for sustained problem-solving,"
  },
  {
    "observation_date": "Jul 17 2025",
    "post": "Let’s show you what ChatGPT agent can do. Need Monday metrics? ChatGPT agent can fetch the data, generate the spreadsheet, and schedule it to run again—automatically."
  },
  {
    "observation_date": "Jul 17 2025",
    "post": "Context-aware, action-ready. ChatGPT agent understands the task, chooses the right tool, and gets to work. It uses connectors, context, and custom instructions to take smarter actions on your behalf."
  },
  {
    "observation_date": "Jul 17 2025",
    "post": "Slideshows, no stress. ChatGPT agent is your one-stop shop for fully editable presentations."
  },
  {
    "observation_date": "Jul 17 2025",
    "post": "OpenAI reposted Epoch AI @EpochAIResearch: We have graded the results of @OpenAI's evaluation on FrontierMath Tier 1–3 questions, and found a 27% (± 3%) performance. ChatGPT agent is a new model fine-tuned for agentic tasks, equipped with text/GUI browser tools and native terminal access."
  },
  {
    "observation_date": "Jul 17 2025",
    "post": "ChatGPT can now do work for you using its own computer. Introducing ChatGPT agent—a unified agentic system combining Operator’s action-taking remote browser, deep research’s web synthesis, and ChatGPT’s conversational strengths."
  },
  {
    "observation_date": "Jul 17 2025",
    "post": "ChatGPT agent has new capabilities that introduce new risks. We’ve implemented extensive safety mitigations across multiple categories of risks accordingly. One particular emphasis has also been placed on safeguarding ChatGPT agent against adversarial manipulation through prompt"
  },
  {
    "observation_date": "Jul 17 2025",
    "post": "We’ve decided to treat this launch as High Capability in the Biological and Chemical domain under our Preparedness Framework, and activated the associated safeguards. This is a precautionary approach, and we detail our safeguards in the system card. We outlined our approach on"
  },
  {
    "observation_date": "Jul 17 2025",
    "post": "ChatGPT agent starts rolling out today to Pro, Plus, and Team users. Pro users will get access by the end of day, while Plus and Team users will get access over the next few days. Enterprise and Edu users will get access in the coming weeks."
  },
  {
    "observation_date": "Jul 17 2025",
    "post": "ChatGPT agent uses a full suite of tools, including a visual browser, text browser, a terminal, and direct APIs."
  },
  {
    "observation_date": "Jul 17 2025",
    "post": "ChatGPT agent is ready to introduce itself."
  },

  {
    "observation_date": "Jul 16 2025",
    "post": "Plus users, the mic is yours. Record mode is now available to ChatGPT Plus users globally in the macOS desktop app."
  },
  {
    "observation_date": "Jul 15 2025",
    "post": "Chain of Thought (CoT) monitoring could be a powerful tool for overseeing future AI systems—especially as they become more agentic. That’s why we’re backing a new research paper from a cross-institutional team of researchers pushing this work forward."
  },
  {
    "observation_date": "Jul 15 2025",
    "post": "A Chief Economist and a COO walk into a podcast...@ronniechatterji and @bradlightcap talk about the future of jobs and the economy in the age of AI on Episode 3 of the OpenAI podcast, now live."
  },
  {
    "observation_date": "Jul 15 2025",
    "post": "Listen to the OpenAI Podcast on— Spotify Apple YouTube Brad Lightcap and Ronnie Chatterji on jobs, growth, and the AI... The future of work is arriving faster than expected. In this episode, OpenAI COO Brad Lightcap and Chief Economist Ronnie Chatterji join Andrew Mayne to disc..."
  },
  {
    "observation_date": "Jul 9 2025",
    "post": "The io Products, Inc. deal has officially closed and we’re thrilled to welcome the team to OpenAI. Jony Ive & LoveFrom remain independent. They’ll have deep design & creative responsibilities across OpenAI."
  },
  {
    "observation_date": "Jul 1 2025",
    "post": "OpenAI Podcast Episode 2 is now live!@markchen90 and @nickaturley join @andrewmayne to pull back the curtain on the making of ChatGPT. They also get into how products are developed and what’s next for agentic coding and multimodal assistants."
  },
  {
    "observation_date": "Jun 26 2025",
    "post": "OpenAI DevDay Oct 6, 2025 in San Francisco. Our biggest one yet: - 1500+ developers - Livestreamed opening keynote - Hands-on building with our latest models & tools - More stages & more demos"
  },
  {
    "observation_date": "Jun 25 2025",
    "post": "ChatGPT connectors for Google Drive, Dropbox, SharePoint, and Box are now available to Pro users (excluding EEA, CH, UK) in ChatGPT outside of deep research. Perfect for bringing in your unique context for everyday work."
  },
  {
    "observation_date": "Jun 18 2025",
    "post": "OpenAI reposted Johannes Heidecke @JoHeidecke: 1/ Our models are becoming more capable in biology and we expect upcoming models to reach ‘High’ capability levels as defined by our Preparedness Framework."
  },
  {
    "observation_date": "Jun 18 2025",
    "post": "Record mode is rolling out today in ChatGPT to Pro, Enterprise, and Edu users. Available on macOS desktop app."
  },
  {
    "observation_date": "Jun 18 2025",
    "post": "OpenAI reposted Miles Wang @MilesKWang: We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more We find that emergent misalignment: - happens during reinforcement learning - is controlled by “misaligned persona” features - can be detected and mitigated:"
  },
  {
    "observation_date": "Jun 18 2025",
    "post": "Understanding and preventing misalignment generalization. Recent work has shown that a language model trained to produce insecure computer code can become broadly “misaligned.” This surprising effect is called “emergent misalignment.” We studied why this happens. Through this"
  }
]



In [None]:
from itext2kg.documents_distiller import DocumentsDistiller


document_distiller = DocumentsDistiller(llm_model = mistral_llm_model)

In [None]:
from itext2kg.models import Facts
import asyncio
IE_query = '''
# DIRECTIVES : 
- Act like an experienced information extractor. 
'''
facts_about_posts = await asyncio.gather(*[document_distiller.distill(documents=["OpenAI posted: "+post["post"]], IE_query=IE_query, output_data_structure=Facts) for post in openai_posts])

## Knowledge Graph Construction

The **iText2KG_Star** module is the core component that builds the knowledge graph. This process involves:

1. **Entity Extraction**: Automatically derives entities from relationships
2. **Relationship Mapping**: Identifies connections between entities
3. **Graph Building**: Constructs the knowledge graph structure
4. **Dynamic Updates**: Incrementally adds new information to existing graphs

The process uses similarity thresholds to control entity matching (0.8) and relationship merging (0.7). Higher thresholds create more distinct entities and relationships, while lower thresholds encourage more merging.


In [7]:
for i in range(len(openai_posts)):
    openai_posts[i]["facts"] = facts_about_posts[i].facts

In [8]:
openai_posts

[{'observation_date': 'Jul 17 2025',
  'post': "By developing heuristic algorithms to tackle challenging NP-hard optimization problems, our LLMs demonstrate their ability to reason, strategically explore solutions, and progressively refine their approach. This highlights our models' capacity for sustained problem-solving,",
  'facts': ['OpenAI develops heuristic algorithms.',
   'Heuristic algorithms tackle challenging NP-hard optimization problems.',
   "OpenAI's LLMs demonstrate the ability to reason.",
   "OpenAI's LLMs strategically explore solutions.",
   "OpenAI's LLMs progressively refine their approach.",
   "This highlights OpenAI's models' capacity for sustained problem-solving."]},
 {'observation_date': 'Jul 17 2025',
  'post': 'Let’s show you what ChatGPT agent can do. Need Monday metrics? ChatGPT agent can fetch the data, generate the spreadsheet, and schedule it to run again—automatically.',
  'facts': ['ChatGPT agent can fetch the data.',
   'ChatGPT agent can generate t

## iText2KG_Star for graph construction

The **iText2KG_Star** module is the core component that builds the knowledge graph. This process involves:

1. **Entity Extraction**: Automatically derives entities from relationships
2. **Relationship Mapping**: Identifies connections between entities
3. **Graph Building**: Constructs the knowledge graph structure
4. **Dynamic Updates**: Incrementally adds new information to existing graphs

The process uses similarity thresholds to control entity matching (0.8) and relationship merging (0.7). Higher thresholds create more distinct entities and relationships, while lower thresholds encourage more merging.


In [None]:
from itext2kg import iText2KG_Star  
# Initialize
simple_kg = iText2KG_Star(llm_model=mistral_llm_model, embeddings_model=openai_embeddings_model)

# Build graph - entities automatically derived from relationships
kg = await simple_kg.build_graph(sections=openai_posts[0]["facts"], 
                                   observation_date=openai_posts[0]["observation_date"], 
                                   ent_threshold=0.8,
                                   rel_threshold=0.7)

for i in range(1, len(openai_posts)):
    kg = await simple_kg.build_graph(sections=openai_posts[i]["facts"], 
                                   observation_date=openai_posts[i]["observation_date"], 
                                   ent_threshold=0.8,
                                   rel_threshold=0.7,
                                   existing_knowledge_graph=kg.model_copy())

# Draw the graph
---

The final section involves visualizing the constructed knowledge graph using GraphIntegrator. The graph database Neo4j is accessed using specified credentials, and the resulting graph is visualized to provide a visual representation of the relationships and entities extracted from the document.

In [None]:
from itext2kg.graph_integration import Neo4jStorage


URI = "bolt://localhost:7687"
USERNAME = "neo4j"
PASSWORD = "##"

Neo4jStorage(uri=URI, username=USERNAME, password=PASSWORD).visualize_graph(knowledge_graph=kg)

# Retrieve the snapshots
---

The following modeling approach is designed to show the dynamism of a graph as new information is added, while ensuring consistency by grouping observations of the same fact.

In this version, each relationship has a property, `observation_dates`, which is a list of Unix timestamps representing when the relationship was observed. If an incoming piece of information describes an identical relationship, its observation date is simply appended to the existing list. A snapshot for a specific date is therefore defined as the subgraph formed by all relationships that include a corresponding timestamp in their `observation_dates` list.

* **Retrieving a Single-Day Snapshot:** To retrieve the snapshot for a specific day, for example, July 17, 2025 one can query for relationships where an observation date falls within that day's time range. (It is important to note that **this version does not handle conflict resolution**, such as starting and ending times for a fact's validity. Furthermore, the **Cypher query required for this model is not the most optimized one**.)

    * **Lower Bound** (July 17, 2025 00:00:00): `1752703200`
    * **Upper Bound** (July 17, 2025 23:59:59): `1752789599`

* **Analyzing Graph Evolution:** To observe how the knowledge graph aggregates and evolves over a period, such as from June 18 to July 17, 2025, you can fix the lower bound and progressively increase the upper bound. This allows you to retrieve a series of cumulative snapshots that show how information changes over time.

In [15]:
from dateutil import parser

lower_bound = "17 july 2025 00:00:00"
upper_bound = "17 july 2025 23:59:59"

lower_bound_obj = parser.parse(lower_bound).timestamp()
upper_bound_obj = parser.parse(upper_bound).timestamp()

print("lower_bound_obj: ", lower_bound_obj)
print("upper_bound_obj: ", upper_bound_obj)

lower_bound_obj:  1752703200.0
upper_bound_obj:  1752789599.0


WITH <put_here_the_lower_bound(unix)> AS lowerBound, <put_here_the_uppder_bound(unix)> AS upperBound
MATCH (n)-[r]->(m)
WHERE r.observation_dates IS NOT NULL
WITH n,r,m,
     r.observation_dates AS tsList,
     lowerBound, upperBound
WHERE size(tsList) > 0
WITH n,r,m,
     apoc.coll.min(tsList) AS minTs,
     apoc.coll.max(tsList) AS maxTs,
     lowerBound, upperBound
WHERE minTs >= lowerBound
  AND maxTs <= upperBound
RETURN n,r,m