**Microsoft GraphRAG Implementation**  

**Overview**  
This is a basic GraphRAG implementation using Microsoft's GraphRAG library. I opted to use OpenAI's GPT-4o-mini API for the language model and text-embedding-3-large for the embeddings.


**Preprocessing**  
I preprocessed my grandfather's memoir titled "My Life Story" into 10 .txt files (chapters). I then stored the .txt files into the "data/graphrag/input" folder.


In [1]:
import os
import yaml
from dotenv import load_dotenv
from openai import OpenAI

In [2]:
# Load from .env file that contains the OpenAI API key
load_dotenv() 

# Get OpenAI API key from .env file
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
oai = OpenAI(api_key=OPENAI_API_KEY)

In [18]:
!python -m graphrag init --root ./data/graphrag

# Load the settings from the 'data/graphrag/settings.yaml' file
with open('data/graphrag/settings.yaml', 'r') as f:
    settings_yaml = yaml.load(f, Loader=yaml.FullLoader)

# Update the settings for the language model
settings_yaml['llm']['model'] = "gpt-4o-mini"
settings_yaml['llm']['api_key'] = OPENAI_API_KEY
settings_yaml['llm']['type'] = 'openai_chat'

# Update the settings for the embeddings
settings_yaml['embeddings']['llm']['api_key'] = OPENAI_API_KEY
settings_yaml['embeddings']['llm']['type'] = 'openai_embedding'
settings_yaml['embeddings']['llm']['model'] = 'text-embedding-3-large'

# Save the updated settings to the 'data/graphrag/settings.yaml' file
with open('data/graphrag/settings.yaml', 'w') as f:
    yaml.dump(settings_yaml, f)

Ensure that the .txt files are in the "data/graphrag/input" folder before running the following code.


In [19]:

#index the memoir
!python -m graphrag index --root ./data/graphrag


⠋ GraphRAG Indexer 
Logging enabled at 
C:\Users\mattg\Desktop\Programming\Portfolio\data\graphrag\logs\indexing-engine
.log
⠋ GraphRAG Indexer 
⠋ GraphRAG Indexer 
⠙ GraphRAG Indexer 
├── Loading Input (text) - 10 files loaded (0 filtered) ----- 100% 0:00… 0:00:…
⠙ GraphRAG Indexer 
├── Loading Input (text) - 10 files loaded (0 filtered) ----- 100% 0:00… 0:00:…
└── create_base_text_units
⠹ GraphRAG Indexer 
├── Loading Input (text) - 10 files loaded (0 filtered) ----- 100% 0:00… 0:00:…
└── create_base_text_units
⠴ GraphRAG Indexer 
├── Loading Input (text) - 10 files loaded (0 filtered) ----- 100% 0:00… 0:00:…
└── create_base_text_units
⠧ GraphRAG Indexer 
├── Loading Input (text) - 10 files loaded (0 filtered) ----- 100% 0:00… 0:00:…
└── create_base_text_units
⠹ GraphRAG Indexer 
├── Loading Input (text) - 10 files loaded (0 filtered) ----- 100% 0:00… 0:00:…
└── create_base_text_units
🚀 create_base_text_units
⠸ GraphRAG Indexer 
├── Loading Input (text) - 10 files loaded (0 filtered

[2024-12-14T00:07:55Z WARN  lance::dataset::write::insert] No existing dataset at C:\Users\mattg\Desktop\Programming\Portfolio\data\graphrag\output\lancedb\default-entity-description.lance, it will be created
[2024-12-14T00:07:58Z WARN  lance::dataset::write::insert] No existing dataset at C:\Users\mattg\Desktop\Programming\Portfolio\data\graphrag\output\lancedb\default-text_unit-text.lance, it will be created
[2024-12-14T00:08:02Z WARN  lance::dataset::write::insert] No existing dataset at C:\Users\mattg\Desktop\Programming\Portfolio\data\graphrag\output\lancedb\default-community-full_content.lance, it will be created


├── compute_communities
├── create_final_entities
├── create_final_relationships
├── create_final_communities
├── create_final_nodes
├── create_final_text_units
├── create_final_community_reports
└── generate_text_embeddings
    └── Verb generate_text_embeddings -------------------   97% 0:00:01 0:00:02
⠸ GraphRAG Indexer 
├── Loading Input (text) - 10 files loaded (0 filtered) ----- 100% 0:00… 0:00:…
├── create_base_text_units
├── create_final_documents
├── extract_graph
├── compute_communities
├── create_final_entities
├── create_final_relationships
├── create_final_communities
├── create_final_nodes
├── create_final_text_units
├── create_final_community_reports
└── generate_text_embeddings
    └── Verb generate_text_embeddings -------------------   97% 0:00:01 0:00:02
⠦ GraphRAG Indexer 
├── Loading Input (text) - 10 files loaded (0 filtered) ----- 100% 0:00… 0:00:…
├── create_base_text_units
├── create_final_documents
├── extract_graph
├── compute_communities
├── create_final_entit

In [1]:
#query the memoir
!python -m graphrag query --root ./data/graphrag --method local --query "What is an example of one of his mom's sayings, and what did he refer to these sayings as?"




INFO: Vector Store Args: {
    "container_name": "==== REDACTED ====",
    "db_uri": "C:\\Users\\mattg\\Desktop\\Programming\\Portfolio\\data\\graphrag\\output\\lancedb",
    "overwrite": true,
    "type": "lancedb"
}
creating llm client with {'api_key': 'REDACTED,len=164', 'type': "openai_chat", 'encoding_model': 'cl100k_base', 'model': 'gpt-4o-mini', 'max_tokens': 4000, 'temperature': 0.0, 'top_p': 1.0, 'n': 1, 'frequency_penalty': 0.0, 'presence_penalty': 0.0, 'request_timeout': 180.0, 'api_base': None, 'api_version': None, 'organization': None, 'proxy': None, 'audience': None, 'deployment_name': None, 'model_supports_json': True, 'tokens_per_minute': 0, 'requests_per_minute': 0, 'max_retries': 10, 'max_retry_wait': 10.0, 'sleep_on_rate_limit_recommendation': True, 'concurrent_requests': 25, 'responses': None}
creating embedding llm client with {'api_key': 'REDACTED,len=164', 'type': "openai_embedding", 'encoding_model': 'cl100k_base', 'model': 'text-embedding-3-large', 'max_toke