-
Dgraph - Open source, AI-ready graph database with horizontal scaling and GraphQL support
- Provides ACID transactions, consistent replication, and linearizable reads
- Native GraphQL integration optimizes disk arrangement for query performance
- Reduces disk seeks and network calls in clustered environments
-
Python - High-level programming language for application development
-
Docker - Container platform for consistent deployment environments
-
Bash - Unix shell for automation and setup scripts
- Docker Engine 24.0+
- Clone the repository:
git clone https://gitlab.kis.agh.edu.pl/databases-2-2025/19-shevchenko-denys-karabanov-yehor.git &&
cd 19-shevchenko-denys-karabanov-yehor- Start the database and load data, using -y flag will skip any dialogs:
./dbcli.sh run./dbcli.sh run -y- Show all existing commands:
./dbcli.sh --help- Stop the database:
./dbcli.sh stop- Clean up resources:
./dbcli.sh cleanup- User Interface - Command-line interface for interacting with the database
- Python CLI - Application logic and database client implementation
- Dgraph Client - gRPC communication layer between application and database
- Dgraph Database - Underlying graph storage and query processing engine
id: string @unique @index(hash) .
label: string @index(term) .
to: [uid] @reverse @facet(id, label) .
synonym: [uid] @reverse .
antonym: [uid] @reverse .
unique_neighbors_count: int @index(int) .
type Node {
id
label
to
synonym
antonym
unique_neighbors_count
}
| Field | Type | Constraints | Description |
|---|---|---|---|
id |
string |
@unique @index(hash) |
Unique identifier for each node. The hash index enables fast lookups by ID. |
label |
string |
@index(term) |
Descriptive label for the node. Term indexing supports text search functionality. |
to |
[uid] |
@reverse @count @facet(id, label) |
Array of references to connected nodes. The @reverse directive enables bidirectional navigation between nodes. The @count property stores the amount of references connected nodes. Each connection includes facets (edge properties) storing id and label metadata. |
synonym |
[uid] |
@reverse |
Direct links to synonym nodes, enabling efficient synonym searching and traversal in both directions. |
antonym |
[uid] |
@reverse |
Direct links to antonym nodes, enabling efficient antonym searching and traversal in both directions. |
unique_neighbors_count |
int |
@index(int) |
Stores the count of unique nodes connected to this node. The integer index enables efficient filtering and sorting by neighbor count, which is essential for queries like finding nodes with the most connections. |
Each edge in the to field contains additional properties:
- id: Unique identifier for the edge relationship (e.g.,
"/r/DefinedAs") - label: Descriptive label for the relationship (e.g.,
"defined as")
# Node: Zero (0)
_:_c_en_0 <id> "/c/en/0" .
_:_c_en_0 <label> "0" .
# Node: Empty Set
_:_c_en_empty_set <id> "/c/en/empty_set" .
_:_c_en_empty_set <label> "empty set" .
# Edge: 0 defined as Empty Set (with facets)
_:_c_en_0 <to> _:_c_en_empty_set (id="/r/DefinedAs<;>/r/Synonym", label="defined as<;>synonym") .
- Directed graph structure with rich relationship metadata
- Bidirectional navigation via
@reversedirective - Strategic indexing for optimized query performance
- Facets to store additional edge properties
The system can count for all predecessors or successors of a given word.
./dbcli.sh count-predecessors /c/en/happy{
"predecessors": [
{
"id": "/c/en/happy",
"label": "123",
"count": 8550
}
]
}The system can find synonyms and antonyms that are multiple relationship hops away from a given word.
./dbcli.sh find-distant-antonyms /c/en/math 2{
"distant_antonyms": [
{
"id": "/c/en/impugnable",
"label": "impugnable"
},
{
"id": "/c/en/supposed",
"label": "supposed"
},
{
"id": "/c/en/suspicious",
"label": "suspicious"
},
{
"id": "/c/en/loose",
"label": "loose"
},
{
"id": "/c/en/alleged",
"label": "alleged"
},
{
"id": "/c/en/approximative",
"label": "approximative"
},
{
"id": "/c/en/hopeless",
"label": "hopeless"
}
]
}
System can find all simillar nodes, that share the same parent or child with the same edge id.
./dbcli.sh find-similar-nodes /c/en/0{
"similar_nodes": [
{
"id": "/c/en/zero",
"label": "zero",
"shared_connections": [
{
"via_node": "/c/en/empty_set",
"edge_type": [
"/r/RelatedTo",
"/r/UsedFor"
]
},
{
"via_node": "/c/en/set_containing_one_element",
"edge_type": "/r/IsA"
}
]
},
{
"id": "/c/en/addy/n",
"label": "addy",
"shared_connections": [
{
"via_node": "/c/en/internet_slang",
"edge_type": "/r/HasContext"
}
]
},
{
"id": "/c/en/afaic",
"label": "afaic",
"shared_connections": [
{
"via_node": "/c/en/internet_slang",
"edge_type": "/r/HasContext"
}
]
},
{
"id": "/c/en/baww/v",
"label": "baww",
"shared_connections": [
{
"via_node": "/c/en/internet_slang",
"edge_type": "/r/HasContext"
}
]
}
]
}query countNodesWithSingleNeighbor {
single_neighbor_count(func: eq(unique_neighbors_count, 1)) {
count(uid)
}
}
query findNodesWithMostNeighbors {
var(func: has(id), orderdesc: unique_neighbors_count, first: 1) {
max_val as unique_neighbors_count
}
nodes_with_most_neighbors(func: eq(unique_neighbors_count, val(max_val))) {
id
label
unique_neighbors_count
}
}
Add the --verbose or -v flag to any command to display detailed execution time measurements and if needed amount of
elements in the result:
./dbcli.sh --verbose count-predecessors /c/en/happy {query_name} has {amount} of elements
`query output`
Query executed in 0.34 seconds
Add the --raw or -r flag to any command to display raw query results without formatting:
label1, label2, label3
-
Initial Research & Setup
- Analysis of graph database options and selection of Dgraph
- Docker environment configuration for consistent development
- Basic project structure setup with Python and Bash scripts
-
Data Processing Pipeline
- Development of data conversion scripts
- Implementation of RDF format transformation
- Optimization of bulk data loading process
-
Database Architecture
- Graph schema design with node and edge properties
- Index optimization for query performance
- Implementation of bidirectional relationships
-
Query System Development
- Core GraphQLΒ± query implementation for all given tasks
- Query optimization and testing
-
CLI Tools Creation
- Development of dbcli.sh for database management
- Implementation of Python CLI interface
- Integration of health checks and retry mechanisms
-
Testing & Optimization
- Query performance testing
- Load testing and optimization
-
Documentation & Deployment
- Project documentation and usage examples
- Architecture diagrams
- Deployment instructions
Note:
All testing was performed on the PC with the following specifications:
- CPU: Intel Core i5-14400F
- GPU: NVIDIA RTX 3060 12GB
- RAM: 16GB DDR4 @ 3600MHz
- SSD: Samsung 980 M.2 NVME
The Python RDF converter underwent several optimization iterations, resulting in a 54.2% improvement in execution time from the initial implementation. Below is a breakdown of each optimization step and its impact:
| Optimization Step | Execution Time | Improvement | Description |
|---|---|---|---|
| Basic Implementation | 72 sec | Baseline | Original implementation using standard Python libraries |
| ID Sanitizer without Regex | 70 sec | 2.8% | Replaced regex-based ID sanitization with direct string operations |
| Escape String Caching | 69 sec | 4.1% | Implemented caching mechanism for repeated string escape operations |
| Precompiled Regex Patterns | 62 sec | 13.9% | Used precompiled regex patterns for escape_string function |
| Batch Processing Approach | 48 sec | 33.3% | Redesigned batch processing algorithm for more efficient memory usage |
| Label Handling | 33 sec | 54.2% | Fixed label with pipes handling , improved null label processing, dynamic batch sizing, reduced compression level, removed defaultdict usage and less function calls |
We achieved dramatic improvements in database loading times through several optimization techniques:
| Optimization Step | Loading Time | Improvement | Description |
|---|---|---|---|
| Basic Loading | 43 mins | Baseline | Initial implementation using default settings and raw RDF files |
| Reduced Properties | 36 mins | 16.3% | Eliminated redundant properties from the dataset |
| RDF.GZ Compression | 28 mins | 34.9% | Used compressed RDF.GZ format instead of raw RDF |
| Dgraph Bulk Loader | 4.46 mins | 89.6% | Switched from Dgraph Live Loader to Bulk Loader |
-
Data Preprocessing: Analyzing and removing redundant properties significantly reduced the data volume while preserving all necessary information. It was identified that storing node types of information is not necessary for current use case, as the data is already structured in a graph format.
-
Compression Benefits: Using the .rdf.gz format not only reduced storage requirements but also decreased I/O overhead during loading, leading to faster processing times.
-
Bulk vs. Live Loading: The most dramatic improvement came from switching from Dgraph's Live Loader to the Bulk Loader:
- Live Loader: Processes data in a transactional manner, with built-in consistency checks
- Bulk Loader: Bypasses transaction processing, generating and loading SSTable files directly
While the Bulk Loader requires a database restart, the 89.6% total reduction in loading time justified this trade-off for initial data loading scenario.
We conducted performance testing of all implemented queries to ensure efficient data retrieval. The testing revealed the following performance categories:
- Fast (< 0.1s): Simple lookups and count operations (count-successors, count-predecessors, count-neighbors, find-neighbors, find-shortest-path)
- Medium (0.1s - 1s): Intermediate operations (count-nodes-single-neighbor, find-nodes-most-neighbors, find-successors, find-predecessors, find-grandchildren, find-grandparents)
- Slow (1s - 5s): Complex relationship analysis (find-similar-nodes, find-distant-synonyms, find-distant-antonyms, count-nodes)
- Very Slow (> 5s): Full graph operations requiring large-scale traversal (count-nodes-no-successors, count-nodes-no-predecessors)
-
@index(exact): Enables near-instant node lookups by ID (0.01s for operations with exact index)
- Creates specialized search indexes for string properties
- Reduces query execution time by ~95% for ID-based lookups compared to non-indexed queries
- Critical for operations like
find-successorsandfind-predecessors
-
@count: Pre-calculates edge counts, enabling fast neighbor counting (0.01s)
- Eliminates expensive counting operations at query time
- Stores count metadata directly with relationships
- Enables sub-0.1s performance for
count-successors,count-predecessors, andcount-neighbors - Without @count, these operations would require full traversal (>1s)
-
@index(int): Enables efficient filtering and sorting for integer fields
- Improves performance for queries based on neighbor counts
- Essential for queries like
count-nodes-single-neighbor(0.16s) andfind-nodes-most-neighbors(0.66s)
These directives significantly improved our most frequent operations, with 5 of 17 queries achieving <0.1s execution time.
- Infrastructure & DevOps:
- Docker configuration and optimization
- Shell script automation and dbcli.sh
- Performance Testing & Optimization:
- Bulk data import
- Query execution optimization
- Data conversion optimization
- Additional Queries (Tasks 16-18):
- Distant synonyms/antonyms analysis
- Pathfinding query
- Database Schema Architecture:
- Graph data model design
- Index optimization
- Relationship structure
- Python Applications:
- CLI interface implementation
- Data conversion utilities
- Query system architecture
- Core Query Development (Tasks 1-15):
- Node counting operations
- Relationship path analysis
- Neighbor pattern detection
10/10 - Project meets all requirements and includes additional features beyond the basic specifications.
MIT