Skip to content

CSKG. Compact, production-ready Python CLI and tooling to load, query and analyze a ConceptNet-style graph in Dgraph. Includes data conversion utilities, optimized bulk loading, and a command-line interface for common tasks.

License

Notifications You must be signed in to change notification settings

LilConsul/CSKG-Database-Implementation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Team 19: CSKG Database Implementation

Authors: Shevchenko Denys & Karabanov Yehor

πŸ“š Technology Stack

  • Dgraph - Open source, AI-ready graph database with horizontal scaling and GraphQL support

    • Provides ACID transactions, consistent replication, and linearizable reads
    • Native GraphQL integration optimizes disk arrangement for query performance
    • Reduces disk seeks and network calls in clustered environments
  • Python - High-level programming language for application development

  • Docker - Container platform for consistent deployment environments

  • Bash - Unix shell for automation and setup scripts

πŸ”§ Prerequisites

  • Docker Engine 24.0+

πŸš€ Quick Start

  1. Clone the repository:
git clone https://gitlab.kis.agh.edu.pl/databases-2-2025/19-shevchenko-denys-karabanov-yehor.git &&
cd 19-shevchenko-denys-karabanov-yehor
  1. Start the database and load data, using -y flag will skip any dialogs:
./dbcli.sh run
./dbcli.sh run -y
  1. Show all existing commands:
./dbcli.sh --help
  1. Stop the database:
./dbcli.sh stop
  1. Clean up resources:
./dbcli.sh cleanup

πŸ—οΈ Architecture

Architecture Diagram

Components

  1. User Interface - Command-line interface for interacting with the database
  2. Python CLI - Application logic and database client implementation
  3. Dgraph Client - gRPC communication layer between application and database
  4. Dgraph Database - Underlying graph storage and query processing engine

πŸ“ Database Design

Dgraph Schema

id: string @unique @index(hash) .
label: string @index(term) .
to: [uid] @reverse @facet(id, label) .
synonym: [uid] @reverse .
antonym: [uid] @reverse .
unique_neighbors_count: int @index(int) .

type Node {
    id
    label
    to
    synonym
    antonym
    unique_neighbors_count
}

Schema Visualization

Example Data Diagram

Field Descriptions

Field Type Constraints Description
id string @unique @index(hash) Unique identifier for each node. The hash index enables fast lookups by ID.
label string @index(term) Descriptive label for the node. Term indexing supports text search functionality.
to [uid] @reverse @count @facet(id, label) Array of references to connected nodes. The @reverse directive enables bidirectional navigation between nodes. The @count property stores the amount of references connected nodes. Each connection includes facets (edge properties) storing id and label metadata.
synonym [uid] @reverse Direct links to synonym nodes, enabling efficient synonym searching and traversal in both directions.
antonym [uid] @reverse Direct links to antonym nodes, enabling efficient antonym searching and traversal in both directions.
unique_neighbors_count int @index(int) Stores the count of unique nodes connected to this node. The integer index enables efficient filtering and sorting by neighbor count, which is essential for queries like finding nodes with the most connections.

Edge Properties (Facets)

Each edge in the to field contains additional properties:

  • id: Unique identifier for the edge relationship (e.g., "/r/DefinedAs")
  • label: Descriptive label for the relationship (e.g., "defined as")

Example Data (RDF Format)

# Node: Zero (0)
_:_c_en_0 <id> "/c/en/0" .
_:_c_en_0 <label> "0" .

# Node: Empty Set
_:_c_en_empty_set <id> "/c/en/empty_set" .
_:_c_en_empty_set <label> "empty set" .

# Edge: 0 defined as Empty Set (with facets)
_:_c_en_0 <to> _:_c_en_empty_set (id="/r/DefinedAs<;>/r/Synonym", label="defined as<;>synonym") .

Data Example Visualization

Example Data Diagram

Key Design Decisions

  • Directed graph structure with rich relationship metadata
  • Bidirectional navigation via @reverse directive
  • Strategic indexing for optimized query performance
  • Facets to store additional edge properties

πŸ’» Usage Examples

Counting Predecessors or Successors

The system can count for all predecessors or successors of a given word.

./dbcli.sh count-predecessors /c/en/happy
{
  "predecessors": [
    {
      "id": "/c/en/happy",
      "label": "123",
      "count": 8550
    }
  ]
}

Finding Distant Synonyms or Antonyms

The system can find synonyms and antonyms that are multiple relationship hops away from a given word.

./dbcli.sh find-distant-antonyms /c/en/math 2
{
  "distant_antonyms": [
    {
      "id": "/c/en/impugnable",
      "label": "impugnable"
    },
    {
      "id": "/c/en/supposed",
      "label": "supposed"
    },
    {
      "id": "/c/en/suspicious",
      "label": "suspicious"
    },
    {
      "id": "/c/en/loose",
      "label": "loose"
    },
    {
      "id": "/c/en/alleged",
      "label": "alleged"
    },
    {
      "id": "/c/en/approximative",
      "label": "approximative"
    },
    {
      "id": "/c/en/hopeless",
      "label": "hopeless"
    }
  ]
}

Searching for Simillar Nodes

System can find all simillar nodes, that share the same parent or child with the same edge id.

./dbcli.sh find-similar-nodes /c/en/0
{
  "similar_nodes": [
    {
      "id": "/c/en/zero",
      "label": "zero",
      "shared_connections": [
        {
          "via_node": "/c/en/empty_set",
          "edge_type": [
            "/r/RelatedTo",
            "/r/UsedFor"
          ]
        },
        {
          "via_node": "/c/en/set_containing_one_element",
          "edge_type": "/r/IsA"
        }
      ]
    },
    {
      "id": "/c/en/addy/n",
      "label": "addy",
      "shared_connections": [
        {
          "via_node": "/c/en/internet_slang",
          "edge_type": "/r/HasContext"
        }
      ]
    },
    {
      "id": "/c/en/afaic",
      "label": "afaic",
      "shared_connections": [
        {
          "via_node": "/c/en/internet_slang",
          "edge_type": "/r/HasContext"
        }
      ]
    },
    {
      "id": "/c/en/baww/v",
      "label": "baww",
      "shared_connections": [
        {
          "via_node": "/c/en/internet_slang",
          "edge_type": "/r/HasContext"
        }
      ]
    }
  ]
}

πŸ” Query Examples

Find Nodes With Single Neighbor

query countNodesWithSingleNeighbor {
  single_neighbor_count(func: eq(unique_neighbors_count, 1)) {
    count(uid)
  }
}

Find Nodes With Most Neighbors

query findNodesWithMostNeighbors {
  var(func: has(id), orderdesc: unique_neighbors_count, first: 1) {
    max_val as unique_neighbors_count
  }

  nodes_with_most_neighbors(func: eq(unique_neighbors_count, val(max_val))) {
    id
    label
    unique_neighbors_count
  }
}

Helping flags

Verbose

Add the --verbose or -v flag to any command to display detailed execution time measurements and if needed amount of elements in the result:

./dbcli.sh --verbose count-predecessors /c/en/happy 
{query_name} has {amount} of elements
`query output`
Query executed in 0.34 seconds

Raw

Add the --raw or -r flag to any command to display raw query results without formatting:

label1, label2, label3

πŸ”„ Implementation Process

  1. Initial Research & Setup

    • Analysis of graph database options and selection of Dgraph
    • Docker environment configuration for consistent development
    • Basic project structure setup with Python and Bash scripts
  2. Data Processing Pipeline

    • Development of data conversion scripts
    • Implementation of RDF format transformation
    • Optimization of bulk data loading process
  3. Database Architecture

    • Graph schema design with node and edge properties
    • Index optimization for query performance
    • Implementation of bidirectional relationships
  4. Query System Development

    • Core GraphQLΒ± query implementation for all given tasks
    • Query optimization and testing
  5. CLI Tools Creation

    • Development of dbcli.sh for database management
    • Implementation of Python CLI interface
    • Integration of health checks and retry mechanisms
  6. Testing & Optimization

    • Query performance testing
    • Load testing and optimization
  7. Documentation & Deployment

    • Project documentation and usage examples
    • Architecture diagrams
    • Deployment instructions

πŸ§ͺ Testing

Note:
All testing was performed on the PC with the following specifications:

  • CPU: Intel Core i5-14400F
  • GPU: NVIDIA RTX 3060 12GB
  • RAM: 16GB DDR4 @ 3600MHz
  • SSD: Samsung 980 M.2 NVME

RDF Converter Optimization

The Python RDF converter underwent several optimization iterations, resulting in a 54.2% improvement in execution time from the initial implementation. Below is a breakdown of each optimization step and its impact:

Convertor Optimization Diagram

Optimization Step Execution Time Improvement Description
Basic Implementation 72 sec Baseline Original implementation using standard Python libraries
ID Sanitizer without Regex 70 sec 2.8% Replaced regex-based ID sanitization with direct string operations
Escape String Caching 69 sec 4.1% Implemented caching mechanism for repeated string escape operations
Precompiled Regex Patterns 62 sec 13.9% Used precompiled regex patterns for escape_string function
Batch Processing Approach 48 sec 33.3% Redesigned batch processing algorithm for more efficient memory usage
Label Handling 33 sec 54.2% Fixed label with pipes handling , improved null label processing, dynamic batch sizing, reduced compression level, removed defaultdict usage and less function calls

Database Loading Optimization

We achieved dramatic improvements in database loading times through several optimization techniques:

Loading Optimization Diagram

Optimization Step Loading Time Improvement Description
Basic Loading 43 mins Baseline Initial implementation using default settings and raw RDF files
Reduced Properties 36 mins 16.3% Eliminated redundant properties from the dataset
RDF.GZ Compression 28 mins 34.9% Used compressed RDF.GZ format instead of raw RDF
Dgraph Bulk Loader 4.46 mins 89.6% Switched from Dgraph Live Loader to Bulk Loader

Loading Optimization Strategies

  1. Data Preprocessing: Analyzing and removing redundant properties significantly reduced the data volume while preserving all necessary information. It was identified that storing node types of information is not necessary for current use case, as the data is already structured in a graph format.

  2. Compression Benefits: Using the .rdf.gz format not only reduced storage requirements but also decreased I/O overhead during loading, leading to faster processing times.

  3. Bulk vs. Live Loading: The most dramatic improvement came from switching from Dgraph's Live Loader to the Bulk Loader:

    • Live Loader: Processes data in a transactional manner, with built-in consistency checks
    • Bulk Loader: Bypasses transaction processing, generating and loading SSTable files directly

    While the Bulk Loader requires a database restart, the 89.6% total reduction in loading time justified this trade-off for initial data loading scenario.

Query Performance Analysis

We conducted performance testing of all implemented queries to ensure efficient data retrieval. The testing revealed the following performance categories:

Query Execution Time Diagram

Performance Categories:

  • Fast (< 0.1s): Simple lookups and count operations (count-successors, count-predecessors, count-neighbors, find-neighbors, find-shortest-path)
  • Medium (0.1s - 1s): Intermediate operations (count-nodes-single-neighbor, find-nodes-most-neighbors, find-successors, find-predecessors, find-grandchildren, find-grandparents)
  • Slow (1s - 5s): Complex relationship analysis (find-similar-nodes, find-distant-synonyms, find-distant-antonyms, count-nodes)
  • Very Slow (> 5s): Full graph operations requiring large-scale traversal (count-nodes-no-successors, count-nodes-no-predecessors)

Schema Optimization Impact:

  • @index(exact): Enables near-instant node lookups by ID (0.01s for operations with exact index)

    • Creates specialized search indexes for string properties
    • Reduces query execution time by ~95% for ID-based lookups compared to non-indexed queries
    • Critical for operations like find-successors and find-predecessors
  • @count: Pre-calculates edge counts, enabling fast neighbor counting (0.01s)

    • Eliminates expensive counting operations at query time
    • Stores count metadata directly with relationships
    • Enables sub-0.1s performance for count-successors, count-predecessors, and count-neighbors
    • Without @count, these operations would require full traversal (>1s)
  • @index(int): Enables efficient filtering and sorting for integer fields

    • Improves performance for queries based on neighbor counts
    • Essential for queries like count-nodes-single-neighbor (0.16s) and find-nodes-most-neighbors (0.66s)

These directives significantly improved our most frequent operations, with 5 of 17 queries achieving <0.1s execution time.

πŸ‘₯ Team Contributions

Shevchenko Denys:

  • Infrastructure & DevOps:
    • Docker configuration and optimization
    • Shell script automation and dbcli.sh
  • Performance Testing & Optimization:
    • Bulk data import
    • Query execution optimization
    • Data conversion optimization
  • Additional Queries (Tasks 16-18):
    • Distant synonyms/antonyms analysis
    • Pathfinding query

Karabanov Yehor:

  • Database Schema Architecture:
    • Graph data model design
    • Index optimization
    • Relationship structure
  • Python Applications:
    • CLI interface implementation
    • Data conversion utilities
    • Query system architecture
  • Core Query Development (Tasks 1-15):
    • Node counting operations
    • Relationship path analysis
    • Neighbor pattern detection

⭐ Self-evaluation

10/10 - Project meets all requirements and includes additional features beyond the basic specifications.

πŸ“„ License

MIT

About

CSKG. Compact, production-ready Python CLI and tooling to load, query and analyze a ConceptNet-style graph in Dgraph. Includes data conversion utilities, optimized bulk loading, and a command-line interface for common tasks.

Topics

Resources

License

Stars

Watchers

Forks

Contributors 3

  •  
  •  
  •