# Deriving Protobuf message schema from GraphQL schema

This notebook demonstrates how to construct Protobuf messages by deriving them from a given reference schema wirtten in GrahphQL SDL.

## Assumptions & Requirements:
- Input: **GraphQL schema** (i.e., conceptual and logical layers)
    - The language GraphQL SDL is used as a way to formalize the semantics of a domain, and not necesarily meant for runtime API calls.
    - Reference can be written in either a single or multiple files.
- Output: **protobuf messages** (i.e., physical layer)
    - The desired output format is Protobuf schema.
    - Protobuf message schema can contain an arbitrary sub selection of fields from the GraphQL schema.

## Approach
```mermaid
flowchart LR
    subgraph Conceptual_Logical["Conceptual / Logical"]
        DomainModel["MyDomainModel.graphql"]
    end

    subgraph Manipulation["Derised manipulation"]
        Tools["Public tools"]
        Selection["Selection query"]
    end

    subgraph Physical["Physical"]
        UseCase["MyUseCaseSchema.proto"]
    end

    DomainModel --> Tools --> UseCase
    Selection --> Tools
```
1. [Model the domain in GraphQL SDL](#1-model-the-domain-in-graphql-sdl)
2. [Select the fields of interest with a GraphQL query](#2-select-the-fields-of-interest-with-a-graphql-query)
    - [2.1. Example: a valid selection](#21-example-a-valid-selection)
    - [2.2. Example: An invalid selection](#22-example-an-invalid-selection)
3. [Export the selected elements to Protobuf](#3-export-the-selected-elements-to-protobuf)
    - [3.1. Extracting metadata of selected fields](#31-extracting-metadata-of-selected-fields)
    - [3.2. Transform selection to the desired output](#32-transform-selection-to-the-desired-output)

### 1. Model the domain in GraphQL SDL

GraphQL SDL (Schema Definition Language) is used here as a **governed specification language** to model data of a domain.  
It is not being used to query instance data.  

For example, suppose we have the following model that connects some entities like `Vehicle`, `Person`, `ChargingStation`, etc.:

In [1]:
from graphql import build_schema

schema_str = """
type Vehicle {
  id: ID!
  fuelType: FuelType!
  location: GeographicLocation
  drivingHistory: [DrivingJourney]
}

type GeographicLocation {
    latitude: Float!
    longitude: Float!
}

enum FuelType {
  PETROL
  DIESEL
  ELECTRIC
  HYBRID
}

type DrivingJourney {
    id: ID!
    vehicle: Vehicle!
    driver: Person!
}

type Person {
    id: ID!
    name: String!
    address: String
    location: GeographicLocation
}

type ChargingSession {
    id: ID!
    name: String!
    vehicle: Vehicle!
    customer: Person!
}

type ChargingStation {
    id: ID!
    name: String
    services: [String]
    location: GeographicLocation
}

type Query {
  Person: Person
  Vehicle: Vehicle
  ChargingStation: ChargingStation
}
"""

schema = build_schema(schema_str)
schema

<graphql.type.schema.GraphQLSchema at 0x1072a0350>

### 2. Select the fields of interest with a GraphQL query

One can use the query mechanism as a way to validate that the fields we want to select actually exist in our reference model.

Instead of creating extra syntax for subsets, we reuse the **GraphQL query language**.  
Here, a query does not mean "fetching data" — instead, it means **declaring a projection** of fields from the governed schema.  

The selection can be checked against the declared queries:
```graphql
type Query {
  Person: Person
  Vehicle: Vehicle
  ChargingStation: ChargingStation
}
```

And the validation can be as simple as:

In [2]:
def validate_query(query_str):
    # Parse + validate against the schema
    query_ast = parse(query_str)
    errors = validate(schema, query_ast)

    if errors:
        for e in errors:
            print("❌", e.message)
    else:
        print("✅ Query is valid against schema")

#### 2.1. Example: a valid selection
A valid query means that the fields I am trying to select indeed exist in the declared specification. Thus, it can act as our selection mechanism.
For example, to declare a correct projection, we can write:

In [3]:
from graphql import parse, validate

valid_query_str = """
query MyCorrectSelection {
  Vehicle {
    location {
      latitude
      longitude
    }
  }
  Person {
    location {
      latitude
      longitude
    }
  }
  ChargingStation {
    location {
      latitude
      longitude
    }
  }
}
"""
validate_query(valid_query_str)


✅ Query is valid against schema


#### 2.2. Example: An invalid selection
Likewise, if the fields we are asking for do not exist or differ from the reference schema, it will be an invalid selection. For instance:

In [4]:
invalid_query_str = """
query MyWrongSelection {
  Vehicle {
    location {
      latitude
      longitude
      altitude  # Invalid field
    }
  }
  Person {
    lastName  # Invalid field
  }
  ChargingStation {
    location {
      latitude
      longitude
    }
  }
  Intersection { # Invalid type
    address: String
  }
}
"""
validate_query(invalid_query_str)

❌ Cannot query field 'altitude' on type 'GeographicLocation'. Did you mean 'latitude' or 'longitude'?
❌ Cannot query field 'lastName' on type 'Person'. Did you mean 'name'?
❌ Cannot query field 'Intersection' on type 'Query'.


### 3. Export the selected elements to Protobuf

#### 3.1. Extracting metadata of selected fields
Once the query is valid, we can traverse its AST (Abstract Syntax Tree), which is a structured representation of the query, and retrieve the **metadata of each selected field** (type, scalar vs enum vs object, list vs single).

##### How Field Metadata Extraction Works

The process requires **both** the query AST and the schema working together:

1. **Query AST provides structure** (what fields were selected and their nesting)
   - Field names: `"Vehicle"`, `"location"`, `"latitude"`
   - Selection hierarchy: `Vehicle → location → latitude`
   - **No type information**

2. **Schema provides type definitions** (what each field's type is)
   - Type mappings: `Vehicle.location → GeographicLocation`, `GeographicLocation.latitude → Float!`
   - **No selection information**

3. **Combined processing**:
   - Walk through the query AST structure (`for field in selections`)
   - For each selected field, look up its type definition in the schema (`parent_type.fields[field_name]`)
   - Extract and analyze the type metadata (`get_named_type(field_type)`)

##### Available Field Metadata

For each selected field, you can extract:

**Basic Information:**
- `field_name` - The field name as selected in the query
- `full_name` - Fully qualified field path (e.g., `"Vehicle.location.latitude"`)

**Type Information:**
- `base_type.name` - The underlying GraphQL type name (`String`, `GeographicLocation`, `FuelType`)
- `field_type` - The complete type wrapper (includes nullability and list info)
- `category` - Type category: `Scalar`, `Enum`, `Object`, or `Unknown`

**Type Modifiers:**
- `is_list` - Whether the field is a list/array type (`[String]` or `[Vehicle]`)
- `is_non_null` - Whether the field is required (`String!` vs `String`)
- `is_list_of_non_null` - For list items nullability (`[String!]` vs `[String]`)

**Additional Metadata (for specific types):**
- **For Enums**: `base_type.values` - Available enum values
- **For Objects**: `base_type.fields` - Available fields in the object type
- **For Scalars**: Built-in vs custom scalar information
- **Selection Context**: `hasattr(field, 'selection_set')` - Whether field has nested selections

**Schema Context:**
- `field_def.description` - Field description from schema
- `field_def.deprecation_reason` - If field is deprecated
- `field_def.args` - Field arguments (if any)

This metadata is essential for translating GraphQL selections to other schema formats like Protobuf, where you need to map GraphQL types to protobuf types, handle lists as `repeated` fields, and generate appropriate message structures.

In [5]:
from graphql import GraphQLEnumType, GraphQLList, GraphQLNonNull, GraphQLObjectType, GraphQLScalarType
from graphql.type.definition import get_named_type

# Parse the MyCorrectSelection query 
query_ast = parse(valid_query_str)

def print_field_metadata(selections, parent_type, prefix=""):
    """Print metadata for each selected field"""
    for field in selections:
        field_name = field.name.value
        full_name = f"{prefix}{field_name}" if prefix else field_name
        
        # Get field definition from parent type
        field_def = parent_type.fields[field_name]
        field_type = field_def.type
        
        # Get base type using get_named_type
        base_type = get_named_type(field_type)
        
        # Determine if it's a list
        is_list = isinstance(field_type, GraphQLList) or (
            isinstance(field_type, GraphQLNonNull) and isinstance(field_type.of_type, GraphQLList)
        )
        
        # Determine type category
        if isinstance(base_type, GraphQLScalarType):
            category = "Scalar"
        elif isinstance(base_type, GraphQLEnumType):
            category = "Enum"
        elif isinstance(base_type, GraphQLObjectType):
            category = "Object"
        else:
            category = "Unknown"
        
        print(f"Field: {full_name}")
        print(f"  Type: {base_type.name}")
        print(f"  Category: {category}")
        print(f"  Is List: {is_list}")
        print()
        
        # Recursively process nested selections
        if hasattr(field, 'selection_set') and field.selection_set:
            print_field_metadata(
                field.selection_set.selections,
                base_type,
                prefix=f"{full_name}."
            )

# Extract metadata from the query
operation = query_ast.definitions[0]
query_type = schema.get_type('Query')

print("=== Selected Fields Metadata ===")
for selection in operation.selection_set.selections:
    root_field_name = selection.name.value
    root_field_def = query_type.fields[root_field_name]
    root_type = get_named_type(root_field_def.type)
    
    print(f"\nRoot Selection: {root_field_name}")
    print(f"Root Type: {root_type.name}")
    print("-" * 30)
    
    if hasattr(selection, 'selection_set') and selection.selection_set:
        print_field_metadata(selection.selection_set.selections, root_type)

=== Selected Fields Metadata ===

Root Selection: Vehicle
Root Type: Vehicle
------------------------------
Field: location
  Type: GeographicLocation
  Category: Object
  Is List: False

Field: location.latitude
  Type: Float
  Category: Scalar
  Is List: False

Field: location.longitude
  Type: Float
  Category: Scalar
  Is List: False


Root Selection: Person
Root Type: Person
------------------------------
Field: location
  Type: GeographicLocation
  Category: Object
  Is List: False

Field: location.latitude
  Type: Float
  Category: Scalar
  Is List: False

Field: location.longitude
  Type: Float
  Category: Scalar
  Is List: False


Root Selection: ChargingStation
Root Type: ChargingStation
------------------------------
Field: location
  Type: GeographicLocation
  Category: Object
  Is List: False

Field: location.latitude
  Type: Float
  Category: Scalar
  Is List: False

Field: location.longitude
  Type: Float
  Category: Scalar
  Is List: False



#### 3.2. Transform selection to the desired output

With the metadata, we can now generate Protobuf definitions.  
Rules:
- GraphQL scalars → Protobuf scalars (`String → string`, `Int → int32`, etc.)  
- GraphQL enums → Protobuf enums (with zero value `*_UNSPECIFIED`)  
- GraphQL objects → Protobuf nested messages  
- GraphQL lists → Protobuf `repeated` fields  

This example exporter takes the query name (`MyCorrectSelection`) as the Protobuf message name,  
and also generates any needed nested messages and enums.

In [6]:
from graphql.type.definition import get_named_type, is_list_type, is_non_null_type

# Scalar type mapping
SCALAR_MAP = {"String": "string", "Int": "int32", "Float": "float", "Boolean": "bool", "ID": "string"}

def to_proto_type(gql_type):
    """Convert GraphQL type to protobuf type string"""
    base_type = get_named_type(gql_type)
    proto_type = SCALAR_MAP.get(base_type.name, base_type.name)
    return f"repeated {proto_type}" if is_list_type(gql_type) else proto_type

def generate_proto_messages(query_ast, schema):
    """Generate protobuf messages from GraphQL query selections"""
    operation = query_ast.definitions[0]
    query_name = operation.name.value if operation.name else "QueryMessage"
    query_type = schema.get_type('Query')
    messages = {}
    enums = {}
    
    def process_selections(selections, parent_type, msg_name):
        fields = []
        for i, field in enumerate(selections, 1):
            field_name = field.name.value
            field_def = parent_type.fields[field_name]
            base_type = get_named_type(field_def.type)
            
            # Handle enums
            if isinstance(base_type, GraphQLEnumType) and base_type.name not in enums:
                enums[base_type.name] = f"enum {base_type.name} {{\n  {base_type.name.upper()}_UNSPECIFIED = 0;\n" + \
                    "\n".join(f"  {val} = {j};" for j, val in enumerate(base_type.values, 1)) + "\n}"
            
            # Handle nested objects
            if isinstance(base_type, GraphQLObjectType) and hasattr(field, 'selection_set'):
                process_selections(field.selection_set.selections, base_type, base_type.name)
            
            proto_type = to_proto_type(field_def.type)
            fields.append(f"  {proto_type} {field_name} = {i};")
        
        messages[msg_name] = f"message {msg_name} {{\n" + "\n".join(fields) + "\n}"
    
    # Create one main message with all root selections
    main_fields = []
    for i, selection in enumerate(operation.selection_set.selections, 1):
        root_name = selection.name.value
        root_type = get_named_type(query_type.fields[root_name].type)
        
        # Process nested selections for this root type
        if hasattr(selection, 'selection_set'):
            process_selections(selection.selection_set.selections, root_type, root_type.name)
        
        # Add this root selection to main message
        main_fields.append(f"  {root_type.name} {root_name.lower()} = {i};")
    
    # Create the main query message
    messages[f"{query_name}Message"] = f"message {query_name}Message {{\n" + "\n".join(main_fields) + "\n}"
    
    # Output combined protobuf
    result = "\n\n".join(enums.values()) + "\n\n" + "\n\n".join(messages.values())
    return result.strip()

# Generate protobuf from MyCorrectSelection
proto_output = generate_proto_messages(query_ast, schema)
print(proto_output)

message GeographicLocation {
  float latitude = 1;
  float longitude = 2;
}

message Vehicle {
  GeographicLocation location = 1;
}

message Person {
  GeographicLocation location = 1;
}

message ChargingStation {
  GeographicLocation location = 1;
}

message MyCorrectSelectionMessage {
  Vehicle vehicle = 1;
  Person person = 2;
  ChargingStation chargingstation = 3;
}


##### Alternative: Flattened Message Structure

Alternatively, one can adapt the conversion to produce a single flattened message instead of hierarchical nested messages:

```protobuf
message MyCorrectSelectionMessage {
  float vehicle_location_latitude = 1;
  float vehicle_location_longitude = 2;
  float person_location_latitude = 3;
  float person_location_longitude = 4;
  float chargingstation_location_latitude = 5;
  float chargingstation_location_longitude = 6;
}
```

The flattened approach combines all selected fields into a single message with underscore-separated field names that reflect the selection path (e.g., `vehicle_location_latitude`). This can be useful for performance-critical scenarios where you want to minimize object allocations and nesting depth, or when integrating with systems that prefer flat data structures. However, the hierarchical approach is generally preferred as it better represents the domain model, enables type reuse, and follows protobuf best practices.