# KiCAD data file exploration

## What format is this in?

KiCAD keeps its data in a format called an s-expression ([Wikipedia: S-expression](https://en.wikipedia.org/wiki/S-expression)).  These were developed for and popularized by the programming language Lisp.  The classical assumption is that they encode trees, and many methods for handling them encode that assumption.  But they can express arbitrarily nested sequences.  They tend to favor being handled via recursion.

In [827]:
from collections import Counter

import sexpdata

The **sexpdata** package is for parsing s-expressions in Python.  It's not KiCAD specific and doesn't impose any meaning on the data.

In [828]:

s_data = sexpdata.load(open("example.sch"))

The line above loads the data file provided by the person who requested this exploration.  @CarlFK

Let's explore the data types involved for a bit.

In [829]:
print(type(s_data))
print(len(s_data))

<class 'list'>
219


So the top level of the parsed s-expression is a list with 219 elements.  What types are those?

In [830]:
print({type(elem) for elem in s_data})

{<class 'sexpdata.Symbol'>, <class 'list'>}


So, there are only sexpdata.Symbols and lists.  Let's look at the symbols.

In [831]:
symbol_list = [elem for elem in s_data if isinstance(elem, sexpdata.Symbol)]
print(f"There are {len(symbol_list)} symbols")
print(symbol_list)

There are 1 symbols
[Symbol('kicad_sch')]


Of the 219 items in the top level list, only one is a symbol and all the rest are lists.  Let's see how long they are

In [832]:
res_list = []
for contained_elem in s_data:
    if isinstance(contained_elem, sexpdata.Symbol):
        res_list.append(str(contained_elem))
    elif isinstance(contained_elem, list):
        res_list.append(len(contained_elem))
    else:
        raise TypeError(f"I wasn't expecting {contained_elem} of type {
                        type(contained_elem)}")

freq_table = Counter([res for res in res_list if isinstance(res, int)])
print(freq_table.most_common())

[(4, 118), (5, 34), (6, 27), (17, 22), (2, 6), (18, 6), (10, 1), (32, 1), (64, 1), (19, 1), (55, 1)]


The list of results displayed above are in order.  The first number is the length of a sublist and the second number is how many sublists with that length were found.

So, most but not all of the sublists at the top level of the data structure are quite small.  Let's get a feel for how many of each type there are in the whole structure.

Let's be lazy at first and assume that the only iterable we're going to want to descend into is a list, since that's the only one at the top level.

In [833]:
def count_types_deep(search_elem):
    for contained_elem in search_elem:
        yield type(contained_elem)
        if isinstance(contained_elem, list):
            yield from count_types_deep(contained_elem)

In [834]:
deep_type_counts = Counter(count_types_deep(s_data))
print(deep_type_counts)

Counter({<class 'sexpdata.Symbol'>: 6734, <class 'list'>: 5468, <class 'float'>: 2618, <class 'str'>: 1513, <class 'int'>: 844})


Let's take a look at some of the early sublists

In [835]:
s_data[1]

[Symbol('version'), 20231120]

In [836]:
s_data[2]

[Symbol('generator'), 'eeschema']

In [837]:
s_data[3]

[Symbol('generator_version'), '8.0']

It looks like the sublists start with symbols.  The outermost list also had only 1 symbol in it.  Let's take a look at this class.

In [838]:
sample_symbol = s_data[3][0]
help(type(sample_symbol))

Help on class Symbol in module sexpdata:

class Symbol(String)
 |  Method resolution order:
 |      Symbol
 |      String
 |      builtins.str
 |      builtins.object
 |
 |  Methods inherited from String:
 |
 |  __eq__(self, other)
 |      >>> from itertools import permutations
 |      >>> S = 'a', String('a'), Symbol('a')
 |      >>> all(x == x for x in S)
 |      True
 |      >>> any(x != x for x in S)
 |      False
 |      >>> any(x == y for x, y in permutations(S, 2))
 |      False
 |      >>> all(x != y for x, y in permutations(S, 2))
 |      True
 |
 |  __hash__(self)
 |      >>> D = {'a': 1, String('a'): 2, Symbol('a'): 3}
 |      >>> len(D)
 |      3
 |
 |  __ne__(self, other)
 |      Return self!=value.
 |
 |  __repr__(self)
 |      Return repr(self).
 |
 |  value(self)
 |
 |  ----------------------------------------------------------------------
 |  Class methods inherited from String:
 |
 |  quote(string)
 |
 |  unquote(string)
 |
 |  ----------------------------------------

Symbol is a sublcass of string, and the only function it seems to add is value, let's see if that has any help.

In [839]:
help(type(sample_symbol).value)

Help on function value in module sexpdata:

value(self)



So, no then.  Let's see what happens when we call it. 

In [840]:
sample_symbol.value()

'generator_version'

It looks like unquoted strings in the original data file get converted into the "Symbol" class to differentiate them from quoted strings.

I have a suspicion that the first value in each list is going to turn out to be a symbol that acts as the name for the rest of the list.  Let's check.

In [841]:
def get_first_element_types(s_expression):
    for elem in s_expression:
        if isinstance(elem, list) and elem:
            yield type(elem[0])
            yield from get_first_element_types(elem[1:])

In [842]:
first_element_type_freqs = Counter(get_first_element_types(s_data))
first_element_type_freqs

Counter({sexpdata.Symbol: 5468})

The number of Symbols that are the first elements of their list corresponds to the number of lists we found earlier.  So it looks like the answer is yes.  Each list starts with a symbol.

In [843]:
def get_list_names(s_expression):
    for elem in s_expression:
        if isinstance(elem, list) and elem:
            yield elem[0]
            yield from get_list_names(elem[1:])

In [844]:
list_names = Counter(get_list_names(s_data))
list_names

Counter({Symbol('size'): 501,
         Symbol('effects'): 481,
         Symbol('font'): 481,
         Symbol('at'): 430,
         Symbol('uuid'): 360,
         Symbol('xy'): 319,
         Symbol('pin'): 265,
         Symbol('type'): 252,
         Symbol('property'): 220,
         Symbol('stroke'): 195,
         Symbol('width'): 195,
         Symbol('hide'): 168,
         Symbol('alternate'): 149,
         Symbol('pts'): 147,
         Symbol('length'): 117,
         Symbol('name'): 117,
         Symbol('number'): 117,
         Symbol('wire'): 92,
         Symbol('unit'): 64,
         Symbol('justify'): 62,
         Symbol('fill'): 57,
         Symbol('symbol'): 54,
         Symbol('fields_autoplaced'): 46,
         Symbol('exclude_from_sim'): 42,
         Symbol('in_bom'): 41,
         Symbol('on_board'): 41,
         Symbol('path'): 33,
         Symbol('lib_id'): 32,
         Symbol('dnp'): 32,
         Symbol('instances'): 32,
         Symbol('project'): 32,
         Symbol('reference

Since each list starts with a symbol that is its name, we could potentially convert this to a nested dictionary.  But this is tricky.  Each level is a list, but the key names for that level are the first elements of each list it contains.

First, let's check if there are any lists that contain a mixture of lists and other values.  Those would throw a wrench into the conversion plan.

In [845]:
def get_list_types(s_expression):
    for elem in s_expression:
        if isinstance(elem, list) and elem and isinstance(elem[0], sexpdata.Symbol):
            yield (elem[0].value(), frozenset({type(x) for x in elem[1:]}))
            yield from get_list_types(elem[1:])

In [846]:
types_of_list = list(get_list_types([s_data]))
types_of_list

[('kicad_sch', frozenset({list})),
 ('version', frozenset({int})),
 ('generator', frozenset({str})),
 ('generator_version', frozenset({str})),
 ('uuid', frozenset({str})),
 ('paper', frozenset({str})),
 ('lib_symbols', frozenset({list})),
 ('symbol', frozenset({list, str})),
 ('pin_names', frozenset({list, sexpdata.Symbol})),
 ('offset', frozenset({float})),
 ('exclude_from_sim', frozenset({sexpdata.Symbol})),
 ('in_bom', frozenset({sexpdata.Symbol})),
 ('on_board', frozenset({sexpdata.Symbol})),
 ('property', frozenset({list, str})),
 ('at', frozenset({float, int})),
 ('effects', frozenset({list})),
 ('font', frozenset({list})),
 ('size', frozenset({float})),
 ('property', frozenset({list, str})),
 ('at', frozenset({float, int})),
 ('effects', frozenset({list})),
 ('font', frozenset({list})),
 ('size', frozenset({float})),
 ('property', frozenset({list, str})),
 ('at', frozenset({int})),
 ('effects', frozenset({list})),
 ('font', frozenset({list})),
 ('size', frozenset({float})),
 ('h

In [847]:
challening_lists = {x for x in types_of_list if list in x[1] and len(x[1]) > 1}
challening_lists

{('label', frozenset({list, str})),
 ('name', frozenset({list, str})),
 ('number', frozenset({list, str})),
 ('path', frozenset({list, str})),
 ('pin', frozenset({list, str})),
 ('pin', frozenset({list, sexpdata.Symbol})),
 ('pin_names', frozenset({list, sexpdata.Symbol})),
 ('project', frozenset({list, str})),
 ('property', frozenset({list, str})),
 ('symbol', frozenset({list, str})),
 ('text', frozenset({list, str}))}

Above the set of symbols I found in the data that don't follow the pattern that would allow them to be converted to dicts in a straightforward way.

## Navigating the data

Since above I found some data structures that can't be converted to a dict in a straightforward way, how can the structure be navigated?

I could decide how to convert each token that can't be converted in an automatic way.  If I was trying to parse the entire file that's what I'd ahve to do.  But that's probably overkill if I'm just searching the file.

If I only want to find all occurances of a partiuclar symbol, and each list starts with a symbol, I'm probably going to have to navigate the structure recursively.  The human-readable path is the sequence of Symbols at the start of each list that lead to the one I was looking for.  To navigate the data structure in Python though, I would want the index in each list.

@CarlFK put a list containing `(label "fun")` into the sample data to facilitate searching, so let's look for that.

### Finding every label

In [848]:
def recursive_search(s_data, search_symbol_name, key_chain=tuple()):
    if not s_data or not isinstance((current_list_symbol := s_data[0]), sexpdata.Symbol):
        print(f"Found list {s_data} that I don't understand")
        return
    key_chain += ((current_list_symbol,))
    if current_list_symbol.value() == search_symbol_name:
        yield key_chain, s_data[1]
    for contained_elem in s_data:
        if isinstance(contained_elem, list):
            yield from recursive_search(contained_elem, search_symbol_name, key_chain)

In [849]:
found_labels = tuple(recursive_search(s_data, "label"))
print(found_labels)
print(len(found_labels))

(((Symbol('kicad_sch'), Symbol('label')), 'x1'), ((Symbol('kicad_sch'), Symbol('label')), 'fun'), ((Symbol('kicad_sch'), Symbol('label')), 'TDI'), ((Symbol('kicad_sch'), Symbol('label')), 'TRST'), ((Symbol('kicad_sch'), Symbol('label')), 'TDI'), ((Symbol('kicad_sch'), Symbol('label')), 'pico TX'), ((Symbol('kicad_sch'), Symbol('label')), 'JTAG'), ((Symbol('kicad_sch'), Symbol('label')), 'x1'), ((Symbol('kicad_sch'), Symbol('label')), 'RTCK'), ((Symbol('kicad_sch'), Symbol('label')), 'pico RX'), ((Symbol('kicad_sch'), Symbol('label')), 'TCK'), ((Symbol('kicad_sch'), Symbol('label')), 'TDO'), ((Symbol('kicad_sch'), Symbol('label')), 'RTCK'), ((Symbol('kicad_sch'), Symbol('label')), 'TDO'), ((Symbol('kicad_sch'), Symbol('label')), 'TMS'), ((Symbol('kicad_sch'), Symbol('label')), 'TCK'), ((Symbol('kicad_sch'), Symbol('label')), 'reset pico'), ((Symbol('kicad_sch'), Symbol('label')), 'pi power'), ((Symbol('kicad_sch'), Symbol('label')), 'pico TX'), ((Symbol('kicad_sch'), Symbol('label')), '

### Does this make sense?

It turns out all of the label elements are at the top level.  The recursive search function finds 26, but searchign for "label" in the sample data shows 42 instances.  It turns out many of them are actually inside quoted strings belonging to other types, as in teh following example.

`"Power symbol creates a global label with name \"GND\" , ground"`

Let's see if the labels match.

In [850]:
[elem[1] for elem in found_labels]

['x1',
 'fun',
 'TDI',
 'TRST',
 'TDI',
 'pico TX',
 'JTAG',
 'x1',
 'RTCK',
 'pico RX',
 'TCK',
 'TDO',
 'RTCK',
 'TDO',
 'TMS',
 'TCK',
 'reset pico',
 'pi power',
 'pico TX',
 'x2',
 'reset pi',
 'pico RX',
 'serial',
 'TRST',
 'TMS',
 'x2']

Doing a quick visual inspection, it appears that the search function finds all the labels.  Also, their names aren't unique, so I should probably update the function to return their UUIDs as well.

In [851]:
def find_labels_with_text(s_data, label_text, key_chain=tuple()):
    if not s_data or not isinstance((current_list_symbol := s_data[0]), sexpdata.Symbol):
        print(f"Found list {s_data} that I don't understand")
        return
    key_chain += ((current_list_symbol,))
    if current_list_symbol.value() == 'label':
        # We have found a label, it should have a UUID.
        # This means a sublist that starts with the symbol "uuid"
        UUID = None
        for possible_uuid in s_data:
            if isinstance(possible_uuid, list) and possible_uuid and possible_uuid[0].value().casefold() == "UUID".casefold():
                UUID = possible_uuid[1]
                # We could either break here, or check to make sure that UUID is unique.
        yield key_chain, s_data[1], UUID
    for contained_elem in s_data:
        # This probably feels like it duplicates work from searching for the UUID, and it does.
        # But they're for diifferent purposes and combining them would be easy to mess up, so I left them separate.
        if isinstance(contained_elem, list):
            yield from find_labels_with_text(contained_elem, find_labels_with_text, key_chain)

In [852]:
tuple(find_labels_with_text(s_data, "x1"))

(((Symbol('kicad_sch'), Symbol('label')),
  'x1',
  '0b22b970-b9bb-4fe6-bf71-1ab837269f15'),
 ((Symbol('kicad_sch'), Symbol('label')),
  'fun',
  '0de11d38-fc02-4941-b52a-794454b9288f'),
 ((Symbol('kicad_sch'), Symbol('label')),
  'TDI',
  '12cf4f88-9ec9-41ab-8893-b3950909b14e'),
 ((Symbol('kicad_sch'), Symbol('label')),
  'TRST',
  '210a62ab-5e09-437e-8452-cf7e739f589a'),
 ((Symbol('kicad_sch'), Symbol('label')),
  'TDI',
  '24a7f594-59d9-4f0e-8feb-3877f35f9ace'),
 ((Symbol('kicad_sch'), Symbol('label')),
  'pico TX',
  '2df899f1-160f-49e5-813a-1ad4eba2d6ed'),
 ((Symbol('kicad_sch'), Symbol('label')),
  'JTAG',
  '3c9d6eb8-a1d0-4f24-a993-e0819bc10a15'),
 ((Symbol('kicad_sch'), Symbol('label')),
  'x1',
  '430b167f-223a-4046-9cf1-c23b8785dff3'),
 ((Symbol('kicad_sch'), Symbol('label')),
  'RTCK',
  '4ee86edb-e937-4f89-87b9-d64f323abdc0'),
 ((Symbol('kicad_sch'), Symbol('label')),
  'pico RX',
  '6140d8d2-0719-441b-b632-b49c10bd9265'),
 ((Symbol('kicad_sch'), Symbol('label')),
  'TCK',
