## Peeves

The current reporting metadata drive I'm on is trying to address some pain points.

- To research any run-time issue may involve
  - examining the datadog Logs
  - interpreting logged exception messages
  - verifying that schema matches in multiple environments
  - verifying tha code matches in multiple environments
  - verifying the state of services and machines
  - examining the underlying metadata
  - examining the underlying data
  - evaluating the state of intermediary files
  - evaluating configurations
  - validating secrets (or their accessibility)
- Only the first two of these ten are directly accessible to the engineers for other than staging and dev environments. 

### 🚧 Roadblocks
- This produces roadblocks, including
  - having to involve DBAs or devops to get or examine info pertinent to the problem
  - being unable to test that data against our processes to determine the point of failure
  - unable to use real data to test a solution to an understood problem
- Also, it is hard for anyone coming into our reporting environment to quickly understand how all the pieces work.
  
- This is not unique to reporting, and also applies across the organization. This is an inherent problem in database systems, largely because the tools to investigate the system from a data point of view tend to be expensive and therefore not freely available to all users.



## Two different data issue:

#### 🗺️ Understanding
Initial understanding of the complexities of the subsystem are hard to grasp without looking at 
- how the data is interrelated (cross-table references)
- whether/how the data is populated/unique/optional
- the cardinality of the relationships (1:1, 1:N, 0:1 ...)

#### ⚠️ Diagnosing

Following the trail of an actual incident involves following a chain of data from the reportId or perhaps ReportId and OrganizationID through to the underlying data, some of which is buried quite deep.
Having to ask for data to be fetched by a DBA requires either that they patiently go in to run subsequent queries as the layers are discovered, or that we have efficient diagnostic routines built in already.
this is also hard on the DBAs because the VMs in production environments are designed to impede the flow of random snippets of information.


## :old_key: Security and Privacy

A further complication arises in that some of the data, even though buried, may contain restricted information that should not be distributed - the need-to-know rule applies.

If we are to build a new first-class reporting system that will stand the test of time, I believe we need to 
- make it easier to understand the way our metadata works
- migrate much of the "tribal knowledge" to transparent automated or configurable processes
- make it easy to understand run-time data related to a specific incident
- make it easy to develop queries that clearly present relational data outside of 1:1 relationships. 
- make tooling available to generate metrics, measure health and execute diagnostic probes as an engineer without requiring DBA-level privileges
- protect the sensitive portions of the metadata while exposing the relevant data and the structure of that data 

In developing a replacement for the existing system, a clear understanding of the CRUD lifecycle as it maps across the data structures is essential: addressing the above issues makes this possible.
 


# Reporting Pain Points - from a Data perspective

### tl;dr

Data-related Challenges for maintaining and growing Reporting include:

- Lack of system diagrams/documentation
- Dependency on Tribal knowledge of schema and configuration
- Difficulty diagnosing errors because of data protection
- High impedance pathway for diagnostic and health data
- Inadequate tools for viewing hierarchical data
- Nested, zipped or embedded data needs extra processing
- Lack of redacted or emulated data representative of the real world

## 🛠️ My process:

1. ****Diagrams*** I started by drawing a simplified diagrams, only tables and inbound or outbound keys of some sort, which reveal how all the tables in the Reporting schema are connected to the others [Overview](../.media/Overview.pdf)  [Group A](../.media/GroupA.pdf)   [Group B](../.media/GroupB.pdf)
2. I added diagrams revealing the full schema of each of those tables
3. In examining the links between the tables I determined the longest possible pathways through the data, using both explicit Primary key - foreign key relationships and inferred relationships to define the links.
4. ***Hierarchial** I realized that the painful process of trying to unravel hierarchial data (1-to-many) using flat tables as a medium cold be greatly simplified by executing queries that yield the results in json form, which encapsulates hierarchical data without much impact of processing speed
5. Developing the sql queries to produce the json is not too complex when it comes to a pair of tables defined by a single link, and that was my initial approach. 
   1. If we express a link between two tables as the fully qualified name of Primary key, and the fully qualified path of the reference to it, we have all the information necessary to get all the rest of the information from the database that we need to generate the sql query.
   2. Querying the InformationSchema yields all the column data we need to build the query.
6. ***Query Generation*** I successfully automated the Just-In-time-Generation of the two-table sql based on the link field names.
7. Following the links between the tables, I recognized that our metadata (not surprisingly) consists of two main subtrees:
   1. One defines report design
   2. One describes run-time instances of reports
8. The json data is inherently serializable, so allows the query results to be easily saved and loaded
9. ***Jitting*** Seeing as how the links naturally organized themselves into two tree structures, it seemed feasible that by using a little recursion we would be able to auto-generate more complex sql statements that would produce json for an entire tree of information! All that was necessary was 
   1. A simple class to manage the recursive tree structure (a parent with potentially multiple children)
   2. A starting sequence that could be simply evaluated in the correct order to allow the structure to be built in a single pass. Fortunately I already had that mapped.
10. The result was a sql generator that, in the context of a known sql database, would produce a query from an arbitrary (though sequenced) list of table links.
11. A quick side-note on the need to sequence the links:
    1.  Although links are navigable (i.e. we can get to them without performing a database seek) from a foreign key to a primary key, the semantic relationships in the database can go both ways.
    2.  Given a link already in place, if the next link is in sequence and part of the pathway, it either shares an end with the beginning of the first link, or the end of it.
    3.  If the end, it is a nested reference, presented a level deeper than the current link, otherwise it is a peer, and its properties will behave the same as a join would, i.e. at the same level as the prior link. 
    4.  These are the only two possibilities, as long as the links have been arranged by adjacencies. Parsing this with the recursive nested data structure is relatively trivial because there are only two choices at each step.
12. ***Two Paths*** The first two extended queries I generated fully covered both trees of information - the definitions and the run-time instances.
13. As my focus is usually either to discover the information structure and cardinality, or following the trail of a particular case, I limited the sql to return a small set of data, and was also able to limit the number of rows brought back at each level, as typically I only needed "a few" samples.
14. In practical terms, the model scope need not be limited when it is being queried with injected filters when these would naturally limit the scope, for example querying a particular RecordLogId or Instance. In all other cases we would expect only to see the top few records.
15. Because I wanted to get the latest and greatest samples, I ordered each sub-query by the identity field descending.
16. Now I had json data with both data and structure. I could easily save this data and retrieve it, which also allowed me to work without re-querying the database and simplifies testing. 
17. To work with Json data requires dealing with nested data structures: because the sql query corresponds in form and relations with the table structure it is querying, this is similarly isomorphic with the data that is returned, so the same classes, being fairly agnostic of the actual content, can once again be used to process the data.
18. Using the recursive class allows quick, efficient parsing of the data, and simplifies the handling to only several decision points which are able to correctly convert the data between its Json form and any other that supports a tree-like hierarchical structure.
19. ***Display*** One format that shares this stricture is html, so using the same class allows us to efficiently convert the data into an easily rendered form.
20. A problem with the way large data joins are presented in a table is that the once you have more than a few columns you are forced to scroll horizontally. Transposing columns and rows instead allows more columns (now rows of course) to be seen at a time, and less rows(now columns) more suitable for this purpose where we want to see a lot of fields but only "a few" of each.
21. We now have more vertical scrolling but way less horizontal scrolling, which most people, especially those with wide screens, prefer.
22. When we encounter a property which is a reference to another table, we can indent and present the contents of that table.
23. We can also choose to limit the horizontal width:
    1.  We can show the titles 'sticky" and let the records (now columns) scroll
    2.  We can just show a single record, along with an indication of which one we are showing (and how many there are), and provide a means to step through or expand the records (now columns).
24. Naturally there are an infinite number of ways to deal with the presentation. However, the underlying html for much of this is the same, as it derives from the original topology of the data: what changes is the css used to render the display. All I needed was something that would work until a front-end expert could make it pretty, and for me the most useful has been the version which steps through one record at at time.
25.***Persistence***  The stepping version of the Html uses javascript to do the stepping. Usefully, this combination of Javascript, Html and Css can be written to a single file, retrieved and viewed in your browser of choice without requiring any server, which allows us to make snapshots of data which can be replayed offline, or even on different machines or operating systems, great for sharing structured information.
26. This work is what I demonstrated somewhat superficially to the team a while back.
27. ***Unit testing*** Since then the class libraries that support this were refactored to clarify purpose, and equipped with comprehensive unit tests.
28. ***Integration testing*** The JsonSql portion of the subsystem does not lend itself to unit testing, so I incorporated integration tests that test about 98% of the code against the industry standard "AdventureWorks2022" database running in a Docker container. This goes with the desire for portability and because the test database has more than thirty times the number of tables allows more rigorous testing.
29. ***Pre/post processing*** Clearly this can meet many of the requirements for diagnostic system, but there are two more requirements that are needed:
    1.  We need to be able to customize the queries to deal with zipped data, redaction and nested json or xml data, and special formatting;
    2.  We need a mechanism for customizing the scope and behavior of the queries, for example excluding some columns, unpacking data into subtrees, pre and post processing.
30. This implies that we have some kind of model, which begins as an automatically generated piece, but has customizations added over time.
31. Thus our original Sql Generator is replaced with a Modeler which can
    1.  Initialize a model automatically from a data source
    2.  Be progressively refined
    3.  Be saved and loaded in its refined state
    4.  Be pruned of unwanted content
    5.  And believe it or not, be reinitialized without losing the applied customizations
32. ***Queryable Model*** The Model so generated, and all the variants that can now be easily made, is not a sql query: when data is needed, we pass the model to a SqlModelQuery
33. The SqlModelQuery uses the model to JIT a sql query and execute it against the data source. The result, in Json form, combining the raw data with encoded column names that index back into the model.
34. This query result is now submitted to a second process which models the data into a nested data structure: in the process, it
    1.  Applies any extraction of nested properties - which may be expanded into sub-properties
    2.  Applies any special processing defined in the model
    3.  Injects any new fields and blocks any removed fields, resulting in results which can be modified in shape, format and content from that returned from the database
35. The resulting Nested data object is indistinguishable in form from the object that would have resulted from a direct Sql query, except that
    1.  Zipped or otherwise packed data can be extracted into subfields
    2.  Formatting can be applied
    3.  Data elements can be removed, redacted, or replaced with mocked data
    4.  Encryption, decryption and hashing (which allows comparison for equality without revealing content) can be applied
36. The processing required by the model is "soft-defined" in that the DataModeler is agnostic of the operations being injected into the data stream. The handlers defined for this delegation of control are implemented using a simple class wrapper and invoked declaratively by symbols in the model: this allows refinement and testing of the handlers independently, and means that all models derived from a larger model will inherit the refinements and pass them to their own derivatives
37. The first two portions of the process (The Schema Modeler and Model Query) are implemented as minimal interfaces, allowing transparent insertion of other processes to deal with other data sources, for example Snowflake.
38. The last two processes (Nested Data Modeler and Presenter) deal with formatted data (the Model, Json and the Nested Data) and so are agnostic of the original data source.
39. When complete, because we are able to redact or hide private data, and efficiently generate displayable html, this provides a safe tool for developers and engineers to investigate reporting issues. 
40. ***Redacted exports*** Beyond this immediate need, we can consider:
    1.  Could models redacted in this manner be used for "live" test data from broader scopes for our reporting development?
    2.  Could the sql-to-json conversion be round trip, i.e. would it be possible to insert relational data in a database from json? 
41. ***Virtualization*** These are obviously related, as if these are feasible we could inject limited data into docker images for testing and diagnostic purposes with complete privacy.
