Skip to content

First Tutorial Part 7: Realization And Refills

Laurent Hasson edited this page Sep 16, 2019 · 2 revisions
Previous Main Next
<-- Part 6 Main Part 8 -->

Flat Datamarts and Performance

One issue with Pivot tables, or any complex view for that matter, is that they tend to be slow if you are dealing with more complex joins and millions of records. So how do we deal with this? View materialization is an answer most databases have implemented. In analytical environments where data may not change all the time, view materialization is a very good answer. If you are not familiar, materializing a view is as if the database dumped the contents of the view into a table. Fully denormalized, and without any joins, accessing that table becomes MUCH faster in MOST cases. At a high level, it's functionally equivalent to doing

create table X as select * from X_view

Of course, databases provide some automation, and maintenance etc... so that view materialization is often better than the quick and dirty SQL above. However, in many analytical cases, it might not be the best answer:

  • A view may be rather complex and refreshing the materialized table under the covers may not be trivial, sometimes requiring the entire view to be recomputed.
  • The resulting underlying table loses many of the semantics from the tables that made up the view such as foreign keys, indices and so on.
  • Sometimes, you know the data better and could do a materialization better.

With Tilda, we encourage the creation of views as they are ideal to encapsulate concepts and/or transformations on your data. This is important because Tilda can be used in environments where some ETL (Extract Transform Load) processes are implemented. Tilda IS NOT an ETL tool, but some simpler view-level transformations, formulas and features like Realization, can often lighten the load on the amount of ETL work you may have to do.

Making liberal use of views is a key aspect of good SQL database design. Views allow you to encapsulate the details of the structure of your tables, which might change as your application evolves, behind consistent interfaces.

– PostgreSQL Documentation - Advanced Features - Views

Tilda has a feature called "Realization" (to make it close to, but different from the term "Materialization") that helps make the process more efficient, especially at scale. It allows to dump the contents of a view into a target table but do this so that:

  • The semantics of your choice from the underlying tables are "ported" to the target table.
  • The process can be customized (in limited ways today) to, for example, implement incremental updates which are critical for scalability.
  • The target table exists implicitly in the Tilda model which makes further views or other constructs possible on top of a performing version of the data.

The process of realizing a view is called a "refill". So let's take our Test_XYZ_PivotView definition and add realization to it:

  • We have one row per Form, so formRefnum is the grain for this table and can be made its PK
  • We can create indices over users as well
  • We can re-declare all the relevant FK
,{ "name": "Test_XYZ_PivotView"
  ,"description": "A pivoted view of XYZ forms"
  ,"columns":[
      { "sameas": "Form.refnum"  , "name":"formRefnum"     }
     ,{ "sameas": "Form.type"    , "name":"formType"       }
     ,{ "sameas": "Form.fillDate", "name":"formFillDate"   }
     ,{ "sameas": "User.refnum"  , "name":"formUserRefnum" }
     ,{ "sameas": "User.email"   , "name":"formUserEmail"  }
     ,{ "sameas": "TestAnswer.correct"   , "name":"countCorrect"        , "aggregate":"COUNT", "filter":"correct = true"   }
     ,{ "sameas": "TestAnswer.correct"   , "name":"countIncorrect"      , "aggregate":"COUNT", "filter":"correct = false"  }
     ,{ "sameas": "TestAnswer.timeMillis", "name":"timeMillisTotal", "aggregate":"SUM"  }
     ,{ "sameas": "TestAnswer.timeMillis", "name":"timeMillisAvg"  , "aggregate":"AVG"  }
     ,{ "sameas": "TestAnswer.timeMillis", "name":"timeMillisMin"  , "aggregate":"MIN"  }
     ,{ "sameas": "TestAnswer.timeMillis", "name":"timeMillisMax"  , "aggregate":"MAX"  }
     ,{ "sameas": "TestAnswer.correct"                             , "aggregate":"MIN", "joinOnly":true  }
    ]
  ,"subWhereX":{
      "clause":["    Form.type = 'XYZ'" 
               ,"and Form.deleted is null"
               ,"and TestAnswer.deleted is null"
               ]
     ,"description":["Only look at XYZ forms and answers that have not been deleted."
                    ]
    }
  ,"pivots":[
      { "on": "TestAnswer.questionId" 
       ,"aggregates":[
           { "name":"correct"      , "suffix":"_correct"    }
          ,{ "name":"timeMillisMax", "suffix":"_timeMillis" }
         ]
       ,"values":[ { "value":"QX", "description":"Question X" }
                  ,{ "value":"QY", "description":"Question Y" }
                  ,{ "value":"QZ", "description":"Question Z" }
                ]
       ,"interleave":true
      }
    ]
  ,"realize":{
       "primary":{ "columns":["formRefnum"] }
      ,"foreign":[
          { "name":"Form" ,  "srcColumns":["formRefnum"    ], "destObject": "Form" }
         ,{ "name":"User"  , "srcColumns":["formUserRefnum"], "destObject": "User" }
        ]
      ,"indices":[
          { "name":"User", "columns":["formUserRefnum"], "orderBy":["formFillDate desc"] }
         ,{ "name":"Type", "columns":["formType"      ], "orderBy":["formFillDate desc"] }
        ]
    }
 }

The fields are:

  • name: An optional name for the target table. If not provided, the name of the view will be used where the "View" or "PivotView" suffixes will be replaced by "Realized".
  • primary: Defines a primary key, if applicable.
  • foreign: Defines one or more foreign keys, if applicable.
  • indices: Defines one or more indices, if applicable.
  • upsert: Define the columns used for tracking the last time data was touched or deleted in order to support incremental realization.
    • upsertTS: One or more timestamp columns (LEAST will be used) that would indicate data has been updated.
    • deleteTS: A timestamp columns that would indicate data has been marked for deletion.

🎈 NOTE: There were a lot of heated conversations around whether the realized table should be allowed a custom name. On the one hand, enforcing "Realized" on the target table name proved to be a good thing as people looking at the schema know exactly how they work, could look up their corresponding view without having to look at docs. On the other hand, we faced a few scenarios where the target table was intended for users outside of our group who dictated the names of the tables they wanted to look at. One way would have been to simple create a .* view over the realized table, but that seemed like a weird thing to do (i.e., a view generates a table which we then re-wrap into a view). In the end, too many customers of the data we were preparing requested that feature.

🎈 NOTE: The "upsert" feature is still being worked on with no ETA as of Sept 2019. The information above is therefore tentative and will be updated when the feature lands. In the meantime, Realization processes, i.e. Refills, go over the entire data from the View and is therefore not incremental.

  • THIS IS RECOGNIZED AS AN ISSUE AND A KEY FEATURE TO MAKE REALIZATION TRULY SCALABLE.
  • In the project mentioned above, we pivotted 590 columns over 500K assessments and 140M question/answers for the OASIS spec.
    • That pivotting view took about 15mn to realize nightly on a system with 32GB memory, 16 cores and SSDs.
    • Postgres is also quite efficient with views so we are able to realize multiple pivot tables concurrently: The HIS, OASIS, ICC and Physical Assessment example mentioned above concurrently take under 20mn altogether, which was very acceptable for this project where the ETL is run nightly.
    • A prototype implementation of the upsert version of realization takes < 1mn assuming about 2.5K assessments are created every day, 7.5K are modified, and low 100's are deleted.

The realization will create and maintain 2 assets:

  • A definition for a new table both for the database, as well as in-memory for the application (as if you had manually created such a table in the Tilda JSON). In this case, we didn't specify a name, so the table is named automatically to Test_XYZ_Realized.
  • A database function to do the "refill" (i.e., realize the view): Refill_Test_XYZ_Realized().
create table if not exists TILDATEST.Test_XYZ_Realized -- Realized table for view TILDATEST.Test_XYZ_PivotView: A pivoted view of XYZ forms
 (  "formRefnum"       bigint            not null   -- The primary key for this record (from TILDATEST.Test_XYZ_PivotView.formRefnum)
  , "formType"         varchar(40)                  -- Form template type (from TILDATEST.Test_XYZ_PivotView.formType)
  , "formFillDateTZ"   varchar(10)                  -- Generated helper column to hold the time zone ID for 'formFillDate'. (from TILDATEST.Test_XYZ_PivotView.formFillDateTZ)
  , "formFillDate"     timestamptz                  -- The date the form was filled (from TILDATEST.Test_XYZ_PivotView.formFillDate)
  , "formUserRefnum"   bigint                       -- The primary key for this record (from TILDATEST.Test_XYZ_PivotView.formUserRefnum)
  , "formUserEmail"    varchar(255)                 -- The user's email (from TILDATEST.Test_XYZ_PivotView.formUserEmail)
  , "countCorrect"     bigint                       -- Whether the answer is correct or not (from TILDATEST.Test_XYZ_PivotView.countCorrect)
  , "countIncorrect"   bigint                       -- Whether the answer is correct or not (from TILDATEST.Test_XYZ_PivotView.countIncorrect)
  , "timeMillisTotal"  bigint                       -- Time in milliseconds for the time spent answering the question (from TILDATEST.Test_XYZ_PivotView.timeMillisTotal)
  , "timeMillisAvg"    double precision             -- Time in milliseconds for the time spent answering the question (from TILDATEST.Test_XYZ_PivotView.timeMillisAvg)
  , "timeMillisMin"    integer                      -- Time in milliseconds for the time spent answering the question (from TILDATEST.Test_XYZ_PivotView.timeMillisMin)
  , "QX_correct"       boolean                      -- Question X (pivot of MIN(TILDATEST.TestAnswer.correct) on TILDATEST.TestAnswer.questionId='QX') (from TILDATEST.Test_XYZ_PivotView.QX_correct)
  , "QX_timeMillis"    integer                      -- Question X (pivot of MAX(TILDATEST.TestAnswer.timeMillis) on TILDATEST.TestAnswer.questionId='QX') (from TILDATEST.Test_XYZ_PivotView.QX_timeMillis)
  , "QY_correct"       boolean                      -- Question Y (pivot of MIN(TILDATEST.TestAnswer.correct) on TILDATEST.TestAnswer.questionId='QY') (from TILDATEST.Test_XYZ_PivotView.QY_correct)
  , "QY_timeMillis"    integer                      -- Question Y (pivot of MAX(TILDATEST.TestAnswer.timeMillis) on TILDATEST.TestAnswer.questionId='QY') (from TILDATEST.Test_XYZ_PivotView.QY_timeMillis)
  , "QZ_correct"       boolean                      -- Question Z (pivot of MIN(TILDATEST.TestAnswer.correct) on TILDATEST.TestAnswer.questionId='QZ') (from TILDATEST.Test_XYZ_PivotView.QZ_correct)
  , "QZ_timeMillis"    integer                      -- Question Z (pivot of MAX(TILDATEST.TestAnswer.timeMillis) on TILDATEST.TestAnswer.questionId='QZ') (from TILDATEST.Test_XYZ_PivotView.QZ_timeMillis)
  , PRIMARY KEY("formRefnum")
  , CONSTRAINT fk_Test_XYZ_Realized_Form FOREIGN KEY ("formRefnum") REFERENCES TILDATEST.Form ON DELETE restrict ON UPDATE cascade
  , CONSTRAINT fk_Test_XYZ_Realized_User FOREIGN KEY ("formUserRefnum") REFERENCES TILDATEST.User ON DELETE restrict ON UPDATE cascade
 );
COMMENT ON TABLE TILDATEST.Test_XYZ_Realized IS E'Realized table for view TILDATEST.Test_XYZ_PivotView: A pivoted view of XYZ forms';
COMMENT ON COLUMN TILDATEST.Test_XYZ_Realized."formRefnum" IS E'The primary key for this record (from TILDATEST.Test_XYZ_PivotView.formRefnum)';
COMMENT ON COLUMN TILDATEST.Test_XYZ_Realized."formType" IS E'Form template type (from TILDATEST.Test_XYZ_PivotView.formType)';
COMMENT ON COLUMN TILDATEST.Test_XYZ_Realized."formFillDateTZ" IS E'Generated helper column to hold the time zone ID for ''formFillDate''. (from TILDATEST.Test_XYZ_PivotView.formFillDateTZ)';
COMMENT ON COLUMN TILDATEST.Test_XYZ_Realized."formFillDate" IS E'The date the form was filled (from TILDATEST.Test_XYZ_PivotView.formFillDate)';
COMMENT ON COLUMN TILDATEST.Test_XYZ_Realized."formUserRefnum" IS E'The primary key for this record (from TILDATEST.Test_XYZ_PivotView.formUserRefnum)';
COMMENT ON COLUMN TILDATEST.Test_XYZ_Realized."formUserEmail" IS E'The user''s email (from TILDATEST.Test_XYZ_PivotView.formUserEmail)';
COMMENT ON COLUMN TILDATEST.Test_XYZ_Realized."countCorrect" IS E'Whether the answer is correct or not (from TILDATEST.Test_XYZ_PivotView.countCorrect)';
COMMENT ON COLUMN TILDATEST.Test_XYZ_Realized."countIncorrect" IS E'Whether the answer is correct or not (from TILDATEST.Test_XYZ_PivotView.countIncorrect)';
COMMENT ON COLUMN TILDATEST.Test_XYZ_Realized."timeMillisTotal" IS E'Time in milliseconds for the time spent answering the question (from TILDATEST.Test_XYZ_PivotView.timeMillisTotal)';
COMMENT ON COLUMN TILDATEST.Test_XYZ_Realized."timeMillisAvg" IS E'Time in milliseconds for the time spent answering the question (from TILDATEST.Test_XYZ_PivotView.timeMillisAvg)';
COMMENT ON COLUMN TILDATEST.Test_XYZ_Realized."timeMillisMin" IS E'Time in milliseconds for the time spent answering the question (from TILDATEST.Test_XYZ_PivotView.timeMillisMin)';
COMMENT ON COLUMN TILDATEST.Test_XYZ_Realized."QX_correct" IS E'Question X (pivot of MIN(TILDATEST.TestAnswer.correct) on TILDATEST.TestAnswer.questionId=''QX'') (from TILDATEST.Test_XYZ_PivotView.QX_correct)';
COMMENT ON COLUMN TILDATEST.Test_XYZ_Realized."QX_timeMillis" IS E'Question X (pivot of MAX(TILDATEST.TestAnswer.timeMillis) on TILDATEST.TestAnswer.questionId=''QX'') (from TILDATEST.Test_XYZ_PivotView.QX_timeMillis)';
COMMENT ON COLUMN TILDATEST.Test_XYZ_Realized."QY_correct" IS E'Question Y (pivot of MIN(TILDATEST.TestAnswer.correct) on TILDATEST.TestAnswer.questionId=''QY'') (from TILDATEST.Test_XYZ_PivotView.QY_correct)';
COMMENT ON COLUMN TILDATEST.Test_XYZ_Realized."QY_timeMillis" IS E'Question Y (pivot of MAX(TILDATEST.TestAnswer.timeMillis) on TILDATEST.TestAnswer.questionId=''QY'') (from TILDATEST.Test_XYZ_PivotView.QY_timeMillis)';
COMMENT ON COLUMN TILDATEST.Test_XYZ_Realized."QZ_correct" IS E'Question Z (pivot of MIN(TILDATEST.TestAnswer.correct) on TILDATEST.TestAnswer.questionId=''QZ'') (from TILDATEST.Test_XYZ_PivotView.QZ_correct)';
COMMENT ON COLUMN TILDATEST.Test_XYZ_Realized."QZ_timeMillis" IS E'Question Z (pivot of MAX(TILDATEST.TestAnswer.timeMillis) on TILDATEST.TestAnswer.questionId=''QZ'') (from TILDATEST.Test_XYZ_PivotView.QZ_timeMillis)';
CREATE INDEX IF NOT EXISTS Test_XYZ_Realized_User ON TILDATEST.Test_XYZ_Realized ("formUserRefnum", "formFillDate" DESC);
CREATE INDEX IF NOT EXISTS Test_XYZ_Realized_Type ON TILDATEST.Test_XYZ_Realized ("formType", "formFillDate" DESC);


DROP FUNCTION IF EXISTS TILDATEST.Refill_Test_XYZ_Realized();
CREATE OR REPLACE FUNCTION TILDATEST.Refill_Test_XYZ_Realized()
 RETURNS boolean AS $$
BEGIN
  TRUNCATE TILDATEST.Test_XYZ_Realized;
  INSERT INTO TILDATEST.Test_XYZ_Realized ("formRefnum", "formType", "formFillDateTZ", "formFillDate", "formUserRefnum", "formUserEmail", "countCorrect", "countIncorrect", "timeMillisTotal", "timeMillisAvg", "timeMillisMin", "QX_correct", "QX_timeMillis", "QY_correct", "QY_timeMillis", "QZ_correct", "QZ_timeMillis")
     SELECT /*genRealizedColumnList*/"formRefnum" -- COLUMN
          ,"formType" -- COLUMN
          ,"formFillDateTZ" -- COLUMN
          ,"formFillDate" -- COLUMN
          ,"formUserRefnum" -- COLUMN
          ,"formUserEmail" -- COLUMN
          ,"countCorrect" -- COLUMN
          ,"countIncorrect" -- COLUMN
          ,"timeMillisTotal" -- COLUMN
          ,"timeMillisAvg" -- COLUMN
          ,"timeMillisMin" -- COLUMN
          , "QX_correct" -- PIVOT COLUMN
          , "QX_timeMillis" -- PIVOT COLUMN
          , "QY_correct" -- PIVOT COLUMN
          , "QY_timeMillis" -- PIVOT COLUMN
          , "QZ_correct" -- PIVOT COLUMN
          , "QZ_timeMillis" -- PIVOT COLUMN
     FROM TILDATEST.Test_XYZ_PivotView;
  ANALYZE TILDATEST.Test_XYZ_Realized;
  return true;
END; $$
LANGUAGE PLPGSQL;

As usual, the SQL should be pretty self-explanatory, including the Refill function which simply truncates the table, inserts into the table select * from the view, and then runs analyze. In the near future, if incremental processing is specified with the "upsert" element, more complex and different code will be generated.

🎈 NOTE: An interesting problem to solve is about what happens if a view is created over another view that is realized. For example, let's say you have V1 as a simple view that is realized. Then you define V2, use V1, and realize V2. It would be silly to have V2's refill not take advantage of V1's refill. To achieve this, Tilda maintains a separate hierarchy of views which substitute the appropriate realized table when applicable. Those views exist in a separate schema, are used by the refill functions and are maintained/migrated automatically by Tilda. Those views shouldn't be used ever directly and there are discussions about eliminating them altogether for a simpler implementation.

X_View
   TableA
   TableB
   TableC
--> X_Realized
--> Refill_X_Realizes uses X_View

Y_View
   X_View
   TableY
   TableZ
--> Y_Realized
--> Y_R
     X_Realized
     TableY
     TableZ
---> Refill_Y_Realized uses Y_R
Previous Main Next
<-- Part 6 Main Part 8 -->
Clone this wiki locally