Internals of ShannonBase

1 Overview

2 Architecture

2.1 Before the start

Shannonbase is an AI/ML empowered open source MySQL HTAP Database. which utilize AI/ML to enhance the AI/ML ability of Shannonbase.

With these employed features, shannonbase will has autonomous tuning, workloads predicting, auto index choice, index recommendation, selectivity estimation, etc. (ref: a phd dissertation from UC berkeley by Zongheng Yang and MySQL heatwave.

In order to support analytical processing in mysql, shannonbase incorporate a secondary engine, which is an in-memory column engine to process analytical workloads. The secondary engine is new feature, which is introduced in MySQL 8.x, it provides an interface to support multi-model, heterogeneous database. It synchronize the data from primary engine(InnoDB) to secondary engine(Rapid).

Now, our in-memory column sotre, also called Rapid.

It will be based on MySQL 8.1. It aims to achieve at least x times higher query performance on TPC-H than that of xxx, with a more concise architecture and a query optimizer that can intelligently offload query workloads to corresponding storage engines.

The design philosophy of ShannonBase Rapid is modularity and performance-cost balance. The following outlines the new features that will be implemented in ShannonBase. To learn details about each feature, see the relevant chapter.

ShannonBase Rapid will still be an open source project, which is a counterpart of close source service, MySQL Heatwave.

At first, an in-memory column store (IMCS) will be used. Secondly, a cost-based query engine will be developed to automatically offload transactional and analytics workloads. Thirdly, ShannonBase Rapid will provide a vectorized execution engine and support massive parallel processing. In this way, the execution performance of ShannonBase Rapid will be at least xxx times as that of xxx.

ShannonBase will load the data into memory from InnoDB to Rapid, just the same as MySQL Heatwave does.

’‘’MySQL Analytics is an in-memory processing engine, data is only persisted in MySQL InnoDB storage engine.‘’‘

This sentence functions as the basic rule and guideline for us when implementation ShannonBase Rapid. This design document introduces the main changes that will be achieved and gives you an overview of architecture of ShannonBase.

The main design goals of ShannonBase will include:

Large Scale.
Real Time.
Highly Fresh Data Changes.
Strong Data Consistency.
Query Performance Capability.
Single System Interface.
Workload Isolation.

2.2 Overview of ShannonBase Rapid

ShannonBase is an integrated HTAP database that adopts hybrid row-column store and in-memory computing. It is fully compatible with MySQL version 8.1.

The architecture overview of ShannonBase

In MySQL 8.0, it provides the secondary engine which can intelligently route the TP workloads to the primary engine(InnonDB) and routes the AP workloads to secondary engine(Rapid), all these operations are based on the workload type.

2.3 Query Engine

After all new SQL syntaxes are enabled, the server will understand all the SQL statements. When the server receives an SQL statement, the SQL string will create some SQL classes, such as PT_create_table_stmt after lexical processing and grammatical processing. We will not discuss how distributed query plans in MPP are generated in this document. Instead, we focus on ONE NODE and try to explain what happens in ONE node when processing an SQL statement.

In MySQL 8.0, when the cost of a query plan on the primary engine is greater than the threshold defined by the new system variable (secondary_engine_cost_threshold), the query optimization engine will offload this workload to the secondary engine, ensuring optimal processing efficiency.

At the last phase of query optimization, The query engine will add optimize_secondary_engine to determine to which engine will the workload route for execution by performing the following three steps:

Use the original processing way: unit->optimize().

Estimate the cost spent by each engine to process the query: current_query_cost and accumulate_current_query_cost.

If current_query_cost is greater than secondary_engine_cost_threshold, forward the workload to optimize_secondary_engine.

if (current_query_cost < variables.secondary_engine_cost_threshold) 
    return false; 
optimize_secondary_engine;

In future, after ShannonBase achieves MPP, the way that ShannonBase processes SQL statements will be different from centralized systems. A distributed query plan will be generated after query optimization is completed.

2.4 Execution Engine

As for the execution engine, a vectorized execution engine will be incorporated in ShannonBase. Shannonbase will support parallel query and vectorized execution. A column-based AP system is native to implement a vectorized execution engine. The vectorized engine seems as a standard feature to improve the performance of AP workloads. RDBMS systems such as ClickHouse also use vectorized execution engines.

Two ways are available to achieve vectorized execution as following:

Use SIMD (single instruction, multiple data) to re-write execution plans. Multiple tuples will be fetched in an iteration, rather than a-tuple-an-iteration. Use GCC to generate vectorized code. Some aggregation functions such as count(), sum(), and avg() can be executed in parallel mode. After a query plan is dispatched on a data node through the management node, the execution engine executes this query plan in parallel and the job is divided into sub-jobs and simultaneously executed through threads. The framework of parallel execution is discussed in issue #xxxx. You can refere to MySQL NDB cluster.

MySQL Cluster has a unique parallel query engine. It gives a consistent consolidated transactional view of the entire distributed partitioned dataset. This makes designing and programming scaleable distributed applications straightforward and extremely simple. The developer can completely focus on the application logic and doesn’t need to think about data distribution.

2.4 Rapid, A In-Memory Column Store Secondary Engine

3 Internals

1 Getting TRX_ID from InnoDB

In Shannonbase, we have introduced a hidden column, db_trx_id, designed to store the trxid from InnoDB records. When executing the secondary_load command in Shannonbase, a full table scan is performed on the specified loaded table, and all visible records in the table are loaded into the Shannonbase Rapid engine. Therefore, the trxid for each record is similarly loaded into the Shannonbase Rapid engine in columnar format. The following picture illustrates the internal layout of a row in innodb engine.

When a row loaded into ShannonBase Rapid Engine, it's divided into parts, each field stored as an independent file.

In order to identify the visibility of a columar, a TRX_ID was attached, which got from its row information. An invisible field, Field_sys_trx_id, employed into Shannonbase to store the TRX_ID when we do full table scan in alter table xxx secondary_load executing stage.

1.1 The ghost column, `Field_sys_trx_id`

/**
 Field_sys_trx_id represented as an system column DB_TRX_ID  used for getting
 trx_id value from innodb to SQL.
 * */
class Field_sys_trx_id : public Field_longlong {
 public:
  using Field_longlong::store;
  static const int PACK_LENGTH_TRX_ID = MAX_DB_TRX_ID_WIDTH;

  Field_sys_trx_id(uchar *ptr_arg, uint32 len_arg)
      : Field_longlong(ptr_arg, len_arg, nullptr, 0, 0, "DB_TRX_ID", 0, false)
  {
    stored_in_db = false;
    set_hidden(dd::Column::enum_hidden_type::HT_HIDDEN_SE);
    set_column_format(COLUMN_FORMAT_TYPE_DEFAULT);
    set_flag(NO_DEFAULT_VALUE_FLAG);
    stored_in_db = true;
  }
  Field_sys_trx_id(uint32 len_arg, bool is_nullable_arg,
                   const char *field_name_arg, bool unsigned_arg)
      : Field_longlong(nullptr, len_arg, is_nullable_arg ? &dummy_null_buffer : nullptr,
		      0, 0, field_name_arg, 0, unsigned_arg)
  {
    stored_in_db = false;
    set_hidden(dd::Column::enum_hidden_type::HT_HIDDEN_SE);
    set_column_format(COLUMN_FORMAT_TYPE_DEFAULT);
    set_flag(NO_DEFAULT_VALUE_FLAG);
    stored_in_db = true;
  }
  type_conversion_status store(longlong nr, bool unsigned_val) final;
  enum_field_types type() const final { return MYSQL_TYPE_DB_TRX_ID; }
  uint32 pack_length() const final { return PACK_LENGTH_TRX_ID; }
  void sql_type(String &str) const final;
  Field_sys_trx_id *clone(MEM_ROOT *mem_root) const final {
    assert(type() == MYSQL_TYPE_DB_TRX_ID);
    return new (mem_root) Field_sys_trx_id(*this);
  }
  longlong val_int() const final;
};

1.2 Adding and Filling the ghost column

In open table stage, MySQL will filling the TABLE_SHARE from dd. Therefore, in order to keep TRX_ID, we need add a ghost column, 'Field_sys_trx_id' to the end of all use defined field. In open_table_from_share, we allocate an extra Field to keep Field_sys_trx_id.

int  open_table_from_share(THD *thd, TABLE_SHARE *share, const char *alias,
                          uint db_stat, uint prgflag, uint ha_open_flags,
                          TABLE *outparam, bool is_create_table,
                          const dd::Table *table_def_param) {
   ...
  //Here we need an extra space to store 'ghost' column from table_share.
  if (!(field_ptr = root->ArrayAlloc<Field *>(share->fields + 1 + 1)))
    goto err; /* purecov: inspected */
  ...

After that, MySQL will fill all the allocated Fields information by fill_columns_from_dd. When all user defined columns are filled, we will add and fill the ghost column.

  bool is_in_upgrade = dd::upgrade_57::in_progress();
  bool is_system_objs = is_system_object(share->db.str, share->table_name.str);
  /*we dont add the extra file for system table or in upgrading phase.*/
  if (!is_in_upgrade && !is_system_objs){
    Create_field db_trx_id_field;
    db_trx_id_field.sql_type = MYSQL_TYPE_DB_TRX_ID;
    db_trx_id_field.is_nullable = db_trx_id_field.is_zerofill = false;
    db_trx_id_field.is_unsigned = true;
    Field *sys_trx_id_field = make_field(db_trx_id_field, share, "MYSQL_TYPE_DB_TRX_ID",
                                                    MAX_DB_TRX_ID_WIDTH, rec_pos, null_pos, 0);
    sys_trx_id_field->set_field_index(field_nr);
    share->field[field_nr] = sys_trx_id_field;
    assert (sys_trx_id_field->pack_length_in_rec() == MAX_DB_TRX_ID_WIDTH);
    //rec_pos += share->field[field_nr]->pack_length_in_rec();
    field_nr++;
    assert(share->fields + 1 == field_nr);
  }

In open_table_from_share, also should add an extra space for that ghost column. And, the field_ptr also should be set at here.

int  open_table_from_share(THD *thd, TABLE_SHARE *share, const char *alias,
                          uint db_stat, uint prgflag, uint ha_open_flags,
                          TABLE *outparam, bool is_create_table,
                          const dd::Table *table_def_param) {
  ...
  //Here we need an extra space to store 'ghost' column from table_share.
  if (!(field_ptr = root->ArrayAlloc<Field *>(share->fields + 1 + 1)))
    goto err; /* purecov: inspected */
  ...

Another extra space need is row's record[0], which used to store row data loaded from InnoDB.

int  open_table_from_share(THD *thd, TABLE_SHARE *share, const char *alias,
                          uint db_stat, uint prgflag, uint ha_open_flags,
                          TABLE *outparam, bool is_create_table,
                          const dd::Table *table_def_param) {
  ...
  //in find_record_length(), MAX_DB_TRX_ID_WIDTH is already added.
  record = root->ArrayAlloc<uchar>(share->rec_buff_length * records +
                                   share->null_bytes);
  ...

static bool find_record_length(const dd::Table &table, size_t min_length,
                               TABLE_SHARE *share) {
  ...
    // Loop over columns, count nullable and bit fields and find record length.
  for (const dd::Column *col_obj : table.columns()) {
    // Skip hidden columns
    if (col_obj->is_se_hidden()) continue;

    // Check if the field may be NULL.
    if (col_obj->is_nullable()) share->null_fields++;

    // Check if this is a BIT field with leftover bits in the preamble, and
    // adjust record length accordingly.
    if (col_obj->type() == dd::enum_column_types::BIT) {
      bool treat_bit_as_char;
      if (col_obj->options().get("treat_bit_as_char", &treat_bit_as_char))
        return true;

      if (!treat_bit_as_char && (col_obj->char_length() & 7))
        leftover_bits += col_obj->char_length() & 7;
    }

    // Increment record length.
    share->reclength += column_pack_length(*col_obj);
    share->fields++;
  }

  // Find preamble length and add it to the total record length.
  share->null_bytes = (share->null_fields + leftover_bits + 7) / 8;
  share->last_null_bit_pos = (share->null_fields + leftover_bits) & 7;
  share->reclength += share->null_bytes;

  // Hack to avoid bugs with small static rows in MySQL.
  share->reclength = std::max<size_t>(min_length, share->reclength);
  //Here, due to we need an extra space to store ghost column, db_trx_id length
  //therefore, 'MAX_DB_TRX_ID_WIDTH' is added.
  
  share->reclength += calc_pack_length(MYSQL_TYPE_DB_TRX_ID, 0);
  share->stored_rec_length = share->reclength;
  ...

By now, all the tables should have an extra invisible column, Field_sys_trx_id.

1.3 Building the template for InnoDB

The template is used in fast retrieval of just those column values MySQL needs in its processing. It used by m_prebuilt when we start to scan data in innodb.

void ha_innobase::build_template(bool whole_row) {
  ...
  /**there're two places using this template for accelerating, one: select, another place is for DML
  in 'row_mysql_convert_row_to_innobase', it uses for build up innobase row format by using template.
  This field is need in any queries, so that we dont use 'build_template_needs_field()' to check it.
  only one we should know that the difference between index (secondary index)and primary key(cluster
  index). append the ghost field template at the end.
  */
  Field* db_trx_id_field = table->field[n_fields];
  if (db_trx_id_field) {
    assert(db_trx_id_field->type() == MYSQL_TYPE_DB_TRX_ID);
    mysql_row_templ_t *templ [[maybe_unused]] = build_template_field(
      m_prebuilt, clust_index, index, table, db_trx_id_field, 1, 0);
  }
  ...

1.4 Filling `TRX_ID` in InnoDB

In InnoDB row_sel_field_store_in_mysql_format function, It transform a row in innodb format to mysql format.

/** Convert a non-SQL-NULL field from Innobase format to MySQL format. */
static inline void row_sel_field_store_in_mysql_format(
    byte *dest, const mysql_row_templ_t *templ, const dict_index_t *idx,
    ulint field, const byte *src, ulint len, ulint sec) {
  row_sel_field_store_in_mysql_format_func(
      dest, templ, idx, field,  src, len , sec);
}

void row_sel_field_store_in_mysql_format_func(
    byte *dest, const mysql_row_templ_t *templ, const dict_index_t *index,
    ulint field_no,  const byte *data,
    ulint len , ulint sec_field) {
   ...
       case DATA_SYS_CHILD:
    case DATA_SYS:
      /* These column types should never be shipped to MySQL. But, in Shannon,
         we will retrieve trx id to MySQL. */
      switch (prtype & DATA_SYS_PRTYPE_MASK) {
         case DATA_TRX_ID:
             id = mach_read_from_6(data);
             memcpy(dest, &id, sizeof(ib_id_t));
             break;
         case DATA_ROW_ID:
         case DATA_ROLL_PTR:
           assert(0);
           break;
      }
      break;
   ...

1.5 SQL Layer

THE GHOST COLUMN IS INVISIBLE TO USER, THEREFORE, IT DONT ENABLE IN SQL LAYER

1.6 miscellaneous

It only USE in InnoDB, not in MyISAM.

2 ShannonBase In-memory Column Store -Rapid

2.1 Overview

Taking Oracle IM column store as an example, here, we will to imll our IM column store as Oracle does.

Oracle IM

In order to provide HTAP services, just like MySQL Heatwave, An in memory column store should be incoperated into, which is used to handle the analytical workloads. Any AP workloads will be offloaded to Rapid Engine(In memory column store).

The architecture of MySQL Heatwave is listed below. (copyright belongs to MySQL)

2.2 Secondary engine

2.2.1 SQL syntaxes

Firstly, the SQL syntaxes must be defined. All these syntaxes are the basis of all future works. This chapter introduces syntaxes of the following operations:

Create a table with secondary_engine=rapid.

Load data.

Process a query.

Monitor the system status.

To determine which SQL syntaxes to define, we must first figure out why we want to use ShannonBase: Firstly, we want users can port their workloads seamlessly from MySQL Heatwave to ShannonBase. Therefore, we try to adopt all SQL syntax that MySQL Heatwave uses (which is also used in MySQL version 8.1).

In addition, relevant changes will be implemented in the MySQL server layer. Following are examples showing some SQL syntaxes supported by ShannonBase.

Certain SQL grammars must be added in sql/sql_yacc.yy. The following uses the SELECT statement as an example:

select_stmt:
          query_expression
          {
            $$ = NEW_PTN PT_select_stmt($1);
          }
        | query_expression locking_clause_list
          {
            $$ = NEW_PTN PT_select_stmt(NEW_PTN PT_locking($1, $2),
                                        nullptr, true);
          }
        | query_expression_parens
          {
            $$ = NEW_PTN PT_select_stmt($1);
          }
        | select_stmt_with_into
        ;

After SQL syntaxes are added, new SQL items are created in yacc. These items will be processed in the MySQL server layer during query optimization.

Create a table with secondary_engine=rapid:

CREATETABLE orders (id INT)SECONDARY_ENGINE= rapid;
ALTERTABLE orders SECONDARY_ENGINE= rapid;

Compared with the syntax for creating tables used in StoneDB V1.0, StoneDB V2.0 will support a new keyword SECONDARY_ENGINE that is adopted in MySQL 8.0. Original CREATE statement syntax used in MySQL:

create_table_stmt:
          CREATE opt_temporary TABLE_SYM opt_if_not_exists table_ident
          '(' table_element_list ')' opt_create_table_options_etc
          {
            $$= NEW_PTN PT_create_table_stmt(YYMEM_ROOT, $2, $4, $5,
                                             $7,
                                             $9.opt_create_table_options,
                                             $9.opt_partitioning,
                                             $9.on_duplicate,
                                             $9.opt_query_expression);
          }

opt_create_table_options_etc:
          create_table_options
          opt_create_partitioning_etc
          {
            $$= $2;
            $$.opt_create_table_options= $1;
          }
        | opt_create_partitioning_etc
        ;

create_table_option:
          ENGINE_SYM opt_equal ident_or_text
          {
            $$= NEW_PTN PT_create_table_engine_option(to_lex_cstring($3));
          }
        | **SECONDARY_ENGINE_SYM** opt_equal NULL_SYM
          {
            $$= NEW_PTN PT_create_table_secondary_engine_option();
          }
        | SECONDARY_ENGINE_SYM opt_equal ident_or_text
          {
            $$= NEW_PTN PT_create_table_secondary_engine_option(to_lex_cstring($3));
          }

From the definition above, SECONDARY_ENGINE_SYM is already defined in create_table_option and also should be in class PT_create_table_stmt. For more information about how SQL syntax support will be designed.

2.2.2 Load Data/Unload Data

This is a part, which is mainly focusing on how to load data from innodb into in-memory column store. This issue will give all the information about this part.

When the table with secondary engine created. the next step will be loaded data into the secondary engine. After the all the data we need have been loaded, we can do query processing. The load operation perform via using alter table statement with SECONDARY_LOAD option.

ALTER TABLE tb_name SECONDARY_LOAD.

/**
  Represents ALTER TABLE SECONDARY_LOAD/SECONDARY_UNLOAD statements.
*/
class Sql_cmd_secondary_load_unload final : public Sql_cmd_common_alter_table {
};

/**
 * Loads a table from its primary engine into its secondary engine.
 *
 * This call assumes that MDL_SHARED_NO_WRITE/SECLOAD_SCAN_START_MDL lock
 * on the table have been acquired by caller. During its execution it may
 * downgrade this lock to MDL_SHARED_UPGRADEABLE/SECLOAD_PAR_SCAN_MDL.
 *
 * @param thd              Thread handler.
 * @param table            Table in primary storage engine.
 *
 * @return True if error, false otherwise.
 */
static bool secondary_engine_load_table(THD *thd, const TABLE &table) {
};

class ha_tianmu_secondary : public handler {
 public:
  ha_tianmu_secondary(handlerton *hton, TABLE_SHARE *table_share);

 private:
  int create(const char *, TABLE *, HA_CREATE_INFO *, dd::Table *) override;

  int open(const char *name, int mode, unsigned int test_if_locked,
           const dd::Table *table_def) override;

  int close() override { return 0; }

  int rnd_init(bool) override { return 0; }

  int rnd_next(unsigned char *) override { return HA_ERR_END_OF_FILE; }

  int rnd_pos(unsigned char *, unsigned char *) override {
    return HA_ERR_WRONG_COMMAND;
  }

  int info(unsigned int) override;

  ha_rows records_in_range(unsigned int index, key_range *min_key,
                           key_range *max_key) override;

  void position(const unsigned char *) override {}

  unsigned long index_flags(unsigned int, unsigned int, bool) const override;

  THR_LOCK_DATA **store_lock(THD *thd, THR_LOCK_DATA **to,
                             thr_lock_type lock_type) override;

  Table_flags table_flags() const override;

  const char *table_type() const override { return "TIANMU_RAPID"; }

  int load_table(const TABLE &table) override;

  int unload_table(const char *db_name, const char *table_name,
                   bool error_if_not_loaded) override;

  THR_LOCK_DATA m_lock;
};

In sql_table.cc, the function Sql_cmd_secondary_load_unload::mysql_secondary_load_or_unload used to load/unload the data to/from secondary. When the operations are done, the meta information of loaded table is stored into performance_schema.rpd_xxx. These tables are used to monitor the status of that secondary engine.

static bool secondary_engine_load_table(THD *thd, const TABLE &table) {
  ...
  // Load table from primary into secondary engine and add to change
  // propagation if that is enabled.
  if (handler->ha_load_table(table)){
    my_error(ER_SECONDARY_ENGINE, MYF(0),
             "secondary storage engine load table failed");
    return true;
  }

  // add the mete info into 'rpd_column_id' and 'rpd_columns tables', etc.
  // to check whether it has been loaded or not. here, we dont use field_ptr != nullptr
  // because the ghost column.
  uint32 field_count = table.s->fields;
  Field *field_ptr = nullptr;
  for (uint32 index = 0; index < field_count; index++) {
    field_ptr = *(table.field + index);
    // Skip columns marked as NOT SECONDARY.
    if ((field_ptr)->is_flag_set(NOT_SECONDARY_FLAG)) continue;

    ShannonBase::rpd_columns_info row_rpd_columns;
    strncpy(row_rpd_columns.schema_name, table.s->db.str, table.s->db.length);
    row_rpd_columns.table_id = static_cast<uint>(table.s->table_map_id.id());
    row_rpd_columns.column_id = field_ptr->field_index();
    strncpy(row_rpd_columns.column_name, field_ptr->field_name,
            strlen(field_ptr->field_name));
    strncpy(row_rpd_columns.table_name, table.s->table_name.str,
            strlen(table.s->table_name.str));
    std::string key_name (table.s->db.str);
    key_name += table.s->table_name.str;
    key_name += field_ptr->field_name;
    ShannonBase::Compress::Dictionary* dict =
      ShannonBase::Imcs::Imcs::get_instance()->get_cu(key_name)->get_header()->m_local_dict.get();
    if (dict)
      row_rpd_columns.data_dict_bytes = dict->content_size();
    row_rpd_columns.data_placement_index = 0;
  ...

2.3 Rapid, In-memory Column Store.

2.3.1 Overview

The first step of processing AP workloads is load the basic full data into rapid engine, then rapid will start propagation operation automatically. When a table loaded from innodb to rapid engine, some meta informations will be also loaded into catalog table, such as performance_schema.rpd_column, performance_schema.rpd_column_id, etc. A backgroud thread will be launched when system start, then start to monitor the redo log, when a new DML operation done, this background thread starts to parse the incoming redo log, and apply the changes into IMCS.

When the load statement was executed, it would peform the load operation. Overall, just like insert into xxx select xxx statement, the system firstly do a table scan via index or full table scan.

1: It scans the target table, usually it is an innodb table. And, here, there is a problem must be clarified at first. That is which data will be visible to operation, and which is not. Therefore, here, we define that only the committed data will be visible to scan operation. In other words, that means we will use auto commited transaction to do table scan. the transaciton will be read committed isolation level.

The new data inserted when we do table scaning, all these the latest data will not be seen by the operation, because this would not happen. An exclusive mdl lock is used to protected the new rows are inserted into table when the loading operation is running.

2: Except the core functions, there must be some system parameters to monitors the load operations， for example, how many data have been loaded? and how many remains, and so on. some parallel related parameters also will be introduced into, such as POD( parallel of degree), etc. Therefore, some system parameters will be introduced.

2.3.2 Column Data Format

Each column is organized as a file, when it flushes to disk. The format of columns in memory is also called as IMCU(In-memory column unit). An IMCU consisted by CUs(Column Unit), A CU has two parts: (1) Header, Meta Information; (2) Data; Data also can be divided into a bunch of chunks.

All chunks are linked. The address of the first chunk can be found from Cu's header, and also contains the address of the next chunk. A chunk's consist with header and data. Header contains the meta information of this chunk. the data part is where the real data located. Gets the first Cu from IMCS. In an IMCS instance header, it has a header, which has a pointer to the address of IMCU.

When a new data in, it stores it in order. Insert sort can be used to make it ordered. It uses binary search to find the data. But it data is in compressed format, at this situation, we need a new algorithm to find the data in compresssed data.

Now, we go to deeper. Giving out the more specific details of the data. Here, we notice that every data we write into CU should a tansaction id attached to it to mark which transaction it belongs.

All the variable length data, such as text, string, etc. are encoded into a double tyep in its local dictionary. Each one has a double typed id when it loaded into rapid.

The length of a record in rapid is aligned to 32.

constexpr uint8 SHANNON_INFO_BYTE_OFFSET = 0;
constexpr uint8 SHANNON_INFO_BYTE_LEN = 1;
constexpr uint8 SHANNON_TRX_ID_BYTE_OFFSET = 1;
constexpr uint8 SHANNON_TRX_ID_BYTE_LEN = 8;
constexpr uint8 SHANNON_ROW_ID_BYTE_OFFSET = 9;
constexpr uint8 SHANNON_ROWID_BYTE_LEN = 8;
constexpr uint8 SHANNON_SUMPTR_BYTE_OFFSET = 17;
constexpr uint8 SHANNON_SUMPTR_BYTE_LEN = 4;
constexpr uint8 SHANNON_DATA_BYTE_OFFSET = 21;
constexpr uint8 SHANNON_DATA_BYTE_LEN = 8;
constexpr uint8 SHANNON_ROW_TOTAL_LEN_UNALIGN =
    SHANNON_INFO_BYTE_LEN + SHANNON_TRX_ID_BYTE_LEN + SHANNON_ROWID_BYTE_LEN +
    SHANNON_SUMPTR_BYTE_LEN + SHANNON_DATA_BYTE_LEN;

constexpr uint8 SHANNON_ROW_TOTAL_LEN =
    ALIGN_WORD(SHANNON_ROW_TOTAL_LEN_UNALIGN, 8);

Imcs

Singleton pattern, Only ONE instance in rapid engin. It used to represent an in-memory store instance. we can use it to perform full table scan or index table scan to getting the data from it. It has several CUs, A Cu means a field of loaded table. Some Chunks consist A Cu.

class Imcs : public MemoryObject {
 public:
  using Cu_map_t = std::unordered_map<std::string, std::unique_ptr<Cu>>;
  using Imcu_map_t = std::multimap<std::string, std::unique_ptr<Imcu>>;
  inline static Imcs *get_instance() {
    std::call_once(one, [&] { m_instance = new Imcs(); });
    return m_instance;
  }
  // initialize the imcs.
  uint initialize();
  // deinitialize the imcs.
  uint deinitialize();
  // gets initialized flag.
  inline bool initialized() {
    return (m_inited == handler::NONE) ? false : true;
  }
  // scan oper initialization.
  uint rnd_init(bool scan);
  // end of scanning
  uint rnd_end();
  // writes a row of a column in.
  uint write_direct(ShannonBase::RapidContext *context, Field *fields);
  // reads the data by a rowid into a field.
  uint read_direct(ShannonBase::RapidContext *context, Field *field);
  // reads the data by a rowid into buffer.
  uint read_direct(ShannonBase::RapidContext *context, uchar *buffer);
  uint read_batch_direct(ShannonBase::RapidContext *context, uchar *buffer);
  // deletes the data by a rowid
  uint delete_direct(ShannonBase::RapidContext *context, Field *field);
  // deletes all the data.
  uint delete_all_direct(ShannonBase::RapidContext *context);
  Cu *get_cu(std::string &key);
  void add_cu(std::string key, std::unique_ptr<Cu> &cu);
  ha_rows get_rows(TABLE *source_table);

 private:
  // make ctor and dctor private.
  Imcs();
  virtual ~Imcs();

  Imcs(Imcs &&) = delete;
  Imcs(Imcs &) = delete;
  Imcs &operator=(const Imcs &) = delete;
  Imcs &operator=(const Imcs &&) = delete;

 private:
  // imcs instance
  static Imcs *m_instance;
  // initialization flag, only once.
  static std::once_flag one;
  // cus in this imcs. <db+table+col, cu*>
  Cu_map_t m_cus;
  // imcu in this imcs. <db name + table name, imcu*>
  Imcu_map_t m_imcus;
  // used to keep all allocated imcus. key string: db_name + table_name.
  // initialization flag.
  std::atomic<uint8> m_inited{handler::NONE};
};

The class member m_imcus is a map, defined as following:

std::unordered_map<std::string, std::unique_ptr<Cu>>

The key is constructed by "db_name" and "table_name". When writting a row into rapid, every field are writting its own data indepently.

Cu

Column Unit, it represent a field of a loaded table. All data of a field is stored into its CU.

class Cu : public MemoryObject {
 public:
  using Cu_header = struct alignas(CACHE_LINE_SIZE) Cu_header_t {
   public:
    // whether the is not null or not.
    bool m_nullable{false};
    // encoding type. pls ref to:
    // https://dev.mysql.com/doc/heatwave/en/mys-hw-varlen-encoding.html
    // https://dev.mysql.com/doc/heatwave/en/mys-hw-dictionary-encoding.html
    Compress::Encoding_type m_encoding_type{Compress::Encoding_type::NONE};
    // the index of field.
    uint16 m_field_no{0};
    // field type of this cu.
    enum_field_types m_cu_type{MYSQL_TYPE_TINY};
    // local dictionary.
    std::unique_ptr<Compress::Dictionary> m_local_dict;
    // statistics info.
    std::atomic<double> m_max{0}, m_min{0}, m_middle{0}, m_median{0}, m_avg{0},
        m_sum{0};
    std::atomic<uint64> m_rows{0};
  };

  explicit Cu(Field *field);
  virtual ~Cu();
  Cu(Cu &&) = delete;
  Cu &operator=(Cu &&) = delete;

  // initialization. these're for internal.
  uint rnd_init(bool scan);
  // End of Rnd scan
  uint rnd_end();
  // writes the data into this chunk. length unspecify means calc by chunk.
  uchar *write_data_direct(ShannonBase::RapidContext *context, uchar *data,
                           uint length = 0);
  // reads the data by from address.
  uchar *read_data_direct(ShannonBase::RapidContext *context, uchar *buffer);
  // reads the data by rowid to buffer.
  uchar *read_data_direct(ShannonBase::RapidContext *context, uchar *rowid,
                          uchar *buffer);
  // deletes the data by rowid
  uchar *delete_data_direct(ShannonBase::RapidContext *context, uchar *rowid);
  // deletes all
  uchar *delete_all_direct();
  // updates the data with rowid with the new data.
  uchar *update_data_direct(ShannonBase::RapidContext *context, uchar *rowid,
                            uchar *data, uint length = 0);
  // flush the data to disk. by now, we cannot impl this part.
  uint flush_direct(ShannonBase::RapidContext *context, uchar *from = nullptr,
                    uchar *to = nullptr);
  inline Compress::Dictionary *local_dictionary() const {
    return m_header->m_local_dict.get();
  }
  Cu_header *get_header() { return m_header.get(); }
  // gets the base address of chunks.
  uchar *get_base();
  void add_chunk(std::unique_ptr<Chunk> &chunk);
  inline Chunk *get_chunk(uint chunkid) {
    return (chunkid < m_chunks.size()) ? m_chunks[chunkid].get() : nullptr;
  }
  inline Chunk *get_first_chunk() { return get_chunk(0); }
  inline Chunk *get_last_chunk() { return get_chunk(m_chunks.size() - 1); }
  inline size_t get_chunk_nums() { return m_chunks.size(); }

  uchar *seek(size_t offset);
  inline Index *get_index() { return m_index.get(); }

 private:
  uint m_magic{SHANNON_MAGIC_CU};
  // proctect header.
  std::mutex m_header_mutex;
  // header info of this Cu.
  std::unique_ptr<Cu_header> m_header{nullptr};
  // chunks in this cu.
  std::vector<std::unique_ptr<Chunk>> m_chunks;
  // current chunk read.
  std::atomic<uint32> m_current_chunk_id{0};
  // index of Cu
  std::unique_ptr<Index> m_index{nullptr};
};

Chunk

The basic unit to store the data.

class Chunk : public MemoryObject {
 public:
  using Chunk_header = struct alignas(CACHE_LINE_SIZE) Chunk_header_t {
   public:
    // is null or not.
    bool m_null{false};
    // whether it is var type or not
    bool m_varlen{false};
    // data type in mysql.
    enum_field_types m_chunk_type{MYSQL_TYPE_TINY};
    // field no.
    uint16 m_field_no{0};
    // pointer to the next or prev.
    Chunk *m_next_chunk{nullptr}, *m_prev_chunk{nullptr};
    // statistics data.
    std::atomic<double> m_max{0}, m_min{0}, m_median{0}, m_middle{0}, m_avg{0},
        m_sum{0};
    std::atomic<uint64> m_rows{0};
  };
  explicit Chunk(Field *field);
  virtual ~Chunk();
  Chunk(Chunk &&) = delete;
  Chunk &operator=(Chunk &&) = delete;

  Chunk_header *get_header() {
    std::scoped_lock lk(m_header_mutex);
    return m_header.get();
  }
  // initial the read opers.
  uint rnd_init(bool scan);
  // End of Rnd scan.
  uint rnd_end();
  // writes the data into this chunk. length unspecify means calc by chunk.
  uchar *write_data_direct(ShannonBase::RapidContext *context, uchar *data,
                           uint length = 0);
  // reads the data by from address .
  uchar *read_data_direct(ShannonBase::RapidContext *context, uchar *buffer);
  // reads the data by rowid.
  uchar *read_data_direct(ShannonBase::RapidContext *context, uchar *rowid,
                          uchar *buffer);
  // deletes the data by rowid.
  uchar *delete_data_direct(ShannonBase::RapidContext *context, uchar *rowid);
  // deletes all.
  uchar *delete_all_direct();
  // updates the data.
  uchar *update_date_direct(ShannonBase::RapidContext *context, uchar *rowid,
                            uchar *data, uint length = 0);
  // flush the data to disk. by now, we cannot impl this part.
  uint flush_direct(RapidContext *context, uchar *from = nullptr,
                    uchar *to = nullptr);
  // the start loc of chunk. where the data wrtes from.
  inline uchar *get_base() const { return m_data_base; }
  // the end loc of chunk. is base + chunk_size
  inline uchar *get_end() const { return m_data_end; }
  // gets the max valid loc of current the data has written to.
  inline uchar *get_data() const { return m_data; }
  bool is_full() { return (m_data == m_data_end) ? true : false; }
  ha_rows records_in_range(ShannonBase::RapidContext *context, double &min_key,
                           double &max_key);

  uchar *where(uint offset);
  uchar *seek(uint offset);

 private:
  std::mutex m_header_mutex;
  std::unique_ptr<Chunk_header> m_header{nullptr};
  // started or not
  std::atomic<uint8> m_inited;
  std::mutex m_data_mutex;
  /** the base pointer of chunk, and the current pos of data. whether data
   * should be in order or not */
  uchar *m_data_base{nullptr};
  // current pointer, where the data is. use write.
  std::atomic<uchar *> m_data{nullptr};
  // pointer of cursor, which used for reading.
  std::atomic<uchar *> m_data_cursor{nullptr};
  // end address of memory, to determine whether the memory is full or not.
  uchar *m_data_end{nullptr};
  // the check sum of this chunk. it used to do check when the data flush to
  // disk.
  uint m_check_sum{0};
  // maigic num of chunk.
  uint m_magic = SHANNON_MAGIC_CHUNK;
};

In its constructor function, to allocate memory and set some pointers.

Chunk::Chunk(Field *field) {
  ut_ad(field);
  ut_ad(ShannonBase::SHANNON_CHUNK_SIZE < rapid_memory_size);
  m_inited = handler::NONE;

  m_header = std::make_unique<Chunk_header> ();
  if (!m_header.get()) {
    assert(false);
    return ;
  }

  /**m_data_base，here, we use the same psi key with buffer pool which used in
   * innodb page allocation. Here, we use ut::xxx to manage memory allocation
   * and free as innobase doese. In SQL lay, we will use MEM_ROOT to manage the
   * memory management. In IMCS, all modules use ut:: to manage memory operations,
   * it's an effiecient memory utils. it has been initialized in
   * ha_innodb.cc: ut_new_boot(); */
  if (likely(rapid_allocated_mem_size + ShannonBase::SHANNON_CHUNK_SIZE <=
      rapid_memory_size)) {
    m_data_base = static_cast<uchar *>(ut::aligned_alloc(ShannonBase::SHANNON_CHUNK_SIZE,
        ALIGN_WORD(ShannonBase::SHANNON_CHUNK_SIZE, SHANNON_ROW_TOTAL_LEN)));

    if (unlikely(!m_data_base)) {
      my_error(ER_SECONDARY_ENGINE_PLUGIN, MYF(0), "Chunk allocation failed");
      return;
    }
    m_data = m_data_base;
    m_data_cursor = m_data_base;
    m_data_end =
        m_data_base + static_cast<ptrdiff_t>(ShannonBase::SHANNON_CHUNK_SIZE);
    rapid_allocated_mem_size += ShannonBase::SHANNON_CHUNK_SIZE;

    m_header->m_avg = 0;
    m_header->m_sum = 0;
    m_header->m_rows = 0;

    m_header->m_max = std::numeric_limits<long long>::lowest();
    m_header->m_min = std::numeric_limits<long long>::max();
    m_header->m_median = std::numeric_limits<long long>::lowest();
    m_header->m_middle = std::numeric_limits<long long>::lowest();

    m_header->m_field_no = field->field_index();
    m_header->m_chunk_type = field->type();
    m_header->m_null = field->is_nullable();
    switch (m_header->m_chunk_type) {
      case MYSQL_TYPE_VARCHAR:
      case MYSQL_TYPE_BIT:
      case MYSQL_TYPE_JSON:
      case MYSQL_TYPE_TINY_BLOB:
      case MYSQL_TYPE_BLOB:
      case MYSQL_TYPE_MEDIUM_BLOB:
      case MYSQL_TYPE_VAR_STRING:
      case MYSQL_TYPE_STRING:
      case MYSQL_TYPE_GEOMETRY:
        m_header->m_varlen = true;
        break;
      default:
        m_header->m_varlen = false;
        break;
    }
  } else {
    my_error(ER_SECONDARY_ENGINE_PLUGIN, MYF(0),
             "Rapid allocated memory exceeds over the maximum");
    return;
  }
}

2.3.3 ART index

Adaptive Radix Tree from https://github.com/armon/libart is introuduced into Rpaid. For more information about ART pls refer to the corresponding papers or soemthing else.

2.3.4 Full Table Scan

The baisc table operation of Rapid engine is full table scan. we have implemented the full table scan.

class ha_rapid : public handler {
  ...
  int rnd_init(bool) override;
  int rnd_next(unsigned char *) override;
  int rnd_end() override;
  ..

In order to support perform the table scan on every CUs. ImcsReader and CuView are employed into Rapid.

The CuView is a view of a Cu, it provides a index or full table scan operation to cruise over the Cu to check the data and return the satisfied data. Taking a sequential full table scan as an instance. It check very chunk to test the visibility condition met or not. if it met, then returned it.

int CuView::read(ShannonBaseContext *context, uchar *buffer, size_t length) {
  DBUG_TRACE;
  ut_a(context && buffer);
  RapidContext *rpd_context = dynamic_cast<RapidContext *>(context);
  if (!m_source_cu) return HA_ERR_END_OF_FILE;

  // gets the chunks belongs to this cu.
  auto chunk = m_source_cu->get_chunk(m_rnd_chunk_rid);
  while (chunk) {
    ptrdiff_t diff = m_rnd_rpos - chunk->get_data();
    if (unlikely(diff >= 0)) {  // to the next
      m_rnd_chunk_rid.fetch_add(1,
          std::memory_order::memory_order_acq_rel);
      chunk = m_source_cu->get_chunk(m_rnd_chunk_rid);
      if (!chunk) return HA_ERR_END_OF_FILE;
      m_rnd_rpos.store(chunk->get_base(), std::memory_order_acq_rel);
      continue;
    }

    uint8 info =
        *((uint8 *)(m_rnd_rpos + SHANNON_INFO_BYTE_OFFSET));  // info byte
    uint64 trxid =
        *((uint64 *)(m_rnd_rpos + SHANNON_TRX_ID_BYTE_OFFSET));  // trxid bytes
    // visibility check at firt.
    table_name_t name{const_cast<char *>(m_source_table->s->table_name.str)};
    ReadView *read_view = trx_get_read_view(rpd_context->m_trx);
    ut_ad(read_view);
    if (!read_view->changes_visible(trxid, name) ||
        (info & DATA_DELETE_FLAG_MASK)) {  // invisible and deleted
      // TODO: travel the change link to get the visibile version data.
      m_rnd_rpos.fetch_add(SHANNON_ROW_TOTAL_LEN,
                           std::memory_order_acq_rel);  // to the next value.
      diff = m_rnd_rpos - chunk->get_data();
      if (diff >= 0) {
        m_rnd_chunk_rid.fetch_add(1, std::memory_order::memory_order_seq_cst);
        chunk = m_source_cu->get_chunk(m_rnd_chunk_rid);
        if (!chunk) return HA_ERR_END_OF_FILE;
        m_rnd_rpos.store(chunk->get_base(), std::memory_order_acq_rel);
        continue;
      }  // no data here to the next.
    }

    memcpy(buffer, m_rnd_rpos, SHANNON_ROW_TOTAL_LEN);
    m_rnd_rpos.fetch_add(SHANNON_ROW_TOTAL_LEN,
                         std::memory_order_acq_rel);  // go to the next.
    return 0;
  }
  return HA_ERR_END_OF_FILE;
}

2.3.5 Index Table Scan

As full table scan, rapid also support index table scan by impl the following interfances.

class ha_rapid : public handler {
 public:
  int index_init(uint keynr, bool sorted) override;

  int index_end() override;

  int index_read(uchar *buf, const uchar *key, uint key_len,
                 ha_rkey_function find_flag) override;

  int index_read_last(uchar *buf, const uchar *key, uint key_len) override;

  int index_next(uchar *buf) override;

  int index_next_same(uchar *buf, const uchar *key, uint keylen) override;

  int index_prev(uchar *buf) override;

  int index_first(uchar *buf) override;

  int index_last(uchar *buf) override;

For the detail, please refer to the corresponding codes.

2.3.6 Index Condition pushdown to Rapid

In ShannonBase Rapid, we support ICP due to the index eanbled in Rapid.

handler::Table_flags ha_rapid::table_flags() const {
  ulong flags = HA_READ_NEXT | HA_READ_PREV | HA_READ_ORDER | HA_READ_RANGE |
                HA_KEYREAD_ONLY | HA_DO_INDEX_COND_PUSHDOWN;
  return flags;
}

int ha_rapid::key_cmp(KEY_PART_INFO *key_part, const uchar *key, uint key_length) {
  uint store_length;

  for (const uchar *end = key + key_length; key < end;
       key += store_length, key_part++) {
    int cmp;
    const int res = (key_part->key_part_flag & HA_REVERSE_SORT) ? -1 : 1;
    store_length = key_part->store_length;
    if (key_part->null_bit) {
      /* This key part allows null values; NULL is lower than everything */
      const bool field_is_null = key_part->field->is_null();
      if (*key)  // If range key is null
      {
        /* the range is expecting a null value */
        if (!field_is_null) return res;  // Found key is > range
        /* null -- exact match, go to next key part */
        continue;
      } else if (field_is_null)
        return -res;  // NULL is less than any value
      key++;          // Skip null byte
      store_length--;
    }
    if ((cmp = key_part->field->key_cmp(key, key_part->length)) < 0)
      return -res;
    if (cmp > 0) return res;
  }
  return 0;  // Keys are equal
}

int ha_rapid::compare_key_icp(const key_range *range) {
  int cmp;
  if (!range) return 0;  // no max range
  cmp = key_cmp(range_key_part, range->key, range->length);
  if (!cmp) cmp = get_key_comp_result();
  if (get_range_scan_direction() == RANGE_SCAN_DESC) cmp = -cmp;
  return cmp;
}

unsigned long ha_rapid::index_flags(unsigned int idx, unsigned int part,
                                   bool all_parts) const {
  //here, we support the same index flag as primary engine.
  const handler *primary = ha_get_primary_handler();
  const unsigned long primary_flags =
      primary == nullptr ? 0 : primary->index_flags(idx, part, all_parts);

  if(pushed_idx_cond) {}
  // Inherit the following index flags from the primary handler, if they are
  // set:
  //
  // HA_READ_RANGE - to signal that ranges can be read from the index, so that
  // the optimizer can use the index to estimate the number of rows in a range.
  //
  // HA_KEY_SCAN_NOT_ROR - to signal if the index returns records in rowid
  // order. Used to disable use of the index in the range optimizer if it is not
  // in rowid order.

  return ((HA_READ_NEXT | HA_READ_PREV | HA_READ_ORDER |
           HA_KEYREAD_ONLY | HA_DO_INDEX_COND_PUSHDOWN |
           HA_READ_RANGE | HA_KEY_SCAN_NOT_ROR) & primary_flags);
}

Item *ha_rapid::idx_cond_push(uint keyno, Item *idx_cond)
{
  DBUG_TRACE;
  ut_ad(keyno != MAX_KEY);
  ut_ad(idx_cond != nullptr);

  pushed_idx_cond = idx_cond;
  pushed_idx_cond_keyno = keyno;
  in_range_check_pushed_down = true;

  /* We will evaluate the condition entirely */
  return nullptr;
}

/**
Index Condition Pushdown interface implementation */

/** Shannon Rapid index push-down condition check
 @return ICP_NO_MATCH, ICP_MATCH, or ICP_OUT_OF_RANGE */
ICP_RESULT
shannon_rapid_index_cond(ha_rapid *h) /*!< in/out: pointer to ha_rapid */
{
  DBUG_TRACE;

  assert(h->pushed_idx_cond);
  assert(h->pushed_idx_cond_keyno != MAX_KEY);

  if (h->end_range && h->compare_key_icp(h->end_range) > 0) {
    /* caller should return HA_ERR_END_OF_FILE already */
    return ICP_OUT_OF_RANGE;
  }

  return h->pushed_idx_cond->val_int() ? ICP_MATCH : ICP_NO_MATCH;
}

2.4 Storage Interface Of Rapid

Rapid Engine handler provides an interface to SQL layer. in this impl full table scan, rnd_next and index table scan, index_read, index_next. And, we also support ICP in rapid engine.

class ha_rapid : public handler {
  ...
}

2.5 Load/Unload Table in Secondary Engine.

The first step of using rapid is to load InnoDB tables into rapid by executing alter table xxx secondary_load coammand. When this commande executed, it will invoke the following code in ha_shannon_rapid.cc.

int ha_rapid::load_table(const TABLE &table_arg) {

In ha_rapid::load_table, at first, to check whether this table has been loaded into or not. If it's, then return, Otherwise, do loading.

  if (shannon_loaded_tables->get(table_arg.s->db.str, table_arg.s->table_name.str) != nullptr) {
    std::ostringstream err;
    err << table_arg.s->db.str << "." <<table_arg.s->table_name.str << " already loaded";
    my_error(ER_SECONDARY_ENGINE_LOAD, MYF(0), err.str().c_str());
    return HA_ERR_GENERIC;
  }

After that to check type of every field is supported or not.

  for (uint idx =0; idx < table_arg.s->fields; idx ++) {
    Field* key_field = *(table_arg.field + idx);
    if (!Utils::Util::is_support_type(key_field->type())) {
      std::ostringstream err;
      err << key_field->field_name << " type not allowed";
      my_error(ER_SECONDARY_ENGINE_LOAD, MYF(0), err.str().c_str());
      return HA_ERR_GENERIC;
    }
  }

In 3rd step, it constructs the Primary key, which used as a Key for ART index. Before this, to check the loading table has a primary key or not. The rapid need a primary key built.

To check primary key missing or not, and the primary key MUST be NOT secondary exclusive.

  context.m_extra_info.m_keynr = 0;
  auto key = (table_arg.key_info + 0);
  for (uint keyid =0; keyid < key->user_defined_key_parts; keyid++) {
    if (key->key_part[keyid].field->is_flag_set(NOT_SECONDARY_FLAG)) {
      my_error(ER_RAPID_DA_PRIMARY_KEY_CAN_NOT_HAVE_NOT_SECONDARY_FLAG, MYF(0),
               table_arg.s->db.str, table_arg.s->table_name.str);
      return HA_ERR_GENERIC;
    }
  }

At final step, do innodb table scan to get each of rows, and write it into the corresponding Cus.

  while ((tmp = table_arg.file->ha_rnd_next(table_arg.record[0])) != HA_ERR_END_OF_FILE) {
   /*** ha_rnd_next can return RECORD_DELETED for MyISAM when one thread is reading and another deleting
    without locks. Now, do full scan, but multi-thread scan will impl in future. */
    if (tmp == HA_ERR_KEY_NOT_FOUND) break;

    auto offset {0};
    memset(context.m_extra_info.m_key_buff.get(), 0x0, key->key_length);
    for (uint key_partid = 0; key_partid < key->user_defined_key_parts; key_partid++) {
      memcpy(context.m_extra_info.m_key_buff.get() + offset,
             key->key_part[key_partid].field->field_ptr(),
             key->key_part[key_partid].store_length);
      offset += key->key_part[key_partid].store_length;
    }
    context.m_extra_info.m_key_len = offset;

    uint32 field_count = table_arg.s->fields;
    Field *field_ptr = nullptr;
    uint32 primary_key_idx [[maybe_unused]] = field_count;

    context.m_trx = thd_to_trx(m_rpd_thd);
    field_ptr = *(table_arg.field + field_count); //ghost field.
    if (field_ptr && field_ptr->type() == MYSQL_TYPE_DB_TRX_ID) {
      context.m_extra_info.m_trxid = field_ptr->val_int();
    }

    if (context.m_trx->state == TRX_STATE_NOT_STARTED) {
      assert (false);
    }
    //will used rowid as rapid pk.
    //if (imcs_instance->write_direct(&context, field_ptr)) {
    if (imcs_reader->write(&context, const_cast<TABLE*>(&table_arg)->record[0])) {
      table_arg.file->ha_rnd_end();
      imcs_instance->delete_all_direct(&context);
      my_error(ER_SECONDARY_ENGINE_LOAD, MYF(0), table_arg.s->db.str,
               table_arg.s->table_name.str);
      return HA_ERR_GENERIC;
    }
    ha_statistic_increment(&System_status_var::ha_read_rnd_count);
    m_rpd_thd->inc_sent_row_count(1);
    if (tmp == HA_ERR_RECORD_DELETED && !thd->killed) continue;
  }

If alter table xxx secondary_unload executed, all the loaded data will be erased from rapid engine.

int ha_rapid::unload_table(const char *db_name, const char *table_name,
                          bool error_if_not_loaded) {
  DBUG_TRACE;
  if (error_if_not_loaded &&
      shannon_loaded_tables->get(db_name, table_name) == nullptr) {
    my_error(ER_SECONDARY_ENGINE_PLUGIN, MYF(0),
             "Table is not loaded on a secondary engine");
    return HA_ERR_GENERIC;
  }
  
  ShannonBase::Imcs::Imcs* imcs_instance = ShannonBase::Imcs::Imcs::get_instance();
  assert(imcs_instance);
  RapidContext context;
  context.m_current_db = std::string(db_name);
  context.m_current_table = std::string(table_name);

  if (auto ret = imcs_instance->delete_all_direct(&context)) {
    return ret;
  }
  shannon_loaded_tables->erase(db_name, table_name);
  return 0;
}

2.6 Local dictionary

Each of Cus has a local dictionary, which is used to keep a dictionary of compressed string. In rapid, we dont store the string or text natviely, just using a double type id instead.

In loading table stage, a text or string value of a field will be processed by local dictionar instance to process via invoking `Dictionary::store. it returns an double value as string value's id, and this id stored in rapid.

In reading table stage, we get the string text by its id via location dictionary. Before using local dictionary process it, all string texts are compressed via zstd or lz4, etc according its field encoding type.

  std::string comment(field->comment.str);
  std::transform(comment.begin(), comment.end(), comment.begin(), ::toupper);
  if (std::regex_search(comment.c_str(), column_encoding_patt)) {
    if (comment.find("SORTED") != std::string::npos)
      m_header->m_encoding_type = Compress::Encoding_type::SORTED;
    else if (comment.find("VARLEN") != std::string::npos)
      m_header->m_encoding_type = Compress::Encoding_type::VARLEN;
  } else
    m_header->m_encoding_type = Compress::Encoding_type::NONE;
  m_header->m_local_dict =
      std::make_unique<Compress::Dictionary>(m_header->m_encoding_type);

4 AutoML

4.1 System procedures, `sys.ml_xxx`

In ShannonBase, we support several MLs. In ${source_code_dir}/scripts/sys_schema/procedures folder, we added some system procedures, and which will be added into sys.xxx after mysql initialization.

Taking sys.ML_train as an instance

DROP PROCEDURE IF EXISTS sys.ml_train;
DELIMITER $$

CREATE PROCEDURE sys.ml_train (
        IN in_table_name VARCHAR(64), IN in_target_name VARCHAR(64), IN in_option JSON, IN in_model_handle VARCHAR(64)
    )
    COMMENT '
Description
-----------

Run the ML_TRAIN routine on a labeled training dataset to produce a trained machine learning model.

Parameters
-----------
in_table_name (VARCHAR(64)):
  fully qualified name of the table containing the training dataset.
in_target_name (VARCHAR(64)):
  name of the column in \'table_name\' representing the target, i.e. ground truth values (required for some tasks)
in_option (JSON)
  optional training parameters as key-value pairs in JSON format.
    1: The most important parameter is \'task\', which specifies the ML task to be performed (if not specified, \'classification\' is assumed)
    2: Other parameters allow finer-grained control on the training task
in_model_handle (VARCHAR(64))
   user-defined session variable storing the ML model handle for the duration of the connection
Example
-----------
mysql> SET @iris_model = \'iris_manual\';
mysql> CALL sys.ML_TRAIN(\'ml_data.iris_train\', \'class\', 
          JSON_OBJECT(\'task\', \'classification\'), 
          @iris_model);
...    
'
    SQL SECURITY INVOKER
    NOT DETERMINISTIC
    CONTAINS SQL
BEGIN
    DECLARE v_error BOOLEAN DEFAULT FALSE;
    DECLARE v_user_name VARCHAR(64);
    DECLARE v_db_name_check VARCHAR(64);
    DECLARE v_sys_schema_name VARCHAR(64);
    DECLARE v_db_err_msg TEXT;

    DECLARE v_train_obj_check INT;
    DECLARE v_train_schema_name VARCHAR(64);
    DECLARE v_train_table_name VARCHAR(64);

    SELECT SUBSTRING_INDEX(CURRENT_USER(), '@', 1) INTO v_user_name;  
    SET v_sys_schema_name = CONCAT('ML_SCHEMA_', v_user_name);
  
    SELECT SCHEMA_NAME INTO v_db_name_check
      FROM INFORMATION_SCHEMA.SCHEMATA
    WHERE SCHEMA_NAME = v_sys_schema_name;

    IF v_db_name_check IS NULL THEN
        SET @create_db_stmt = CONCAT('CREATE DATABASE ', v_sys_schema_name, ';');
        PREPARE create_db_stmt FROM @create_db_stmt;
        EXECUTE create_db_stmt;
        DEALLOCATE PREPARE create_db_stmt;

        SET @create_tb_stmt = CONCAT(' CREATE TABLE ', v_sys_schema_name, '.MODEL_CATALOG(
                                        MODEL_ID INT NOT NULL AUTO_INCREMENT,
                                        MODEL_HANDLE VARCHAR(255),
                                        MODEL_OBJECT JSON,
                                        MODEL_OWNER VARCHAR(64),
                                        BUILD_TIMESTAMP TIMESTAMP,
                                        TARGET_COLUMN_NAME VARCHAR(64),
                                        TRAIN_TABLE_NAME VARCHAR(255),
                                        MODEL_OBJECT_SIZE INT,
                                        MODEL_TYPE  VARCHAR(64),
                                        TASK  VARCHAR(64),
                                        COLUMN_NAMES VARCHAR(1024),
                                        MODEL_EXPLANATION NUMERIC,
                                        LAST_ACCESSED TIMESTAMP,
                                        MODEL_METADATA JSON,
                                        NOTES VARCHAR(1024),
                                        PRIMARY KEY (MODEL_ID));');
        PREPARE create_tb_stmt FROM @create_tb_stmt;
        EXECUTE create_tb_stmt;
        DEALLOCATE PREPARE create_tb_stmt;
    END IF;

    SELECT SUBSTRING_INDEX(in_table_name, '.', 1) INTO v_train_schema_name;
    SELECT SUBSTRING_INDEX(in_table_name, '.', -1) INTO v_train_table_name;
    
    SELECT COUNT(*) INTO v_train_obj_check
    FROM INFORMATION_SCHEMA.TABLES
    WHERE TABLE_SCHEMA = v_train_schema_name AND TABLE_NAME = v_train_table_name;
    IF v_train_obj_check = 0 THEN
        SET v_db_err_msg = CONCAT(in_table_name, ' does not exists.');
        SIGNAL SQLSTATE 'HY000'
            SET MESSAGE_TEXT = v_db_err_msg;
    END IF;
  
    SELECT COUNT(COLUMN_NAME) INTO v_train_obj_check
    FROM INFORMATION_SCHEMA.COLUMNS
    WHERE TABLE_SCHEMA = v_train_schema_name AND TABLE_NAME = v_train_table_name AND COLUMN_NAME = in_target_name;
    IF v_train_obj_check = 0 THEN
        SET v_db_err_msg = CONCAT(in_target_name, ' does not exists.');
        SIGNAL SQLSTATE 'HY000'
            SET MESSAGE_TEXT = v_db_err_msg;
    END IF;
    SELECT ml_train(in_table_name, in_target_name, in_option, in_model_handle) INTO v_train_obj_check;
    IF v_train_obj_check != 0 THEN
        SET v_db_err_msg = CONCAT('ML_TRAIN failed.');
        SIGNAL SQLSTATE 'HY000'
            SET MESSAGE_TEXT = v_db_err_msg;
    END IF; 
END$$
DELIMITER ;

Besides, system procedures added, some system SQL functions also added, which used to invoke ML functions to perform the ML tasks.

4.2 System SQL functions

In ShannonBase, it added some system internal functions, such as ml_train. In item_func.cc, some functions are added.

static const std::pair<const char *, Create_func *> func_array[] = {
    {"ABS", SQL_FN(Item_func_abs, 1)},
    {"ACOS", SQL_FN(Item_func_acos, 1)},
    ...
    {"MD5", SQL_FN(Item_func_md5, 1)},
    {"ML_TRAIN", SQL_FN_V_LIST(Item_func_ml_train, 3, 4)},
    {"ML_MODEL_LOAD", SQL_FN_LIST(Item_func_ml_model_load, 3)},
    {"ML_MODEL_UNLOAD", SQL_FN_LIST(Item_func_ml_model_unload, 2)},
    {"ML_MODEL_IMPORT", SQL_FN_LIST(Item_func_ml_model_import, 4)},
    {"ML_SCORE", SQL_FN_V_LIST(Item_func_ml_score, 5, 6)},
    {"ML_PREDICT_ROW", SQL_FN_LIST(Item_func_ml_predicte_row, 2)},
    {"ML_PREDICT_TABLE", SQL_FN_V_LIST(Item_func_ml_predicte_table, 3, 4)},
    {"ML_EXPLAIN", SQL_FN_V_LIST(Item_func_ml_explain, 3, 4)},
    {"ML_EXPLAIN_ROW", SQL_FN_V_LIST(Item_func_ml_explain_row, 2, 3)},
    {"ML_EXPLAIN_TABLE", SQL_FN_V_LIST(Item_func_ml_explain_table, 3, 4)},
    {"MONTHNAME", SQL_FN(Item_func_monthname, 1)},
    {"NAME_CONST", SQL_FN(Item_name_const, 2)},
    ...

and in item_func.xx, the implementations are in these two files.

5 Populating the changes from InnoDB to Rapid

As a htap database, ShannonBase has abilitiy of populating the changes in innodb to rapid in real-time, which makes it can do real time analytical workloads.

Shannonbase use redo log to achieve this population. When a transaction do commit, it will write some redo logs. Therefore, we can have a copy of this redo log, and then parse it to apply these changes to rapid engine.

Base the idea of above, we use a circular ring buffer to store these redo log. Only, the redo logs of insert, delete, and update operation will be stored into ring buffer.

A backgroud thread will be launched after alter table xxx secondary_load statement executed. And, it will be stopped when all loaded tables are unloaded from rapid engine.

If you want to know the status info of rapid engine, you can use show engine innodb status to list all informations including rapid engine. Here, we dont split rapid engine status information from innodb's because rapid is a sub-engine of innodb. The innodb engine is primary engine, the rapid is secondary one.

5.1 Circular buffer.

[pic from wikipedia]

we incorporate a ring buffer to store the new comming innodb redo logs. It's lock free circular buffer. For more information about ring buffer,pls refer to related materials.

5.2 Background Thread

A new backgroup thread employed to monitor the ring buffer. when a new redo log come, it will be waked up and start to process this redo log. After that it wil hibernate until another new redo log comes.

in populate.cpp, the implementation of this thread be given. parse_log_func is the thread function.

static void parse_log_func (log_t *log_ptr) {
  std::unique_ptr<THD> log_pop_thread_thd {nullptr};
  if (current_thd == nullptr) {
    log_pop_thread_thd.reset(create_internal_thd());
    ut_ad(current_thd == log_pop_thread_thd.get());
  }
  
  os_event_reset(log_ptr->rapid_events[0]);
   //here we have a notifiyer, when checkpoint_lsn/flushed_lsn > rapid_lsn to start pop
  while (sys_pop_started.load(std::memory_order_seq_cst)) {
    auto stop_condition = [&](bool wait) {
      if (sys_population_buffer->readAvailable()) {
        return true;
      }
      if (wait) { //do somthing in waiting
      }
      return false;
    };

    os_event_wait_for(log_ptr->rapid_events[0], MAX_LOG_POP_SPIN_COUNT,
                      std::chrono::microseconds{100}, stop_condition);

    sys_rapid_loop_count++;
    MONITOR_INC(MONITOR_LOG_RAPID_MAIN_LOOPS);

    auto size = sys_population_buffer->readAvailable();
    byte* from_ptr = sys_population_buffer->peek();
    LogParser parse_log;
    uint parsed_bytes = parse_log.parse_redo(from_ptr, from_ptr + size);
    sys_population_buffer->remove(parsed_bytes);
  } //wile(pop_started)

  destroy_internal_thd(current_thd);
  log_pop_thread_thd.reset(nullptr);
  sys_pop_started.store(false, std::memory_order_seq_cst);
}

in Populator::start_change_populate_threads, we start this background thread.

void Populator::start_change_populate_threads() {
  if (!Populator::log_pop_thread_is_active()) {
    sys_log_rapid_thread =
      os_thread_create(rapid_populate_thread_key, 0, parse_log_func, log_sys);
    ShannonBase::Populate::sys_pop_started = true;
    sys_log_rapid_thread.start();
  }
}

To use end_change_populate_threads to stop this background thread.

void Populator::end_change_populate_threads() {
  sys_pop_started.store(false, std::memory_order_seq_cst);
}

5.3 Adding the redo logs

As we known, redo log will be written when transaction be commmitted. Therefore, if the redo log is MLOG_REC_INSERT, MLOG_REC_DELETE and MLOG_REC_UPDATE_IN_PLACE, we add these redo log records into the ring buffer, then start to process these redo log records.

In log_buffer_write function, the redo log records firstly will be inserted into redo log buffer. And, at this point, we add our logic here, also making a copy of that.

lsn_t log_buffer_write(log_t &log, const byte *str, size_t str_len,
                       lsn_t start_lsn) {
  ut_ad(rw_lock_own(log.sn_lock_inst, RW_LOCK_S));
  ...  
    log_sync_point("log_buffer_write_before_memcpy");

    /* This is the critical memcpy operation, which copies data
    from internal mtr's buffer to the shared log buffer. */
    std::memcpy(ptr, str, len);
    auto type = mlog_id_t(*ptr & ~MLOG_SINGLE_REC_FLAG);
    if (ShannonBase::Populate::Populator::log_pop_thread_is_active() &&
        !recv_recovery_is_on()) {
        ShannonBase::Populate::sys_population_buffer->writeBuff(str, len);
    }
  ...

Here, sys_population_buffer is our defined ring buffer. After log records write up, it send a notification to change pop thread in log_buffer_write_completed(...).

os_event_set(log.rapid_events[0]);

5.3 Parsing redo log

LogParser is used to parse the redo log and apply the changes to rapid. the function entry is uint LogParser::parse_redo(byte* ptr, byte* end_ptr). The format of redo log can be found at redo log format.

uint LogParser::parse_multi_rec(byte *ptr, byte *end_ptr) {
  ut_a(end_ptr >= ptr);
  return (end_ptr - ptr);
}
// handle single mtr
uint LogParser::parse_redo(byte* ptr, byte* end_ptr) {
/**
 * after secondary_load command excuted, all the data read from data file. the last
 * checkpoint lsn makes all the data lsn is samller than it were written to data file.
*/
    if (ptr == end_ptr) {
      return 0;
    }

    bool single_rec;
    switch (*ptr) {
#ifdef UNIV_LOG_LSN_DEBUG
      case MLOG_LSN:
#endif /* UNIV_LOG_LSN_DEBUG */
      case MLOG_DUMMY_RECORD:
        single_rec = true;
        break;
      default:
        single_rec = !!(*ptr & MLOG_SINGLE_REC_FLAG);
    }

    return (single_rec) ?  parse_single_rec(ptr, end_ptr) :
                           parse_multi_rec(ptr, end_ptr);
}

parse_cur_and_apply_insert_rec, parse_cur_and_apply_delete_rec and parse_cur_update_in_place_and_apply are used to process insert, delete and update all three types of redo log.

for more information about the these three functions, pls ref to the source code at: implementation

6 Performance_schema

In ShannonBase, performance_schema tables're added to store the informmation that Rapid created. Sucha as, the information of loaded tables and fields. rpd_column_id, rpd_columns, rpd_preload_stats, rpd_table_id and rpd_tables are added.

Now taking rpd_column_id as an instance, In storage/perfschema directorr, a file named table_rpd_column_id.cc added, and the corresponding CMakefile modified also. After that, in pfs_engine.table.cc, adding a table share into all_shares.

mysql> show tables like "%rpd%";
+--------------------------------------+
| Tables_in_performance_schema (%rpd%) |
+--------------------------------------+
| rpd_column_id                        |
| rpd_columns                          |
| rpd_preload_stats                    |
| rpd_table_id                         |
| rpd_tables                           |
+--------------------------------------+
5 rows in set (0.04 sec)


mysql> show create table rpd_column_id \G
*************************** 1. row ***************************
       Table: rpd_column_id
Create Table: CREATE TABLE `rpd_column_id` (
  `ID` bigint unsigned NOT NULL,
  `TABLE_ID` bigint unsigned NOT NULL,
  `COLUMN_NAME` char(128) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL
) ENGINE=PERFORMANCE_SCHEMA DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
1 row in set (0.02 sec)

static PFS_engine_table_share *all_shares[] = {
    &table_cond_instances::m_share,
    &table_error_log::m_share,
    &table_events_waits_current::m_share,
    &table_events_waits_history::m_share,
    &table_events_waits_history_long::m_share,
    ...
    &table_rpl_async_connection_failover_managed::m_share,
    &table_rpd_column_id::m_share,
    &table_rpd_columns::m_share,
    &table_rpd_table_id::m_share,
    &table_rpd_tables::m_share,
    &table_rpd_preload_stats::m_share,

    &table_log_status::m_share,

    &table_prepared_stmt_instances::m_share,
    ...

6.1 Table definition

First of all, it's to define the table definitions.

Plugin_table table_rpd_column_id::m_table_def(
    /* Schema name */
    "performance_schema",
    /* Name */
    "rpd_column_id",
    /* Definition */
    "  ID BIGINT unsigned not null,\n"
    "  TABLE_ID BIGINT unsigned not null,\n"
    "  COLUMN_NAME CHAR(128) collate utf8mb4_bin not null\n",
    /* Options */
    " ENGINE=PERFORMANCE_SCHEMA",
    /* Tablespace */
    nullptr);

and table share is defined as

PFS_engine_table_share table_rpd_column_id::m_share = {
    &pfs_readonly_acl,
    &table_rpd_column_id::create,
    nullptr, /* write_row */
    nullptr, /* delete_all_rows */
    table_rpd_column_id::get_row_count,
    sizeof(pos_t), /* ref length */
    &m_table_lock,
    &m_table_def,
    true, /* perpetual */
    PFS_engine_table_proxy(),
    {0},
    false /* m_in_purgatory */
};

The defintions will be executed at bootstrap stage.

6.2 Impl the adding and query data

int table_rpd_column_id::make_row(uint index[[maybe_unused]]) {
  DBUG_TRACE;
  // Set default values.
  if (index >= ShannonBase::meta_rpd_columns_infos.size()) {
    return HA_ERR_END_OF_FILE;
  } else {
    m_row.column_id = ShannonBase::meta_rpd_columns_infos[index].column_id;
    m_row.table_id = ShannonBase::meta_rpd_columns_infos[index].table_id;

    strncpy(m_row.column_name, ShannonBase::meta_rpd_columns_infos[index].column_name,
            sizeof(m_row.column_name));
    m_row.column_name_length = strlen(ShannonBase::meta_rpd_columns_infos[index].column_name);
  }
  return 0;
}

int table_rpd_column_id::read_row_values(TABLE *table,
                                         unsigned char *buf,
                                         Field **fields,
                                         bool read_all) {
  Field *f;

  //assert(table->s->null_bytes == 0);
  buf[0] = 0;

  for (; (f = *fields); fields++) {
    if (read_all || bitmap_is_set(table->read_set, f->field_index())) {
      switch (f->field_index()) {
        case 0: /** colum_id */
          set_field_ulonglong(f, m_row.column_id);
          break;
        case 1: /** table_id */
          set_field_ulonglong(f, m_row.table_id);
          break;
        case 2: /** column name */
          set_field_char_utf8mb4(f, m_row.column_name, m_row.column_name_length);
          break;
        default:
          assert(false);
      }
    }
  }
  return 0;
}

7 Run MTRs

MySQL test cases are used to make sure your features dont brake the others correctness.

./mtr --suite=xxx--nowarnings --force --nocheck-testcases --retry=0 [--sanitize] --parallel=5