Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to specify a list of columns for insert #217

Closed
cjroebuck opened this issue Dec 23, 2023 · 2 comments · Fixed by #218
Closed

Allow to specify a list of columns for insert #217

cjroebuck opened this issue Dec 23, 2023 · 2 comments · Fixed by #218
Assignees
Labels
enhancement New feature or request

Comments

@cjroebuck
Copy link

Describe the bug

Not sure if this is a clickhouse-js specific bug or just clickhouse in general.

Whenever I try to insert JSON as a raw String with EPHEMERAL column and then have derived columns using various JSONExtract* functions, I can't seem to get it to work. I'm using the example from the JSON docs here.

Steps to reproduce

Run the following code to setup tables and insert test rows.

import { createClient } from "@clickhouse/client"; 

void (async () => {
  const client = createClient({
    clickhouse_settings: {
      date_time_input_format: "best_effort",
      allow_experimental_object_type: 1,
    },
  });

  await client.command({
    query: `DROP TABLE IF EXISTS github_json_ephemeral`,
  });

  await client.command({
    query: `CREATE table github_json_ephemeral
  (
     event_type LowCardinality(String) DEFAULT JSONExtractString(message_raw, 'type'),
     repo_name LowCardinality(String) DEFAULT JSONExtractString(message_raw, 'repo.name'),
     message JSON DEFAULT message_raw,
     message_raw String EPHEMERAL
  ) ENGINE = MergeTree ORDER BY (event_type, repo_name)`,
  });

  await client.insert({
    table: "github_json_ephemeral",
    format: "JSON",
    values: [
      {
        message_raw: {
          type: "PushEvent",
          created_at: "2022-01-04 07:00:00",
          actor: {
            avatar_url: "https://avatars.githubusercontent.com/u/41898282?",
            display_login: "github-actions",
            gravatar_id: "",
            id: 41898282,
            login: "github-actions[bot]",
            url: "https://api.github.com/users/github-actions[bot]",
          },
          repo: {
            id: 410071248,
            name: "pioug/yield-data",
            url: "https://api.github.com/repos/pioug/yield-data",
          },
        },
      },
    ],
  });
})

Expected behaviour

Output:

rows [
  {
    event_type: 'PushEvent',
    repo_name: 'pioug/yield-data',
    message: {
      actor: [Object],
      created_at: '2022-01-04 07:00:00',
      repo: [Object],
      type: 'PushEvent'
    },
  }

Actual behaviour

The derived columns are blank and the JSON column is empty:

rows [
  { event_type: '', repo_name: '', message: { _dummy: 0 } }
]

Removing the EPHEMERAL directive from the column, I get the following output:

rows [
  {
    event_type: 'PushEvent',
    repo_name: '',
    message: {
      actor: [Object],
      created_at: '2022-01-04 07:00:00',
      repo: [Object],
      type: 'PushEvent'
    },
    message_raw: '{"type":"PushEvent","created_at":"2022-01-04 07:00:00","actor":{"avatar_url":"https://avatars.githubusercontent.com/u/41898282?","display_login":"github-actions","gravatar_id":"","id":41898282,"login":"github-actions[bot]","url":"https://api.github.com/users/github-actions[bot]"},"repo":{"id":410071248,"name":"pioug/yield-data","url":"https://api.github.com/repos/pioug/yield-data"}}'
  }

Notice that repo_name field is empty but event_type has been successfully extracted from the JSON. So it seems the extract from repo.name is also not working. This is a secondary issue.

Configuration

Environment

  • Client version: npm package: 0.2.7.
  • Language version: Node v20.9.0.
  • OS: Mac OS Sonoma

ClickHouse server

  • ClickHouse Server version: 23.11.2
@cjroebuck cjroebuck added the bug Something isn't working label Dec 23, 2023
@slvrtrn
Copy link
Contributor

slvrtrn commented Dec 23, 2023

I think this is because the right syntax for an insert operation here will be:

insert into github_json_ephemeral (message_raw) values ('...')

^ see only message_raw in the insert clause, which triggers default values for the rest of the fields (https://clickhouse.com/docs/en/sql-reference/statements/create/table#ephemeral).

We can add an option to restrict the insert columns, as it is useful.

The only option I see as a workaround before we add this feature is to use command with VALUES instead. It's not the optimal solution, as you might encounter request size limits (since the values are not streamed):

import { createClient } from '@clickhouse/client'

void (async () => {
  const table = 'github_json_ephemeral'
  const client = createClient({
    clickhouse_settings: {
      date_time_input_format: 'best_effort',
      allow_experimental_object_type: 1,
    },
  })

  await client.command({
    query: `
      CREATE OR REPLACE TABLE ${table}
      (
        event_type  LowCardinality(String) DEFAULT JSONExtractString(message_raw, 'type'),
        repo_name   LowCardinality(String) DEFAULT JSONExtractString(message_raw, 'repo', 'name'),
        message     JSON                   DEFAULT message_raw,
        message_raw String                 EPHEMERAL
      ) ENGINE = MergeTree ORDER BY (event_type, repo_name)`,
  })

  const messagesRaw = [
    {
      type: 'MyEventType',
      repo: {
        name: 'foo',
      },
    },
    {
      type: 'SomeOtherType',
      repo: {
        name: 'bar',
      },
    },
  ]

  // Generates ('val1'),('val2'), ...
  const messagesRawValues = messagesRaw
    .map((msg) => {
      return `('${JSON.stringify(msg)}')`
    })
    .join(',')

  const insertQuery = `INSERT INTO ${table} (message_raw) VALUES ${messagesRawValues}`
  await client.command({
    query: insertQuery,
  })

  const result = await client.query({
    query: `SELECT * FROM ${table}`,
    format: 'JSONCompactEachRow',
  })
  console.log('Result:', JSON.stringify(await result.json()))
})()

Prints:

Result: [["MyEventType","foo",{"repo":{"name":"foo"},"type":"MyEventType"}],["SomeOtherType","bar",{"repo":{"name":"bar"},"type":"SomeOtherType"}]]

NB: see JSONExtractString(message_raw, 'repo', 'name') - it should correctly extract the repo name.

@slvrtrn slvrtrn changed the title JSONExtract and Ephemeral Column Allow to specify a list of columns for insert Dec 23, 2023
@slvrtrn slvrtrn added enhancement New feature or request and removed bug Something isn't working labels Dec 23, 2023
@slvrtrn slvrtrn self-assigned this Dec 23, 2023
@slvrtrn
Copy link
Contributor

slvrtrn commented Jan 12, 2024

@cjroebuck, added in 0.2.8. See the release notes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants