feat: support `text/plain` format for log ingestion #4300

shuiyisong · 2024-07-05T09:21:12Z

I hereby agree to the terms of the GreptimeDB CLA.

Refer to a related PR or issue link (optional)

What's changed and what's your intention?

This pr

adds support for text/plain and text/plain; charset=utf-8 on log ingestion
refactors pipeline find and delete api using dataframe api

now you can use plain text logs to ingest log data, for example:

url="http://localhost:4000/v1/events/logs?table=$table_name&pipeline_name=$pipeline_name"
curl -X POST "$url" -H "Content-Type: text/plain" \
 -d '
2024-05-25 20:16:37.217 hello
2024-05-25 20:16:37.218 hello world
'

Checklist

I have written the necessary rustdoc comments.
I have added the necessary unit tests and integration tests.
This PR requires documentation updates.

Summary by CodeRabbit

New Features
- Introduced a new error type, DataFrame, for more detailed error reporting related to data frames.
- Added TimestampNanosecondValue support to handle more precise timestamp data in tests.
- Added plaintext ingestion test for HTTP pipelines.
Refactor
- Improved data frame preparation and filtering logic in the PipelineTable module.
- Updated content type processing methods in HTTP event handling.
Bug Fixes
- Enhanced error handling to include new data frame-related errors.

coderabbitai · 2024-07-05T09:21:21Z

Walkthrough

The update introduces a new error variant, DataFrame, in the manager module's error.rs file, along with its implementation in ErrorExt. Various files in the pipeline project, including table.rs and util.rs, have been refactored for improved dataframe handling. Additional changes include updates to enums, test functions, content type handling in http/event.rs, and the integration of a new HTTP test function for plain text ingestion.

Changes

File(s)	Change Summary
`src/pipeline/src/manager/error.rs`	Added `DataFrame` error variant and updated `ErrorExt` implementation.
`src/pipeline/src/manager/table.rs`	Refactored dataframe preparation and execution logic, and added `prepare_dataframe_conditions` function.
`src/pipeline/src/manager/util.rs`	Refactored `build_plan_filter` to `prepare_dataframe_conditions`.
`src/pipeline/tests/pipeline.rs`	Added `TimestampNanosecondValue` variant to `ValueData` enum and updated `test_simple_data` function.
`src/servers/src/http/event.rs`	Updated content type handling and processing methods for payload.
`src/pipeline/src/etl/processor.rs`	Added logic branch to `Processor` trait for handling `Value::String`.
`tests-integration/tests/http.rs`	Added `test_plain_text_ingestion` function to the HTTP test suite.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant HTTPServer
    participant PipelineManager
    participant DataFrameHandler

    Client->>HTTPServer: Send plain text data ingestion request
    HTTPServer->>PipelineManager: Forward request data
    PipelineManager->>DataFrameHandler: Prepare and execute dataframe
    DataFrameHandler-->>PipelineManager: Return dataframe result
    PipelineManager-->>HTTPServer: Return processed data response
    HTTPServer-->>Client: Send response with ingestion result

Poem

In the land of data, changes sprout,
A new variant to clear the doubt,
DataFrames now handle errors bright,
Tests and logic take new flight.
HTTP sings a fresher tune,
As pipelines hum a smoother croon.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai generate interesting stats about this repository and render them as a table.
- @coderabbitai show all the console.log statements in this repository.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between da0c840 and f87472a.

Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

Files selected for processing (6)

src/pipeline/src/manager/error.rs (2 hunks)
src/pipeline/src/manager/table.rs (5 hunks)
src/pipeline/src/manager/util.rs (2 hunks)
src/pipeline/tests/pipeline.rs (2 hunks)
src/servers/src/http/event.rs (2 hunks)
tests-integration/tests/http.rs (2 hunks)

Additional comments not posted (11)

src/pipeline/src/manager/util.rs (1)
37-52: LGTM! But verify the function usage in the codebase.

The logic for constructing filter expressions and combining them using Expr::and is correct and efficient.

However, ensure that all invocations of prepare_dataframe_conditions match the new signature and expected parameters.

Verification successful

The function usage looks correct, but ensure the parameters are defined and passed correctly.

The function prepare_dataframe_conditions is invoked in src/pipeline/src/manager/table.rs with the parameters schema, name, and version. Verify these parameters are correctly defined and used.

Verified: All invocations of prepare_dataframe_conditions are correct and consistent with the new signature.

The parameters schema, name, and version are defined and used appropriately in src/pipeline/src/manager/table.rs.

schema: Verified definition and usage.

name: Verified definition and usage.

version: Verified definition and usage.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify all function calls to `prepare_dataframe_conditions` match the new signature.

# Test: Search for the function usage. Expect: Only occurrences of the new signature.
rg --type rust -A 5 $'prepare_dataframe_conditions'
Length of output: 1920

Script:
#!/bin/bash
# Description: Verify definitions and usages of `schema`, `name`, and `version` in `table.rs`.

# Search for the definitions and usages of `schema`.
rg --type rust -A 5 $'schema' src/pipeline/src/manager/table.rs

# Search for the definitions and usages of `name`.
rg --type rust -A 5 $'name' src/pipeline/src/manager/table.rs

# Search for the definitions and usages of `version`.
rg --type rust -A 5 $'version' src/pipeline/src/manager/table.rs
Length of output: 18412
src/pipeline/src/manager/error.rs (2)

84-89: Addition of the DataFrame error variant looks good.

The new DataFrame error variant is correctly defined with appropriate fields.

136-136: Update in ErrorExt implementation looks good.

The new DataFrame error variant is correctly handled in the ErrorExt implementation.
src/servers/src/http/event.rs (1)
252-264: LGTM! But verify the content type handling logic.

The content type handling logic for text/plain and text/plain; charset=utf-8 is correct and efficient.

However, ensure that all content type handling scenarios are covered and tested.

Verification successful

LGTM! But verify the content type handling logic.

The content type handling logic for text/plain and text/plain; charset=utf-8 is correct and efficient. Additionally, the current implementation covers all necessary content types as per the codebase.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify all content type handling scenarios are covered and tested.

# Test: Search for the content type handling logic. Expect: Only occurrences of the new content type handling logic.
rg --type rust -A 5 $'ContentType::text'
Length of output: 495

Script:
#!/bin/bash
# Description: Verify that all content type handling scenarios are covered and tested.

# Test: Search for all occurrences of ContentType handling logic in the codebase.
rg --type rust -A 5 $'ContentType::'
Length of output: 783
src/pipeline/tests/pipeline.rs (2)

17-18: Addition of the TimestampNanosecondValue variant looks good.

The new TimestampNanosecondValue variant is correctly defined and integrated into the ValueData enum.

460-510: Update in test_simple_data function looks good.

The function correctly tests the new TimestampNanosecondValue variant.

src/pipeline/src/manager/table.rs (4)

29-29: LGTM! Import added for dataframe handling.

The use query::dataframe::DataFrame import is necessary for the refactored dataframe handling.

36-52: LGTM! Import added for refactored function.

The prepare_dataframe_conditions function is imported from the util module, indicating the usage of a refactored function for preparing dataframe conditions.

339-350: LGTM! Refactored dataframe preparation.

The dataframe preparation logic has been refactored to use the query_engine to read the table and apply a filter using prepare_dataframe_conditions. This improves clarity and modularity.

408-431: LGTM! Refactored dataframe preparation.

The dataframe preparation logic has been refactored to use the query_engine to read the table and apply a filter using prepare_dataframe_conditions. This improves clarity and modularity.

tests-integration/tests/http.rs (1)

1133-1231: LGTM! Comprehensive test for plain text ingestion.

The new public async function test_plain_text_ingestion is well-structured and comprehensive, covering the entire workflow of creating a pipeline, ingesting data, and verifying the results.

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between f87472a and cdf0bee.

Files selected for processing (1)

src/servers/src/http/event.rs (2 hunks)

Files skipped from review as they are similar to previous changes (1)

src/servers/src/http/event.rs

codecov · 2024-07-11T05:28:43Z

Codecov Report

Attention: Patch coverage is 93.54839% with 6 lines in your changes missing coverage. Please review.

Project coverage is 84.92%. Comparing base (da0c840) to head (db0e1ec).
Report is 8 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4300      +/-   ##
==========================================
- Coverage   85.18%   84.92%   -0.26%     
==========================================
  Files        1060     1061       +1     
  Lines      189069   189497     +428     
==========================================
- Hits       161057   160929     -128     
- Misses      28012    28568     +556

src/servers/src/http/event.rs

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between cdf0bee and 37d0179.

Files selected for processing (1)

src/servers/src/http/event.rs (4 hunks)

Files skipped from review as they are similar to previous changes (1)

src/servers/src/http/event.rs

src/servers/src/http/event.rs

coderabbitai

Actionable comments posted: 0

Outside diff range, codebase verification and nitpick comments (2)

src/pipeline/src/etl/processor.rs (1)
96-108: Unit tests missing for Value::String handling in exec_array function

The search did not find any direct unit tests for the exec_array function or the Processor trait handling Value::String. It is essential to add specific unit tests to cover this new branch.

Add unit tests for the exec_array function to ensure Value::String handling is correctly implemented.

Verify the correctness of this implementation with the new unit tests.

Analysis chain

LGTM! Verify the correctness with unit tests.

The changes to handle Value::String in the exec_array function are correct. Ensure that unit tests cover this new branch.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the correctness of the `Value::String` handling in the `exec_array` function.

# Test: Search for unit tests for the exec_array function. Expect: Tests cover the new branch.
rg --type rust -A 5 'fn exec_array' --glob '*_test.rs'
Length of output: 56

Script:
#!/bin/bash
# Description: Find all test files related to the Processor trait and check for tests handling Value::String in the exec_array function.

# Find all test files related to the Processor trait.
rg --type rust 'Processor' --glob '*_test.rs'

# Check for tests handling Value::String in the exec_array function.
rg --type rust 'Value::String' --glob '*_test.rs'
Length of output: 228
src/servers/src/http/event.rs (1)
265-291: No unit tests found for extract_pipeline_value_by_content_type function.

There are no unit tests covering the extract_pipeline_value_by_content_type function. It's important to add unit tests to ensure all branches of this function are properly tested.

Add unit tests to cover JSON content type handling.

Add unit tests to cover text content type handling.

Add unit tests for unsupported content types.

Analysis chain

LGTM! Verify the correctness with unit tests.

The extract_pipeline_value_by_content_type function correctly handles JSON and text content types and returns errors for unsupported content types. Ensure unit tests cover all branches.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the correctness of the `extract_pipeline_value_by_content_type` function.

# Test: Search for unit tests for the extract_pipeline_value_by_content_type function. Expect: Tests cover all branches.
rg --type rust -A 5 'fn extract_pipeline_value_by_content_type' --glob '*_test.rs'
Length of output: 84

Script:
#!/bin/bash
# Description: Verify the presence of tests for the `extract_pipeline_value_by_content_type` function.

# Test: Search for any tests related to the `extract_pipeline_value_by_content_type` function across all Rust files.
rg --type rust 'extract_pipeline_value_by_content_type' -A 10
Length of output: 1363

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 37d0179 and db0e1ec.

Files selected for processing (2)

src/pipeline/src/etl/processor.rs (1 hunks)
src/servers/src/http/event.rs (4 hunks)

Additional comments not posted (1)

src/servers/src/http/event.rs (1)
252-252: LGTM! Ensure integration tests cover this new functionality.

The changes to the log_ingester function to handle text/plain content type are correct.

Verification successful

Integration tests cover the new functionality for text/plain content type.

The test_plain_text_ingestion function in tests-integration/tests/http.rs indirectly tests the log_ingester function for text/plain content type, confirming that the new functionality is covered by integration tests.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Ensure integration tests cover the new functionality for text/plain content type.

# Test: Search for integration tests for the log_ingester function. Expect: Tests cover the new functionality.
rg --type rust -A 5 'fn log_ingester' --glob '*_test.rs'
Length of output: 58

Script:
#!/bin/bash
# Description: Ensure integration tests cover the new functionality for text/plain content type.

# Test: Search for any references to log_ingester in all Rust test files.
rg --type rust 'log_ingester' --glob '*_test.rs'
Length of output: 48

Script:
#!/bin/bash
# Description: Ensure integration tests cover the new functionality for text/plain content type.

# Test: Search for any references to 'text/plain' content type in all Rust test files.
rg --type rust 'text/plain' --glob '*_test.rs'
Length of output: 46

Script:
#!/bin/bash
# Description: Ensure integration tests cover the new functionality for text/plain content type.

# Test: Search for any references to 'log_ingester' throughout the entire codebase.
rg --type rust 'log_ingester'

# Test: Search for any references to 'text/plain' throughout the entire codebase.
rg --type rust 'text/plain'
Length of output: 365

Script:
#!/bin/bash
# Description: Inspect the tests-integration/tests/http.rs file for any tests that might indirectly cover the log_ingester functionality.

# Print the contents of tests-integration/tests/http.rs
cat tests-integration/tests/http.rs
Length of output: 38662

src/pipeline/src/etl/processor.rs

paomian · 2024-07-12T07:36:49Z

I'm in the process of refactoring pipeline processors.
I don't think the incoming data needs to be the same type as the processors intermediate state data.
So I'm going to change the signature of the pipeline processors function.
I think this pr can be merged first if @sunng87 agrees. I'll make the changes later.

sunng87 · 2024-07-12T08:42:24Z

I agree. We can merge this first and discuss for next step.

github-actions bot added the docs-not-required This change does not impact docs. label Jul 5, 2024

shuiyisong force-pushed the feat/support_plain_text_log branch from e08759b to 9fc3645 Compare July 8, 2024 04:28

shuiyisong added 2 commits July 11, 2024 11:55

feat: support text/plain format of log input

e4e1526

refactor: pipeline query and delete using dataframe api

f87472a

shuiyisong force-pushed the feat/support_plain_text_log branch from 9fc3645 to f87472a Compare July 11, 2024 04:54

shuiyisong marked this pull request as ready for review July 11, 2024 05:00

shuiyisong requested a review from a team as a code owner July 11, 2024 05:00

shuiyisong requested review from sunng87 and paomian July 11, 2024 05:00

coderabbitai bot reviewed Jul 11, 2024

View reviewed changes

shuiyisong changed the title ~~feat: support text/plain format of log input~~ feat: support text/plain format for log ingestion Jul 11, 2024

chore: minor refactor

cdf0bee

coderabbitai bot reviewed Jul 11, 2024

View reviewed changes

sunng87 reviewed Jul 11, 2024

View reviewed changes

src/servers/src/http/event.rs Outdated Show resolved Hide resolved

refactor: skip jsonify when processing plan/text

37d0179

coderabbitai bot reviewed Jul 11, 2024

View reviewed changes

sunng87 reviewed Jul 11, 2024

View reviewed changes

src/servers/src/http/event.rs Outdated Show resolved Hide resolved

refactor: support array(string) as pipeline engine input

db0e1ec

coderabbitai bot reviewed Jul 12, 2024

View reviewed changes

paomian reviewed Jul 12, 2024

View reviewed changes

src/pipeline/src/etl/processor.rs Show resolved Hide resolved

sunng87 approved these changes Jul 12, 2024

View reviewed changes

shuiyisong mentioned this pull request Jul 12, 2024

chore: add pipeline and log doc GreptimeTeam/docs#1026

Merged

2 tasks

github-actions bot added docs-required This change requires docs update. and removed docs-not-required This change does not impact docs. labels Jul 12, 2024

sunng87 mentioned this pull request Jul 12, 2024

Update docs for feat: support text/plain format for log ingestion GreptimeTeam/docs#1054

Closed

paomian approved these changes Jul 12, 2024

View reviewed changes

shuiyisong added this pull request to the merge queue Jul 12, 2024

Merged via the queue into GreptimeTeam:main with commit 67dfdd6 Jul 12, 2024
57 checks passed

shuiyisong deleted the feat/support_plain_text_log branch July 12, 2024 09:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support `text/plain` format for log ingestion #4300

feat: support `text/plain` format for log ingestion #4300

shuiyisong commented Jul 5, 2024 •

edited

Loading

coderabbitai bot commented Jul 5, 2024 •

edited

Loading

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot left a comment

codecov bot commented Jul 11, 2024 •

edited

Loading

coderabbitai bot left a comment

coderabbitai bot left a comment

paomian commented Jul 12, 2024

sunng87 commented Jul 12, 2024

feat: support text/plain format for log ingestion #4300

feat: support text/plain format for log ingestion #4300

Conversation

shuiyisong commented Jul 5, 2024 • edited Loading

Refer to a related PR or issue link (optional)

What's changed and what's your intention?

Checklist

Summary by CodeRabbit

coderabbitai bot commented Jul 5, 2024 • edited Loading

Walkthrough

Changes

Sequence Diagram(s)

Poem

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

codecov bot commented Jul 11, 2024 • edited Loading

Codecov Report

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

paomian commented Jul 12, 2024

sunng87 commented Jul 12, 2024

feat: support `text/plain` format for log ingestion #4300

feat: support `text/plain` format for log ingestion #4300

shuiyisong commented Jul 5, 2024 •

edited

Loading

coderabbitai bot commented Jul 5, 2024 •

edited

Loading

CodeRabbit Configration File (`.coderabbit.yaml`)

codecov bot commented Jul 11, 2024 •

edited

Loading