Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support text/plain format for log ingestion #4300

Merged
merged 5 commits into from
Jul 12, 2024

Conversation

shuiyisong
Copy link
Contributor

@shuiyisong shuiyisong commented Jul 5, 2024

I hereby agree to the terms of the GreptimeDB CLA.

Refer to a related PR or issue link (optional)

What's changed and what's your intention?

This pr

  • adds support for text/plain and text/plain; charset=utf-8 on log ingestion
  • refactors pipeline find and delete api using dataframe api

now you can use plain text logs to ingest log data, for example:

url="http://localhost:4000/v1/events/logs?table=$table_name&pipeline_name=$pipeline_name"
curl -X POST "$url" -H "Content-Type: text/plain" \
 -d '
2024-05-25 20:16:37.217 hello
2024-05-25 20:16:37.218 hello world
'

Checklist

  • I have written the necessary rustdoc comments.
  • I have added the necessary unit tests and integration tests.
  • This PR requires documentation updates.

Summary by CodeRabbit

  • New Features

    • Introduced a new error type, DataFrame, for more detailed error reporting related to data frames.
    • Added TimestampNanosecondValue support to handle more precise timestamp data in tests.
    • Added plaintext ingestion test for HTTP pipelines.
  • Refactor

    • Improved data frame preparation and filtering logic in the PipelineTable module.
    • Updated content type processing methods in HTTP event handling.
  • Bug Fixes

    • Enhanced error handling to include new data frame-related errors.

Copy link
Contributor

coderabbitai bot commented Jul 5, 2024

Walkthrough

The update introduces a new error variant, DataFrame, in the manager module's error.rs file, along with its implementation in ErrorExt. Various files in the pipeline project, including table.rs and util.rs, have been refactored for improved dataframe handling. Additional changes include updates to enums, test functions, content type handling in http/event.rs, and the integration of a new HTTP test function for plain text ingestion.

Changes

File(s) Change Summary
src/pipeline/src/manager/error.rs Added DataFrame error variant and updated ErrorExt implementation.
src/pipeline/src/manager/table.rs Refactored dataframe preparation and execution logic, and added prepare_dataframe_conditions function.
src/pipeline/src/manager/util.rs Refactored build_plan_filter to prepare_dataframe_conditions.
src/pipeline/tests/pipeline.rs Added TimestampNanosecondValue variant to ValueData enum and updated test_simple_data function.
src/servers/src/http/event.rs Updated content type handling and processing methods for payload.
src/pipeline/src/etl/processor.rs Added logic branch to Processor trait for handling Value::String.
tests-integration/tests/http.rs Added test_plain_text_ingestion function to the HTTP test suite.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant HTTPServer
    participant PipelineManager
    participant DataFrameHandler

    Client->>HTTPServer: Send plain text data ingestion request
    HTTPServer->>PipelineManager: Forward request data
    PipelineManager->>DataFrameHandler: Prepare and execute dataframe
    DataFrameHandler-->>PipelineManager: Return dataframe result
    PipelineManager-->>HTTPServer: Return processed data response
    HTTPServer-->>Client: Send response with ingestion result
Loading

Poem

In the land of data, changes sprout,
A new variant to clear the doubt,
DataFrames now handle errors bright,
Tests and logic take new flight.
HTTP sings a fresher tune,
As pipelines hum a smoother croon.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions github-actions bot added the docs-not-required This change does not impact docs. label Jul 5, 2024
@shuiyisong shuiyisong force-pushed the feat/support_plain_text_log branch from e08759b to 9fc3645 Compare July 8, 2024 04:28
@shuiyisong shuiyisong marked this pull request as ready for review July 11, 2024 05:00
@shuiyisong shuiyisong requested a review from a team as a code owner July 11, 2024 05:00
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between da0c840 and f87472a.

Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
Files selected for processing (6)
  • src/pipeline/src/manager/error.rs (2 hunks)
  • src/pipeline/src/manager/table.rs (5 hunks)
  • src/pipeline/src/manager/util.rs (2 hunks)
  • src/pipeline/tests/pipeline.rs (2 hunks)
  • src/servers/src/http/event.rs (2 hunks)
  • tests-integration/tests/http.rs (2 hunks)
Additional comments not posted (11)
src/pipeline/src/manager/util.rs (1)

37-52: LGTM! But verify the function usage in the codebase.

The logic for constructing filter expressions and combining them using Expr::and is correct and efficient.

However, ensure that all invocations of prepare_dataframe_conditions match the new signature and expected parameters.

Verification successful

The function usage looks correct, but ensure the parameters are defined and passed correctly.

The function prepare_dataframe_conditions is invoked in src/pipeline/src/manager/table.rs with the parameters schema, name, and version. Verify these parameters are correctly defined and used.


Verified: All invocations of prepare_dataframe_conditions are correct and consistent with the new signature.

The parameters schema, name, and version are defined and used appropriately in src/pipeline/src/manager/table.rs.

  • schema: Verified definition and usage.
  • name: Verified definition and usage.
  • version: Verified definition and usage.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify all function calls to `prepare_dataframe_conditions` match the new signature.

# Test: Search for the function usage. Expect: Only occurrences of the new signature.
rg --type rust -A 5 $'prepare_dataframe_conditions'

Length of output: 1920


Script:

#!/bin/bash
# Description: Verify definitions and usages of `schema`, `name`, and `version` in `table.rs`.

# Search for the definitions and usages of `schema`.
rg --type rust -A 5 $'schema' src/pipeline/src/manager/table.rs

# Search for the definitions and usages of `name`.
rg --type rust -A 5 $'name' src/pipeline/src/manager/table.rs

# Search for the definitions and usages of `version`.
rg --type rust -A 5 $'version' src/pipeline/src/manager/table.rs

Length of output: 18412

src/pipeline/src/manager/error.rs (2)

84-89: Addition of the DataFrame error variant looks good.

The new DataFrame error variant is correctly defined with appropriate fields.


136-136: Update in ErrorExt implementation looks good.

The new DataFrame error variant is correctly handled in the ErrorExt implementation.

src/servers/src/http/event.rs (1)

252-264: LGTM! But verify the content type handling logic.

The content type handling logic for text/plain and text/plain; charset=utf-8 is correct and efficient.

However, ensure that all content type handling scenarios are covered and tested.

Verification successful

LGTM! But verify the content type handling logic.

The content type handling logic for text/plain and text/plain; charset=utf-8 is correct and efficient. Additionally, the current implementation covers all necessary content types as per the codebase.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify all content type handling scenarios are covered and tested.

# Test: Search for the content type handling logic. Expect: Only occurrences of the new content type handling logic.
rg --type rust -A 5 $'ContentType::text'

Length of output: 495


Script:

#!/bin/bash
# Description: Verify that all content type handling scenarios are covered and tested.

# Test: Search for all occurrences of ContentType handling logic in the codebase.
rg --type rust -A 5 $'ContentType::'

Length of output: 783

src/pipeline/tests/pipeline.rs (2)

17-18: Addition of the TimestampNanosecondValue variant looks good.

The new TimestampNanosecondValue variant is correctly defined and integrated into the ValueData enum.


460-510: Update in test_simple_data function looks good.

The function correctly tests the new TimestampNanosecondValue variant.

src/pipeline/src/manager/table.rs (4)

29-29: LGTM! Import added for dataframe handling.

The use query::dataframe::DataFrame import is necessary for the refactored dataframe handling.


36-52: LGTM! Import added for refactored function.

The prepare_dataframe_conditions function is imported from the util module, indicating the usage of a refactored function for preparing dataframe conditions.


339-350: LGTM! Refactored dataframe preparation.

The dataframe preparation logic has been refactored to use the query_engine to read the table and apply a filter using prepare_dataframe_conditions. This improves clarity and modularity.


408-431: LGTM! Refactored dataframe preparation.

The dataframe preparation logic has been refactored to use the query_engine to read the table and apply a filter using prepare_dataframe_conditions. This improves clarity and modularity.

tests-integration/tests/http.rs (1)

1133-1231: LGTM! Comprehensive test for plain text ingestion.

The new public async function test_plain_text_ingestion is well-structured and comprehensive, covering the entire workflow of creating a pipeline, ingesting data, and verifying the results.

@shuiyisong shuiyisong changed the title feat: support text/plain format of log input feat: support text/plain format for log ingestion Jul 11, 2024
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between f87472a and cdf0bee.

Files selected for processing (1)
  • src/servers/src/http/event.rs (2 hunks)
Files skipped from review as they are similar to previous changes (1)
  • src/servers/src/http/event.rs

Copy link

codecov bot commented Jul 11, 2024

Codecov Report

Attention: Patch coverage is 93.54839% with 6 lines in your changes missing coverage. Please review.

Project coverage is 84.92%. Comparing base (da0c840) to head (db0e1ec).
Report is 8 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4300      +/-   ##
==========================================
- Coverage   85.18%   84.92%   -0.26%     
==========================================
  Files        1060     1061       +1     
  Lines      189069   189497     +428     
==========================================
- Hits       161057   160929     -128     
- Misses      28012    28568     +556     

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between cdf0bee and 37d0179.

Files selected for processing (1)
  • src/servers/src/http/event.rs (4 hunks)
Files skipped from review as they are similar to previous changes (1)
  • src/servers/src/http/event.rs

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Outside diff range, codebase verification and nitpick comments (2)
src/pipeline/src/etl/processor.rs (1)

96-108: Unit tests missing for Value::String handling in exec_array function

The search did not find any direct unit tests for the exec_array function or the Processor trait handling Value::String. It is essential to add specific unit tests to cover this new branch.

  • Add unit tests for the exec_array function to ensure Value::String handling is correctly implemented.
  • Verify the correctness of this implementation with the new unit tests.
Analysis chain

LGTM! Verify the correctness with unit tests.

The changes to handle Value::String in the exec_array function are correct. Ensure that unit tests cover this new branch.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the correctness of the `Value::String` handling in the `exec_array` function.

# Test: Search for unit tests for the exec_array function. Expect: Tests cover the new branch.
rg --type rust -A 5 'fn exec_array' --glob '*_test.rs'

Length of output: 56


Script:

#!/bin/bash
# Description: Find all test files related to the Processor trait and check for tests handling Value::String in the exec_array function.

# Find all test files related to the Processor trait.
rg --type rust 'Processor' --glob '*_test.rs'

# Check for tests handling Value::String in the exec_array function.
rg --type rust 'Value::String' --glob '*_test.rs'

Length of output: 228

src/servers/src/http/event.rs (1)

265-291: No unit tests found for extract_pipeline_value_by_content_type function.

There are no unit tests covering the extract_pipeline_value_by_content_type function. It's important to add unit tests to ensure all branches of this function are properly tested.

  • Add unit tests to cover JSON content type handling.
  • Add unit tests to cover text content type handling.
  • Add unit tests for unsupported content types.
Analysis chain

LGTM! Verify the correctness with unit tests.

The extract_pipeline_value_by_content_type function correctly handles JSON and text content types and returns errors for unsupported content types. Ensure unit tests cover all branches.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the correctness of the `extract_pipeline_value_by_content_type` function.

# Test: Search for unit tests for the extract_pipeline_value_by_content_type function. Expect: Tests cover all branches.
rg --type rust -A 5 'fn extract_pipeline_value_by_content_type' --glob '*_test.rs'

Length of output: 84


Script:

#!/bin/bash
# Description: Verify the presence of tests for the `extract_pipeline_value_by_content_type` function.

# Test: Search for any tests related to the `extract_pipeline_value_by_content_type` function across all Rust files.
rg --type rust 'extract_pipeline_value_by_content_type' -A 10

Length of output: 1363

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 37d0179 and db0e1ec.

Files selected for processing (2)
  • src/pipeline/src/etl/processor.rs (1 hunks)
  • src/servers/src/http/event.rs (4 hunks)
Additional comments not posted (1)
src/servers/src/http/event.rs (1)

252-252: LGTM! Ensure integration tests cover this new functionality.

The changes to the log_ingester function to handle text/plain content type are correct.

Verification successful

Integration tests cover the new functionality for text/plain content type.

The test_plain_text_ingestion function in tests-integration/tests/http.rs indirectly tests the log_ingester function for text/plain content type, confirming that the new functionality is covered by integration tests.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Ensure integration tests cover the new functionality for text/plain content type.

# Test: Search for integration tests for the log_ingester function. Expect: Tests cover the new functionality.
rg --type rust -A 5 'fn log_ingester' --glob '*_test.rs'

Length of output: 58


Script:

#!/bin/bash
# Description: Ensure integration tests cover the new functionality for text/plain content type.

# Test: Search for any references to log_ingester in all Rust test files.
rg --type rust 'log_ingester' --glob '*_test.rs'

Length of output: 48


Script:

#!/bin/bash
# Description: Ensure integration tests cover the new functionality for text/plain content type.

# Test: Search for any references to 'text/plain' content type in all Rust test files.
rg --type rust 'text/plain' --glob '*_test.rs'

Length of output: 46


Script:

#!/bin/bash
# Description: Ensure integration tests cover the new functionality for text/plain content type.

# Test: Search for any references to 'log_ingester' throughout the entire codebase.
rg --type rust 'log_ingester'

# Test: Search for any references to 'text/plain' throughout the entire codebase.
rg --type rust 'text/plain'

Length of output: 365


Script:

#!/bin/bash
# Description: Inspect the tests-integration/tests/http.rs file for any tests that might indirectly cover the log_ingester functionality.

# Print the contents of tests-integration/tests/http.rs
cat tests-integration/tests/http.rs

Length of output: 38662

@paomian
Copy link
Contributor

paomian commented Jul 12, 2024

I'm in the process of refactoring pipeline processors.
I don't think the incoming data needs to be the same type as the processors intermediate state data.
So I'm going to change the signature of the pipeline processors function.
I think this pr can be merged first if @sunng87 agrees. I'll make the changes later.

@sunng87
Copy link
Member

sunng87 commented Jul 12, 2024

I agree. We can merge this first and discuss for next step.

@github-actions github-actions bot added docs-required This change requires docs update. and removed docs-not-required This change does not impact docs. labels Jul 12, 2024
@shuiyisong shuiyisong added this pull request to the merge queue Jul 12, 2024
Merged via the queue into GreptimeTeam:main with commit 67dfdd6 Jul 12, 2024
57 checks passed
@shuiyisong shuiyisong deleted the feat/support_plain_text_log branch July 12, 2024 09:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs-required This change requires docs update.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants