Skip to content

refactor(data-import): refactor source credential handlers#30260

Closed
EDsCODE wants to merge 3 commits intomasterfrom
pipeline-source-logic-refactor
Closed

refactor(data-import): refactor source credential handlers#30260
EDsCODE wants to merge 3 commits intomasterfrom
pipeline-source-logic-refactor

Conversation

@EDsCODE
Copy link
Collaborator

@EDsCODE EDsCODE commented Mar 21, 2025

Problem

  • source handlers growing out of control
  • no diff check api

Changes

  • refactor to handler pattern
  • add diff check api and tests

👉 Stay up-to-date with PostHog coding conventions for a smoother review.

Does this work well for both Cloud and self-hosted?

How did you test this code?

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

This PR introduces a significant refactoring of data source handlers in PostHog, implementing a standardized pattern for managing external data source integrations.

Key changes:

  • Introduces base SourceHandler class in /pipelines/source/handlers.py to standardize credential validation and schema retrieval
  • Adds specialized handlers for multiple data sources (Stripe, Hubspot, BigQuery, etc.) with consistent validation patterns
  • Implements SSH tunnel support and improved error handling in SQL database handlers
  • Adds schema diff checking capabilities to detect changes between source and database
  • Consolidates common validation patterns and removes redundant code across handlers

The changes appear well-structured and improve maintainability, though there are some potential concerns around error handling consistency and edge cases in schema validation.

12 file(s) reviewed, 18 comment(s)
Edit PR Review Bot Settings | Greptile

Comment on lines +10 to +11
api_key = self.request_data.get("api_key", "")
site_name = self.request_data.get("site_name", "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: consider trimming whitespace from api_key and site_name inputs, as seen in test_trimming_payload

Suggested change
api_key = self.request_data.get("api_key", "")
site_name = self.request_data.get("site_name", "")
api_key = self.request_data.get("api_key", "").strip()
site_name = self.request_data.get("site_name", "").strip()

Comment on lines +8 to +9
if not salesforce_integration_id:
return False, "Missing required parameters: Salesforce integration ID"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Error message inconsistent with other handlers - should be 'Invalid credentials: Salesforce integration ID is missing' to match pattern

for column_name, column_type in columns
],
"incremental_available": True,
"incremental_field": columns[0][0] if len(columns) > 0 and len(columns[0]) > 0 else None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Assumes first column is always suitable for incremental field. Should validate column type is appropriate for incremental updates.

Comment on lines +17 to +19
auth_type_obj = self.request_data.get("auth_type", {})
auth_type = auth_type_obj.get("selection", None)
auth_type_username = auth_type_obj.get("username", None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: auth_type is allowed to be None here, but there's no validation to ensure a valid auth_type is provided. This could lead to authentication failures.

Suggested change
auth_type_obj = self.request_data.get("auth_type", {})
auth_type = auth_type_obj.get("selection", None)
auth_type_username = auth_type_obj.get("username", None)
auth_type_obj = self.request_data.get("auth_type", {})
auth_type = auth_type_obj.get("selection", None)
if auth_type not in ["password", "keypair"]:
return False, "Invalid auth_type: must be 'password' or 'keypair'"
auth_type_username = auth_type_obj.get("username", None)


class StripeSourceHandler(SourceHandler):
def validate_credentials(self) -> tuple[bool, str | None]:
key = self.request_data.get("stripe_secret_key", "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Missing key existence check before validation. Empty string will be used if key doesn't exist.

Comment on lines +14 to +15
subdomain_regex = re.compile("^[a-zA-Z-]+$")
if region == "US" and not subdomain_regex.match(subdomain):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Subdomain regex will reject valid subdomains containing numbers. Consider using ^[a-zA-Z0-9-]+$ instead

Suggested change
subdomain_regex = re.compile("^[a-zA-Z-]+$")
if region == "US" and not subdomain_regex.match(subdomain):
subdomain_regex = re.compile("^[a-zA-Z0-9-]+$")
if region == "US" and not subdomain_regex.match(subdomain):

Comment on lines +15 to +16
if region == "US" and not subdomain_regex.match(subdomain):
return False, "Invalid credentials: Vitally subdomain is incorrect"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Subdomain validation only happens for US region. Should validate for all regions or document why US-only

api_key = self.request_data.get("api_key", "")
email_address = self.request_data.get("email_address", "")

subdomain_regex = re.compile("^[a-zA-Z-]+$")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: regex pattern could be too restrictive - some Zendesk subdomains may contain numbers

Suggested change
subdomain_regex = re.compile("^[a-zA-Z-]+$")
subdomain_regex = re.compile("^[a-zA-Z0-9-]+$")

Comment on lines +10 to +12
subdomain = self.request_data.get("subdomain", "")
api_key = self.request_data.get("api_key", "")
email_address = self.request_data.get("email_address", "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: consider validating that none of these required fields are empty before proceeding with validation

Comment on lines +1089 to +1090
current_schemas = handler.get_schema_options()
current_schemas = [schema["table"] for schema in current_schemas]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: schema["table"] access could raise KeyError if schema options format changes

Suggested change
current_schemas = handler.get_schema_options()
current_schemas = [schema["table"] for schema in current_schemas]
current_schemas = handler.get_schema_options()
current_schemas = [schema.get("table") for schema in current_schemas if isinstance(schema, dict) and schema.get("table")]

return Response(
status=status.HTTP_400_BAD_REQUEST,
data={"message": "Invalid parameter: source_type"},
data={"message": str(e)},

Check warning

Code scanning / CodeQL

Information exposure through an exception Medium

Stack trace information
flows to this location and may be exposed to an external user.

Copilot Autofix

AI 11 months ago

To fix the problem, we need to ensure that detailed exception messages are not exposed to the end user. Instead, we should log the detailed error message on the server and return a generic error message to the user. This can be achieved by modifying the exception handling block to log the exception and return a generic error message.

Specifically, we will:

  1. Log the detailed exception message using logger.exception.
  2. Return a generic error message in the HTTP response.
Suggested changeset 1
posthog/warehouse/api/external_data_source.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/posthog/warehouse/api/external_data_source.py b/posthog/warehouse/api/external_data_source.py
--- a/posthog/warehouse/api/external_data_source.py
+++ b/posthog/warehouse/api/external_data_source.py
@@ -978,5 +978,6 @@
         except ValidationError as e:
+            logger.exception(f"Validation error for source type {source_type}", exc_info=e)
             return Response(
                 status=status.HTTP_400_BAD_REQUEST,
-                data={"message": str(e)},
+                data={"message": "Validation error occurred"},
             )
@@ -987,3 +988,3 @@
                 status=status.HTTP_400_BAD_REQUEST,
-                data={"message": f"Error handling source type {source_type}"},
+                data={"message": "An internal error has occurred"},
             )
EOF
@@ -978,5 +978,6 @@
except ValidationError as e:
logger.exception(f"Validation error for source type {source_type}", exc_info=e)
return Response(
status=status.HTTP_400_BAD_REQUEST,
data={"message": str(e)},
data={"message": "Validation error occurred"},
)
@@ -987,3 +988,3 @@
status=status.HTTP_400_BAD_REQUEST,
data={"message": f"Error handling source type {source_type}"},
data={"message": "An internal error has occurred"},
)
Copilot is powered by AI and may make mistakes. Always verify output.
Unable to commit as this autofix suggestion is now outdated
logger.exception(f"Error checking schema changes for source type {source_type}", exc_info=e)
return Response(
status=status.HTTP_400_BAD_REQUEST,
data={"message": f"Error checking schema changes: {str(e)}"},

Check warning

Code scanning / CodeQL

Information exposure through an exception Medium

Stack trace information
flows to this location and may be exposed to an external user.

Copilot Autofix

AI 11 months ago

To fix the problem, we need to replace the detailed error message returned to the user with a more generic message. This can be done by modifying the response data in the exception handling block. The detailed error message should be logged using logger.exception, which already captures the stack trace.

Steps to fix:

  1. Modify the response data in the exception handling block to return a generic error message.
  2. Ensure that the detailed error message is logged for debugging purposes.
Suggested changeset 1
posthog/warehouse/api/external_data_source.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/posthog/warehouse/api/external_data_source.py b/posthog/warehouse/api/external_data_source.py
--- a/posthog/warehouse/api/external_data_source.py
+++ b/posthog/warehouse/api/external_data_source.py
@@ -1124,3 +1124,3 @@
                 status=status.HTTP_400_BAD_REQUEST,
-                data={"message": f"Error checking schema changes: {str(e)}"},
+                data={"message": "An internal error has occurred while checking schema changes."},
             )
EOF
@@ -1124,3 +1124,3 @@
status=status.HTTP_400_BAD_REQUEST,
data={"message": f"Error checking schema changes: {str(e)}"},
data={"message": "An internal error has occurred while checking schema changes."},
)
Copilot is powered by AI and may make mistakes. Always verify output.
Unable to commit as this autofix suggestion is now outdated
@EDsCODE EDsCODE marked this pull request as draft March 21, 2025 13:16
@posthog-bot
Copy link
Contributor

This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in another week. If you want to permanentely keep it open, use the waiting label.

@posthog-bot
Copy link
Contributor

This PR was closed due to lack of activity. Feel free to reopen if it's still relevant.

@posthog-bot posthog-bot closed this Apr 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants