-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Import Plugins Discussion #6847
Comments
I think this is amazing work and makes a ton of sense! 💪 🎉 I'm wondering now:
Aside: The plugin config screenshot threw me off a bit cause it says "the plugin will create a table with this name". Probably left over from the redshift export plugin? |
To answer this question effectively we should speak with our customers and get an understanding of which solves their needs better. My gut feeling is that the needs of self-hosted vs cloud hosted will be quite different, I would expect cloud users to want to rely more on something quick to configure vs self-hosed users being comfortable creating a bespoke integration that meets their specific requirements. Given our focus is more aligned with self-hosting users today it might mean that the bespoke integration approach is likely the area we should invest more in. But would be great to validate - I have a few customers in mind we should ask. |
This is cool =) There could be privacy concerns about revealing their underlaying data schema (which writing a transformation to an open source repo could do). Just a bit more context from our Platform meeting yesterday. If I understood jams correctly it might be tricky to pull data (that import plugin would do) because of permissions (requiring Security review etc), but sending data out is usually easy and folks usually know their own data source & how to do that (i.e. the guides alternative: we'd provide example python script, Airflow job, ...). So if I've understood things correctly they could take our
+1 and for migrations between cloud and self hosted too in the future.
Why would one want to do that? I would assume the path is 1x Postres -> ClickHouse & then just continue using ClickHouse only |
I did have some metrics from when I did PostHog/plugin-server#504. Guess I don't have them anymore, will do this.
Yup, definitely. PostHog/plugin-server#406 is coming soon so I've already been brainstorming on this, which falls under PostHog/plugin-server#414.
Agreed, like you mentioned before, a good call here is probably to build another one of these and see what has the most overlap.
Don't see why not. A few things like cursors would work differently but should have quite a bit of overlap.
Currently this is actually what it does. Should add a config option to import continuously or only up to the last event as per a "snapshot" (count) taken at the beginning of the import. @marcushyett-ph could you let me know who they are in private? Would like to move on this so can determine my sprint priorities and move fast with this.
This is super easy to mitigate. You could: a) Fork if self-hosting |
When writing the data ingestion docs I thought of two high-level data ingestion scenarios. I can't remember who, but somebody correctly flagged that there's a third scenario. Altogether we have:
So, for importing data we need to address 2 and 3. # 2. could be very process-intensive as large amounts of data are pulled and transformed from a third party system. Depending on the volume of data created in the third party and the frequency of the checks 3. could also require a lot of resources. To me @tiina303 raises a few questions/thoughts in my head:
These questions aside, I think the transformation approach is 👍 |
Imports make sense for pulling from other SaaS servicesI think import plugins make a ton of sense for ingesting any data from a 3rd party service where the schema is set and there are no arbitrary or custom transformations. We should manage ingesting data where we know the schema. Things like ingesting from:
Imports do not make sense for pulling from data warehousesI don't think import plugins make sense for importing data from warehouses. I don't think our plugin server is the right place to be doing the heavy lifting on what is effectively ETL.
Because of that, if I was evaluating PostHog I would absolutely decline using import plugins mainly because of the complexity and moving something that is squarely an ETL out of my normal set of tools to build and manage an ETL. Pushing from your warehouse is the best way to get data out of your warehouseUsing something that is more push based using our existing libraries but adding examples for how you can add pushing to PostHog using things like Airflow, Azkaban, or Luigi. This is beneficial because:
Next stepsThe first integration I would pursue would be building out an operator for Airflow or just having an example of how to build a DAG with PostHog on Airflow Other examples that would be a priority:
Edit: |
Thanks a lot @leggetter and @fuziontech for the thoughts. Following our meeting and some discussions with clients, I'm also convinced that most warehouse users prefer the push model instead of the pull model. As such, here are some practical next steps:
How do the above steps sound? @fuziontech @tiina303 @marcushyett-ph |
My prioritisation process is based on:
If we feel that enabling existing, in-progress, and future customers to import from data warehouses will help us achieve one or more of the above then I'll prioritise. I believe it will. The other considerations for me are: |
Interestingly, it just came to mind that down the line this might actually be one way to build the foundation of the "pluggable backends" we talked about in our call @fuziontech @tiina303 @marcushyett-ph. Instead of actually sitting on top of your DB Metabase-style and doing all the work to translate queries (+ all the maintenance work), we can rather just have an open bridge and constantly bring data into PostHog in a format we understand and just query it "in our way". In fact, if you use the plugin as is, as of today you could effectively plug in Redshift as your backend for PostHog. 👀 Plus all of the issues regarding security that are relevant when talking about historical imports are irrelevant when it comes to pluggable backends, since you gotta give up the credentials anyway. |
I'd love some context on pluggable backends @yakkomajuri if you concluded anything for / against. Would help with the story with customers / investors etc. |
@jamesefhawkins in our call we talked about these being something we may or may not do in the longer-term future. I believe @marcushyett-ph mentioned how if we wanted to do this we'd probably need a team supporting each database available, and this would all be a lot of work. However, would love to get people's thoughts on this. Didn't think of it yesterday but essentially plugins like the Redshift import one could be a way to get closer to something like "pluggable backends" much much faster, although of course with its own limitations. |
Regarding priorities, I would love if we started with something for people to move from Postgres -> Clickhouse. Python script for that sounds great (so agree with Yakko's proposed priority). Perhaps we can create a generic python script template & then a specific example (i.e. the Postgres -> Clickhouse move one).
I like this idea in general for longer term future. Jams in the call mentioned that we could potentially get more users with data warehouse this way as one of the problems with running self-hosted posthog is managing it. Specifically managing both the compute and storage, if we can use a pluggable backend, then they have no storage maintenance overhead (their data warehouse might get a bit more data, but that's way easier than having two separate data stores).
Just to make sure I understand correctly how it would work: we don't store any data (maybe caching some) & for every query do the transformation and pull data? |
This is a good one and potentially very important for the existing community who may feel left behind otherwise. Thanks for flagging, @tiina303 🙌 |
Very insightful discussion above! I'm completely on board with the notion that very often it's easier to push data into posthog, instead of going through the potential security/compliance nightmare described above. It's also rather likely that our current focus customers would prefer to push their own data, and based on that we should deprioritize this task (and work on guides on how to However, two cases discussed above still make me believe import plugins (and especially abstractions around them with corresponding UI features) will be very useful to a lot of users:
In short, yes. The plugin server is built for handling huge amounts of data/events flowing through it, and long running processes that both import and export data. Even if it's often better to push data, just saying "we have a plugin for redshift/bigquery/snowflake imports (thought you might want to push out yourself)" sounds enabling. |
Request to migrate from self-hosted (most likely Postgres) to Cloud in the community Slack https://posthogusers.slack.com/archives/CT7HXDEG3/p1629628505141700 |
FWIW I don't think we should be thinking about "Pluggable backends" right now. I think that is a distraction. That may or may not even live in the plugin server at all. As things are setup currently that seems like something that would be built on the backend of app, but again. I don't think we should be designing for that currently. If anything we should discuss the merits of the product outside of implementation in posthog/product-internal#150 Maybe this (export plugins) is something we should jam on at the offsite? |
A few updates:
Just some things to consider |
This issue has 3251 words. Issues this long are hard to read or contribute to, and tend to take very long to reach a conclusion. Instead, why not:
|
📖 Context
I recently moved teams with one of the goals of the move being validating approaches to interacting with data warehouses. Working with @fuziontech and @tiina303 we could dig more deeply into the topics of how to best build solutions to import and export data from PostHog.
For a while we've been talking about import plugins, and I remember @jameshawkins even mentioning somewhere that import plugins would be a pre-requisite for making an external plugins release (on HN etc.).
However, on our first meeting, one of my first questions to James and Tiina was: do import plugins actually make sense?
It's possible the generalization and simplicity we're seeking with these plugins might actually bring a degree of complexity. As such, I'm opening this issue so we can discuss the validity of import plugins from various perspectives: Sales/Growth (@jameshawkins), Product (@marcushyett-ph), and Engineering (@fuziontech / @tiina303 / @timgl).
To give us some context, I decided to build a prototype of what an import plugin would look like, so we can truly see if it makes sense.
📥 Redshift Import Plugin
Here's how this plugin works - configure the following config and you're done, events will be coming in:
This will handle for you:
And to transform your data, you can either use an already-available transformation (like JSON Map), or contribute a transformation to the repo.
🎯 Click here if you care to know why this was built this way
The discussion here got me convinced a "plugin templates" concept was the way to go for this. But the more I started to think about how this would work today, the more I got discouraged by the idea.
From a PostHog Cloud perspective, providing users with a template means:
So instead of telling users to spin up a whole new repo which we'll have to fully review anyway, why not just let them contribute transformations to one specific repo? Add in your transformation function, we check that it's safe, and the only place they will crowd is the select dropdown in the plugin config, but that's mitigated by #5578 too.
Plus this also allows for someone's transformations to help others, and "standard" transformations to appear, like Amplitude -> PostHog for example.
Non-Cloud users on the other hand can indeed fork the repo and do as they wish.
I recognize this isn't an amazing solution, but we're limited by https://github.com/PostHog/plugin-server/issues/454. So this is indeed better than the alternative, which is one plugin per import. It allows us to get started quickly with import plugins and iterate as we make progress towards https://github.com/PostHog/plugin-server/issues/454.
🤔 Does this make sense?
So then the question: does this make sense?
Do we want to have an import plugin for each warehouse where users just need to write a transformation OR would we rather provide people with guides (much like what @leggetter is doing) on how to import data and let people fend for themselves?
At the end of the day, a lot of how I've approached plugins is seeking to provide an experience where even developers just starting out are able to make a difference. If you know a bit of JS, you can do a ton. So I'm constantly looking for approaches to simplify the life of PostHog users. However, it could well be that this actually makes things more complicated.
So, thoughts?
The text was updated successfully, but these errors were encountered: