Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about Herd - consolidated from 397-411 #413

Open
nateiam opened this issue Nov 27, 2018 · 1 comment
Open

Questions about Herd - consolidated from 397-411 #413

nateiam opened this issue Nov 27, 2018 · 1 comment

Comments

@nateiam
Copy link
Contributor

nateiam commented Nov 27, 2018

Hello @ChoyChoy9834

That's a great set of questions! I've consolidated some brief answers here for the sake of efficiency.

Herd does okay in some of these areas, could be easily enhanced in others, and yet others are great ideas for future development.

Since you have so many in-depth questions it would be nice to know where you are in the process of considering a cloud-based data catalog for big data. Are you shopping around various commercial and open-source products? What are your use cases?

If you are interested in talking more, we could arrange a call - you could share a bit more about your use cases and we can provide more information than the quick answers below.

Thanks, hope to continue the dialogue

  • Nate Weisz (Herd Product Owner @ FINRA)

Okay, here goes on the items:

Can HERD's lineage diagrams provide a summarized view and ability to expand nodes to drill down to add'l details?
The Herd metadata model includes summary- and detail-level lineage metadata. While we currently only visualize the summary-level, it's probably fairly straightforward to get a simple visualization that drills down to the detailed level

Also, Does HERD suggest missing lineage for missing data lineage chains?
We don't currently have plans to do this but it sounds pretty cool

Does HERD provide a data profiling tool that provides usage stats and profile, detect potential anomalies, rule-based profiling and ability to view sample data?
We do some work with usage stats but not in the Herd product - we mostly do that reporting in other tools
Many teams that manage data in our ecosystem do profiling for anomolies. In our organization we consider that an analytical function that we don't yet plan to roll into Herd

Will I be able to view sample data on HERD's data catalog?
Can Herd Metadata Catalog store sample data, and if yes, can it store unstructured or semi-structured sample data?
Yes, we have a feature that allows teams that publish data to provide sample data for consumers to view. This is free-form, whatever they can place as a file in S3 can be sample data and we provide APIs for them to manage this sample data and it gets surfaced in the catalog UI.

Is HERD data catalog able to specify and manage DQ rules and measure data against DQ rules?
Herd doesn't have features around data quality. Our current approach encourages teams publishing data to take responsibility for the quality. We do have features that teams use to label data as raw vs validated and generic features where teams can store n/v pairs which can include object-level quality descriptors. We've had teams express an interest in adding descriptors to the column-level as well. But no immediate plans to develop features in this area unless teams demand more vocally

Does the HERD Data Catalog support data masking in order to protect sensitive data?
We are actively planning on adding column-level metadata that will identify sensitive data. After this we plan to work with teams to utilize this metadata into their processing/analytics tools to perform masking.

Does the Herd Metadata Catalog UI allow for parametric or semantic searches?
Not yet - our underlying text indexing and search technology is ElasticSearch so it's probably feasible. We've had some teams approach us about pulling known synonyms into queries and it's possible we'll use ES to accomplish this - but nothing more advance than that until we have some more influence that takes us there.

Does the Herd Metadata Catalog UI provide for ability to download search results?
No but this would likely be a straightforward enhancement

Does the Herd Metadata Catalog UI allow for download of the entire catalog contents to my local drive?
No but people have used APIs to scrape significant portions of the catalog. It is probably possible to build an external tool - or even a feature in Herd - that does selective or full updates. But we see the metadata changing frequently and expanding rapidly so any export would become stale fast. We are more likely to serve whatever use case with APIs and notifications.

Does Herd Metadata Catalog provide a way to notify owners/users about changes in metadata or particular areas of interest?
Yes, we do have notification hooks in several areas (creation of new objects, format change) with templatized messaging that can be used for this purpose. But we don't yet have higher-level features for registering interest or viewing recent changes in the UI.

Does Herd Metadata Catalog have a workflow function which provides a way for stakeholders to be assigned to work, approvals, etc. ?
We have a workflow function that is used extensively to orchestrate data processing pipelines. We embed an open third-party engine called Activiti that uses standards-based (BPMNL) workflows and we have several Herd tasks that can be used from workflows. It's likely possible to use this engine for a purpose you describe.

Where/how do I find or access Dashboards or reports that have already been developed for the Herd Metadata Catalog , by the Community?
We don't have any examples of this so far beyond Herd's use at FINRA but we can share more if we end up talking.

Does Herd Metadata Catalog allow annotations or comments in the lineage diagrams?
No but lineage is an area we are likely to continue investing effort. This is likely a straightforward enhancement

Which of the following can Herd Metadata Catalog NOT do?
A.- Regular, scheduled scans to update metadata
B - Notify data owner of new metadata
C. Pre-built scanners to collect from various databases.
We recently introduced an initial feature that retrieves format information from relational databases so we're starting to explore 'C' and 'A' but our features are really in their infancy. As mentioned above we do have 'B'.

Are there any metadata integration templates available from the Community, that are helpful to capture metadata for say, big data? If yes, how do we get an better understanding of these?
Not sure I understand this one, we can discuss if we end up talking.

@ChoyChoy9834
Copy link

ChoyChoy9834 commented Nov 27, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants