Questions about Herd - consolidated from 397-411 #413

nateiam · 2018-11-27T18:37:32Z

That's a great set of questions! I've consolidated some brief answers here for the sake of efficiency.

Herd does okay in some of these areas, could be easily enhanced in others, and yet others are great ideas for future development.

Since you have so many in-depth questions it would be nice to know where you are in the process of considering a cloud-based data catalog for big data. Are you shopping around various commercial and open-source products? What are your use cases?

If you are interested in talking more, we could arrange a call - you could share a bit more about your use cases and we can provide more information than the quick answers below.

Thanks, hope to continue the dialogue

Nate Weisz (Herd Product Owner @ FINRA)

Okay, here goes on the items:

Can HERD's lineage diagrams provide a summarized view and ability to expand nodes to drill down to add'l details?
The Herd metadata model includes summary- and detail-level lineage metadata. While we currently only visualize the summary-level, it's probably fairly straightforward to get a simple visualization that drills down to the detailed level

Also, Does HERD suggest missing lineage for missing data lineage chains?
We don't currently have plans to do this but it sounds pretty cool

Does HERD provide a data profiling tool that provides usage stats and profile, detect potential anomalies, rule-based profiling and ability to view sample data?
We do some work with usage stats but not in the Herd product - we mostly do that reporting in other tools
Many teams that manage data in our ecosystem do profiling for anomolies. In our organization we consider that an analytical function that we don't yet plan to roll into Herd

Will I be able to view sample data on HERD's data catalog?
Can Herd Metadata Catalog store sample data, and if yes, can it store unstructured or semi-structured sample data?
Yes, we have a feature that allows teams that publish data to provide sample data for consumers to view. This is free-form, whatever they can place as a file in S3 can be sample data and we provide APIs for them to manage this sample data and it gets surfaced in the catalog UI.

Is HERD data catalog able to specify and manage DQ rules and measure data against DQ rules?
Herd doesn't have features around data quality. Our current approach encourages teams publishing data to take responsibility for the quality. We do have features that teams use to label data as raw vs validated and generic features where teams can store n/v pairs which can include object-level quality descriptors. We've had teams express an interest in adding descriptors to the column-level as well. But no immediate plans to develop features in this area unless teams demand more vocally

Does the HERD Data Catalog support data masking in order to protect sensitive data?
We are actively planning on adding column-level metadata that will identify sensitive data. After this we plan to work with teams to utilize this metadata into their processing/analytics tools to perform masking.

Does the Herd Metadata Catalog UI allow for parametric or semantic searches?
Not yet - our underlying text indexing and search technology is ElasticSearch so it's probably feasible. We've had some teams approach us about pulling known synonyms into queries and it's possible we'll use ES to accomplish this - but nothing more advance than that until we have some more influence that takes us there.

Does the Herd Metadata Catalog UI provide for ability to download search results?
No but this would likely be a straightforward enhancement

Does the Herd Metadata Catalog UI allow for download of the entire catalog contents to my local drive?
No but people have used APIs to scrape significant portions of the catalog. It is probably possible to build an external tool - or even a feature in Herd - that does selective or full updates. But we see the metadata changing frequently and expanding rapidly so any export would become stale fast. We are more likely to serve whatever use case with APIs and notifications.

Does Herd Metadata Catalog provide a way to notify owners/users about changes in metadata or particular areas of interest?
Yes, we do have notification hooks in several areas (creation of new objects, format change) with templatized messaging that can be used for this purpose. But we don't yet have higher-level features for registering interest or viewing recent changes in the UI.

Does Herd Metadata Catalog have a workflow function which provides a way for stakeholders to be assigned to work, approvals, etc. ?
We have a workflow function that is used extensively to orchestrate data processing pipelines. We embed an open third-party engine called Activiti that uses standards-based (BPMNL) workflows and we have several Herd tasks that can be used from workflows. It's likely possible to use this engine for a purpose you describe.

Where/how do I find or access Dashboards or reports that have already been developed for the Herd Metadata Catalog , by the Community?
We don't have any examples of this so far beyond Herd's use at FINRA but we can share more if we end up talking.

Does Herd Metadata Catalog allow annotations or comments in the lineage diagrams?
No but lineage is an area we are likely to continue investing effort. This is likely a straightforward enhancement

Which of the following can Herd Metadata Catalog NOT do?
A.- Regular, scheduled scans to update metadata
B - Notify data owner of new metadata
C. Pre-built scanners to collect from various databases.
We recently introduced an initial feature that retrieves format information from relational databases so we're starting to explore 'C' and 'A' but our features are really in their infancy. As mentioned above we do have 'B'.

Are there any metadata integration templates available from the Community, that are helpful to capture metadata for say, big data? If yes, how do we get an better understanding of these?
Not sure I understand this one, we can discuss if we end up talking.

ChoyChoy9834 · 2018-11-27T19:18:22Z

Nate, Thanks so much for responding so quickly! And my apologies for hitting you all with so many Q’s right after the holiday! Yes, we are definitely shopping around for a data catalog that can handle both on-prem and cloud based data. The Catalog itself does not necessarily have to be in the Cloud but preferable as this is the future state. As you can tell from my Q’s, I have a definitive set of use cases which I’ll be glad to share. And yes, would love to have a conf. Call with you and your team about this. I’ve included my other team members whom I’ll want to participate in the Conf. Call. Again, thanks so much! Choy Choy Cahill, PMP | Systems Analyst-Tech Lead, Multifamily | Email: choy_choy_cahill@freddiemac.com|<mailto:choy_choy_cahill@freddiemac.com%7C> Phone: 703-714-3977 (Work) IMPORTANT: The information transmitted in this E-mail is for the exclusive use of the person or entity to which it is addressed and may contain confidential information. If you are not the intended recipient of this E-mail, you are prohibited from reading, printing, duplicating, disseminating or otherwise using this information. If you have received this information in error, please notify the sender at Freddie Mac immediately, delete this information from your computer, and destroy all copies of the information. Thank you. From: Nate Weisz [mailto:notifications@github.com] Sent: Tuesday, November 27, 2018 1:38 PM To: FINRAOS/herd <herd@noreply.github.com> Cc: Cahill, Choy Choy <choy_choy_cahill@freddiemac.com>; Mention <mention@noreply.github.com> Subject: [FINRAOS/herd] Questions about Herd - consolidated (#413) CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe. Hello @ChoyChoy9834<https://github.com/ChoyChoy9834> That's a great set of questions! I've consolidated some brief answers here for the sake of efficiency. Herd does okay in some of these areas, could be easily enhanced in others, and yet others are great ideas for future development. Since you have so many in-depth questions it would be nice to know where you are in the process of considering a cloud-based data catalog for big data. Are you shopping around various commercial and open-source products? What are your use cases? If you are interested in talking more, we could arrange a call - you could share a bit more about your use cases and we can provide more information than the quick answers below. Thanks, hope to continue the dialogue * Nate Weisz (Herd Product Owner @ FINRA) Okay, here goes on the items: Can HERD's lineage diagrams provide a summarized view and ability to expand nodes to drill down to add'l details? The Herd metadata model includes summary- and detail-level lineage metadata. While we currently only visualize the summary-level, it's probably fairly straightforward to get a simple visualization that drills down to the detailed level Also, Does HERD suggest missing lineage for missing data lineage chains? We don't currently have plans to do this but it sounds pretty cool Does HERD provide a data profiling tool that provides usage stats and profile, detect potential anomalies, rule-based profiling and ability to view sample data? We do some work with usage stats but not in the Herd product - we mostly do that reporting in other tools Many teams that manage data in our ecosystem do profiling for anomolies. In our organization we consider that an analytical function that we don't yet plan to roll into Herd Will I be able to view sample data on HERD's data catalog? Can Herd Metadata Catalog store sample data, and if yes, can it store unstructured or semi-structured sample data? Yes, we have a feature that allows teams that publish data to provide sample data for consumers to view. This is free-form, whatever they can place as a file in S3 can be sample data and we provide APIs for them to manage this sample data and it gets surfaced in the catalog UI. Is HERD data catalog able to specify and manage DQ rules and measure data against DQ rules? Herd doesn't have features around data quality. Our current approach encourages teams publishing data to take responsibility for the quality. We do have features that teams use to label data as raw vs validated and generic features where teams can store n/v pairs which can include object-level quality descriptors. We've had teams express an interest in adding descriptors to the column-level as well. But no immediate plans to develop features in this area unless teams demand more vocally Does the HERD Data Catalog support data masking in order to protect sensitive data? We are actively planning on adding column-level metadata that will identify sensitive data. After this we plan to work with teams to utilize this metadata into their processing/analytics tools to perform masking. Does the Herd Metadata Catalog UI allow for parametric or semantic searches? Not yet - our underlying text indexing and search technology is ElasticSearch so it's probably feasible. We've had some teams approach us about pulling known synonyms into queries and it's possible we'll use ES to accomplish this - but nothing more advance than that until we have some more influence that takes us there. Does the Herd Metadata Catalog UI provide for ability to download search results? No but this would likely be a straightforward enhancement Does the Herd Metadata Catalog UI allow for download of the entire catalog contents to my local drive? No but people have used APIs to scrape significant portions of the catalog. It is probably possible to build an external tool - or even a feature in Herd - that does selective or full updates. But we see the metadata changing frequently and expanding rapidly so any export would become stale fast. We are more likely to serve whatever use case with APIs and notifications. Does Herd Metadata Catalog provide a way to notify owners/users about changes in metadata or particular areas of interest? Yes, we do have notification hooks in several areas (creation of new objects, format change) with templatized messaging that can be used for this purpose. But we don't yet have higher-level features for registering interest or viewing recent changes in the UI. Does Herd Metadata Catalog have a workflow function which provides a way for stakeholders to be assigned to work, approvals, etc. ? We have a workflow function that is used extensively to orchestrate data processing pipelines. We embed an open third-party engine called Activiti that uses standards-based (BPMNL) workflows and we have several Herd tasks that can be used from workflows. It's likely possible to use this engine for a purpose you describe. Where/how do I find or access Dashboards or reports that have already been developed for the Herd Metadata Catalog , by the Community? We don't have any examples of this so far beyond Herd's use at FINRA but we can share more if we end up talking. Does Herd Metadata Catalog allow annotations or comments in the lineage diagrams? No but lineage is an area we are likely to continue investing effort. This is likely a straightforward enhancement Which of the following can Herd Metadata Catalog NOT do? A.- Regular, scheduled scans to update metadata B - Notify data owner of new metadata C. Pre-built scanners to collect from various databases. We recently introduced an initial feature that retrieves format information from relational databases so we're starting to explore 'C' and 'A' but our features are really in their infancy. As mentioned above we do have 'B'. Are there any metadata integration templates available from the Community, that are helpful to capture metadata for say, big data? If yes, how do we get an better understanding of these? Not sure I understand this one, we can discuss if we end up talking. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#413>, or mute the thread<https://github.com/notifications/unsubscribe-auth/Aq9SJ3RGuYPCKICvNhnLP6iNMO8ZOFPYks5uzYZvgaJpZM4Y2GOM>.

nateiam changed the title ~~Questions about Herd - consolidated~~ Questions about Herd - consolidated from 397-411 Nov 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about Herd - consolidated from 397-411 #413

Questions about Herd - consolidated from 397-411 #413

nateiam commented Nov 27, 2018

ChoyChoy9834 commented Nov 27, 2018 via email

Questions about Herd - consolidated from 397-411 #413

Questions about Herd - consolidated from 397-411 #413

Comments

nateiam commented Nov 27, 2018

ChoyChoy9834 commented Nov 27, 2018 via email