Home

Umbrella SI Journey

Fingerprinting Personal Data from Unstructured Text: Leverage IBM Watson Natural Language Understanding and Watson Knowledge Studio to identify personal data from unstructured text

Short Name

Fingerprinting Personal Data from Unstructured Text

Short Description

Leverage IBM Watson Natural Language Understanding and Watson Knowledge Studio to identify personal data from unstructured text

Offering Type

Cognitive

Introduction

Identifying personal data from unstructured text can be a daunting task. Watson Natural Language Understanding with Watson Knowledge Studio custom model provides an effective way of identifying necessary information from unstructured documents. The result can be augmented with Regular expressions. The personal data identified is provided a score based on which further processing or consuming can be done

Author

Muralidhar Chavan

Code

https://github.com/IBM/gdpr-fingerprint-pii

Demo

N/A

Video

https://youtu.be/NiBCa3EtCr0

Overview

In this pattern we show you how to build custom model using Watson Knowledge Studio and use it to identify personal data from unstructured documents. The pattern also augments the results with Regular Expression parsers

When the reader has completed this pattern, they will understand how to:

Build a custom model using Watson Knowledge Studio (WKS) and have Natural Language Understanding (NLU) use that model to identify personal data.
Use regular expressions to augment NLU for meta data identification
Configure what personal data needs to be identified. Assign weightage for persoanl data so as to assign a score
View the score and the personal data identified in a tree structure for better visualization.
Consume the output of this Journey by other applications.

Flow

1 – Viewer passes input text to Personal Data Extractor.
2 – Personal Data Extractor passes the text to NLU.
3 – Personal Data extracted from the input text. NLU uses custom model to provide the response.
4 – Personal Data Extractor passes NLU Output to Regex component.
5 – Regex component uses the regular expressions provided in configuration to extract personal data which is then augmented to the NLU Output.
6 – The augmented personal data is passed to scorer component.
7 – Scorer component uses the configuration to come up with a overall document score and the result is passed back to Personal Data Extractor component.
8 – This data is then passed to viewer component.

Included components

Watson Knowledge Studio: A tool to create a machine-learning model that understands the linguistic nuances, meaning, and relationships specific to your industry or to create a rule-based model that finds entities in documents based on rules that you define.
Watson Natural Language Understanding: A Bluemix service that can analyze text to extract meta-data from content such as concepts, entities, keywords, categories, sentiment, emotion, relations, semantic roles, using natural language understanding.
Liberty for Java: Develop, deploy, and scale Java web apps with ease. IBM WebSphere Liberty Profile is a highly composable, ultra-fast, ultra-light profile of IBM WebSphere Application Server designed for the cloud.

Featured technologies

Blog

The General Data Protection Regulation (GDPR) is a regulation by which the European Parliament, the Council of the European Union and the European Commission intend to strengthen and unify data protection for all individuals within the European Union (EU). It also addresses the export of personal data outside the EU.

Under the EU's new General Data Protection Regulation, enterprises around the world must not only keep personal data private, but they will also be required to "forget" any personal data related to an individual on request -- and the GDPR right to be forgotten will be a significant part of compliance with the new rule.

An individual can request for his or her personal data to be erased in circumstances such as

Where the personal data is no longer necessary in relation to the purpose for which it was originally collected or processed
When an individual withdraws consent
The personal data was unlawfully processed
The personal data has to be erased in order to comply with a legal obligation

When a customer requests that all his or her personal data be deleted, then an organisation needs to identify all the documents where the customer's personal data reside.

To cater to this requirement we have come up with an IBM code pattern "Fingerprinting personal data from unstructured documents". This pattern identifies if a document has enough personal data to identify an individual. This pattern can be consumed by organizations in complying with policies like

Security layer that exposes only those documents that does not contain personal data
Identify documents having personal data for erasure

IBM Cognitive services such as Watson Natural Language and Watson Knowledge Studio, which are used in this code pattern, provide features such as

Identify information
Train Watson without coding
Custom model for unique industries
Accelerated training process
Continuous performance improvement

If your organization needs to identify personal data from it's records, then you can consider the IBM code pattern "Fingerprinting personal data from unstructured documents". You’ll find the code, documentation and video here

Links

GDPR Overview: Describes the main elements of GDPR
IBM Code pattern provides other ways of identifying metadata from unstructured text

Provide feedback

Saved searches

Use saved searches to filter your results more quickly