Discover the archetypes in your system of records

Systems of records are ubiquitous in the world around us, ranging from music playlists, job listings, medical records, customer service calls, Github issues, etc. Archetypes are formally defined as a pattern, or a model, of which all things of the same type are copied. More informally, we can think of archetypes as categories, classes, topics, etc.

When we read through a set of these records, our mind naturally groups the records into some collection of archetypes. For example, we may sort a song collection into easy listening, classical, rock, etc. This manual process is practical for a small number of records (e.g., a few dozen). Large systems can have millions of records, so we need an automated way to process them. In addition, without prior knowledge of these records, we may not know beforehand the archetypes that exist in the records, so we also need a way to discover meaningful archetypes that can be adopted. Since records are often in the form of unstructured text, such automated processing needs to be able to understand natural language. Watson Natural Language Understanding, coupled with statistical techniques, can help you to:

discover meaningful archetypes in your records and then
classify new records against this set of archetypes.

In this example, we will use a medical dictation data set to illustrate the process. The data is provided by ezDI and includes 249 actual medical dictations that have been anonymized.

When the reader has completed this code pattern, they will understand how to:

Work with the Watson Natural Language Understanding service (NLU) through API calls.
Work with the IBM Cloud Object Store service (COS) through the SDK to hold data and result.
Perform statistical analysis on the result from Watson Natural Language Understanding.
Explore the archetypes through graphical interpretation of the data in a Jupyter Notebook or a web interface.

Flow

The user downloads the custom medical dictation data set from ezDI and prepares the text data for processing.
The user interacts with the Watson Natural Language Understanding service via the provided application UI or the Jupyter Notebook.
The user runs a series of statistical analysis on the result from Watson Natural Language Understanding.
The user uses the graphical display to explore the archetypes that the analysis discovers.
The user classifies a new dictation by providing it as input and sees which archetype it is mapped to.

Included components

IBM Watson Natural Language Understanding: process a clip of natural text and return a number of attributes such as sentiment, keywords, entities, relationship, concepts, categories.

Featured technologies

IBM Watson Natural Language Understanding: advanced models that process text in natural language and produce relevant information that can be used directly or in further processing downstream.
IBM Watson Studio: a comprehensive environment and tools to work with your data.
IBM Cloud Object Store: easily store and manage your data without limit.
AI in medical services: save time for medical care providers by automating tasks such as entering data into Electronic Medical Record.

Watch the Video

Steps

Clone the repo
Create IBM Cloud services
Download and prepare the data
Run the Jupyter notebook
Run the Web UI

1. Clone the repo

git clone https://github.com/IBM/discover-archetype

2. Create IBM Cloud services

You will use 3 IBM Cloud services.

a. Watson Natural Language Understanding

On your IBM Cloud dashboard, using the left-side navigation menu, navigate to Watson -> Watson Services -> Browse Services -> Natural Language Understanding. Select the Lite plan and click Create. When the service becomes available, copy the endpoint and credentials for use later.

b. IBM Cloud Object Store

On your IBM Cloud dashboard, using the left-side navigation menu, navigate to Classic Infrastructure -> Storage -> Object Storage. Select the Lite plan and click Create. When the service becomes available, click on Create bucket and create two buckets: one for the medical dictation and one for the NLU result. Copy the bucket instance CRN, endpoints and credentials for use later.

c. Watson Studio

On your IBM Cloud dashboard, using the left-side navigation menu, navigate to Watson -> Watson Services -> Browse Services -> Watson Studio. Select the Lite plan and click Create. Click on New project and create an empty project. Navigate into your new empty project and click on New notebook. Select From file and upload the Jupyter notebook from the local git repo:

discover-archetype/notebook/WATSON_Document_Archetypes_Analysis_Showcase.ipynb

3. Download and prepare the data

Go to the ezDI web site and download both the medical dictation text files. The downloaded files will be contained in zip files.

Create a notebook/Documents subdirectory and then extract the downloaded zip files into their respective locations.

The dictation files stored in the notebook/Documents directory will be in rtf format, and need to be converted to plain text. Use the following bash script to convert them all to txt files.

Note: Run the following script with Python 3.

pip install striprtf
cd notebook/Documents
python ../python/convert_rtf.py

Upload the dictation files in text format to the IBM Cloud Object Store bucket for dictation.

4. Run the Jupyter notebook

a. Configure credentials and endpoints

In the Jupyter console, the second cell contains a number of parameters that need to be filled out with the necessary credentials, endpoints, and resource IDs for the 3 IBM Cloud services. Use the values obtained from step 2. Then use the console to execute each cell in the notebook.

5. Run the Web UI

This web app showcases the archetype discovery process. Users can:

Upload a corpus (zip file containing txt files) which will be processed by Watson NLU.
Compute the archetypes of a corpus and analyze them.
Match a new document with the archetypes and see the relevant terms.

Follow the instructions in the README.

If the web service is deployed on a server with public IP, the UI can be accessed on a mobile device.

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
doc/source/images		doc/source/images
notebook		notebook
python		python
web-app		web-app
.gitignore		.gitignore
.travis.yml		.travis.yml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MAINTAINERS.md		MAINTAINERS.md
README.md		README.md
__init__.py		__init__.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Discover the archetypes in your system of records

Flow

Included components

Featured technologies

Watch the Video

Steps

1. Clone the repo

2. Create IBM Cloud services

a. Watson Natural Language Understanding

b. IBM Cloud Object Store

c. Watson Studio

3. Download and prepare the data

4. Run the Jupyter notebook

a. Configure credentials and endpoints

5. Run the Web UI

About

Releases

Packages

Contributors 5

Languages

License

IBM/discover-archetype

Folders and files

Latest commit

History

Repository files navigation

Discover the archetypes in your system of records

Flow

Included components

Featured technologies

Watch the Video

Steps

1. Clone the repo

2. Create IBM Cloud services

a. Watson Natural Language Understanding

b. IBM Cloud Object Store

c. Watson Studio

3. Download and prepare the data

4. Run the Jupyter notebook

a. Configure credentials and endpoints

5. Run the Web UI

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages