Skip to content
This repository has been archived by the owner on Jul 3, 2024. It is now read-only.
/ discover-archetype Public archive

Discover archetypes in your text corpus using Watson Natural Language Understanding.

License

Notifications You must be signed in to change notification settings

IBM/discover-archetype

Repository files navigation

Build Status

Discover the archetypes in your system of records

Systems of records are ubiquitous in the world around us, ranging from music playlists, job listings, medical records, customer service calls, Github issues, etc. Archetypes are formally defined as a pattern, or a model, of which all things of the same type are copied. More informally, we can think of archetypes as categories, classes, topics, etc.

When we read through a set of these records, our mind naturally groups the records into some collection of archetypes. For example, we may sort a song collection into easy listening, classical, rock, etc. This manual process is practical for a small number of records (e.g., a few dozen). Large systems can have millions of records, so we need an automated way to process them. In addition, without prior knowledge of these records, we may not know beforehand the archetypes that exist in the records, so we also need a way to discover meaningful archetypes that can be adopted. Since records are often in the form of unstructured text, such automated processing needs to be able to understand natural language. Watson Natural Language Understanding, coupled with statistical techniques, can help you to:

  1. discover meaningful archetypes in your records and then
  2. classify new records against this set of archetypes.

In this example, we will use a medical dictation data set to illustrate the process. The data is provided by ezDI and includes 249 actual medical dictations that have been anonymized.

When the reader has completed this code pattern, they will understand how to:

  • Work with the Watson Natural Language Understanding service (NLU) through API calls.
  • Work with the IBM Cloud Object Store service (COS) through the SDK to hold data and result.
  • Perform statistical analysis on the result from Watson Natural Language Understanding.
  • Explore the archetypes through graphical interpretation of the data in a Jupyter Notebook or a web interface.

architecture

Flow

  1. The user downloads the custom medical dictation data set from ezDI and prepares the text data for processing.
  2. The user interacts with the Watson Natural Language Understanding service via the provided application UI or the Jupyter Notebook.
  3. The user runs a series of statistical analysis on the result from Watson Natural Language Understanding.
  4. The user uses the graphical display to explore the archetypes that the analysis discovers.
  5. The user classifies a new dictation by providing it as input and sees which archetype it is mapped to.

Included components

Featured technologies

Watch the Video

Steps

  1. Clone the repo
  2. Create IBM Cloud services
  3. Download and prepare the data
  4. Run the Jupyter notebook
  5. Run the Web UI

1. Clone the repo

git clone https://github.com/IBM/discover-archetype

2. Create IBM Cloud services

You will use 3 IBM Cloud services.

a. Watson Natural Language Understanding

On your IBM Cloud dashboard, using the left-side navigation menu, navigate to Watson -> Watson Services -> Browse Services -> Natural Language Understanding. Select the Lite plan and click Create. When the service becomes available, copy the endpoint and credentials for use later.

b. IBM Cloud Object Store

On your IBM Cloud dashboard, using the left-side navigation menu, navigate to Classic Infrastructure -> Storage -> Object Storage. Select the Lite plan and click Create. When the service becomes available, click on Create bucket and create two buckets: one for the medical dictation and one for the NLU result. Copy the bucket instance CRN, endpoints and credentials for use later.

c. Watson Studio

On your IBM Cloud dashboard, using the left-side navigation menu, navigate to Watson -> Watson Services -> Browse Services -> Watson Studio. Select the Lite plan and click Create. Click on New project and create an empty project. Navigate into your new empty project and click on New notebook. Select From file and upload the Jupyter notebook from the local git repo:

discover-archetype/notebook/WATSON_Document_Archetypes_Analysis_Showcase.ipynb

3. Download and prepare the data

Go to the ezDI web site and download both the medical dictation text files. The downloaded files will be contained in zip files.

Create a notebook/Documents subdirectory and then extract the downloaded zip files into their respective locations.

The dictation files stored in the notebook/Documents directory will be in rtf format, and need to be converted to plain text. Use the following bash script to convert them all to txt files.

Note: Run the following script with Python 3.

pip install striprtf
cd notebook/Documents
python ../python/convert_rtf.py

Upload the dictation files in text format to the IBM Cloud Object Store bucket for dictation.

4. Run the Jupyter notebook

a. Configure credentials and endpoints

In the Jupyter console, the second cell contains a number of parameters that need to be filled out with the necessary credentials, endpoints, and resource IDs for the 3 IBM Cloud services. Use the values obtained from step 2. Then use the console to execute each cell in the notebook.

5. Run the Web UI

This web app showcases the archetype discovery process. Users can:

  1. Upload a corpus (zip file containing txt files) which will be processed by Watson NLU.
  2. Compute the archetypes of a corpus and analyze them.
  3. Match a new document with the archetypes and see the relevant terms.

Follow the instructions in the README.

If the web service is deployed on a server with public IP, the UI can be accessed on a mobile device.

archetypes UI

About

Discover archetypes in your text corpus using Watson Natural Language Understanding.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published