Systems of records are ubiquitous in the world around us, ranging from music playlists, job listings, medical records, customer service calls, Github issues, etc. Archetypes are formally defined as a pattern, or a model, of which all things of the same type are copied. More informally, we can think of archetypes as categories, classes, topics, etc.
When we read through a set of these records, our mind naturally groups the records into some collection of archetypes. For example, we may sort a song collection into easy listening, classical, rock, etc. This manual process is practical for a small number of records (e.g., a few dozen). Large systems can have millions of records, so we need an automated way to process them. In addition, without prior knowledge of these records, we may not know beforehand the archetypes that exist in the records, so we also need a way to discover meaningful archetypes that can be adopted. Since records are often in the form of unstructured text, such automated processing needs to be able to understand natural language. Watson Natural Language Understanding, coupled with statistical techniques, can help you to:
- discover meaningful archetypes in your records and then
- classify new records against this set of archetypes.
In this example, we will use a medical dictation data set to illustrate the process. The data is provided by ezDI and includes 249 actual medical dictations that have been anonymized.
When the reader has completed this code pattern, they will understand how to:
- Work with the
Watson Natural Language Understanding
service (NLU) through API calls. - Work with the
IBM Cloud Object Store
service (COS) through the SDK to hold data and result. - Perform statistical analysis on the result from
Watson Natural Language Understanding
. - Explore the archetypes through graphical interpretation of the data in a Jupyter Notebook or a web interface.
- The user downloads the custom medical dictation data set from ezDI and prepares the text data for processing.
- The user interacts with the Watson Natural Language Understanding service via the provided application UI or the Jupyter Notebook.
- The user runs a series of statistical analysis on the result from Watson Natural Language Understanding.
- The user uses the graphical display to explore the archetypes that the analysis discovers.
- The user classifies a new dictation by providing it as input and sees which archetype it is mapped to.
- IBM Watson Natural Language Understanding: process a clip of natural text and return a number of attributes such as sentiment, keywords, entities, relationship, concepts, categories.
- IBM Watson Natural Language Understanding: advanced models that process text in natural language and produce relevant information that can be used directly or in further processing downstream.
- IBM Watson Studio: a comprehensive environment and tools to work with your data.
- IBM Cloud Object Store: easily store and manage your data without limit.
- AI in medical services: save time for medical care providers by automating tasks such as entering data into Electronic Medical Record.
- Clone the repo
- Create IBM Cloud services
- Download and prepare the data
- Run the Jupyter notebook
- Run the Web UI
git clone https://github.com/IBM/discover-archetype
You will use 3 IBM Cloud services.
On your IBM Cloud dashboard, using the left-side navigation menu,
navigate to Watson -> Watson Services -> Browse Services -> Natural Language Understanding.
Select the Lite plan and click Create
.
When the service becomes available, copy the endpoint and credentials for use later.
On your IBM Cloud dashboard, using the left-side navigation menu, navigate to Classic Infrastructure -> Storage -> Object Storage.
Select the Lite plan and click Create
.
When the service becomes available, click on Create bucket
and create two buckets: one for the medical dictation and one for the NLU result.
Copy the bucket instance CRN, endpoints and credentials for use later.
On your IBM Cloud dashboard, using the left-side navigation menu, navigate to Watson -> Watson Services -> Browse Services -> Watson Studio.
Select the Lite plan and click Create
.
Click on New project
and create an empty project. Navigate into your new empty project and click on New notebook
. Select From file
and upload the Jupyter notebook from the local git repo:
discover-archetype/notebook/WATSON_Document_Archetypes_Analysis_Showcase.ipynb
Go to the ezDI web site and download both the medical dictation text files. The downloaded files will be contained in zip files.
Create a notebook/Documents
subdirectory and then extract the downloaded zip files into their respective locations.
The dictation files stored in the notebook/Documents
directory will be in rtf format, and need to be converted to plain
text. Use the following bash script to convert them all to txt files.
Note: Run the following script with Python 3.
pip install striprtf
cd notebook/Documents
python ../python/convert_rtf.py
Upload the dictation files in text format to the IBM Cloud Object Store bucket for dictation.
In the Jupyter console, the second cell contains a number of parameters that need to be filled out with the necessary credentials, endpoints, and resource IDs for the 3 IBM Cloud services. Use the values obtained from step 2. Then use the console to execute each cell in the notebook.
This web app showcases the archetype discovery process. Users can:
- Upload a corpus (
zip
file containingtxt
files) which will be processed by Watson NLU. - Compute the archetypes of a corpus and analyze them.
- Match a new document with the archetypes and see the relevant terms.
Follow the instructions in the README.
If the web service is deployed on a server with public IP, the UI can be accessed on a mobile device.