# Set up environment in notebook

## Install packages for Watson Studio

In [None]:
!pip install --upgrade "ibm-watson>=4.0.1"
!pip install --upgrade "dash>=1.2.0"
!pip install --upgrade "plotly>=4.1.1"
!rm -rf discover-archetype
!git clone -b refactor https://github.com/tonanhngo/discover-archetype.git

## Configure parameters

In [None]:
import sys
sys.path.append("discover-archetype/python")
import find_archetype
from ibm_watson.natural_language_understanding_v1 import Features,CategoriesOptions,ConceptsOptions,EntitiesOptions,KeywordsOptions,RelationsOptions,SyntaxOptions

print('Credentials needed for https://cloud.ibm.com/catalog/services/natural-language-understanding )')

NLU_APIKEY = input(prompt='Please enter the API-Key for your Watson NLU service:')
NLU_ENDPOINT = "UPDATE-ME"

# If true, use the IBM Cloud Storage for input and output
# If false, use local file system
USE_CLOUD_STORE = True

# Configure the following if using IBM Object Cloud Storage
# This is needed for running your notebook on Watson Studio, but can also be used when running your notebook locally
PATH = {}
if USE_CLOUD_STORE:
    PATH['dictation_bucket'] = "dictations"
    PATH['nlu_bucket'] = "output-nlu"
    PATH['cos_dictation_apikey'] = input(prompt='Please enter the API-Key for your dictation bucket in IBM Cloud Object Storage:')
    PATH['cos_dictation_crn'] = "UPDATE-ME"
    PATH['cos_dictation_endpoint'] = "UPDATE-ME"
    PATH['cos_dictation_auth_endpoint'] = "UPDATE-ME"
    
    PATH['cos_nlu_apikey'] = input(prompt='Please enter API-Key for your NLU bucket in IBM Cloud Object Storage:')
    PATH['cos_nlu_crn'] = "UPDATE-ME"
    PATH['cos_nlu_endpoint'] = "UPDATE-ME"

else:
    # Where to load data from and save data to
    PATH['data']    = '../data/Documents/'
    PATH['results'] = './Watson-nlu-results/'

NLU = {}
NLU['apikey']         = NLU_APIKEY
NLU['apiurl']         = NLU_ENDPOINT
NLU['version']        = '2019-07-12'
NLU['features']       = Features(
                        categories= CategoriesOptions(),
                        concepts  = ConceptsOptions(),
                        entities  = EntitiesOptions(),
                        keywords  = KeywordsOptions(),
                        relations = RelationsOptions(),
                        syntax    = SyntaxOptions()
                        )


# EXPLORING ARCHETYPES

When exploring the dimensionality of the problem we used SVD - the 'singular value decomposition' of a matrix of data. Now we move on to Archetypal analysis, a type of 'soft clustering'. 

The relationship between SVD and Archetypes/Cluster representations is not unlike the relation between waves and particles, where SVD is more like an overlay of multiple waves, like jpeg uses the fourier/cosine transform to decompose pictures, while Archetypes/Clusters are more like representing the picture as a sum of objects. The dimensionality distribution in SVD is not unlike the frequency distribution in a jpeg, although much less restricted. Fourier/Cosine transform has its basis functions predefined and has to stick to its delocalized functions, while SVD is more flexible in its choice of basis functions. The don't have to be delocalized, even if they often are. In fact, the clustering/Archetypal analysis trades information quality in return for intuitive interpretability. It is easier to understand a sum of objects than an overlay of waves. The key difference is that waves have phase, they have both negative and positive amplitude. Objects, on the other hand, never have negative presence - the presence of objects always add up, they can't cancel each other out like waves do. This explains the name of the method we use for computing the Archetypes/Clusters: "Non-Negative Matrix Factorization" (NMF). SVD factorizes a matrix into orthogonal components that can have either negative of positive matrix elements. NMF requires that all matrix elements are positive. NMF does, however, still allow delocalization. In straightforward clustering models, an element belongs to one or the other cluster. NMF is an example of "soft clustering" where an element can belong to several clusters, just like a word can belong to several overlapping categories. 

Since NMF has more restrictions than SVD, we assume that SVD is a lower limit for the dimensionality reduction that can be achieved through NMF. According to the same line of reasoning, the overlap between two different Archetypes/Soft Clusters can't be smaller than zero. The overlap between two different modes of 'waves' in SVD will always be zero. 

With this in mind we now identify the Archetypes of our corpus of dictations by computing the NMF-clusters. 

Here below we choose to partition our corpus into six archetypes. 

In [None]:
## INSTANTIATE THE WatsonDocumentArchetypes OBJECT as 'wda' 
# Split dictations into train/test-set with 5% set aside as test-dictations
wda    = find_archetype.WatsonDocumentArchetypes(PATH,NLU,train_test = 0.05, random_state = 42, use_cloud_store = USE_CLOUD_STORE)

# BUILDING THE ARCHETYPES


In the plots below the different archetypes are shown and compared. Each plot is organized so that one (key) archetype is plotted in order from its largest variable and downwards. The other archetypes' values for the same components are shown for comparison. 

The list is truncated where the key archetype's component values go below 10% of the strongest component. 

## 1. FROM ENTITIES

In [None]:
import show_archetype
fig = show_archetype.plot_archetypes(wda, 'entities')
fig.show()

## 2. FROM CONCEPTS

In [None]:
fig = show_archetype.plot_archetypes(wda, 'concepts')
fig.show()

## 3. FROM KEYWORDS


In [None]:
fig = show_archetype.plot_archetypes(wda, 'keywords')
fig.show()

# USING THE ARCHETYPES AS A COORDINATE SYSTEM FOR DOCUMENTS

We apply hierarchical clustering (dendrograms) to organize the dictations so that the clustering ones are put next to each other. We see that they are quite distinct. 

The columns represent the six archetypes, the rows are the dictations. 

The dictations are normalized so that the sum of coefficients over the archetypes sum up to exactly one for each dictation. A row with a completely white segment will therefore be completely black otherwise, indicating that 100% of he dictation belongs to the 'white' archetype. 

## ENTITY-ARCHETYPES

In [None]:
## ARCHETYPES based on ENTITIES in corpus texts
show_archetype.plot_coordinate(wda, "entities")

## CONCEPT - ARCHETYPES

In [None]:
## ARCHETYPES based on CONCEPTS in corpus texts
show_archetype.plot_coordinate(wda,'concepts')

## KEYWORD-ARCHETYPES 

In [None]:
## ARCHETYPES based on KEYWORDS in corpus texts
show_archetype.plot_coordinate(wda,'keywords')

# ANALYZING NEW DOCUMENTS 

**SCENARIO**: 
1. A physician dictates notes after examining a patient. 
2. The dictation is automatically transcribed. 
3. The dictation (transcript) is analyzed by Watson NLU, returning entities/concepts/keywords as shown above.
4. The analysis is mapped onto the archetypes shown above and returned to the physician. 

Note that we do not include the new document in the corpus.

Here we will go through steps 3-4, assuming that 1-2 have already been performed. 

## 2. Analyzing a New Transcript

We emulate a new transcript by picking one from our test set. Not included in the corpus. 



In [None]:
## We emulate a new transcript by picking one from our test set. Not included in the corpus. 

test_name = wda.names_test[1]

test_text = wda.dictation_df.loc[test_name]
test_text



## 3. RUN WATSON NLU ON TEST DOCUMENT

In [None]:
## DAVID TO TEAM: See self.watson and self.watson_nlu - which does the Watson analysis 
## for all documents in the corups. Reuse code?.



# ## Call Watson API
# def watson_nlu(text, 
#                typ_list = ['entities','concepts','keywords']):
#     module = wda.nlu_model.analyze(text = text, features=NLU['features'])
#     result = {}
#     for typ in typ_list:
#          result[typ] = pd.DataFrame(module.result[typ])
#     return result

# test_watson = watson_nlu(test_text)

test_watson = wda.watson_nlu[test_name]
test_watson
    

In [None]:
test_watson['concepts']

In [None]:
## Construct the 'concepts'-word vector

test_vec = test_watson['concepts'].set_index('text')[['relevance']].apply(find_archetype.norm_dot)
test_vec

## 4. MAP TEST DOCUMENT ON ARCHETYPES 

### 1:  Similarities to Archetypes

We project the test document onto the Archetypes by using **cosine similarities** showing 'how similar' the document is to an archetype. The similarity between an archetype vector **a** and the test document vector **d** is 

$$\text{similarity} = {\mathbf{a} \cdot \mathbf{d} \over \|\mathbf{a}\| \|\mathbf{d}\|}= \mathbf{\widehat{a}} \cdot \mathbf{\widehat{d}} $$

where the 'hat' represents 'dot-normalized' vectors, such that $ \mathbf{\widehat{a}} \cdot \mathbf{\widehat{a} = 1}$

**NOTE** that, since the Archetypes are NOT an orthogonal set, projecting the test document onto the Archetypes, i.e. saying 'the test document is this much similar to the first archetype and that much similar to the second archetype' is *NOT* the same as saying 'the test vector can be described as a sum of this much of the first archetype and that much of the second archetype. Because the archetypes have overlap, the overlapping similarities will be erroneously amplified, multiplied by the summation. Consider: A mule is half horse and half donkey. A mule on a hill is half a horse on a hill and half a donkey on a hill, but don't sum that up to a half a horse and half a donkey on *two* hills. When using Archetypes as a basis set, this will be taken into account. We do this in "Archetypes as a basis set"

Here we will only look at the **projections / similarities** between a document and the archetypes. 

In [None]:
archetypes = wda.archetypes(typ='concepts',n_archs=6)

In [None]:
archetypes.f

In [None]:
## Select the subset of features in corpus that cover the test vector.
in_common     = list(set(test_vec.index).intersection(set(archetypes.fn.columns)))

## Check if the test vector contains new features that are not in corpus
beyond_corpus = list(set(test_vec.index) - set(archetypes.fn.columns))

## Display
in_common, beyond_corpus

In [None]:
# Measure the similarities between the test vector and the archetypes
show_archetype.plot_similarity(((archetypes.fn[in_common] @ test_vec.loc[in_common]) * 100).applymap(int), 
                               'MY DOC match with ALL Archetypes-features (%)')

In [None]:
import numpy as np
scale_segment = np.sqrt(archetypes.fn.shape[1]/len(test_vec))
show_archetype.plot_similarity(((archetypes.fn[in_common]* scale_segment @ test_vec.loc[in_common]) * 100).applymap(int), 
                               'Archetypes match with MY DOC-feature subset (%)')

In [None]:
from matplotlib import pyplot as plt
import seaborn as sns
sns.set(font_scale = 2)
compare = archetypes.fn[in_common].T
# mx = np.sqrt(archetypes.fn.shape[1]/len(test_vec))
mx = 1
compare = compare * mx
compare['MY DOC'] = test_vec.loc[in_common].apply(find_archetype.scale)
compare = compare.sort_values(by='MY DOC', ascending = False)[['MY DOC']+list(compare.columns)[:-1]]
plt.figure(figsize = (6,6))
sns.heatmap((compare*100).applymap(np.sqrt),linewidths = 1,cbar=False)
plt.xlabel('Archetypes')
plt.title('Match with Archetypes')

In [None]:
import pandas as pd
test_vec_expanded = pd.DataFrame(test_vec, index = archetypes.f.columns).apply(find_archetype.scale).fillna(-0.1)
test_vec_expanded.min()

In [None]:
test_vec_expanded = pd.DataFrame(test_vec, index = archetypes.f.columns).apply(find_archetype.scale).fillna(0)

sns.set(font_scale = 1.5)
compare = archetypes.f.T.apply(find_archetype.scale)
compare['MY DOC'] = test_vec_expanded.apply(find_archetype.scale)
for ix in archetypes.f.index:
    cmp = compare.sort_values(by=ix,ascending=False)[[ix,'MY DOC']]
    cmp = cmp[cmp[ix] >0.1]
    plt.figure(figsize = (2,6))
    sns.heatmap(cmp.applymap(np.sqrt),linewidth = 1)
    plt.show()

# Dimensionality: How diverse is the corpus?

In [None]:
## DIMENSIONALITY OF THE CORPUS
# Establish with Singular Value Decomposition (principal component analysis) :

types = ['keywords','concepts','entities']
svde  = {}
volume = {}
volume_distribution = {}

for typ in types:
    svde[typ]     = find_archetype.Svd(wda.X_matrix(typ))
    volume[typ] = svde[typ].s.sum()
    volume_distribution[typ] = svde[typ].s.cumsum()/volume[typ]
    plt.plot(volume_distribution[typ],label = typ)
plt.title('DIMENSIONALITY OF WATSON DICTATION DATA ANALYSIS OUTPUT')
plt.xlabel('Dimensions / max = number of dictations in corpus')
plt.ylabel('Volume')
plt.legend()
plt.grid()
plt.show()


## CONCLUSIONS: 
## DICTATION WORD/ENTITY/CONCEPTUAL CONTENT IS DIVERSE AND SPREAD OVER MANY DIMENSIONS
## ACCESS TO A LARGER CORPUS SHOULD BE VERY HELPFUL!