# Using Primer Engines and NLP to get a big-picture view across documents

Making sense of the contents of a large set of unknown documents is relevant to many industry applications. This tutorial will take you through the HierarchicalTagger, a pipeline created to address this broad use case. 

Combining the power of [Primer Engines](https://developers.primer.ai/docs) with a custom prototype built on top of deep NLP models, the pipeline is designed to ingest an arbitrary set of documents, produce a hierarchical visualization of their contents (like the one below), and finally make the corpus searchable by tagging each document with both specific and broad keywords.

You can follow the full tutorial [here](ADDLINK)

![](./example-sunburst-chart.png)

## Set-up

In [1]:
# Import packages
import os
import sys
import numpy as np
import pandas as pd
import json
from more_itertools import chunked
from IPython.display import clear_output

In [2]:
# Find path to the repository root. Apply os.path.dirname twice to go up two levels.
# All paths will be expressed relative to the directory root.
ROOT_DIR = os.path.dirname(os.path.dirname(os.path.realpath("__file__")))

In [3]:
# Make sure you launched the Jupyter notebook with `$PYTHONPATH=$(pwd) jupyter notebook
# This allows importing python modules found at the root of the repository like this:
from engines_utils.engines import infer_model_on_docs

## Generate Abstractive Topics via Engines

We would now hit the Primer APIs with batches of documents for processing and receive the generated topic labels results back. So that you can proceed directly to the next steps, we’ve done this for you, and included the processed results for a random sample of 3000 products in `PRECOMPUTED_ITEM_TOPICS` path. Feel free to save your Engines credits and proceed to the next section.

Running the pipeline on your own data is easy and the following cells show how to do it.

In [4]:
# We start by loading the Amazon Product Dateset into a pandas DataFrame
PATH_TO_DATASET = os.path.join(ROOT_DIR, "./examples/data/amazon_products_2020.csv")
df = pd.read_csv(PATH_TO_DATASET)

In [None]:
# Let's see what the dataset looks like
df.head()

In [6]:
# We limit our analysis to 3000 documents to limit the computational burden.
# Random sampling guarantees that the documents remain representative of the whole dataset
# We fix a random_state parameter so that the same 3000 documents are chosen each time one runs this notebook.
sampled_items = df[df['About Product'].notnull()].sample(3000, random_state=1363)

In [7]:
# Tranform documents in standard format to send to Engines: a list of dictionaries with an `id` and a `text` key. 
documents = [{"id": r["Uniq Id"], "text": r['About Product']} for i, r in sampled_items.iterrows()]

In [8]:
# Confirm we have 3000 documents
len(documents)

3000

In [9]:
# Extract a dictionary storing additional attributes about the documents. This will be necessary to 
# to display the original document data in the webapp we will use later.
document_attributes = {r["Uniq Id"]: {"title": r['Product Name'], "text": r['About Product']} for i, r in sampled_items.iterrows()}

In [23]:
# We stored our API key in credentials.py in the root of the repository and outside of version control.
# It contains a single line:
# ENGINES_API_KEY="YOUR_ENGINES_API_KEY"
# Please do the same if you wish to try it out. Otherwise, skip to the next section.
# We can then import the key without exposing it into the notebook
from credentials import ENGINES_API_KEY

In [None]:
# Let's try pinging the the Engines API and check everthing is in good shape
# We'll just try out with 2 test documents
test_documents = documents[:2]
topics_results = await infer_model_on_docs(test_documents, 
                                               model_name="abstractive_topics", 
                                               api_key=ENGINES_API_KEY, 
                                               batch_size=10,
                                               **{"segmented": False})
# The results will be a dictionary mapping document id to the response from the API
# Let's check we are happy with what that looks like.
topics_results 

In [None]:
# We are now ready to send the full 3000 documents to the Engines API and save the results.
ITEM_TOPICS = os.path.join(ROOT_DIR, "./examples/data/amazon_products.json")

topics = {}

# Infer topics from Engines
for doc_chunk in chunked(documents, 100):
    topics_results = await infer_model_on_docs(doc_chunk, 
                                               model_name="abstractive_topics", 
                                               api_key=ENGINES_API_KEY, 
                                               batch_size=10,
                                               **{"segmented": False})
    topics.update(topics_results)
    clear_output()
    print(f"Collected topics for {len(topics)} documents")
    # Save
    with open(ITEM_TOPICS, "w") as f:
        json.dump(topics, f)

## Ingest the processed docs into the HierarchicalTagger pipeline

In [10]:
# Load topic labels from precomputed datast
# If you ran Engines on your own data, change the path to where you stored the output.
PRECOMPUTED_ITEM_TOPICS = os.path.join(ROOT_DIR, "./examples/data/amazon_products_precomputed.json")
with open(PRECOMPUTED_ITEM_TOPICS, "r") as f:
    topics = json.load(f)

In [11]:
# Confirm we have topics for all 3000 documents
len(topics)

3000

In [None]:
# Inspect results for an item. We are interested in the `topics` key
topics["96d96237978ba26bbc6baa437372527a"]

In [13]:
# Import module and create a HierarchicalTagger instance
from hierarchical_tagger.hierarchical_tagger import HierarchicalTagger
hierarchical_tagger = HierarchicalTagger()

In [14]:
# Rework the document topic representations into a dictionary mapping document_id: List[topic labels as str]
document_topics = {document_id: topics_entry['topics'] for document_id, topics_entry in topics.items()}

In [15]:
# Send the document ids and their corresponding topic labels for ingest
hierarchical_tagger.ingest(document_terms=document_topics, document_attributes=document_attributes)

In [16]:
# The previous step is the most computationally demanding.
# To avoid having to repeat this, we can save our HierarchicalTagger instance to a json file, using the .to_json() helper method.
# As this file will also be the input data to our web app, so let's save it in `webapp/data/`
SERIALIZED_INSTANCE_PATH = os.path.join(ROOT_DIR, "./webapp/data/amazon_products.json")
with open(SERIALIZED_INSTANCE_PATH, "w") as f:
    f.write(hierarchical_tagger.to_json())

In [17]:
# This is how we can reload a HierarchicalTagger instance from file
with open(SERIALIZED_INSTANCE_PATH, "r") as f:
    reloaded_serialized =  json.load(f)
hierarchical_tagger = HierarchicalTagger.from_dict(reloaded_serialized)

## Build the topic tree and tag the documents

#### Hierarchical topic tree

The `.fit_tag_tree()` method populates the `.tree` attribute with a [treelib](https://treelib.readthedocs.io/en/latest/) object representing the extracted term tree. This can be manipulated and explored with all the treelib methods, for example `.show()` to print out a text representation of the tree.

In [None]:
# Fit the tree
hierarchical_tagger.fit_tag_tree()
hierarchical_tagger.tree.show()

In [None]:
# The final step is tagging the original documents based on the hierarchy we found in the tree
hierarchical_tagger.tag_documents()

#### Document tags

Inspect the `document_tags` attribute: a dictionary mapping document `id` to a list of tuples of the form `(term, score, node_id)` sorted by descending score. `score` measures how close in meaning the term is to the document. We would expect higher level abstractions to have lower scores.
`node_id` loosely indicates how high the node is in the tree: it's not a perfect measure, but more abstract terms will generally have higher `node_id`s.

In [None]:
# Let's see how the tags assinged to the Hover Board example we saw above
hierarchical_tagger.document_tags["96d96237978ba26bbc6baa437372527a"]

## Tuning parameters and using the web app

We expose several tuning parameters that the investigator can tweak to guide the extraction of the term tree and the logic applied when tagging the documents. You can find detailed documentation [here](https://github.com/PrimerAI/hierarchical-tagger/blob/main/hierarchical_tagger/hierarchical_tagger.py#L33). 

This repository also includes a simple web app to facilitate this iterative exploration. In a terminal, just run activate this tutorial's virtual enviroment and run the app like this:

```
$ workon ht-repo # Or alternative command to activate your virtual environment
$ streamlit run webapp/app.py
```

You can find more detailed instructions in our [companion tutorial](TODO_ADDLINK).


## Congratulations!

We hope you enjoyed this tutorial! Can you think of any other data you could run this on?
