# Load Content (Active Data)

> **Active Data** represents the stream on information content coming into system, typically on a daily basis. For example, a series of news articles published each day and what things are mentioned in those articles are all Active Data. 
 
This notebook reads Content from a DataRepo and ingests them as nodes into your KnowledgeGraph.

* Iterate through the content in the repository

* For each Content:
- Create (or merge) a Content node
- Link the Content node to the Source node (create if necessary)
- Load known Keywords (in Ref Data) and create Entities for any that exist in the Request Node
- Run Named Entity Recognition on the Request Node text and create Entities for any found

The result of this step includes adding:
- User nodes (existing or newly created)
- Content nodes, connected to Source nodes with an IS_FROM relationship
- Entity nodes, connected to Content node by a MENTIONED_IN relationship


In [None]:
import os
import logging
import re

## Parameters can be passed into the Notebook from an OpenTLDR Workflow
OpenTLDR workflows use the notebook block tagged as "parameters" to inject variables (for example to redirect the source of content).

> **Changing Variable Names in the Parameters Block** you are welcome to change the values of these parameter variables, but if you change their names, be aware they are used elsewhere in the notebook and in other workflow stages.

In [None]:
# Notebook Specific Parameters

data_repo_config = {'repo_type': 'files','path': '../Data/Sample/content'}
spacy_model = "en_core_web_sm"

# Logging level ranges are (from least to most verbose): ERROR, WARN, INFO, DEBUG
logging_level = logging.INFO

# List of the UserIdqs to Ingest
list_of_uids = None

# level of unnecessary output
verbose = True

## Setup

In [None]:
logging.getLogger("OpenTLDR").setLevel(logging_level)

from opentldr.Domain import Source, Content, Entity
from opentldr import KnowledgeGraph, DataRepo

kg=KnowledgeGraph()

## Ingest Content nodes from data repo into the KG

In [None]:
if data_repo_config is not None:

    repo = DataRepo(kg,data_repo_config)
    
    if verbose:
        print("Loading Content from: {}".format(repo.describe()))

    list_of_uids =  repo.importData()
    print("Loaded {count} articles from the repository.".format(count=len(list_of_uids)))

else:
    print("No DataRepo specified for Content.")

## Process the newly created Content nodes


### Named Entity Recognition (NER)

This is the process of detecting objects that are identified within the text.

In [None]:
from NerWithSpacy import NerWithSpacy
ner = NerWithSpacy(verbose=True, model_name=spacy_model)

### Keywords
One example of such an entity are the presense of keywords.

In [None]:
keywords = []

for node in kg.get_all_reference_nodes():
    if node.type == "KEYWORD":
        keywords.append(node.text)
if verbose:
    print("Will look for {} known keywords: {}".format(len(keywords),keywords))

### Discover Entities in the Content node's text
Iterate over the imported Requests, recognize entities of interest, and add those nodes to the KG


In [None]:
for uid in list_of_uids:
    node = kg.get_content_by_uid(uid)
    if verbose:
        print("\nProcessing Content {title}:".format(title=node.title))
    
    # avoid adding duplicate entities for the same text value
    ner.skip = [ e.text for e in kg.get_entities_by_content(node) ]

   # Iterate Keywords first
    for keyword in keywords:
        if keyword not in ner.skip:
            if re.search(keyword, node.text, re.IGNORECASE):
                entity_node=kg.add_entity(node,text=keyword, type="KEYWORD")
                ner.skip.append(entity_node.text)
                if verbose:
                    print(" - Added entity '{text}' of type {type}".format(text=entity_node.text, type=entity_node.type))
    

    # Then search for Named Entities (avoid text being both)
    for type, text in ner.process(node.text):
        kg.add_entity(node = node, text=text, type=type)
        ner.skip.append(text)
        if verbose:
            print(" - Added entity '{text}' of type {type}".format(text=text, type=type))

# Close the KG

In [None]:
kg.close()