# Step 1: Ingest Active Data
> **Active Data** represents the stream on information content coming into system, typically on a daily basis. For example, a series of news articles published each day and what things are mentioned in those articles are all Active Data. 
 
![step1](resources/Step1_Ingest.png)

This notebook reads news articles (Content) and ingests them as nodes into your Knowledge Graph.
* Import OpenTLDR Knowledge Graph - which automatically uses your .env to connect to Neo4J.
* Iterate through the content text in the repository
* For each article:
- Create (or merge) a Content node that represents where the article was published.
- Run NLP for "Named Entity Recognition" to identify the entities mentioned.

The result of this step includes:
- Source nodes
- Content nodes, connected to Source nodes with an IS_FROM relationship
- Entity nodes, connected to Content nodes with a MENTIONED_IN relationship

## Setup

In [None]:
import logging

#logging.getLogger("OpenTLDR").setLevel(logging.ERROR)  # Less output
#logging.getLogger("OpenTLDR").setLevel(logging.WARN)   # Default
logging.getLogger("OpenTLDR").setLevel(logging.INFO)   # More output
#logging.getLogger("OpenTLDR").setLevel(logging.DEBUG)  # So much output

In [None]:
import os
from datetime import datetime

from opentldr.Domain import Source, Content, Entity
from opentldr import DataRepo
from opentldr import KnowledgeGraph

kg=KnowledgeGraph()

## Parameters
OpenTLDR workflows use the notebook block tagged as "parameters" to inject variables (for example to redirect the source of content).

> **Do Not Change Variable Names in the Parameters Block** you are welcome to change the values of these parameter variables, but please do not change their names. They are used elsewhere in the notebook and in other workflow processes.

In [None]:
#Parameters

active_date_repo_config = {'repo_type': 'files','path': './sample_data/content'}

spacy_model="en_core_web_sm"


## Ingest text files from the sources directory into the KG

In [None]:
# if you plan to only run this notebook multiple times you could clean out the content nodes each time
# kg.delete_all_content()

if active_date_repo_config is not None:
    repo = DataRepo(kg,active_date_repo_config)
    list_of_uids =  repo.importData()
    print("Loaded {count} articles from the repository.".format(count=len(list_of_uids)))

## Named Entity Recognition (NER)
This is the process of detecting objects that are identified within the text and associating them with one of the above types.

### Setup spaCy for NLP/NER

In [None]:
import spacy
import sys
import subprocess

# if you have a GPU and your imstalled the spacy[cuda] package, it will use the GPU
spacy.prefer_gpu()

# SpaCy uses a language model that needs to be downloaded, this checks if that has been done
# and if it has not, it will download the model (and some dependencies) which can take a bit.
if not spacy.util.is_package(spacy_model):
        print("Downloading spaCy NLP Model...")
        #equivelent to running -> !{sys.executable} -m spacy download {spacy_model}
        subprocess.check_call([sys.executable, "-m", "spacy", "download", spacy_model])
else:
        print("spaCy model ({model}) is already downloaded.".format(model=spacy_model))

nlp = spacy.load(spacy_model)   



### List NER entity "types" that Spacy will look for

In [None]:
for label in nlp.get_pipe('ner').labels:
    print(label, '\t\t', spacy.explain(label))

### Function for running NER on a text string
The call to "spacy.display.render" prints out the text with annotations.

In [None]:
def named_entity_recognition(text:str):
        doc = nlp(text)
        spacy.displacy.render(doc, style='ent')
        return doc.ents

### Iterate over the content, recognize entities of interest, and add those nodes to the KG
If you are only running this workflow directly, this may re-process existing nodes inefficiently.
However, it is possible that content is added without running the this notebook, so we need to process those entries.


In [None]:
for content_node in kg.get_all_content():
    print("\nProcessing {title}:".format(title=content_node.title))
    
    # avoid adding duplicate entities for the same text value
    existing_entities=kg.get_entities_by_content(content_node)
    unique=[ e.text for e in existing_entities ]
    
    for entity in named_entity_recognition(content_node.text):
        if entity.label_ not in ['DATE','TIME','MONEY', 'CARDINAL', 'ORDINAL','PERCENT','QUANTITY','WORK_OF_ART']:
            if entity.text not in unique:
                entity_node=kg.add_entity(content_node,text=entity.text, type=entity.label_)
                print(" - Added entity '{text}' of type {type}".format(text=entity_node.text, type=entity_node.type))
                unique.append(entity_node.text)

## Verify which entities were discovered
At this point in the notebooks, the KG should contain a set of Entities that have been discovered.
Each entity should be linked to the Content (e.i., the news article) that it was mentioned in.

In [None]:
# Makes a cypher query to the KG
all_entities = kg.get_all_entities();

print("Found {count} entity nodes in the knowledge graph:".format(count=len(all_entities)))

# Iterate thru the Entity Nodes and print info for each
for entity in all_entities:
    citable_node=entity.get_mentioned_in()
    
    if hasattr(citable_node,"url"): # a request wouldn't have a url...
        print(" - {type}({uid}):\t'{text}' was mentioned in {url}".format(
            type=entity.type, uid=entity.uid, text=entity.text, url=citable_node.url))


# Close down any remote connections

In [None]:
kg.close()