# Lesson 7 - Schema Proposal for Unstructured Data

In this lesson, you will design agents that propose how to extract information from unstructured data.

You'll learn:
- how "named entity recognition" can be used to identify the people, places and things
- how facts can be extracted as a "triple" of subject, predicate and object

<div style="background-color:#fff6ff; padding:13px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px">
<p> 💻 &nbsp; <b>To access the helper.py, neo4j_for_adk.py and tools.py files :</b> 1) click on the <em>"File"</em> option on the top menu of the notebook and then 2) click on <em>"Open"</em>.

</div>

## 7.1. Agents

<img src="images/entire_solution.png" width="500">

Two agents to propose data extraction from unstructured data: a "named entity recognition" agent and a "fact extraction" agent:

- Input: `approved_user_goal`, `approved_files`, `approved_construction_plan`
- Output: 
    - `approved_entity_types` describing the type of entities that could be extracted from the unstructured data
    - `approved_fact_types` describing how those entities can be related in a fact triple
- Tools: `get_approved_user_goal`, `get_approved_files`, `get_well_known_types`, `sample_file`, `set_proposed_entities`, `get_proposed_entities`, `approve_proposed_entities`, `add_proposed_fact`, `get_proposed_facts`, `approve_proposed_facts`

**Workflow**

<img src="images/workflow.png" width="500">

Named entity recognition:

1. The context is initialized with an `approved_user_goal`, `approved_files` and an `approved_construction_plan`
2. The agent analyzes unstructured data files, looking for relevant entity types 
3. The agent proposes a list of entity types, seeking user approval

Fact extraction:

1. The context now includes `approved_entity_types`
2. The agent analyzes unstructured data files, looking for how those entities can be saved as fact triples
3. The agent proposes fact types, seeking user approval

## 7.2. Setup

The usual import of needed libraries, loading of environment variables, and connection to Neo4j.

## 7.3. Named Entity Recognition (NER) Sub-agent

The NER agent is responsible for proposing entities that could be extracted from the unstructured data files.

An entity is a person, place or thing that is relevant to the user's goal.

There are two general kinds of entities:

1. Well-known entities: these closely correlate with nodes in the existing structured data
   - in our example, this would be things like Products, Parts and Suppliers
2. Discovered entities: these are entities that are not pre-defined, but are mentioned in the markdown text
    - in the product reviews, this may may Reviewers, product complaints, or product features

The general goal of the NER agent is to analyze the markdown files and propose entities that are 
relevant to the user goal of root-cause analysis.


### 7.3.1. Agent Instructions (NER)


### 7.3.2. Tool Definitions (NER)

As in previous lessons, you'll define some tools that explictly follow
a propose then approve pattern. 

The "well-known entities" are based on existing node labels used during graph construction.

This helper tool will get the existing node labels from the approved construction plan.

The full toolset includes some existing tools that you'll import
plus the extra tools you just defined.

To see what the NER agent is working with, use the sample_file tool to look
at one of the markdown files.

The markdown has product reviews from multiple users that include a rating, the review and their username.

For root-cause analysis, you'll be interested in reviews that are negative and report product issues
like quality, challenges in assembly, or reliability.

We won't provide explicit instructions about product reviews to the agent, instead relying
on a combination of the stated user goal along with instruction to find entity types
that would support that user goal.

### 7.3.3. Construct the Sub-agent (NER)

The initial state is important in this phase, as the agent is designed to act
within a particular phase of an overall workflow.

The NER agent will need:

- the user goal, extended to mention product reviews and what to look for there
- a list of markdown files that have been pre-approved
- the approved construction plan from the structured data design phase

OK, you're ready to run the agent. 

- use the make_agent_caller to create an execution environment
- prompt the agent with a single message that should kick-off the analysis
- expect the result to be a proposed list of entity types
- but *not* a list of approved entity types

**The entity types here may vary quite a bit. If you're not happy with the proposal,
you can run the cell again to get a new list.**

<p style="background-color:#f7fff8; padding:15px; border-width:3px; border-color:#e0f0e0; border-style:solid; border-radius:6px"> 🚨
&nbsp; <b>Different Run Results:</b> The output generated by LLMs can vary with each execution due to their stochastic nature. Your results might differ from those shown in the video.</p>

**Note**

- often, the agent will confuse the process of "Assembly" with the resulting thing that is an "Assembly"
- why is that?
- the agent will see the term "Assembly" in the list of well-known entity types
- and, it will notice complaints about assembling furniture
- but, it has no way to know those are two different uses of the word
- to fix this, the schema proposal from the previous lesson could save descriptions for each node label

Once you're happy with the proposal, you can tell the agent that you approve.

## 7.4. Fact Type Extraction Sub-agent


### 7.4.1. Agent Instructions (fact type extraction)

### 7.4.2. Tool Definitions (fact type extraction)

### 7.4.3. Construct the Sub-agent (fact type extraction)

<p style="background-color:#f7fff8; padding:15px; border-width:3px; border-color:#e0f0e0; border-style:solid; border-radius:6px"> 🚨
&nbsp; <b>Different Run Results:</b> The output generated by LLMs can vary with each execution due to their stochastic nature. Your results might differ from those shown in the video.</p>

If you're happy with the fact type proposal, approve it.