<a href="https://colab.research.google.com/github/LoniQin/lifelong-ml/blob/main/Entity_Validator_using_HayStack_and_DeepSeek_Chat.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Entity Validator using HayStack and DeepSeekChat

This notebook demonstrates how to build an entity extraction pipeline using HayStack and DeepSeek Chat.

## Steps:

1. **Environment Setup:** Install Haystack and configure the DeepSeek Chat API key.
2. **EntitiesValidator:** Create a custom component to validate extracted entities.
3. **Prompt Template:** Define a prompt template to guide the language model.
4. **Self-Reflecting Agent:** Build a Haystack pipeline that iteratively refines entity extraction using the language model and validator.
5. **Tests:** Evaluate the pipeline on sample texts.


## Step 1: Environment Setup:
Install Haystack and configure the DeepSeek Chat API key.

In [9]:
!pip install -q google-ai-haystack colorama

In [47]:
from typing import List
from colorama import Fore
from haystack import Pipeline, component
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack_integrations.components.generators.google_ai import GoogleAIGeminiGenerator
from google.colab import userdata
import requests
import os
from haystack.components.generators.openai import OpenAIGenerator
# Ignore all warnings
import warnings
warnings.filterwarnings("ignore")
DEEPSEEK_API_KEY = userdata.get('DEEPSEEK_API_KEY')
os.environ["OPENAI_API_KEY"] = DEEPSEEK_API_KEY

## Simple usage of deepseek-chat

In [64]:
llm = OpenAIGenerator(model="deepseek-chat", api_base_url="https://api.deepseek.com")
llm.run(prompt="Tell me a joke")

{'replies': ["Sure, here's a classic one:\n\nWhy don't scientists trust atoms?\n\nBecause they make up everything!"],
 'meta': [{'model': 'deepseek-chat',
   'index': 0,
   'finish_reason': 'stop',
   'usage': {'completion_tokens': 27,
    'prompt_tokens': 7,
    'total_tokens': 34,
    'prompt_cache_hit_tokens': 0,
    'prompt_cache_miss_tokens': 7}}]}

### Step 2: EntitiesValidator

Create a custom component to validate extracted entities.
* Create a custom Haystack component called EntitiesValidator.
* This component checks if the extracted entities meet specific criteria (e.g., correct categories, no duplicates).
* If the entities are valid, it marks them as 'DONE'. Otherwise, it prompts the language model for refinement.

In [51]:
@component
class EntitiesValidator:

    @component.output_types(entities_to_validate=str, entities=str)
    def run(self, replies: List[str]):
        if 'DONE' in replies[0]:
            return {"entities":replies[0].replace('DONE', '')}
        else:
            print(Fore.RED + "Reflecting on entities\n", replies[0])
            return {"entities_to_validate": replies[0]}

In [52]:
entities_validator = EntitiesValidator()
entities_validator.run(replies= ["{'name': 'Tuana'}"])

[31mReflecting on entities
 {'name': 'Tuana'}


{'entities_to_validate': "{'name': 'Tuana'}"}

In [53]:
entities_validator.run(replies= ["DONE {'name': 'Tuana'}"])

{'entities': " {'name': 'Tuana'}"}

### 3. Prompt Template

Define a prompt template to guide the language model.
* Define a prompt template using Jinja2 syntax.
* The template guides the language model on how to * extract entities and provides feedback based on the EntitiesValidator's output.
* It includes conditional logic to handle both initial entity extraction and subsequent refinements.

In [54]:
template = """"
{% if entities_to_validate %}
    Here was the text you were provided:
    {{ text }}
    Here are the entities you previously extracted:
    {{ entities_to_validate[0] }}
    Are these the correct entities?
    Things to check for:
    - Entity categories should exactly be "Person", "Location" and "Date"
    - There should be no extra categories
    - There should be no duplicate entities
    - If there are no appropriate entities for a category, the category should have an empty list
    If you are done say 'DONE' and return your new entities in the next line
    If not, simply return the best entities you can come up with.
    Entities:
{% else %}
    Extract entities from the following text
    Text: {{ text }}
    The entities should be presented as key-value pairs in a JSON object.
    Example:
    {
        "Person": ["value1", "value2"],
        "Location": ["value3", "value4"],
        "Date": ["value5", "value6"]
    }
    If there are no possibilities for a particular category, return an empty list for this
    category
    Entities:
{% endif %}
"""

### 4. Self-Reflecting Agent

* Build a Haystack pipeline that orchestrates the entity extraction process.
* The pipeline connects the prompt builder, entities validator, and language model (DeepSeek Chat).
* It allows for iterative refinement, where the language model receives feedback and improves its entity extraction based on the validator's output.

In [59]:
prompt_template = PromptBuilder(template=template)
llm = OpenAIGenerator(model="deepseek-chat", api_base_url="https://api.deepseek.com")
entities_validator = EntitiesValidator()

agent = Pipeline(max_loops_allowed=10)

agent.add_component("prompt_builder", prompt_template)
agent.add_component("entities_validator", entities_validator)
agent.add_component("llm", llm)

agent.connect("prompt_builder.prompt", "llm.prompt")
agent.connect("llm.replies", "entities_validator.replies")
agent.connect("entities_validator.entities_to_validate", "prompt_builder.entities_to_validate")


<haystack.core.pipeline.pipeline.Pipeline object at 0x7d437c8d99f0>
🚅 Components
  - prompt_builder: PromptBuilder
  - entities_validator: EntitiesValidator
  - llm: OpenAIGenerator
🛤️ Connections
  - prompt_builder.prompt -> llm.prompt (str)
  - entities_validator.entities_to_validate -> prompt_builder.entities_to_validate (str)
  - llm.replies -> entities_validator.replies (List[str])

## 5. Tests

Evaluate the pipeline on sample texts.

* Evaluate the performance of the pipeline on sample texts.
* Print the final extracted entities in green to indicate successful validation.

In [61]:
text = """
Istanbul is the largest city in Turkey, straddling the Bosporus Strait,
the boundary between Europe and Asia. It is considered the country's economic,
cultural and historic capital. The city has a population of over 15 million residents,
comprising 19% of the population of Turkey,[4] and is the most populous city in Europe
and the world's fifteenth-largest city."""

result = agent.run({"prompt_builder": {"text": text}})
print(Fore.GREEN + result['entities_validator']['entities'])

[31mReflecting on entities
 ```json
{
    "Person": [],
    "Location": ["Istanbul", "Turkey", "Bosporus Strait", "Europe", "Asia"],
    "Date": []
}
```
[32mBased on the provided text, the correct entities categorized as "Person", "Location", and "Date" are as follows:

Entities:
- Person: []
- Location: ["Istanbul", "Turkey", "Europe", "Bosporus Strait"]
- Date: []




In [63]:
text = """
Stefano: Hey all, let's start the all hands for June 6th 2024
Geoff: Thanks, I'll kick it off with a request. Could we please add persistent memory to the Chroma document store.
Stefano: Easy enough, I can add that to the feature requests. What else?
Julain: There's a bug, some BM25 algorithms return negative scores and we filter them out from the results by default.
Instead, we should probably check which algorithm is being used and keep results with negative scores accordingly.
Esmail: Before we end this call, we should add a new Generator component for LlamaCpp in the next release.
Tuana: Thanks all, I think we're done here, we can create some issues in GitHub about these."""

result = agent.run({"prompt_builder": {"text": text}})
print(Fore.GREEN + result['entities_validator']['entities'])

[31mReflecting on entities
 ```json
{
    "Person": ["Stefano", "Geoff", "Julain", "Esmail", "Tuana"], 
    "Location": [], 
    "Date": ["June 6th 2024"]
}
```
[32m
Entities:
{
  "Person": ["Stefano", "Geoff", "Julain", "Esmail", "Tuana"],
  "Location": [],
  "Date": ["June 6th 2024"]
}



## Conclusion

This notebook demonstrated how to build a self-reflecting entity extraction pipeline using Haystack and DeepSeek Chat.
The pipeline iteratively refines entity extraction using a language model and a custom validator, ensuring high-quality results.
This approach can be applied to various entity extraction tasks, improving accuracy and efficiency in natural language processing applications.