# Entity Extraction from text using Granite

LLMs have demonstrated remarkable accuracy in the task of entity extraction. This cookbook focuses on extracting key entities from descriptions related to books.

## Install dependencies

In [1]:
! pip install "git+https://github.com/ibm-granite-community/utils" \
    transformers \
    langchain_community \
    langchain_ollama \
    replicate \
    pydantic

Collecting git+https://github.com/ibm-granite-community/utils
  Cloning https://github.com/ibm-granite-community/utils to /tmp/pip-req-build-mja2hacc
  Running command git clone --filter=blob:none --quiet https://github.com/ibm-granite-community/utils /tmp/pip-req-build-mja2hacc
  Resolved https://github.com/ibm-granite-community/utils to commit 157a188d350c6220679f40106a20d7c230e9b61d
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting langchain_community
  Downloading langchain_community-0.3.25-py3-none-any.whl.metadata (2.9 kB)
Collecting langchain_ollama
  Downloading langchain_ollama-0.3.3-py3-none-any.whl.metadata (1.5 kB)
Collecting replicate
  Downloading replicate-1.0.7-py3-none-any.whl.metadata (29 kB)
Collecting python-dotenv (from ibm-granite-community-utils==0.1.dev67)
  Downloading python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB)
Collectin

## Instantiate the Granite model

In [2]:
import json
import os
import requests
from langchain_ollama.llms import OllamaLLM
from langchain_community.llms import Replicate
from ibm_granite_community.notebook_utils import get_env_var
from transformers import AutoTokenizer

model_path = "ibm-granite/granite-3.3-8b-instruct"
try: # Look for a locally accessible Ollama server for the model
    response = requests.get(os.getenv("OLLAMA_HOST", "http://127.0.0.1:11434"))
    model = OllamaLLM(
        model="granite3.3:8b",
    )
    model = model.bind(raw=True) # Client side controls prompt
except Exception: # Use Replicate for the model
    model = Replicate(
        model=model_path,
        replicate_api_token=get_env_var('REPLICATE_API_TOKEN'),
        model_kwargs={
            "temperature": 0.0, #greedy
        },
    )

tokenizer = AutoTokenizer.from_pretrained(model_path)

REPLICATE_API_TOKEN not found in Google Colab secrets.
Please enter your REPLICATE_API_TOKEN: ··········


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/9.93k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/777k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/442k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.48M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/207 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/801 [00:00<?, ?B/s]

## 1 - Entity Extraction by defining entities in the prompt

The first approach is straightforward and involves explicitly defining the entities within the prompt itself. In this method, we specify the entities to be extracted along with their descriptions directly in the prompt. This includes:  

<u>**Entity Definitions:**</u> Each entity, such as title, author, price, and rating, is clearly outlined with a concise description of what it represents.  

<u>**Prompt Structure:**</u> The prompt is structured to guide the LLM in understanding exactly what information is needed. By providing detailed instructions, we aim to ensure that the model focuses on extracting only the relevant data.  

<u>**Output Format:**</u> The output is required to be in JSON format, which enforces a consistent structure for the extracted data. If any entity is not found, the model is instructed to return "Data not available," preventing ambiguity.  

Provide some text with information for a book. In this case, we use generated commentary on 'The Hunger Games' by Suzanne Collins.

In [3]:
books = [
    """The name of our next book is The Hunger Games. Now, some of you might have read this book earlier, but in
my personal opinion, reading it again won't hurt, right? It gets even more interesting when we read it a second time.
So, the author of this book is Suzanne Collins, who wrote many books in the past, but this particular book for her
has got a rating of 4.33/5, which I think is a pretty good number considering what we just saw for The Book Thief.
Anyways, this book is a work of fiction, written in English and falls in the Post Apocalyptic genre. This book has 374 pages.
It was published on the 10th of October, 2005. This book is priced at 5 dollars and 9 cents. If anyone is interested in this book,
you can approach Mr Hofstadter after this workshop, he will be glad to sell you this book for only 3 dollars.
Dont miss out this chance to grab such a memorable book.
""",
]

All the entities that need to be fetched are defined in the prompt itself along with the entity's description.

In [4]:
entity_prompt = tokenizer.apply_chat_template(
    conversation=[
        {
            "role": "user",
            "content": """\
- You are an AI Entity Extractor. You help extract entities from the given documents about books.
- Extract the following entities:

1) `title` : This is the title of the book.
2) `author`: This is author of the book. The one who wrote this book.
3) `price`: This is the price of the book.
4) `rating`: This is the rating given for this particular book.

- Your output should strictly be in a json format, which only contains the key and value. The key here is the entity to be extracted and the value is the entity which you extracted.
- Do not generate random entities on your own. If it is not present or you are unable to find any specified entity, you strictly have to output it as `Data not available`.
- Only do what is asked to you. Do not give any explanations in your output and do not hallucinate.
""",
        },
    ],
    documents=[
        {
            "doc_id": i+1,
            "text": book,
        }
        for i, book in enumerate(books)
    ],
    add_generation_prompt=True,
    tokenize=False,
)

Invoking the model to get the results

In [5]:
response = model.invoke(entity_prompt)
print(response)

{
  "title": "The Hunger Games",
  "author": "Suzanne Collins",
  "price": "5 dollars and 9 cents",
  "rating": "4.33/5"
}


In [6]:
book_info_json = json.loads(response)
book_info_json

{'title': 'The Hunger Games',
 'author': 'Suzanne Collins',
 'price': '5 dollars and 9 cents',
 'rating': '4.33/5'}

---

## 2 - Pydantic Class-Based Entity Definition

The second approach takes advantage of object-oriented programming principles by defining entities within a class structure. This method involves several key steps:  

<u>**Class Definition:**</u> We create a class that encapsulates all the relevant entities as members. Each member corresponds to an entity such as title, author, etc., and can include type annotations for better validation and clarity.  

<u>**Pydantic Integration:**</u> Utilizing Pydantic, a data validation library, we convert this class into a Pydantic model. This model not only defines the structure of our data but also provides built-in validation features, ensuring that any extracted data adheres to specified formats and types.  

<u>**Dynamic Prompting:**</u> The Pydantic model can then be integrated with the prompt sent to the LLM. This allows for a more dynamic interaction where the model can adapt based on the defined structure of entities. If new entities are added or existing ones modified, changes can be made at the class level without needing to rewrite the entire prompt.  

<u>**Enhanced Validation:**</u> By leveraging Pydantic's capabilities, we can ensure that any data extracted by the LLM meets our predefined criteria, enhancing data integrity and reliability.  

This class-based approach offers greater flexibility and scalability compared to the first method. It allows for easier modifications and expansions as new requirements arise, making it particularly suitable for larger projects or those requiring frequent updates.

In [7]:
from pydantic import BaseModel, Field, ValidationError
from langchain_core.utils.function_calling import convert_to_openai_function

Here we add a commentary for a second book.

In [8]:
books.append("""
Our next book is titled Magic of Lands. Even if some of you have read it before, I believe giving it another read would be worthwhile --
it actually gets more captivating the second time around. The author, John Williams, who has several other books to his name,
received a 3 out of 5 rating for this particular one. Considering the ratings we've seen for other books like Endurance, that's a fair score.
This French drama is 330 pages long and was published on September 11, 2010. It's currently priced at $3.22.
However, if you're interested, you can contact Mr. Shakespeare after the session -- he's offering it at a discounted price of $2.
Don't miss the opportunity to grab such an intriguing read!
""")

We define all of the entities in a Python class along with the description.

In [9]:
class Book(BaseModel):
    "This contains information about a book including its title, author, price, rating, and so on."
    title: str = Field(description="The title of the book")
    price: str = Field(description="Total cost of this book")
    author: str = Field(description="The person who wrote this book")
    rating: str = Field(description="Total rating for this book")

In [10]:
class BooksInformation(BaseModel):
    "This contains information about multiple books."
    books: list[Book] = Field(description = "Information on multiple books. ")

And we then create a JSON object describing the classes to use in the prompt.

In [11]:
book_function = convert_to_openai_function(BooksInformation)
print(json.dumps(book_function, indent=2))

{
  "name": "BooksInformation",
  "description": "This contains information about multiple books.",
  "parameters": {
    "properties": {
      "books": {
        "description": "Information on multiple books. ",
        "items": {
          "description": "This contains information about a book including its title, author, price, rating, and so on.",
          "properties": {
            "title": {
              "description": "The title of the book",
              "type": "string"
            },
            "price": {
              "description": "Total cost of this book",
              "type": "string"
            },
            "author": {
              "description": "The person who wrote this book",
              "type": "string"
            },
            "rating": {
              "description": "Total rating for this book",
              "type": "string"
            }
          },
          "required": [
            "title",
            "price",
            "author",
          

The prompt is similar to the previous prompt, but here, the JSON describing the Pydantic classes is used instead of defining each entity in the prompt text.

In [12]:
entity_prompt_with_pydantic = tokenizer.apply_chat_template(
    conversation=[
        {
            "role": "user",
            "content": f"""\
- You are an AI Entity Extractor. You help extract entities from the given documents about books.
- Analyze this information and extract the following entities as per this function definition:

{json.dumps(book_function)}

- Generate output as a json representation of a BooksInformation object. Include only the json.
- Your output should strictly be in a json format, which only contains the key and value. The key here is the entity to be extracted and the value is the entity which you extracted.
- Do not generate random entities on your own. If it is not present or you are unable to find any specified entity, you strictly have to output it as `Data not available`.
- Only do what is asked to you. Do not give any explanations in your output and do not hallucinate.
""",
        },
    ],
    documents=[
        {
            "doc_id": i+1,
            "text": book,
        }
        for i, book in enumerate(books)
    ],
    add_generation_prompt=True,
    tokenize=False,
)
entity_prompt_with_pydantic

'<|start_of_role|>system<|end_of_role|>Knowledge Cutoff Date: April 2024.\nToday\'s Date: June 19, 2025.\nYou are Granite, developed by IBM. Write the response to the user\'s input by strictly aligning with the facts in the provided documents. If the information needed to answer the question is not available in the documents, inform the user that the question cannot be answered based on the available data.<|end_of_text|>\n<|start_of_role|>document {"document_id": "1"}<|end_of_role|>\nThe name of our next book is The Hunger Games. Now, some of you might have read this book earlier, but in\nmy personal opinion, reading it again won\'t hurt, right? It gets even more interesting when we read it a second time.\nSo, the author of this book is Suzanne Collins, who wrote many books in the past, but this particular book for her\nhas got a rating of 4.33/5, which I think is a pretty good number considering what we just saw for The Book Thief.\nAnyways, this book is a work of fiction, written in 

Invoking the model to get the results as a JSON string.

In [13]:
response = model.invoke(entity_prompt_with_pydantic)
print(response)

{
  "books": [
    {
      "title": "The Hunger Games",
      "price": "5 dollars and 9 cents",
      "author": "Suzanne Collins",
      "rating": "4.33/5"
    },
    {
      "title": "Magic of Lands",
      "price": "$3.22",
      "author": "John Williams",
      "rating": "3/5"
    }
  ]
}


We can now instantiate the `Book` and `BooksInformation` classes with the extracted information. We'll need error handling in case we get an improperly-formatted response.

In [14]:
# Parse the json response.
try:
    books_information = BooksInformation.model_validate_json(response)
    print(books_information.model_dump_json(indent=2))
except ValidationError as e:
    print(f"Error while parsing response: {e}")

{
  "books": [
    {
      "title": "The Hunger Games",
      "price": "5 dollars and 9 cents",
      "author": "Suzanne Collins",
      "rating": "4.33/5"
    },
    {
      "title": "Magic of Lands",
      "price": "$3.22",
      "author": "John Williams",
      "rating": "3/5"
    }
  ]
}
