<a href="https://colab.research.google.com/github/CherpanovNazim/learn-llm/blob/main/notebooks/02_Information_extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


> [GitHub Repo](https://github.com/CherpanovNazim/learn-llm)

In [10]:
# wait ~3 min for installations
%%time

!pip install -qU thefuzz==0.22.1 openai==1.40.3 vllm==0.5.4 transformers==4.44.0 langchain==0.2.13

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m997.8/997.8 kB[0m [31m49.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m384.8/384.8 kB[0m [31m32.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m140.4/140.4 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m141.9/141.9 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[?25hCPU times: user 64 ms, sys: 5.97 ms, total: 70 ms
Wall time: 8.84 s


In [2]:
# wait ~3 min for installations
%%time

# Load the default model
DEFAULT_MODEL = {"model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4", "api_base": "http://localhost:8000/v1", "api_key": "EMPTY"}

#run VLLM
!nohup vllm serve {DEFAULT_MODEL['model']} --quantization awq --max-model-len=4096 > vllm.log &
!tail -f vllm.log | grep -q "Uvicorn running" && echo "Now you can start using the model"

nohup: redirecting stderr to stdout
Now you can start using the model
CPU times: user 684 ms, sys: 127 ms, total: 810 ms
Wall time: 2min 7s


In [3]:
!wget -q 'https://raw.githubusercontent.com/CherpanovNazim/learn-llm/main/notebooks/utils/explainer.py' -O explainer.py
!python3 explainer.py

from explainer import Explainer

explain = Explainer()
# use this class if you want to get some explanations
# It uses LLAMA 3 to generate explanations :)
explain("Explain what is NER in NLP in short")

**Named Entity Recognition (NER)**

Named Entity Recognition (NER) is a fundamental task in Natural Language Processing (NLP) that identifies and categorizes named entities in unstructured text into predefined categories.

**Categories:**

1. **Person**: Names of individuals (e.g., John Smith)
2. **Organization**: Names of companies, institutions, or organizations (e.g., Google)
3. **Location**: Names of cities, countries, or geographic locations (e.g., New York)
4. **Date**: Specific dates or time periods (e.g., 2022-07-26)
5. **Event**: Names of events, conferences, or festivals (e.g., Olympics)

**Goals:**

1. Identify and extract named entities from text
2. Categorize entities into predefined categories
3. Improve text analysis, information extraction, and understanding

NER is a crucial step in many NLP applications, including:

* Information retrieval
* Question answering
* Sentiment analysis
* Text classification
* Machine translation

By recognizing and categorizing named entities, NER enables computers to better understand the meaning and context of text, leading to more accurate and informative results.

In [4]:
import json
import openai
import pandas as pd
from pprint import pprint
from tqdm import tqdm

# Set the base URL and API key
# For production apps it's preferable to use some secret management system and don't store the key in git repo :)
client = openai.OpenAI(
    base_url = DEFAULT_MODEL['api_base'],
    api_key = DEFAULT_MODEL['api_key']
    )

# Named Entity Recognition (NER)
* We use a small sample from [Few-NERD](https://github.com/thunlp/Few-NERD) dataset with 4 entity types

In [5]:
#Sownload dataset from GitHub
!wget -q 'https://github.com/CherpanovNazim/learn-llm/blob/main/data/fewnerd_NER.json?raw=true' -O fewnerd_NER.json

dataset_df = pd.read_json('fewnerd_NER.json', lines=True)

unique_entities = ['product_food', 'building_restaurant', 'organization_company', 'location_gpe']

dataset_df.head()

Unnamed: 0,text,labels
0,Texas Roadhouse is an American chain restauran...,"{'building_restaurant': ['Texas Roadhouse'], '..."
1,The 2008 Chinese milk scandal affected Cadbury...,"{'location_gpe': ['China', 'Chinese'], 'organi..."
2,"In early 2002 , Boston Market entered the Aust...","{'organization_company': ['Boston Market'], 'l..."
3,Pizza Hut has offered a seafood pizza in Malay...,"{'organization_company': ['Pizza Hut'], 'produ..."
4,"Stuart C. Davidson ( September 9 , 1922 – Augu...","{'location_gpe': ['Georgetown', 'Washington , ..."


In [6]:
print(dataset_df.iloc[0]['text'])
print('\nLabels:')
pprint(dataset_df.iloc[0]['labels'])

# note that there is no "product-food" entity there for this example :)

Texas Roadhouse is an American chain restaurant that specializes in steaks around a Western theme and is a subsidiary of Texas Roadhouse Inc , which is headquartered in Louisville , Kentucky .

Labels:
{'building_restaurant': ['Texas Roadhouse'],
 'location_gpe': ['Kentucky', 'Louisville', 'American'],
 'organization_company': ['Texas Roadhouse Inc']}


# Evaluation
* Let's first of all define the evaluation metric

In [7]:
!wget -q 'https://raw.githubusercontent.com/CherpanovNazim/learn-llm/main/notebooks/utils/ner_fuzzy_score.py' -O ner_fuzzy_score.py
!python3 ner_fuzzy_score.py

from sklearn.model_selection import train_test_split
from ner_fuzzy_score import calculate_mean_score

train_df, test_df = train_test_split(dataset_df, test_size=0.5, random_state=42)

# example of metric calculation
calculate_mean_score(y_true = [{'product_food': ['chicken nuggets', 'rice'], 'place_restaurant': ['McDonalds']}],
                     y_pred = [{'product_food': ['rice', 'chicken', 'FALSE']}])

(0.3333333333333333,
 Counter({'product_food': 0.6666666666666666, 'place_restaurant': 0.0}))

In [8]:
def text_completion(prompt, temperature=0, max_tokens=2, return_completion_only=True, **kwargs):
    completion_response = client.completions.create(
                            model=DEFAULT_MODEL["model"],
                            temperature=temperature,
                            max_tokens=max_tokens,
                            prompt=prompt,
                            **kwargs)
    if return_completion_only:
        return completion_response.choices[0].text.strip()
    else:
        return completion_response

# JSON Parser

LLMs not always returns just a JSON, so we need to correctly extract and parse it. There are few options available:
* [JSON mode](https://platform.openai.com/docs/guides/json-mode) provided by OpenAI forses responce to output a JSON format.
* In this notebook we are using vLLM functionality to recieve outputs in JSON format. This requires passing sample JSON schema to recieve correct output.

JSON schema shapes LLM's output, so it is crucial to correctly initialize it.

In some use-cases prior information about presence or exact number of entities might be available, so we should include it in our JSON:
- **min_items = 1** : model should include at least one entry
- **Optional** : entries could be *None*

These parameters help LLM correctly shape final output in our case, which resulted in overall better performance. <br />

Depending on the case additional fields (E.g. **unique_items**, **max_digits**) might be added for more specific structure. <br />
All supported parameters could be found in [Pydantic documentation](https://docs.pydantic.dev/1.10/usage/schema/#field-customization).

In [11]:
from langchain.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field, conlist
from typing import List, Optional
import json
from functools import partial
from tqdm import tqdm

class Entities(BaseModel):
    product_food: Optional[List[str]] = Field(..., min_items = 1)
    building_restaurant: Optional[List[str]] = Field(..., min_items = 1)
    organization_company: Optional[List[str]] = Field(..., min_items = 1)
    location_gpe: Optional[List[str]] = Field(..., min_items = 1)

sample_json_schema = Entities.model_json_schema()

In [12]:
def chat_completion(prompts, temperature=0, max_tokens=2, system_prompt: str = None, json_schema = sample_json_schema, **kwargs):
    if system_prompt is None:
        system_prompt = "Just follow user instructions and don't communicate like \"Sure!\" or \"I hope this helps\""

    completion = client.chat.completions.create(
        model=DEFAULT_MODEL["model"],
        temperature=temperature,
        max_tokens=max_tokens,
        messages=[{"role": "system", "content": system_prompt},
                  {"role": "user", "content": prompts}],
        extra_body=dict(guided_json = json_schema),
        **kwargs
        )
    return completion.choices[0].message.content.strip()

In [13]:
def llm_predict(text, template):
    prompt = template.format(text=text)
    prediction = chat_completion(prompt, max_tokens=200).lower().strip()
    out = json.loads(prediction)
    return out

# Zero-shot NER

In [14]:
zero_shot_template = """\
You will get a text and you should extract entities from it.
Format your answer in a JSON format with keys: product_food, building_restaurant, organization_company, location_gpe.
Each key should have a list of entities.

Text: "{text}"
"""

zero_shot_predictions = list(map(partial(llm_predict, template=zero_shot_template), tqdm(test_df.text)))

print('Score:')
pprint(calculate_mean_score(test_df.labels, zero_shot_predictions))

100%|██████████| 48/48 [02:52<00:00,  3.59s/it]

Score:
(0.7582079102828368,
 Counter({'product_food': 0.8477564102564101,
          'organization_company': 0.7837301587301587,
          'location_gpe': 0.7404255319148936,
          'building_restaurant': 0.660919540229885}))





# Few-shot NER

In [15]:
def get_few_shot_example(df, n=1, seed=None):
    examples_str = ''
    for _,row in df.sample(n, random_state=seed, ).iterrows():
        json_str = ''
        if row.labels != {}:
            # keep the same order of entities (keys) as in the dataset
            # if there is no entity in the text - it will be empty list
            # additional { and } are needed for escaping - it will keep only ony after second .format

            json_str = '{'+json.dumps({k:row.labels.get(k, []) for k in unique_entities})+'}'
            #json_str = '{'+json.dumps({k:row.labels[k] for k in unique_entities if k in row.labels})+'}'

        examples_str += 'Text: "{}"\nJSON: {}\n\n'.format(row.text, json_str)
    return examples_str.strip()

print(get_few_shot_example(train_df, n=1, seed=42))

Text: "It is well known for its Chinese restaurants , such as the China Court Restaurant , for the parade which is held there each year to celebrate the Chinese New Year , for the Birmingham Hippodrome and for being the location of the headquarters of Wing Yip ."
JSON: {{"product_food": [], "building_restaurant": ["China Court Restaurant"], "organization_company": ["Wing Yip"], "location_gpe": ["Chinese"]}}


In [16]:
few_shot_template = """\
You will get a text and you should extract entities from it.
Format your answer in a JSON format with keys: product_food, building_restaurant, organization_company, location_gpe.
Each key should have a list of entities.

{few_shot_examples}

Text: "{{text}}"
JSON:
"""

# we randomly select 5 few-shot examples from the train dataset
few_shot_examples = get_few_shot_example(train_df, n=5, seed=42)

# fill-in the template with few-shot examples
filled_few_shot_template = few_shot_template.format(few_shot_examples=few_shot_examples)

# make predictions
few_shot_predictions = list(map(partial(llm_predict, template=filled_few_shot_template), tqdm(test_df.text)))

print('Score:')
pprint(calculate_mean_score(test_df.labels, few_shot_predictions))

100%|██████████| 48/48 [02:36<00:00,  3.26s/it]

Score:
(0.8222236367978579,
 Counter({'product_food': 0.9212278106508875,
          'building_restaurant': 0.8620689655172413,
          'organization_company': 0.7559523809523809,
          'location_gpe': 0.749645390070922}))





# Relation extraction

* Unfortunately is doesn't work well with LLAMA 3
* _TODO_: test with other models like GPT or Mistral\Zephyr

In [17]:
explain('What is relation extraction in NLP and how it connected to NER ? make a short summary')

**Relation Extraction in NLP**
================================

Relation extraction is a subtask of Natural Language Processing (NLP) that involves identifying and extracting specific relationships between entities in a given text. It is a crucial step in various applications such as question answering, text summarization, and information retrieval.

**What is Relation Extraction?**
------------------------------

Relation extraction aims to identify the relationships between entities, such as:

*   Person A is married to Person B
*   Company X is headquartered in City Y
*   Product A is manufactured by Company B

**Connection to Named Entity Recognition (NER)**
---------------------------------------------

Named Entity Recognition (NER) is a related task in NLP that involves identifying and categorizing named entities in a text, such as:

*   Person (e.g., John Smith)
*   Organization (e.g., Google)
*   Location (e.g., New York)

Relation extraction builds upon the output of NER, as it requires the identification of entities to establish relationships between them. In other words, NER provides the foundation for relation extraction by identifying the entities, and relation extraction then determines the relationships between these entities.

**Example**
-----------

Suppose we have a sentence: "John Smith, the CEO of Google, was born in New York."

*   NER would identify the entities: John Smith (Person), Google (Organization), and New York (Location)
*   Relation extraction would then identify the relationships: John Smith is the CEO of Google, and John Smith was born in New York.

In summary, relation extraction is a crucial task in NLP that builds upon the output of NER to identify specific relationships between entities in a given text.

# Summary
* Correct parsing of response is very important
* Zero-shot learning is very powerful and can be used for many tasks
* There are special tools for NER and REL tasks
* Few-shot learning might NOT boost your performance (for some reason he-he _TODO_: find out why)
* Relation extraction can help to identify relations between entities in text

# Extra
* There are pre-trained LLMs specifically for Information Extraction
* * [GoLLIE](https://github.com/hitz-zentroa/GoLLIE) - pre-trained LLM for zero-shot Information Extraction with annotation schemas defined on the fly (Apache 2)
* * [UniversalNER](https://universal-ner.github.io/) - instruction-tuned LLM for NER for zero-shot IE (non commercial ?)


# Homework
* Try to play with prompts in order to improve accuracy
* Try to extract PRICE from reviews in amazon_food_reviews_sample.csv dataset and identify the overall price for each of 6 products

In [18]:
import pandas as pd

#Download the Amazon Food Reviews dataset from GitHub
!wget -q 'https://github.com/CherpanovNazim/learn-llm/blob/main/data/amazon_food_reviews_sample.csv?raw=true' -O amazon_food_reviews_sample.csv

reviews_df = pd.read_csv('amazon_food_reviews_sample.csv')
reviews_df.ProductId.value_counts()

Unnamed: 0_level_0,count
ProductId,Unnamed: 1_level_1
B000KV61FC,556
B000NMJWZO,542
B006MONQMC,491
B0041NYV8E,452
B004EAGP74,196
B000GAT6NG,193
