<a href="https://colab.research.google.com/github/CherpanovNazim/learn-llm/blob/main/notebooks/02_Information_extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


> [GitHub Repo](https://github.com/CherpanovNazim/learn-llm)

In [None]:
#clone git repository
!git clone -q https://github.com/CherpanovNazim/learn-llm.git

In [None]:
# wait ~3 min for installations
%%time

!pip install -qU thefuzz==0.22.1 openai==1.40.3 vllm==0.5.4 transformers==4.44.0 langchain==0.2.13

In [None]:
# wait ~3 min for installations
%%time

import json
import  sys

# Load the default model
DEFAULT_MODEL = json.load(open('learn-llm/configs/llama_3_8B_instruct_awq.json', 'r'))

#run VLLM
!nohup vllm serve {DEFAULT_MODEL['model']} --quantization awq --max-model-len=4096 > vllm.log &
!tail -f vllm.log | grep -q "Uvicorn running" && echo "Now you can start using the model"

In [None]:
!python3 learn-llm\notebooks\utils\explainer.py
sys.path.append('learn-llm/notebooks/utils')

from explainer import Explainer

explain = Explainer(DEFAULT_MODEL)
# use this class if you want to get some explanations
explain("Explain what is NER in NLP in short")

In [None]:
import json
import openai
import pandas as pd
from pprint import pprint
from tqdm import tqdm

# Set the base URL and API key
# For production apps it's preferable to use some secret management system and don't store the key in git repo :)
client = openai.OpenAI(
    base_url = DEFAULT_MODEL['api_base'],
    api_key = DEFAULT_MODEL['api_key']
    )

# Named Entity Recognition (NER)
* We use a small sample from [Few-NERD](https://github.com/thunlp/Few-NERD) dataset with 4 entity types

In [None]:
dataset_df = pd.read_json('learn-llm/data/fewnerd_NER.json', lines=True)

unique_entities = ['product_food', 'building_restaurant', 'organization_company', 'location_gpe']

dataset_df.head()

In [None]:
print(dataset_df.iloc[0]['text'])
print('\nLabels:')
pprint(dataset_df.iloc[0]['labels'])

# note that there is no "product-food" entity there for this example :)

# Evaluation
* Let's first of all define the evaluation metric

In [None]:
!python3 learn-llm\notebooks\utils\ner_fuzzy_score.py

from sklearn.model_selection import train_test_split
from ner_fuzzy_score import calculate_mean_score

train_df, test_df = train_test_split(dataset_df, test_size=0.5, random_state=42)

# example of metric calculation
calculate_mean_score(y_true = [{'product_food': ['chicken nuggets', 'rice'], 'place_restaurant': ['McDonalds']}],
                     y_pred = [{'product_food': ['rice', 'chicken', 'FALSE']}])

In [None]:
def text_completion(prompt, temperature=0, max_tokens=2, return_completion_only=True, **kwargs):
    completion_response = client.completions.create(
                            model=DEFAULT_MODEL["model"],
                            temperature=temperature,
                            max_tokens=max_tokens,
                            prompt=prompt,
                            **kwargs)
    if return_completion_only:
        return completion_response.choices[0].text.strip()
    else:
        return completion_response

# JSON Parser

LLMs not always returns just a JSON, so we need to correctly extract and parse it. There are few options available:
* [JSON mode](https://platform.openai.com/docs/guides/json-mode) provided by OpenAI forses responce to output a JSON format.
* In this notebook we are using vLLM functionality to recieve outputs in JSON format. This requires passing sample JSON schema to recieve correct output.

JSON schema shapes LLM's output, so it is crucial to correctly initialize it.

In some use-cases prior information about presence or exact number of entities might be available, so we should include it in our JSON:
- **min_items = 1** : model should include at least one entry
- **Optional** : entries could be *None*

These parameters help LLM correctly shape final output in our case, which resulted in overall better performance. <br />

Depending on the case additional fields (E.g. **unique_items**, **max_digits**) might be added for more specific structure. <br />
All supported parameters could be found in [Pydantic documentation](https://docs.pydantic.dev/1.10/usage/schema/#field-customization).

In [None]:
from langchain.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field, conlist
from typing import List, Optional
import json
from functools import partial
from tqdm import tqdm

class Entities(BaseModel):
    product_food: Optional[List[str]] = Field(..., min_items = 1)
    building_restaurant: Optional[List[str]] = Field(..., min_items = 1)
    organization_company: Optional[List[str]] = Field(..., min_items = 1)
    location_gpe: Optional[List[str]] = Field(..., min_items = 1)

sample_json_schema = Entities.model_json_schema()

In [None]:
def chat_completion(prompts, temperature=0, max_tokens=2, system_prompt: str = None, json_schema = sample_json_schema, **kwargs):
    if system_prompt is None:
        system_prompt = "Just follow user instructions and don't communicate like \"Sure!\" or \"I hope this helps\""

    completion = client.chat.completions.create(
        model=DEFAULT_MODEL["model"],
        temperature=temperature,
        max_tokens=max_tokens,
        messages=[{"role": "system", "content": system_prompt},
                  {"role": "user", "content": prompts}],
        extra_body=dict(guided_json = json_schema),
        **kwargs
        )
    return completion.choices[0].message.content.strip()

In [None]:
def llm_predict(text, template):
    prompt = template.format(text=text)
    prediction = chat_completion(prompt, max_tokens=200).lower().strip()
    out = json.loads(prediction)
    return out

# Zero-shot NER

In [None]:
zero_shot_template = """\
You will get a text and you should extract entities from it.
Format your answer in a JSON format with keys: product_food, building_restaurant, organization_company, location_gpe.
Each key should have a list of entities.

Text: "{text}"
"""

zero_shot_predictions = list(map(partial(llm_predict, template=zero_shot_template), tqdm(test_df.text)))

print('Score:')
pprint(calculate_mean_score(test_df.labels, zero_shot_predictions))

# Few-shot NER

In [None]:
def get_few_shot_example(df, n=1, seed=None):
    examples_str = ''
    for _,row in df.sample(n, random_state=seed, ).iterrows():
        json_str = ''
        if row.labels != {}:
            # keep the same order of entities (keys) as in the dataset
            # if there is no entity in the text - it will be empty list
            # additional { and } are needed for escaping - it will keep only ony after second .format

            json_str = '{'+json.dumps({k:row.labels.get(k, []) for k in unique_entities})+'}'
            #json_str = '{'+json.dumps({k:row.labels[k] for k in unique_entities if k in row.labels})+'}'

        examples_str += 'Text: "{}"\nJSON: {}\n\n'.format(row.text, json_str)
    return examples_str.strip()

print(get_few_shot_example(train_df, n=1, seed=42))

In [None]:
few_shot_template = """\
You will get a text and you should extract entities from it.
Format your answer in a JSON format with keys: product_food, building_restaurant, organization_company, location_gpe.
Each key should have a list of entities.

{few_shot_examples}

Text: "{{text}}"
JSON:
"""

# we randomly select 5 few-shot examples from the train dataset
few_shot_examples = get_few_shot_example(train_df, n=5, seed=42)

# fill-in the template with few-shot examples
filled_few_shot_template = few_shot_template.format(few_shot_examples=few_shot_examples)

# make predictions
few_shot_predictions = list(map(partial(llm_predict, template=filled_few_shot_template), tqdm(test_df.text)))

print('Score:')
pprint(calculate_mean_score(test_df.labels, few_shot_predictions))

# Relation extraction

* Unfortunately is doesn't work well with LLAMA 3
* _TODO_: test with other models like GPT or Mistral\Zephyr

In [None]:
explain('What is relation extraction in NLP and how it connected to NER ? make a short summary')

# Summary
* Correct parsing of response is very important
* Zero-shot learning is very powerful and can be used for many tasks
* There are special tools for NER and REL tasks
* Few-shot learning might NOT boost your performance (for some reason he-he _TODO_: find out why)
* Relation extraction can help to identify relations between entities in text

# Extra
* There are pre-trained LLMs specifically for Information Extraction
* * [GoLLIE](https://github.com/hitz-zentroa/GoLLIE) - pre-trained LLM for zero-shot Information Extraction with annotation schemas defined on the fly (Apache 2)
* * [UniversalNER](https://universal-ner.github.io/) - instruction-tuned LLM for NER for zero-shot IE (non commercial ?)


# Homework
* Try to play with prompts in order to improve accuracy
* Try to extract PRICE from reviews in amazon_food_reviews_sample.csv dataset and identify the overall price for each of 6 products

In [None]:
import pandas as pd

reviews_df = pd.read_csv('learn-llm/data/amazon_food_reviews_sample.csv')
reviews_df.ProductId.value_counts()