In this example, you will learn how to define `Chain` object and use it to extract info from text

In [1]:
# import dependency
import logging

from langchain.pydantic_v1 import BaseModel, Field

from sisyphus.index import create_vectordb_in_memory
from sisyphus.chain import Filter, Extractor, Validator, Writer
from sisyphus.utils.helper_functions import get_chat_model, get_create_resultdb, create_example_messages

In [2]:
# config logging
logging.basicConfig(level=20) # INFO level

In [3]:
# In this example, we are gonna to extract nlo propery shg from text
# For simplicity, we use in memory database to store vectors
# In normal case, I recommand to use documentation method to store vectors

# Our target property context looks like this
target_sent = """Besides, their second-order NLO intensity versus different particle sizes under 2090 nm laser radiation was also investigated. Among them, 
requisite phase-matching behavior and considerable NLO responses (0.7 and 1.2 times that of benchmark AgGaS2 at 200-250 nm) were discovered in La3LiGeS7 and La3LiSnS7 (Figure 2e), respectively"""
# expected results should involve teh name of the nlo material and shg value and unit, e.g., name: La3LiGeS7, shg: 0.7, shg unit: AgGaS2.

In [4]:
# Here we use pydantic model to represent our extracted results
class ExtractSHG(BaseModel):
    """extract nlo material's SHG(second harmonic generation) coeffecients from text""" # General description of the extraction task
    nlo_name: str = Field(description='the chemical name of the nlo material') # for every expected fields, you should define like this.
    shg: float = Field(description='the value of the shg coeffiencient, usually represented as 0.4 pm/V or times of KDP/AgGaS2')
    shg_unit: str = Field(description='the unit of the shg, pm/V or standard material like KDP/AgGaS2')

In [5]:
# indexing it first
vector_db = create_vectordb_in_memory(target_file='test.html', collection_name='nlo') # you can check the test.html

INFO:chromadb.telemetry.product.posthog:Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


Define Chain object, there are 4 basic elements of the chain elements, which are Filter, Extractor, Validator, Writer. In here, in order to show the result, we omit the writer.

In [6]:
# first define chat model(gpt) and result database to save the results
chat_model = get_chat_model('gpt-4o') # default if gpt-3.5-turbo
# result_db = get_create_resultdb('example', ExtractSHG) # since result database need defined pydantic model to create its schema

In [7]:
chain = (Filter(vector_db, query='second harmonic generation pm/V') +
         Extractor(chat_model, ExtractSHG) +
         Validator()
)

In [8]:
# run this chain with out examples
res = chain.compose('test.html')
# check your results at db/example.db

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


In [9]:
res

[DocInfo(doc=Document(page_content='Besides, their second-order NLO intensity versus different particle sizes under 2090 nm laser radiation was also investigated. Among them, requisite phase-matching behavior and considerable NLO responses (0.7 and 1.2 times that of benchmark AgGaS2 at 200-250 nm) were discovered in La3LiGeS7 and La3LiSnS7 (Figure 2e), respectively, which are comparable to those of previously reported Ln3-M′-M-Q7 analogues, such as Y3Zn0.5SiS7 (2.0 × KTiOPO4@2.05 μm), (45) La3Sb0.33SiS7 (0.5 × AgGaS2@2.05 μm), (46) La3InGe0.5S7 (1.8 × AgGaS2@2.05 μm). (39) Except La3LiMS7, other title compounds exhibit the weak signals that may be affected by their absorption peaks, poor crystallinity, or/and dark crystal color. For example, seen from the absorption spectra of Pr3LiGeS7 and Pr3LiSnS7 (Figure S2), they exhibit the obvious optical absorption range from 1000 to 1100 nm spectral region, which seriously affected their second harmonic generation (SHG) responses. The similar 

In [10]:
# create some examples
# I didn't take too much time for these examples, the example I give here is not as good as I expected, but you can always find good examples if you want to.
# remember, example can make a huge improvement on your results if you provide the model with a representative one which fit your task well or in another word, more general to your task.
tool_examples = [
    (
        'Example: After much effort, we obtained a new Pb-containing fluorooxoborate PbB5O7F3 with a strong SHG response (approximately 6 × KDP), a large birefringence (cal. 0.12@1064 nm), and a short UV cutoff edge (∼225 nm).',
        [
            ExtractSHG(
                nlo_name='PbB5O7F3',
                shg=6,
                shg_unit='KDP'
            ),
        ],
        
    ),
    (
        """Example: BBF exhibits a theoretically larger Eg (cal. ~8.88 eV), SHG effect (cal. d12 ~1.6×KDP) and Δn (cal. ~0.09) than KBBF, so that its shortest PM SHG output λPM (cal.) can reach 149 nm. Its deff at 177.3 nm is greater than KBBF, which meets the theoretical standard of DUV NLO crystals. ABF has a superior SHG effect (~3×KDP), Δn (~0.1) and λPM (~158 nm) than KBBF. Its effective SHG coefficient deff is twice that of KBBF, despite its Eg (~8 eV) is smaller than KBBF."""
        ,[
            ExtractSHG(
                nlo_name='BBF',
                shg=1.6,
                shg_unit='KDP'
            ),
            ExtractSHG(
                nlo_name='ABF',
                shg=3,
                shg_unit='KDP'
            )
        ]
    )
]
examples = create_example_messages(tool_examples)

In [11]:
# with examples
extractor_with_examples = Extractor(chat_model, ExtractSHG, examples=examples)
chain_with_examples = (Filter(vector_db, query='second harmonic generation pm/V') +
         extractor_with_examples +
         Validator()
)

In [12]:
res = chain.compose('test.html')

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


In [13]:
res

[DocInfo(doc=Document(page_content='Besides, their second-order NLO intensity versus different particle sizes under 2090 nm laser radiation was also investigated. Among them, requisite phase-matching behavior and considerable NLO responses (0.7 and 1.2 times that of benchmark AgGaS2 at 200-250 nm) were discovered in La3LiGeS7 and La3LiSnS7 (Figure 2e), respectively, which are comparable to those of previously reported Ln3-M′-M-Q7 analogues, such as Y3Zn0.5SiS7 (2.0 × KTiOPO4@2.05 μm), (45) La3Sb0.33SiS7 (0.5 × AgGaS2@2.05 μm), (46) La3InGe0.5S7 (1.8 × AgGaS2@2.05 μm). (39) Except La3LiMS7, other title compounds exhibit the weak signals that may be affected by their absorption peaks, poor crystallinity, or/and dark crystal color. For example, seen from the absorption spectra of Pr3LiGeS7 and Pr3LiSnS7 (Figure S2), they exhibit the obvious optical absorption range from 1000 to 1100 nm spectral region, which seriously affected their second harmonic generation (SHG) responses. The similar 