# Structured data extraction

LLMs are capable of extracting structured data from raw, unstructured text. Here is an example excerpt from the history of [Baker-Berry Library on Wikipedia](https://en.wikipedia.org/wiki/Baker-Berry_Library):

>The original, historic library building is the Fisher Ames Baker Memorial Library; it opened in 1928 with a collection of 240,000 volumes. The building was designed by Jens Fredrick Larson, modeled after Independence Hall in Philadelphia, and funded by a gift to Dartmouth College by George Fisher Baker in memory of his uncle, Fisher Ames Baker, Dartmouth class of 1859. The facility was expanded in 1941 and 1957–1958 and received its one millionth volume in 1970.
>
>In 1992, John Berry and the Baker family donated US $30 million for the construction of a new facility, the Berry Library designed by architect Robert Venturi, adjoining the Baker Library. The new complex, the Baker-Berry Library, opened in 2000 and was completed in 2002.[6] The Dartmouth College libraries presently hold over 2 million volumes in their collections.

The text describes a sequence of events. The narrative style makes it hard to parse and extract structured data. However, we can use an LLM to transform the text into a structured data format, e.g. JSON.

In [1]:
from dotenv import find_dotenv, load_dotenv

load_dotenv(find_dotenv())

True

In [4]:
from langchain_dartmouth.llms import ChatDartmouth

llm = ChatDartmouth(model_name="llama-3-1-8b-instruct")


unstructured_text = """The original, historic library building is the Fisher Ames Baker Memorial Library; it opened in 1928 with a collection of 240,000 volumes. The building was designed by Jens Fredrick Larson, modeled after Independence Hall in Philadelphia, and funded by a gift to Dartmouth College by George Fisher Baker in memory of his uncle, Fisher Ames Baker, Dartmouth class of 1859. The facility was expanded in 1941 and 1957–1958 and received its one millionth volume in 1970.

In 1992, John Berry and the Baker family donated US $30 million for the construction of a new facility, the Berry Library designed by architect Robert Venturi, adjoining the Baker Library. The new complex, the Baker-Berry Library, opened in 2000 and was completed in 2002.[6] The Dartmouth College libraries presently hold over 2 million volumes in their collections."""

prompt = (
    "Extract a succinct timeline of events directly related the Library from the following text. Return the timeline as a list of dictionaries, where each dictionary has two keys: 'year' and 'event'. Format your output in JSON format. The text:\n\n"
    + unstructured_text
)

response = llm.invoke(prompt)

response.pretty_print()


[
  {"year": 1928, "event": "Fisher Ames Baker Memorial Library opened with a collection of 240,000 volumes"},
  {"year": 1941, "event": "The library building was expanded"},
  {"year": 1957, "event": "Library building expansion (1957-1958)"},
  {"year": 1970, "event": "Received its one millionth volume"},
  {"year": 1992, "event": "John Berry and the Baker family donated $30 million for a new library"},
  {"year": 2000, "event": "Baker-Berry Library opened"},
  {"year": 2002, "event": "Baker-Berry Library was completed"}
]


We can take this one step further and process the raw output of the LLM, which is still just a string, into actual Python objects. The LangChain framework offers a useful output parser for this:

In [6]:
from langchain_core.output_parsers import JsonOutputParser

parser = JsonOutputParser()

timeline = parser.invoke(response)

for event in timeline:
    print(event)
    print("-" * 10)

{'year': 1928, 'event': 'Fisher Ames Baker Memorial Library opened with a collection of 240,000 volumes'}
----------
{'year': 1941, 'event': 'The library building was expanded'}
----------
{'year': 1957, 'event': 'Library building expansion (1957-1958)'}
----------
{'year': 1970, 'event': 'Received its one millionth volume'}
----------
{'year': 1992, 'event': 'John Berry and the Baker family donated $30 million for a new library'}
----------
{'year': 2000, 'event': 'Baker-Berry Library opened'}
----------
{'year': 2002, 'event': 'Baker-Berry Library was completed'}
----------
