In [18]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# LLaMaBot's PydanticBot in under 5 minutes

We want to be able to pull structured data out of unstructured text. When the data is structured, we can the use programmatically in later steps.

In this example, we'll look at a small dataset of SciPy videos uploaded to YouTube. The videos are given a title and a description. We want to extract the name of the speaker giving the talk, and the topics the talk is about.
We want to be able to validate the data we've extracted not only matches the structured format we expect, but that it also meets some custom requirements.

In [19]:
# load in unstructured text data
import pandas as pd

df = pd.read_json("../scipy_videos.json", orient="index")
df

Unnamed: 0,name,description,view_count
0ALKGR0I5MA,Basic Sound Processing in Python | SciPy 2015 ...,,261832
ZB7BZMhfPgk,Introduction to Numerical Computing with NumPy...,NumPy provides Python with a powerful array pr...,208823
v5ijNXvlC5A,Modern Time Series Analysis | SciPy 2019 Tutor...,This tutorial will cover the newest and most s...,199372
tYYVSEHq-io,Getting Started with TensorFlow and Deep Learn...,"A friendly introduction to Deep Learning, taug...",161483
xAoljeRJ3lU,A Better Default Colormap for Matplotlib | Sci...,Complete SciPy 2015 Talk & Tutorial Playlist h...,160912
5rNu16O3YNE,Introduction to Data Processing in Python with...,This is a tutorial for beginners on using the ...,117260
JNfxr4BQrLk,Time Series Analysis with Python Intermediate ...,Tutorial materials for the Time Series Analysi...,113018
KhAUfqhLakw,Frequentism and Bayesianism: What's the Big De...,,106990
gtejJ3RCddE,NumPy Beginner | SciPy 2016 Tutorial | Alexand...,Materials for this tutorial may be found here:...,106043
nq6iPZVUxZU,UMAP Uniform Manifold Approximation and Projec...,This talk will present a new approach to dimen...,96781


Using Pydantic, we can define a class and some validation rules

In [20]:
from typing import List, Optional
from pydantic import BaseModel, Field, field_validator

class TopicExtract(BaseModel):
    """ The name of the speaker presenting this conference talk, and a list of topics that best describe what this talk is about."""
    name: Optional[str] = Field(default=None, description="The name of the speaker giving this talk. If there is no speaker named, leave empty.")
    topics: List[str] = Field(
        description="A list of upto 5 topics that this text is about. Each topic should be at most 1 or 2 word descriptions.")

    @field_validator('name')
    def validate_name(cls, name):
        return name
    
    @field_validator('topics')
    def validate_topics(cls, topics):
        # validate that the list of topics contains atleast 1, and no more than 5 topics 
        if len(topics) <= 0 or len(topics) > 5:
            raise ValueError('The list of topics can be no more than 5 items')

        # for each topic the model generated, ensure that the topic contains no more than 2 words
        for topic in topics:
            if len(topic.split()) > 2:
                # make the validation message helpful to the LLM.
                # Here we repeat which topic is failing validation, and remind it what it must do to pass the validation.
                raise ValueError(f'The topic "{topic}" has too many words, A topic can contain AT MOST 2 words')
        return topics

Now we can initialize the PydanticBot and assign this model to it.

In [21]:
from llamabot.prompt_manager import prompt
from llamabot.bot.pydanticbot import PydanticBot


system_prompt = "You are an expert topic labeller. You read text and extract the topics the text is about."

bot = PydanticBot(
    system_prompt=system_prompt,
    session_name="session_name",
    model_name = "ollama/llama3:latest",
    temperature=0,
    stream_target="stdout",
    pydantic_model=TopicExtract
)

Now we can pass in our text, and extract the topics

In [22]:
video_extracts = []
for index, video_row in df.iterrows():
    video_text = f"name: {video_row['name']}\ndescription: {video_row['description']}"

    extract = bot(video_text)
    
    video_extracts.append(extract)

{
"description": "Basic Sound Processing in Python | SciPy 2015 | Allen Downey",
"name": "Allen Downey",
"topics": [
"Sound Processing",
"Python",
"SciPy"
]
}{
"name": "Alex Chabot-Leclerc",
"topics": [
"Numerical Computing",
"NumPy",
"SciPy",
"Tutorial"
]
}{"name": "Aileen Nielsen", "topics": ["Time Series Analysis", "Machine Learning", "Deep Learning", "Bayesian Methods", "Python"]}

 
 
 
 

 
 

 
 

 
 
 
 

 
 

 
 

 
 
 
 

 
 

 
 

 
 
 
 

 
 

 
{"name": "Aileen Nielsen", "topics": ["Time Series", "Machine Learning", "Deep Learning", "Bayesian", "Python"]}{
"name": "Josh Gordon",
"topics": [
"TensorFlow",
"Deep Learning",
"Machine Learning",
"Computer Vision",
"Natural Language Processing"
]
}{"name": "Josh Gordon", "topics": ["TensorFlow", "Deep Learning", "Machine Learning", "Computer Vision", "Natural Lang"]}

 
 

 
 

 

 

 

 

 

 
 

 

 

 

 

 

 

 
 

 

 

 

 

 

 

 
 

 

 

 

 
{
"name": "Nathaniel Smith and Stéfan van der Walt",
"topics": [
    "Matpl

ValidationError: 1 validation error for TopicExtract
topics
  Field required [type=missing, input_value={'description': {'name': ...rning', 'Bayesianism']}}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.8/v/missing

In [None]:
for video in video_extracts:
    print(video)