# Basic Tasks using spannerlib


In [None]:
#| default_exp tutorials.basic

In [None]:
#| hide
from nbdev.showdoc import show_doc
from IPython.display import display, HTML
%load_ext autoreload
%autoreload 2

In [None]:
#| export
# importing dependencies
import re
import pandas as pd
from pandas import DataFrame
from pathlib import Path
from spannerlib import get_magic_session,Session,Span

This tutorials aim to show how to use the spannerlib framework for simple use cases.
To illustrate the simplicity of using spannerlib, all required IE functions will be implemented from scratch.

## Finding identical sentences in a corpus of documents

In this example, we would like to get a collection of documents. And find identical sentences among them.
For example, given the following documents:

In [None]:
input_documents = pd.DataFrame([
    ('doc1', 'The quick brown fox jumps over the lazy dog. Im walking on Sunshine.'),
    ('doc2', 'Im walking on Sunshine. Lorem ipsum. Im walking on Sunshine.'),
    ('doc3', 'All you need is love. The quick brown fox jumps over the lazy dog.'),
])
input_documents

Unnamed: 0,0,1
0,doc1,The quick brown fox jumps over the lazy dog. I...
1,doc2,Im walking on Sunshine. Lorem ipsum. Im walkin...
2,doc3,All you need is love. The quick brown fox jump...


We would like to compute that the first sentence of `doc1` is equal to the second sentence of `doc3` etc.. 

### Building IE functions

To do so, we would need 2 IE functions,
* Ones that extract the Span of a sentence from a document, called `split`
* The other which would let us know when sentences have identical content but are not actually the same sentence.
  * this will require them to have the same content when ignoring whitespace
  * but not be equal spans. Lets call this function `eq_content_spans`

Lets implement them and register them to our session object. We will use the `Span` class to save indexed substrings.



In [None]:
#| export
# this implementation is naive
# the standard library has a rgx_split ie function that does this in a more efficient way
def split(text):
    split_indices = [ pos for pos,char in enumerate(text) if char == '.' ]
    start = 0
    for pos,char in enumerate(text):
        if char == '.':
            yield Span(text, start, pos)
            start = pos+1


def eq_content_spans(span1, span2):
    # notice that we are yielding a boolean value
    yield span1 != span2 and str(span1).strip() == str(span2).strip()

In [None]:
print(list(split('The quick brown fox jumps over the lazy dog. Im walking on Sunshine.')))


[[@3ba775,0,43) "The quick ...", [@3ba775,44,67) " Im walkin..."]


### Building spannerlog rules

In [None]:
# we register the functions and their input and output schema
sess = get_magic_session()
sess.register('split', split, [str],[Span])
sess.register('eq_content_spans', eq_content_spans,[Span,Span],[bool])

Now let us import our data

In [None]:
sess.import_rel('Docs', input_documents)

Lets make sure we can see our data in using Spannerlog

In [None]:
%%spannerlog
?Docs(doc_id,text)

'?Docs(doc_id,text)'

Unnamed: 0,doc_id,text
0,doc1,The quick brown fox jumps over the lazy dog. Im walking on Sunshine.
1,doc2,Im walking on Sunshine. Lorem ipsum. Im walking on Sunshine.
2,doc3,All you need is love. The quick brown fox jumps over the lazy dog.


Now let us build rules that allow us to find identical sentences:

In [None]:
%%spannerlog
# this rule gives us
Sents(doc_id,sent)<-\
    Docs(doc_id,text),split(text)->(sent)
?Sents(doc_id,sent)

# this rule find equal pairs of sentences
EqualSents(doc_id1,sent1,doc_id2,sent2)<-\
    Sents(doc_id1,sent1),\
    Sents(doc_id2,sent2),\
    eq_content_spans(sent1,sent2)->(True)
?EqualSents(doc_id1,sent1,doc_id2,sent2)

'?Sents(doc_id,sent)'

Unnamed: 0,doc_id,sent
0,doc1,"[@3ba775,0,43) ""The quick ..."""
1,doc1,"[@3ba775,44,67) "" Im walkin..."""
2,doc2,"[@06bc2d,0,22) ""Im walking..."""
3,doc2,"[@06bc2d,23,35) "" Lorem ips..."""
4,doc2,"[@06bc2d,36,59) "" Im walkin..."""
5,doc3,"[@9c32df,0,20) ""All you ne..."""
6,doc3,"[@9c32df,21,65) "" The quick..."""


'?EqualSents(doc_id1,sent1,doc_id2,sent2)'

Unnamed: 0,doc_id1,sent1,doc_id2,sent2
0,doc1,"[@3ba775,0,43) ""The quick ...""",doc3,"[@9c32df,21,65) "" The quick..."""
1,doc1,"[@3ba775,44,67) "" Im walkin...""",doc2,"[@06bc2d,0,22) ""Im walking..."""
2,doc1,"[@3ba775,44,67) "" Im walkin...""",doc2,"[@06bc2d,36,59) "" Im walkin..."""
3,doc2,"[@06bc2d,0,22) ""Im walking...""",doc1,"[@3ba775,44,67) "" Im walkin..."""
4,doc2,"[@06bc2d,0,22) ""Im walking...""",doc2,"[@06bc2d,36,59) "" Im walkin..."""
5,doc2,"[@06bc2d,36,59) "" Im walkin...""",doc1,"[@3ba775,44,67) "" Im walkin..."""
6,doc2,"[@06bc2d,36,59) "" Im walkin...""",doc2,"[@06bc2d,0,22) ""Im walking..."""
7,doc3,"[@9c32df,21,65) "" The quick...""",doc1,"[@3ba775,0,43) ""The quick ..."""


### Oops, handling symmetric queries

Notice that we got each pair twice. This is because while we think of a sentence pair as a set with two sentences,
the tuples `(x,y)` and `(y,x)` are actually different. To remedy this we can limit our pairs to pairs where the first sentence is in a smaller position (ie position in memory).

We can do that by introducing a `span_lt` ie function

In [None]:
#| export
def span_lt(span1, span2):
    yield span1 < span2


In [None]:
sess.register('span_lt', span_lt, [Span,Span],[bool])

In [None]:
%%spannerlog
EqualSentsUniqe(doc_id1,sent1,doc_id2,sent2)<-\
    Sents(doc_id1,sent1),\
    Sents(doc_id2,sent2),\
    span_lt(sent1,sent2)->(True),\
    eq_content_spans(sent1,sent2)->(True)
    
?EqualSentsUniqe(doc_id1,sent1,doc_id2,sent2)

'?EqualSentsUniqe(doc_id1,sent1,doc_id2,sent2)'

Unnamed: 0,doc_id1,sent1,doc_id2,sent2
0,doc2,"[@06bc2d,0,22) ""Im walking...""",doc1,"[@3ba775,44,67) "" Im walkin..."""
1,doc2,"[@06bc2d,0,22) ""Im walking...""",doc2,"[@06bc2d,36,59) "" Im walkin..."""
2,doc2,"[@06bc2d,36,59) "" Im walkin...""",doc1,"[@3ba775,44,67) "" Im walkin..."""
3,doc3,"[@9c32df,21,65) "" The quick...""",doc1,"[@3ba775,0,43) ""The quick ..."""


## Building LLM agents using spannerlib

### Motivation

Recall, that most LLM based applications do not include a single call to an LLM, but are rather built using LLM agents.
LLM agents are programs that
* Wrap LLMs in control logic.
  * Either deterministic
  * Or decided by an LLM
* Offloads structured reasoning to tools, whose output is fed into some of the Prompts that are used by the agent's LLMs.

In this tutorial we will show how to build a simple agent that might be used to build the backend of a more nuanced chatgpt like system.

Our agent takes:
* A question from a user

And has previous knowledge encoding
* How to best answer questions on different topics.
* How to a user prefers his answers to be formatted.

For example, given the question `How do we measure the distance between starts` we might note that:
* in cases of scientific questions, it is better to answer using mathematical notation and formulas, rather than pure english.

Moreover, some users would like the answer to be styled in an academic way, with dense prose ending with citations, like so:
<blockquote>
The distance between stars is measured in astronomical units (AU), parsecs (pc), or light-years (ly). 

1. **Astronomical Unit (AU)**: An AU is the average distance between the Earth and the Sun, which is approximately 93 million miles or 150 million kilometers. It is more commonly used to measure distances within our solar system.

2. **Parsec (pc)**: A parsec is a unit of distance used in astronomy, defined as the distance at which an object would have a parallax angle of one arcsecond. One parsec is equal to approximately 3.26 light-years or 3.09 x 10^13 kilometers.

3. **Light-year (ly)**: A light-year is the distance that light travels in one year, which is approximately 9.46 trillion kilometers or about 5.88 trillion miles.

To measure the distance to a star, astronomers use methods such as parallax and spectroscopic parallax. Parallax involves observing a star from two different points in Earth's orbit and measuring the apparent shift in position. Spectroscopic parallax uses the star's spectral type and luminosity to estimate its distance.

In summary, astronomers use a combination of trigonometric and spectroscopic methods to measure the distance between stars in order to better understand the vastness of our universe.

**References:**
1. Carroll, B. W., & Ostlie, D. A. (2007). *An introduction to modern astrophysics* (2nd ed.). Pearson Addison-Wesley.
2. Bennett, J., Donahue, M., Schneider, N., & Voit, M. (2014). *The cosmic perspective* (7th ed.). Pearson.
</blockquote>


While other users will like a shorter answer in simple language like so:
<blockquote>
One common way to measure the distance between stars is through parallax. Parallax is the apparent shift in position of an object when viewed from two different points. 

The parallax angle, denoted by p, is the angle formed between a line from the observer to the nearer star and a line from the observer to the farther star. The parallax angle is related to the distance to the star (d) by the formula:

d = 1 / p

where d is the distance to the star in parsecs (pc) and p is the parallax angle in arcseconds ("). 

A parsec is a unit of distance often used in astronomy, equal to about 3.26 light-years. So when we measure the parallax angle of a star, we can use the formula above to calculate its distance from Earth.
</blockquote>


### Problem definition

So given:
* A relation of the form `(user,question)`
* A relation of the form `(topic,topicSpecificIntructions)`
* A relation of the form `(user,userSpecificInstructions)`

We would like to output for each user and question, an answer that fits both the user's specified preferences and the instructions fitting for that topic.

### Defining our IE functions

For building agent we need 2 basic building blocks as IE functions:
* an `llm` function of the form `llm(model:str,prompt:str)->(answer:str)`
* a `printf` like function `format` for formatting a prompt from template and other strings 
which will be of the form `format(template:str,s_1:str,...s_n:str)->(prompt:str)`

The following section will involve technical details such as data type conversions and interfacing with openai's API.
If this is not of interest to the reader, please skip to the next section.

To implement the llm ie function, we need to wrap some llm api as an IE function.
We want an ie function that takes a string and returns a string.
Since most LLM api expect a dict of the form `{'role':role,'content':message}`
we will right converters that parse them from prompt strings of the form 
```txt
role: content
role: content
```

In [None]:
#| export
# load openAI api key for env file
from dotenv import load_dotenv
load_dotenv('.env_dev')
import os
assert os.getenv('OPENAI_API_KEY') is not None

In [None]:
#| export
from spannerlib.ie_func.basic import rgx_split
from functools import cache
import openai
from joblib import Memory
memory = Memory("cachedir", verbose=0)
client = openai.Client()

# we use the rgx_split function to split the string into messages
def str_to_messages (string_prompt):
    return [
        {
            'role': str(role).replace(': ',''),
            'content': str(content)
        } for role,content in rgx_split('system:\s|assistant:\s|user:\s', string_prompt.strip())
    ]
def messages_to_string(msgs):
    return ''.join([f"{msg['content']}" for msg in msgs])

# the specific API we are going to call using the messages interface
def openai_chat(model, messages):
    respone = client.chat.completions.create(
        model=model,
        messages=messages,
        seed=42
    )
    return [dict(respone.choices[0].message)]

# we disk cache our function to spare my openAI credits
@memory.cache
def llm(model, question):
    q_msgs = str_to_messages(question)
    a_msgs = openai_chat(model, q_msgs)
    answer = messages_to_string(a_msgs)
    return answer

def llm_ie(model, question):
    return [llm(model, question)]




In [None]:
llm('gpt-3.5-turbo','user: Hello, who are you?')

'Hello! I am a AI-powered virtual assistant designed to help and provide information to users. How can I assist you today?'

Now we can register the llm function as an ie function and use it from in spannerlog code.

Now let see how we can use spannerlib to build a simple pipeline that varries the LLM prompt in a data dependant way.
Imagine that we have a chatbot that we want to act differently based on the topic of conversation and the user preference.

To enable us to format prompts from data, we will use a prompt formatting function with a `printf` like syntax.
It is available in spannerlib's stdlib, but we will reproduce it here.

In [None]:
#| export
def format_ie(f_string,*params):
    yield f_string.format(*params),

# note that since the schema is dynamic we need to define a function that returns the schema based on the arity
string_schema = lambda x: ([str]*x)

In [None]:
sess.register('llm', llm_ie, [str,str],[str])
sess.register('format', format_ie, string_schema,[str])

### Seeing our llm IE function in action

In [None]:
%%spannerlog

model = 'gpt-3.5-turbo'
prompt = "user: sing it with me, love, what is it good for?"

TestLLM(answer)<-\
    llm($model,$prompt)->(answer)

?TestLLM(answer)


'?TestLLM(answer)'

Unnamed: 0,answer
0,Absolutely nothing!


### Example data


We will model our data as follows:

In [None]:

# a binary relation that stores how the user prefers the formatted
user_answer_prefernce = pd.DataFrame([
    ('Bob', 'please answer in prose. Make sure you add references in the end like a bibliography.'),
    ('Joe', 'Prose is hard for me to read quickly. Format the answer in bullet points.'),
    ('Sally', 'Try to avoid complicated jargon and use simple language.'),
])

# relation that stores per topic style
topic_specific_style = pd.DataFrame([
    ('history', 'please answer in a narrative form, use shakespearing language and lots of examples'),
    ('science', 'Use mathematical notation and formulas to explain the concepts.'),
    ('hiphop', 'introduce the answer with a rap verse.'),
    ('other', 'Be polite and neutral in your answer. Avoid controversial topics.'),
])

sess.import_rel('UserAnswerPreference', user_answer_prefernce)
sess.import_rel('TopicSpecificStyle', topic_specific_style)

### Buliding the agent logic in spannerlog

Now we will build our pipeline as follows:
* We will make an llm call to help us decide the topic of the question
* Based on the topic and the user, we will formulate the final prompt for the llm to answer.


In [None]:
prompts = pd.DataFrame([
    ('topic_selection',
"""
system: Please select a topic from the following list: [history, science, hiphop, other]
based on the question provided by the user. You are only allwed to say the topic name, nothing else.

user: {}
"""),
    ('custom_style_prompt',
"""
system: Answer the question of the user in the following style:

topic specific style instructions: {}

user specific style instructions: {}

user: {}
"""
)
])
sess.import_rel('Prompts', prompts)
prompts

Unnamed: 0,0,1
0,topic_selection,system: Please select a topic from the follow...
1,custom_style_prompt,system: Answer the question of the user in th...


In [None]:
%%spannerlog
?Prompts(prompt_id,prompt)

'?Prompts(prompt_id,prompt)'

Unnamed: 0,prompt_id,prompt
0,custom_style_prompt,system: Answer the question of the user in the following style: topic specific style instructions: {} user specific style instructions: {} user: {}
1,topic_selection,"system: Please select a topic from the following list: [history, science, hiphop, other] based on the question provided by the user. You are only allwed to say the topic name, nothing else. user: {}"


Now given a relation of user and question

In [None]:
questions= pd.DataFrame([
    ('Bob', 'Who won the civil war?'),
    ('Joe', 'Who won the civil war?'),
    ('Sally', 'Who won the civil war?'),
    ('Bob', 'How do we measure the distance between stars?'),
    ('Joe', 'How do we measure the distance between stars?'),
    ('Sally', 'How do we measure the distance between stars?'),
    ('Bob', 'Who are the most well known rappers?'),
    ('Joe', 'Who are the most well known rappers?'),
    ('Sally', 'Who are the most well known rappers?'),
])
sess.import_rel('Questions', questions)

Now lets combine all of this into a pipeline.


In [None]:
%%spannerlog
model = 'gpt-3.5-turbo'

TopicPrompt(q,p)<-\
    Questions(user,q),\
    Prompts('topic_selection',template),\
    format(template,q)->(p)
    
?TopicPrompt(q,p)

TopicSelection(q,t)<-\
    TopicPrompt(q,p),\
    llm($model,p)->(t)

?TopicSelection(q,t)
StylePrompt(q,p,topic,user)<-\
    Questions(user,q),\
    TopicSelection(q,topic),\
    UserAnswerPreference(user,user_style),\
    TopicSpecificStyle(topic,topic_style),\
    Prompts('custom_style_prompt',prompt_template),\
    format(prompt_template,topic_style,user_style,q)->(p)


Style_Based_QA(question,topic,user,prompt,answer)<-\
    StylePrompt(question,prompt,topic,user),\
    llm($model,prompt)->(answer)



'?TopicPrompt(q,p)'

Unnamed: 0,q,p
0,How do we measure the distance between stars?,"system: Please select a topic from the following list: [history, science, hiphop, other] based on the question provided by the user. You are only allwed to say the topic name, nothing else. user: How do we measure the distance between stars?"
1,Who are the most well known rappers?,"system: Please select a topic from the following list: [history, science, hiphop, other] based on the question provided by the user. You are only allwed to say the topic name, nothing else. user: Who are the most well known rappers?"
2,Who won the civil war?,"system: Please select a topic from the following list: [history, science, hiphop, other] based on the question provided by the user. You are only allwed to say the topic name, nothing else. user: Who won the civil war?"


'?TopicSelection(q,t)'

Unnamed: 0,q,t
0,How do we measure the distance between stars?,science
1,Who are the most well known rappers?,hiphop
2,Who won the civil war?,history


### Executing our agent and returning the results to python

In [None]:
completions = sess.export('?Style_Based_QA(question,topic,user,prompt,answer)')
completions

Unnamed: 0,question,topic,user,prompt,answer
0,How do we measure the distance between stars?,science,Bob,system: Answer the question of the user in th...,The distance between stars is measured in astr...
1,How do we measure the distance between stars?,science,Joe,system: Answer the question of the user in th...,"- To measure the distance between stars, astro..."
2,How do we measure the distance between stars?,science,Sally,system: Answer the question of the user in th...,One common way to measure the distance between...
3,Who are the most well known rappers?,hiphop,Bob,system: Answer the question of the user in th...,"Straight outta Compton, a city so gritty, Let ..."
4,Who are the most well known rappers?,hiphop,Joe,system: Answer the question of the user in th...,"Yo, when it comes to rappers, there's a whole ..."
5,Who are the most well known rappers?,hiphop,Sally,system: Answer the question of the user in th...,"Yo, when it comes to rappers, there's a whole ..."
6,Who won the civil war?,history,Bob,system: Answer the question of the user in th...,"In sooth, fair user, the tale of the Civil War..."
7,Who won the civil war?,history,Joe,system: Answer the question of the user in th...,"- The Civil War, a great conflict betwixt the ..."
8,Who won the civil war?,history,Sally,system: Answer the question of the user in th...,"In sooth, good sir, the Civil War wrought much..."


And as you can see, we have our agent running.

In [None]:
for (question,topic,user,prompt,answer) in completions.itertuples(name=None, index=False):
    print(f"{user}: {question}\nassistant: {answer}\n\n")
    print("="*80)

Bob: How do we measure the distance between stars?
assistant: The distance between stars is measured in astronomical units (AU), parsecs (pc), or light-years (ly). 

1. **Astronomical Unit (AU)**: An AU is the average distance between the Earth and the Sun, which is approximately 93 million miles or 150 million kilometers. It is more commonly used to measure distances within our solar system.

2. **Parsec (pc)**: A parsec is a unit of distance used in astronomy, defined as the distance at which an object would have a parallax angle of one arcsecond. One parsec is equal to approximately 3.26 light-years or 3.09 x 10^13 kilometers.

3. **Light-year (ly)**: A light-year is the distance that light travels in one year, which is approximately 9.46 trillion kilometers or about 5.88 trillion miles.

To measure the distance to a star, astronomers use methods such as parallax and spectroscopic parallax. Parallax involves observing a star from two different points in Earth's orbit and measuring t

In [None]:
#|hide
import nbdev; nbdev.nbdev_export()