# Basic Tasks using spannerlib


In [None]:
#| hide
from nbdev.showdoc import show_doc
from IPython.display import display, HTML
%load_ext autoreload
%autoreload 2

In [None]:
# importing dependencies
import re
import pandas as pd
from pandas import DataFrame
from pathlib import Path
from spannerlib import get_magic_session,Session,Span


This tutorials aim to show how to use the spannerlib framework for simple use cases.
To illustrate the simplicity of using spannerlib, all required IE functions will be implemented from scratch.

## Finding identical sentences in a corpus of documents

Imagine we have a collection of documents, which have a document id (perhaps identified by a path on a file system),
and a content. And imaging we would like to find identical sentences in different in these documents.

Finding identical sentences is a toy example that can be extended to finding identical entity mentions etc...

In [None]:
input_documents = pd.DataFrame([
    ('doc1', 'The quick brown fox jumps over the lazy dog. Im walking on Sunshine.'),
    ('doc2', 'Im walking on Sunshine. Lorem ipsum. Im walking on Sunshine.'),
    ('doc3', 'All you need is love. The quick brown fox jumps over the lazy dog.'),
])
input_documents

Unnamed: 0,0,1
0,doc1,The quick brown fox jumps over the lazy dog. I...
1,doc2,Im walking on Sunshine. Lorem ipsum. Im walkin...
2,doc3,All you need is love. The quick brown fox jump...


To do so, we would need 2 IE functions,
* Ones that extract the Span of a sentence from a document, called `split`
* The other which would let us know when sentences have identical content but are not actually the same sentence.
  * this will require them to have the same content when ignoring whitespace
  * but not be equal spans. Lets call this function `eq_content_spans`

Lets implement them and register them to our session object. We will use the `Span` class to save indexed substrings.



In [None]:
' Im walking on Sunshine. '.strip()

'Im walking on Sunshine.'

In [None]:
# this implementation is naive
# the standard library has a rgx_split ie function that does this in a more efficient way
def split(text):
    split_indices = [ pos for pos,char in enumerate(text) if char == '.' ]
    start = 0
    for pos,char in enumerate(text):
        if char == '.':
            yield Span(text, start, pos)
            start = pos+1

print(list(split('The quick brown fox jumps over the lazy dog. Im walking on Sunshine.')))

def eq_content_spans(span1, span2):
    # notice that we are yielding a boolean value
    yield span1 != span2 and str(span1).strip() == str(span2).strip()

# we register the functions and their input and output schema
sess = get_magic_session()
sess.register('split', split, [str],[Span])
sess.register('eq_content_spans', eq_content_spans,[Span,Span],[bool])


[[@3ba775,0,43) "The quick ...", [@3ba775,44,67) " Im walkin..."]


Now let us import our data

In [None]:
sess.import_rel('Docs', input_documents)

Lets make sure we can see our data in using Spannerlog

In [None]:
%%spannerlog
?Docs(doc_id,text)

'?Docs(doc_id,text)'

Unnamed: 0,doc_id,text
0,doc1,The quick brown fox jumps over the lazy dog. Im walking on Sunshine.
1,doc2,Im walking on Sunshine. Lorem ipsum. Im walking on Sunshine.
2,doc3,All you need is love. The quick brown fox jumps over the lazy dog.


Now let us build rules that allow us to find identical sentences:

In [None]:
%%spannerlog
# this rule gives us
Sents(doc_id,sent)<-\
    Docs(doc_id,text),split(text)->(sent)
?Sents(doc_id,sent)

# this rule find equal pairs of sentences
EqualSents(doc_id1,sent1,doc_id2,sent2)<-\
    Sents(doc_id1,sent1),\
    Sents(doc_id2,sent2),\
    eq_content_spans(sent1,sent2)->(True)
?EqualSents(doc_id1,sent1,doc_id2,sent2)

'?Sents(doc_id,sent)'

Unnamed: 0,doc_id,sent
0,doc1,"[@3ba775,0,43) ""The quick ..."""
1,doc1,"[@3ba775,44,67) "" Im walkin..."""
2,doc2,"[@06bc2d,0,22) ""Im walking..."""
3,doc2,"[@06bc2d,23,35) "" Lorem ips..."""
4,doc2,"[@06bc2d,36,59) "" Im walkin..."""
5,doc3,"[@9c32df,0,20) ""All you ne..."""
6,doc3,"[@9c32df,21,65) "" The quick..."""


'?EqualSents(doc_id1,sent1,doc_id2,sent2)'

Unnamed: 0,doc_id1,sent1,doc_id2,sent2
0,doc1,"[@3ba775,0,43) ""The quick ...""",doc3,"[@9c32df,21,65) "" The quick..."""
1,doc1,"[@3ba775,44,67) "" Im walkin...""",doc2,"[@06bc2d,0,22) ""Im walking..."""
2,doc1,"[@3ba775,44,67) "" Im walkin...""",doc2,"[@06bc2d,36,59) "" Im walkin..."""
3,doc2,"[@06bc2d,0,22) ""Im walking...""",doc1,"[@3ba775,44,67) "" Im walkin..."""
4,doc2,"[@06bc2d,0,22) ""Im walking...""",doc2,"[@06bc2d,36,59) "" Im walkin..."""
5,doc2,"[@06bc2d,36,59) "" Im walkin...""",doc1,"[@3ba775,44,67) "" Im walkin..."""
6,doc2,"[@06bc2d,36,59) "" Im walkin...""",doc2,"[@06bc2d,0,22) ""Im walking..."""
7,doc3,"[@9c32df,21,65) "" The quick...""",doc1,"[@3ba775,0,43) ""The quick ..."""


Notice that we got each pair twice. This is because while we think of a sentence pair as a set with two sentences,
the tuples `(x,y)` and `(y,x)` are actually different. To remedy this we can limit our pairs to pairs where the first sentence is in a smaller position (ie position in memory).

We can do that by introducing a `span_lt` ie function

In [None]:
def span_lt(span1, span2):
    yield span1 < span2

sess.register('span_lt', span_lt, [Span,Span],[bool])

In [None]:
%%spannerlog
EqualSentsUniqe(doc_id1,sent1,doc_id2,sent2)<-\
    Sents(doc_id1,sent1),\
    Sents(doc_id2,sent2),\
    span_lt(sent1,sent2)->(True),\
    eq_content_spans(sent1,sent2)->(True)
    
?EqualSentsUniqe(doc_id1,sent1,doc_id2,sent2)

'?EqualSentsUniqe(doc_id1,sent1,doc_id2,sent2)'

Unnamed: 0,doc_id1,sent1,doc_id2,sent2
0,doc2,"[@06bc2d,0,22) ""Im walking...""",doc1,"[@3ba775,44,67) "" Im walkin..."""
1,doc2,"[@06bc2d,0,22) ""Im walking...""",doc2,"[@06bc2d,36,59) "" Im walkin..."""
2,doc2,"[@06bc2d,36,59) "" Im walkin...""",doc1,"[@3ba775,44,67) "" Im walkin..."""
3,doc3,"[@9c32df,21,65) "" The quick...""",doc1,"[@3ba775,0,43) ""The quick ..."""


In [None]:
# # cleaning the session before the next use case
# sess.clear()

### Calling an LLMs as part of a data dependant pipeline

In this use case, we will implement basic llm based pipelines using spannerlib.

First of all, we need to wrap some llm api as an IE function.
We want an ie function that takes a string and returns a string.
Since most LLM api expect a dict of the form `{'role':role,'content':message}`
we will right converters that parse them from prompt strings of the form 
```txt
role: content
role: content
```

In [None]:
# load openAI api key for env file

from dotenv import load_dotenv
load_dotenv('.env_dev')
import os
assert os.getenv('OPENAI_API_KEY') is not None

In [None]:
from spannerlib.ie_func.basic import rgx_split
from functools import cache
import openai
from joblib import Memory
memory = Memory("cachedir", verbose=0)
client = openai.Client()

# we use the rgx_split function to split the string into messages
def str_to_messages (string_prompt):
    return [
        {
            'role': str(role).replace(': ',''),
            'content': str(content)
        } for role,content in rgx_split('system:\s|assistant:\s|user:\s', string_prompt.strip())
    ]
def messages_to_string(msgs):
    return ''.join([f"{msg['content']}" for msg in msgs])

# the specific API we are going to call using the messages interface
def openai_chat(model, messages):
    respone = client.chat.completions.create(
        model=model,
        messages=messages,
        seed=42
    )
    return [dict(respone.choices[0].message)]

# we disk cache our function to spare my openAI credits
@memory.cache
def llm(model, question):
    q_msgs = str_to_messages(question)
    a_msgs = openai_chat(model, q_msgs)
    answer = messages_to_string(a_msgs)
    return answer

def llm_ie(model, question):
    return [llm(model, question)]




In [None]:
llm('gpt-3.5-turbo','user: Hello, who are you?')

'Hello! I am a AI-powered virtual assistant designed to help and provide information to users. How can I assist you today?'

Now we can register the llm function as an ie function and use it from in spannerlog code.

In [None]:
sess.register('llm', llm_ie, [str,str],[str])

In [None]:
%%spannerlog

model = 'gpt-3.5-turbo'
prompt = "user: sing it with me, love, what is it good for?"

TestLLM(answer)<-\
    llm($model,$prompt)->(answer)

?TestLLM(answer)


'?TestLLM(answer)'

Unnamed: 0,answer
0,Absolutely nothing!


Now let see how we can use spannerlib to build a simple pipeline that varries the LLM prompt in a data dependant way.
Imagine that we have a chatbot that we want to act differently based on the topic of conversation and the user preference.

To enable us to format prompts from data, we will use a prompt formatting function with a `printf` like syntax.
It is available in spannerlib's stdlib, but we will reproduce it here.

In [None]:
def format_ie(f_string,*params):
    yield f_string.format(*params),

# note that since the schema is dynamic we need to define a function that returns the schema based on the arity
string_schema = lambda x: ([str]*x)

sess.register('format', format_ie, string_schema,[str])



We will model our data as follows:

In [None]:

# a binary relation that stores how the user prefers the formatted
user_answer_prefernce = pd.DataFrame([
    ('Bob', 'please answer in prose. Make sure you add references in the end like a bibliography.'),
    ('Joe', 'Prose is hard for me to read quickly. Format the answer in bullet points.'),
    ('Sally', 'Try to avoid complicated jargon and use simple language.'),
])

# relation that stores per topic style
topic_specific_style = pd.DataFrame([
    ('history', 'please answer in a narrative form, use shakespearing language and lots of examples'),
    ('science', 'Use mathematical notation and formulas to explain the concepts.'),
    ('hiphop', 'introduce the answer with a rap verse.'),
    ('other', 'Be polite and neutral in your answer. Avoid controversial topics.'),
])

sess.import_rel('UserAnswerPreference', user_answer_prefernce)
sess.import_rel('TopicSpecificStyle', topic_specific_style)

Now we will build our pipeline as follows:
* We will make an llm call to help us decide the topic of the question
* Based on the topic and the user, we will formulate the final prompt for the llm to answer.


In [None]:
prompts = pd.DataFrame([
    ('topic_selection',
"""
system: Please select a topic from the following list: [history, science, hiphop, other]
based on the question provided by the user. You are only allwed to say the topic name, nothing else.

user: {}
"""),
    ('custom_style_prompt',
"""
system: Answer the question of the user in the following style:

topic specific style instructions: {}

user specific style instructions: {}

user: {}
"""
)
])
sess.import_rel('Prompts', prompts)
prompts

Unnamed: 0,0,1
0,topic_selection,system: Please select a topic from the follow...
1,custom_style_prompt,system: Answer the question of the user in th...


In [None]:
%%spannerlog
?Prompts(prompt_id,prompt)

'?Prompts(prompt_id,prompt)'

Unnamed: 0,prompt_id,prompt
0,custom_style_prompt,system: Answer the question of the user in the following style: topic specific style instructions: {} user specific style instructions: {} user: {}
1,topic_selection,"system: Please select a topic from the following list: [history, science, hiphop, other] based on the question provided by the user. You are only allwed to say the topic name, nothing else. user: {}"


Now given a relation of user and question

In [None]:
questions= pd.DataFrame([
    ('Bob', 'Who won the civil war?'),
    ('Joe', 'Who won the civil war?'),
    ('Sally', 'Who won the civil war?'),
    ('Bob', 'How do we measure the distance between stars?'),
    ('Joe', 'How do we measure the distance between stars?'),
    ('Sally', 'How do we measure the distance between stars?'),
    ('Bob', 'Who are the most well known rappers?'),
    ('Joe', 'Who are the most well known rappers?'),
    ('Sally', 'Who are the most well known rappers?'),
])
sess.import_rel('Questions', questions)

'?TopicPrompt(Q,P)'

Unnamed: 0,Q,P
0,How do we measure the distance between stars?,"system: Please select a topic from the following list: [history, science, hiphop, other] based on the question provided by the user. You are only allwed to say the topic name, nothing else. user: How do we measure the distance between stars?"
1,Who are the most well known rappers?,"system: Please select a topic from the following list: [history, science, hiphop, other] based on the question provided by the user. You are only allwed to say the topic name, nothing else. user: Who are the most well known rappers?"
2,Who won the civil war?,"system: Please select a topic from the following list: [history, science, hiphop, other] based on the question provided by the user. You are only allwed to say the topic name, nothing else. user: Who won the civil war?"


Now lets combine all of this into a pipeline.


In [None]:
%%spannerlog
model = 'gpt-3.5-turbo'

TopicPrompt(Q,P)<-\
    Questions(user,Q),\
    Prompts('topic_selection',Template),\
    format(Template,Q)->(P)
    
?TopicPrompt(Q,P)

TopicSelection(Q,T)<-\
    TopicPrompt(Q,P),\
    llm($model,P)->(T)

?TopicSelection(Q,T)
StylePrompt(Q,P,Topic,user)<-\
    Questions(user,Q),\
    TopicSelection(Q,Topic),\
    UserAnswerPreference(user,UserStyle),\
    TopicSpecificStyle(Topic,TopicStyle),\
    Prompts('custom_style_prompt',promptTemplate),\
    format(promptTemplate,TopicStyle,UserStyle,Q)->(P)


Style_Based_QA(Q,Topic,user,P,A)<-\
    StylePrompt(Q,P,Topic,user),\
    llm($model,P)->(A)



And as you can see, we have a topic and user dependant LLM pipeline

In [None]:
%%spannerlog
?Style_Based_QA(Q,Topic,user,Prompt,Answer)

'?Style_Based_QA(Q,Topic,user,Prompt,Answer)'

Unnamed: 0,Q,Topic,user,Prompt,Answer
0,How do we measure the distance between stars?,science,Bob,system: Answer the question of the user in the following style: topic specific style instructions: Use mathematical notation and formulas to explain the concepts. user specific style instructions: please answer in prose. Make sure you add references in the end like a bibliography. user: How do we measure the distance between stars?,"The distance between stars is measured in astronomical units (AU), parsecs (pc), or light-years (ly). 1. **Astronomical Unit (AU)**: An AU is the average distance between the Earth and the Sun, which is approximately 93 million miles or 150 million kilometers. It is more commonly used to measure distances within our solar system. 2. **Parsec (pc)**: A parsec is a unit of distance used in astronomy, defined as the distance at which an object would have a parallax angle of one arcsecond. One parsec is equal to about 3.26 light-years or 3.09 x 10^13 kilometers. 3. **Light-year (ly)**: A light-year is the distance that light travels in one year, which is approximately 9.46 trillion kilometers or about 5.88 trillion miles. To measure the distance to nearby stars, astronomers use a method called parallax. This involves observing a star from two different points in Earth's orbit around the Sun, which causes the star to appear to shift slightly against the background of more distant stars. By measuring the angle of this shift (the parallax angle), astronomers can calculate the distance to the star using trigonometry. For more distant stars, astronomers use other methods such as spectroscopic parallax, standard candles (like Cepheid variables), and the cosmic distance ladder which combines various distance measurement techniques to determine distances to faraway objects in the universe. References: 1. ""Astronomical unit"" - NASA, https://solarsystem.nasa.gov/solar-system/our-solar-system/overview/ 2. ""Parsec"" - European Space Agency, https://www.spacetelescope.org/about/general/parallax/ 3. ""Light-year"" - Space.com, https://www.space.com/15830-light-years.html 4. ""Parallax"" - European Southern Observatory, https://www.eso.org/public/usa/news/eso1718/"
1,How do we measure the distance between stars?,science,Joe,system: Answer the question of the user in the following style: topic specific style instructions: Use mathematical notation and formulas to explain the concepts. user specific style instructions: Prose is hard for me to read quickly. Format the answer in bullet points. user: How do we measure the distance between stars?,"- To measure the distance between stars, astronomers use a method called parallax, which is based on trigonometry. - Parallax involves observing a star from two different points in Earth's orbit and measuring the angle, known as the parallax angle, that the star appears to move. - The distance to the star can be calculated using the formula: distance = 1 / parallax angle. - The unit typically used for measuring stellar distances is the parsec (pc), where 1 parsec is equivalent to about 3.26 light years. - Other methods like photometric distance and spectroscopic parallax are also used to measure distances to stars that are much farther away."
2,How do we measure the distance between stars?,science,Sally,system: Answer the question of the user in the following style: topic specific style instructions: Use mathematical notation and formulas to explain the concepts. user specific style instructions: Try to avoid complicated jargon and use simple language. user: How do we measure the distance between stars?,"One common way to measure the distance between stars is through parallax. Parallax is the apparent shift in position of an object when viewed from two different points. The parallax angle, denoted by p, is the angle formed between a line from the observer to the star at two different times when Earth is on opposite sides of its orbit. The distance to the star, denoted by d, can be calculated using the formula: \[ d = \frac{1}{\tan(p)} \] By measuring the parallax angle and using this formula, astronomers can determine the distance to nearby stars."
3,Who are the most well known rappers?,hiphop,Bob,system: Answer the question of the user in the following style: topic specific style instructions: introduce the answer with a rap verse. user specific style instructions: please answer in prose. Make sure you add references in the end like a bibliography. user: Who are the most well known rappers?,"Straight outta Compton, a city so gritty, Let me introduce thee to the rap elite committee. Eminem, with his lyrical finesse, Jay-Z, the blueprint of success. Kendrick Lamar, with his socially conscious flow, And Tupac, whose words still resonate, yo. These artists have left a mark on the game, Their names forever etched in the hall of fame. Some of the most well-known rappers in the music industry include Eminem, Jay-Z, Kendrick Lamar, and Tupac Shakur. Each of these artists has had a significant impact on the genre of rap through their unique styles and powerful lyrics. Their influence can be seen in the way they have shaped the culture of hip-hop and inspired countless artists to follow in their footsteps. 1. Eminem: An American rapper known for his intricate wordplay and controversial lyrics. His albums, such as ""The Marshall Mathers LP"" and ""The Eminem Show,"" have garnered critical acclaim and commercial success. 2. Jay-Z: A legendary rapper and businessman, Jay-Z is renowned for his storytelling abilities and entrepreneurial ventures. His albums, including ""The Blueprint"" and ""4:44,"" have solidified his status as one of the greatest rappers of all time. 3. Kendrick Lamar: Hailing from Compton, California, Kendrick Lamar is celebrated for his socially conscious lyrics and complex rhyme schemes. His albums, like ""good kid, m.A.A.d city"" and ""To Pimp a Butterfly,"" have won numerous awards and accolades. 4. Tupac Shakur: A cultural icon, Tupac Shakur left a lasting legacy through his music and activism. His albums, such as ""All Eyez on Me"" and ""Me Against the World,"" continue to resonate with audiences around the world even decades after his death. These rappers have not only achieved commercial success but have also made a lasting impact on the music industry and popular culture as a whole. Their contributions to the rap genre are undeniable and continue to be celebrated by fans and fellow artists alike. References: - ""Eminem Biography."" Biography.com - ""Jay-Z - Biography."" Biography.com - ""Kendrick Lamar - Biography."" Biography.com - ""Tupac Shakur - Biography."" Biography.com"
4,Who are the most well known rappers?,hiphop,Joe,system: Answer the question of the user in the following style: topic specific style instructions: introduce the answer with a rap verse. user specific style instructions: Prose is hard for me to read quickly. Format the answer in bullet points. user: Who are the most well known rappers?,"Yo, when it comes to rappers, there's a whole crew, Here are some of the most well-known for you: - Eminem: Known for his lyrical genius and raw talent - Jay-Z: Business mogul and rap legend with a smooth flow - Tupac Shakur: Iconic figure known for his powerful lyrics and impact on the industry - Notorious B.I.G.: Legendary rapper with a unique flow and storytelling abilities - Kendrick Lamar: Modern-day artist known for his socially conscious lyrics and innovative sound - Drake: Chart-topping rapper with a versatile style and massive commercial success - Kanye West: Controversial yet influential artist known for pushing boundaries in music - Nicki Minaj: Female rap superstar known for her bold persona and rapid-fire delivery - Lil Wayne: Veteran rapper with a long-standing career and influential impact on the genre - Cardi B: Breakout star known for her charisma, humor, and hit songs"
5,Who are the most well known rappers?,hiphop,Sally,system: Answer the question of the user in the following style: topic specific style instructions: introduce the answer with a rap verse. user specific style instructions: Try to avoid complicated jargon and use simple language. user: Who are the most well known rappers?,"Yo, when it comes to rappers, there's a whole lot in the game, From OGs like Tupac and Biggie, to new stars on their rise to fame. Eminem's known for his lyrical prowess and flow, While Jay-Z's the blueprint for success, yo. Drake stays topping charts with hits in every season, And Kendrick Lamar speaks truth, shedding light on social reason. Cardi B's making waves with her bold and fierce style, While Nicki Minaj's been reigning as the queen for a while. So if you're looking for the most well-known in the rap scene, These artists are the ones that truly reign supreme."
6,Who won the civil war?,history,Bob,"system: Answer the question of the user in the following style: topic specific style instructions: please answer in a narrative form, use shakespearing language and lots of examples user specific style instructions: please answer in prose. Make sure you add references in the end like a bibliography. user: Who won the civil war?","In sooth, fair user, the tale of the Civil War is a tumultuous one, where brother fought against brother, and the land was torn asunder by the clash of arms. Verily, 'twas a war of great import that shook the very foundations of our fair nation. In the end, it was the valiant forces of the Union, led by the noble General Ulysses S. Grant, that emerged victorious over the Confederate armies. The Confederate army, under the command of the gallant General Robert E. Lee, bravely fought to defend their cause, but in the final battles at Gettysburg and Appomattox, they were bested by the Union forces. 'Twas a bitter victory, for the toll of the conflict was great, with countless lives lost and cities laid to waste. But in the end, the Union prevailed, and the forces of unity and freedom triumphed over those of division and oppression. And so, dear user, history records that the Civil War was won by the Union, and the nation was preserved under one flag, one nation indivisible. References: - McPherson, James M. Battle cry of freedom: The Civil War era. Oxford University Press, 2003. - Foote, Shelby. The Civil War: A Narrative. Vol. 3. Vintage, 1986."
7,Who won the civil war?,history,Joe,"system: Answer the question of the user in the following style: topic specific style instructions: please answer in a narrative form, use shakespearing language and lots of examples user specific style instructions: Prose is hard for me to read quickly. Format the answer in bullet points. user: Who won the civil war?","- The Civil War, a great conflict betwixt the Union and the Confederacy, did conclude with the victory of the Union forces led by the noble General Ulysses S. Grant. - The Union, under the wise leadership of President Abraham Lincoln, did vanquish the Confederate forces after many years of bloody battle. - The surrender of General Robert E. Lee at Appomattox Court House in 1865 did mark the end of the war, bringing peace and unity to the land once more."
8,Who won the civil war?,history,Sally,"system: Answer the question of the user in the following style: topic specific style instructions: please answer in a narrative form, use shakespearing language and lots of examples user specific style instructions: Try to avoid complicated jargon and use simple language. user: Who won the civil war?","In sooth, good sir, the Civil War didst end with the victory of the Union forces led by General Ulysses S. Grant over the Confederate troops under General Robert E. Lee. Verily, after many a fierce battle and much bloodshed, the Union did emerge triumphant o'er the Confederate armies. 'Twas in the year of 1865, when General Lee did yield his sword to General Grant at Appomattox Court House, marking the official end of the war. The Union's cause of preserving the United States and abolishing the scourge of slavery did prevail. 'Tis oft said that the pen is mightier than the sword, and in this case, 'twas the Union's determination and unity that did carry them to victory over the Confederate forces. Yet, the scars of war did linger on, as the nation sought to heal and rebuild in the aftermath of such a tumultuous conflict."
