# Spanner Workbench Presentation

### Presenting:
| #       |              Name |             Id |             email |
|---------|-------------------|----------------|------------------ |
|Student 1|  Tom Feldman | 325135960 | tom.p@campus.technion.ac.il |
|Student 2|  Niv Shetrit | 316354976 | shniv@campus.technion.ac.il |

# TODO: create google doc (add link here)

# The Project's Purpose

* Incorporating information extraction in the relational database model
* Our language will be a framework used by other programmers to implement their own version of Spannerlog
* Our implementation is very modular in order to allow others to modify it however they like


// TODO@niv: Create a language which combines logic programming with imperative, information-extraction functions (e.g. adding passes)

# How the Project Looked Like Before Our Changes
* the bare-minimum working version of the project
* relied heavily on pydatalog, which isn't scalable
* adding IE functions was not supported - required modifying the code itself
* recursive rules were not supported
* no tests at all



# TODO: for every feature, explain why it was needed

# Information Extraction Functions <a class="anchor" id="IEFunctions"></a>

* IE-functions are the very essence of spannerlog. It is necessary to let users add them easily and programmatically 
* Client-Server refactoring
* Design of `IEFunction` Class

TODO@tom: (in contrast to manually editing the code)

## Example: Logical Operators:

#### Implemention of the **not equals (NEQ)** relation.

We would like to have a rule such as:
<br>
```unique_pair(X, Y) <- pair(X, Y), X != Y```
<br><br>
Unfortunately RGXLog doesn't support True/False values. Therefore, we can't use ```X != Y```.
<br>
Our solution to this problem is to create an ie function that implements NEQ relation:

In [1]:
import rgxlog
from rgxlog.engine.datatypes.primitive_types import DataTypes

def NEQ(x, y):
    if x == y:
        # return false (empty tuple represents false)
        yield tuple() 
    else:
        #return true
        yield x, y

in_out_types = [DataTypes.string, DataTypes.string]
rgxlog.magic_session.register(ie_function=NEQ, 
                       ie_function_name="NEQ", 
                       in_rel=in_out_types, 
                       out_rel=in_out_types)

In [2]:
%%rgxlog
new pair(str, str)
pair("Dan", "Tom")
pair("Cat", "Dog")
pair("Apple", "Apple")
pair("Cow", "Cow")
pair("123", "321")

unique_pair(X, Y) <- pair(First, Second), \
                     NEQ(First, Second) -> (X, Y)
?unique_pair(X, Y)

printing results for query 'unique_pair(X, Y)':
  X  |  Y
-----+-----
 Dan | Tom
 Cat | Dog
 123 | 321



# Import-Export Functions <a class="anchor" id="Import-Export-Functions"></a>

* RGXlog is used as a (TODO@niv: ML/database) language, in which people use spreadsheets and dataframes very often
* import/export a relation from a csv file
* import/export a relation from a dataframe

In [3]:
%%bash
cat enrolled.csv

abigail,operating_systems
jordan,chemistry
gale,operating_systems
howard,chemistry
howard,physics


In [4]:
rgxlog.magic_session.import_relation_from_csv("enrolled.csv", relation_name="enrolled", delimiter=",")

In [8]:
%%rgxlog
enrolled("abigail", "chemistry")
gpa_str = "abigail 100 jordan 80 gale 79 howard 60"

gpa(Student,Grade) <- py_rgx_string(gpa_str, "(\w+).*?(\d+)")->(Student, Grade),enrolled(Student,X)

?gpa(X,Y)

printing results for query 'gpa(X, Y)':
    X    |   Y
---------+-----
 abigail | 100
  gale   |  79
 howard  |  60
 jordan  |  80



# Standard Library <a class="anchor" id="STDLIB"></a>

* contains implementation of many default IE functions, which users can use as a template to create their own functions
* python and rust regex
* json path
* many nlp functions wrapping StanfordCoreNLP


TODO@tom: rust regex shows users how to create IE functions which use unix functions. json/stanford show how to wrap pythonic functions

## NLP Example - Named Entity Recognition:

NER IEFunction Recognizes named entities (person and company names, etc.)

First, we register the ie function:

```python
magic_session.register(ie_function=ner_wrapper,
                       ie_function_name='NER',
                       in_rel=[DataTypes.string],
                       out_rel=[DataTypes.string, DataTypes.string, DataTypes.span])
```

In [9]:
%%rgxlog

sentence = "While in France, Christine Lagarde discussed short-term stimulus  \
            efforts in a recent interview with the Wall Street Journal."
               
ner(X, Y, Z) <- NER(sentence) -> (X, Y, Z)
# TODO@tom ?ner(Token, NER, Span)

# Tests and Typing <a class="anchor" id="CI"></a>

* our RGXlog code contains many different functions, and we want to make sure that our changes don't cause any bugs
* this is why we are running github actions CI after every push, which includes:
    * 63 unique pytest tests
    * pep8 test
    * mypy test - all functions are type annotated
<br><br>
* a successful run looks like this:

<img src="git_workflow.png" alt="workflow" width="600" height="600" style="display: block; margin-left: auto; margin-right: auto; width: 50%;">

# PyDatalog to SQL Engine <a class="anchor" id="Engine"></a>

* previously, we were heavily dependent on pydatalog, which doesn't enable access to its internal calculations
* TODO@tom: add example with the previous engine, explain importance

which is why we've implemented our main feature, which contains:
* parse graph and term graph
* SQL engine
* adding rules to term graph
* bottom-up algorithm and execution function

## parse and term graph

data structures which store information for our calculations
* parse graph - stores facts, relation declerations, rule declarations and queries.
* term graph - stores connection between all the relations.

In [7]:
from rgxlog import Session

rgxlog.magic_session = Session()  # reset session

In [8]:
%%rgxlog 

new parent(str, str)
parent("Liam", "Noah")
parent("Noah", "Oliver")
parent("Oliver", "Mason")

ancestor(X,Y) <- parent(X,Y)
ancestor(X,Y) <- parent(X,Z), ancestor(Z,Y)
?ancestor(Parent, Son)

printing results for query 'ancestor(Parent, Son)':
  Parent  |  Son
----------+--------
   Liam   | Mason
   Liam   |  Noah
   Liam   | Oliver
   Noah   | Mason
   Noah   | Oliver
  Oliver  | Mason



In [9]:
print(f"parse graph:\n{rgxlog.magic_session._parse_graph}")
print(f"\nterm graph:\n{rgxlog.magic_session._term_graph}")

parse graph:
(__rgxlog_root) (computed) root
    (0) (computed) relation_declaration: parent(str, str)
    (1) (computed) add_fact: parent("Liam", "Noah")
    (2) (computed) add_fact: parent("Noah", "Oliver")
    (3) (computed) add_fact: parent("Oliver", "Mason")
    (4) (computed) rule: ancestor(X, Y) <- parent(X, Y)
    (5) (computed) rule: ancestor(X, Y) <- parent(X, Z), ancestor(Z, Y)
    (6) (computed) query: ancestor(Parent, Son)


term graph:
(__rgxlog_root) (not_computed) root
    (ancestor) (not_computed) rule_rel: ancestor(X, Y)
        (0) (not_computed) union
            (1) (not_computed) project: ['X', 'Y']
                (2) (not_computed) get_rel: parent(X, Y)
            (3) (not_computed) project: ['X', 'Y']
                (4) (not_computed) join: {'X': [(parent(X, Z), 0)], 'Z': [(ancestor(Z, Y), 0), (parent(X, Z), 1)], 'Y': [(ancestor(Z, Y), 1)]}
                    (5) (not_computed) get_rel: ancestor(Z, Y)
                        (ancestor) (not_computed) rule_re

## Adding rules to Term Graph

* traverse the parse graph and find new rules (which will be added to the term graph)
* compute a legal execution order of a rule (i.e. relations input variables must be bounded)
* join all the relations (in the computed order)
* project to the relevant variables
* connect to the rule_head node (in case of some rules with same head, we need a union node for them)
* select from relations (if the relations contain constant terms)
* calculate ie relations (bounding relations are children)

TODO@tom: parentheses are notes and will not appear in the slide.
TODO@tom: add example for execution order of a rule

## SQL engine
TODO@niv: jinja, sql interface, why we moved from pydatalog to sql (relations are tables), rules are operators

## Execution Function

* Implements a naive bottom-up alogrithm:
    * reset all the mutually recursive relations
    * traverse the term graph and update all the mutually recursive relations based on the previous step
    * stop when all the mutually recursive relations converged at the same step.
* uses the SQL engine

## Example - Ancestor Program

$ancestor_{i}(X,Y)$ <- $parent(X,Y)$ <br>
$ancestor_{i}(X,Y)$ <- $parent(X,Z), ancestor_{i - 1}(Z,Y)$

### ancestor_0 (empty table):

| Parent | Son |
| --- | --- |
|     |     |

### ancestor_1 (finds parents):


| Parent | Son |
| --- | --- |
| Liam | Noah |
| Noah | Oliver |
| Oliver | Mason |

### ancestor_2 (finds grandparents):

| Parent | Son |
| --- | --- |
| Liam | Noah |
| Noah | Oliver |
| Oliver | Mason |
| Liam | Oliver |
| Noah | Mason |

### ancestor_3 (finds great-grandparents):

| Parent | Son |
| --- | --- |
| Liam | Noah |
| Noah | Oliver |
| Oliver | Mason |
| Liam | Oliver |
| Noah | Mason |
| Liam | Mason |

### ancestor_4 (finds great-great-grandparents):

| Parent | Son |
| --- | --- |
| Liam | Noah |
| Noah | Oliver |
| Oliver | Mason |
| Liam | Oliver |
| Noah | Mason |
| Liam | Mason |

## Stop the computation since no tuples were added (fixed point)

## new engine features
we've also added tools which are useful for both users and developers:
* remove rules
* loggers/debug info
* enable union of rules (same head)
* print ie functions

## Tutorials Overview

<a href="./introduction.ipynb">introduction</a><br>
<a href="./Advanced usage.ipynb">advanced</a>

TODO: add overview between every 2 major slides
## Presentation Overview:
* Information Extraction Functions
* Import-Export Functions
* Tests and Typing
* Standard Library
* Tutorials Examples
* **PyDatalog to SQL Engine**:
    * **Parse Grpah and Term Graph**
    * **Adding Rules to Trem Graph**
    * **SQL Engine**
    * **Bottom-Up Execution**

link to optimization tutorials
show idea from md file and describe how to easily implement it with our interfaces