# Evaluation

This notebook is used to evaluate the Soft Query Evaluation System. 


To evaluate the soft query evaluation system we create queries, encoded as evaluation plans, and execute them on a [hand-crafted schema](https://github.com/HackerBschor/SofteningQueryEvaluation/blob/main/evaluation/schema.json).
The queries result set $R$ is then compared against the ground truth set $G$ to determine the true positives $TPs = R \cap G$, the false positives $FPs = R \setminus G$ and the false negatives $FNs = G \setminus R$. Using these sets, we can calculate the following metrics:
1) Precision: $\frac{|TP|}{|TP| + |FP|}$
2) Recall: $\frac{|TP|}{|TP| + |FN|}$
3) F1: $\frac{2 \cdot Precision \cdot Recall}{Precision + Recall}$
4) Jaccard-Coefficient: $\frac{|TP|}{|G \cup R|}$

Furthermore, we collect the runtime, to compare it with the execution time of a query that replaces all soft bindings with strict bindings and therefore, doesn't utilize LLMs.

### Evaluation Schema

The schema is created to maximize **Cross-Domain Generalization**, in order to simulate a real world application of the software.
Therefore, we include data from different domains such as:
1) Medicine
2) Chemistry
3) E-Commerce & Retail
4) Social Media & User-Generated Content
5) Geography
6) Entertainment & Media (Movies/ Music)
7) Sports & Gaming


### Test Criteria
We identified 8 different test criteria, where the usage of soft bindings can benefit the final result. We try to involve all test criteria in every domain to maximize the significance for every domain.
We define the test criteria as follows:
1) Semantic Matching ('Movie about toys that come to life' $\rightarrow$ Record of 'Toy Story')
2) Lexical & Contextual Ambiguity ('Apple' $\rightarrow$ 'Company', 'Apple' $\rightarrow$ 'Fruit')
3) Spelling Variations & Typos ('Aple' $\rightarrow$ 'Apple', 'Neighbour' $\rightarrow$ 'Neighbor')
4) Synonyms & Conceptual Overlap ('High Blood Pressure' $\approx$ 'Hypertension', 'Laptop' $\approx$ 'Notebook')
5) Aliases ('Lady Gaga' $\approx$ 'Stefani Joanne Angelina Germanotta', 'J. Smith' $\approx$ 'Jane Smith')
6) Abbreviations & Acronyms ('NYC' $\approx$ 'New York City', 'Dr' $\approx$ 'Doctor', 'AI' $\approx$ 'Artificial Intelligence')
7) Different Languages (Apple $\approx$ Apfel $\approx$ Mela)
8) Unit & Format Inconsistencies ('2007-06-29' $\approx$ '06/29/2007' $\approx$ 'June 29, 2007' $\approx$ '29.06.2007' $\approx$ '29. Juni 2007', '2.2 lbs' $\approx$ '1 kg', '1M' $\approx$ '1.000.000' $\approx$ 1000000)

The defined test criteria mostly tests for semantic equality $\approx$. However, a database also offers operators like $<$, $>$, $!=$.
So, if applicable, we also include range queries and queries for semantic inequality. These range queries are mostly affect the criteria `Unit & Format Inconsistencies`.





### Import & Initialize: Models, Data and Functions

In [None]:
import json
import time

from sklearn.cluster import DBSCAN, SpectralClustering

from db.criteria import *
from db.operators import *
from db.operators.Aggregate import *
from db.structure import *

from models.embedding import SentenceTransformerEmbeddingModel
from models.semantic_validation import LLaMAValidationModel
from models.text_generation.LLaMA import LLaMATextGenerationModel

# logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.DEBUG)

In [None]:
def load_dummy_operators():
    with open("schema.json", "r", encoding="UTF-8") as schema:
        schema = json.load(schema)

    operators = {}
    for relation, data in schema.items():
        operators[relation] = Dummy(relation, [x["name"] for x in data["schema"]], data["data"])
    return operators


In [None]:
def evaluate(op, result_cols, gt):
    st = time.time()
    op.open()
    result = {(row[col] for col in result_cols) for row in op}
    exec_time = time.time() - st
    
    tps, fns, fps = gt & result, gt - result, result - gt
    tp, fn, fp = len(tps), len(fns), len(fps)
    precision = round((tp / (tp + fp) if (tp + fp) > 0 else 0) * 100.0)
    recall = round((tp / (tp + fn) if (tp + fn) > 0 else 0) * 100.0)
    f1_score = round(((2 * precision * recall) / (precision + recall) if (precision + recall) > 0 else 0))
    jaccard_coefficient = len(tps) / len(gt | result)
    print("False Positives:", "\n".join(map(lambda x: f"\t{x}", fps)), "", sep="\n")
    print("False Negatives:", "\n".join(map(lambda x: f"\t{x}", fns)), "", sep="\n")
    return {
        "precision": precision, 
        "recall": recall, 
        "f1_score": f1_score, 
        "jaccard_coefficient": jaccard_coefficient,
        "exec_time": exec_time
    }

In [None]:
# Load Models
mm = ModelMgr(config="../config.ini")
em = SentenceTransformerEmbeddingModel(mm)
sv = LLaMAValidationModel(mm, temperature=0.0001)
gm = LLaMATextGenerationModel(mm)

# Load Data
ops = load_dummy_operators()

## Test Case 1

### Find disease where symptoms match description (Semantic Matching)

$$ 
\sigma_{\text{symptoms}  \approx  \text{'<description>'} } (\bowtie_{\text{disease}}(\text{Diseases}, \gamma_{\text{disease}, STR\_AGG(\text{symptom}) \rightarrow \text{symptoms} } (\text{Symptoms}))) 
$$

### 1.2: Find disease where all symptoms occur (Spelling Variations: e.g. 'high fefer' $\approx$ 'high fever'; Synonyms & Conceptual Overlap: 'Appetite suppression' $\approx$ 'loss of appetite')

$$ 
    \sigma_{\text{number\_symptoms} > 3} (
        \gamma_{\text{disease}, \text{COUNT(Symptom)} \rightarrow \text{number\_symptoms}} (\sigma_{\text{description} \approx \text{<symptom 2>} \lor \text{description} \approx \text{<symptom 2>} \lor \text{description} \approx \text{<symptom 3>}}(\text{Symptoms}))
    )
$$


In [None]:
# TODO

## Test Case 2: Chemistry
### 2.1: Find Organic Chemicals (Semantic Matching)
$$
\sigma_{$\approx$ \text{'Is organic'}}(\text{Chemicals})
$$

In [None]:
# TODO

### 2.2. Find all warnings to the chemicals (Synonyms: 'Drinking Alcohol' $\approx$ 'CH3OH' $\approx$ 'Ethanol')

$$
\bowtie_{\text{scientific\_name} \approx \text{chemical\_name}} (\text{Chemicals}, \gamma_{\text{chemical\_name}, STR\_AGG(\text{warning}) \rightarrow \text{warning\_list} } (\text{Chemical Warnings}))
$$

In [None]:
gt1_2 = {("H2O", "Water"), ("H2SO4", "Sulfuric Acid"), ("CH3OH", "Methanol"), ("HCl", "Hydrochloric Acid"), ("NH3", "Ammonia"), ("C2H5OH", "Ethanol"), ("C6H6", "Benzene"), ("Cl2", "Chlorine"), ("C3H6O", "Acetone"), ("NaOH", "Sodium Hydroxide"), ("C8H10N4O2", "Caffeine")}

op1_2_hard = InnerHashJoin(ops["chemicals"], ops["chemical_warnings"],  Column("scientific_name"), Column("name"))
evaluate(op1_2_hard, ["scientific_name", "name"], gt1_2)

op1_2_soft = InnerSoftJoin(
    ops["chemicals"], ops["chemical_warnings"],
    Column("scientific_name"), Column("name"),
    em=em, sv=sv, threshold=0.3, use_semantic_validation=True, sv_template="Is {scientific_name} the scientific name for {name}")

evaluate(op1_2_soft, ["scientific_name", "name"], gt1_2)

### 2.3: Find the chemical: (Spelling Variations & Typos: 'Etanol' $\approx$ 'Ethanol'; Languages: 'Hydrochloric Acid' $\approx$ 'Salzsäure')
* $ \sigma_{\text{chemical\_name} \approx 'Wtaer'}(\text{Chemical Warnings}) $ $\rightarrow$ Water
* $ \sigma_{\text{chemical\_name} \approx 'Sulfric Aciid'}(\text{Chemical Warnings}) $ $\rightarrow$ Sulfuric Acid
* $ \sigma_{\text{chemical\_name} \approx 'Methonol'}(\text{Chemical Warnings}) $  $\rightarrow$ Methanol
* $ \sigma_{\text{chemical\_name} \approx 'Hydrocloric Acd'}(\text{Chemical Warnings}) $ $\rightarrow$ Hydrochloric Acid
* $ \sigma_{\text{chemical\_name} \approx 'Muratic Acd'}(\text{Chemical Warnings}) $ $\rightarrow $ Muriatic Acid
* $ \sigma_{\text{chemical\_name} \approx 'Amnoia'}(\text{Chemical Warnings}) $ $\rightarrow$ Ammonia
* $ \sigma_{\text{chemical\_name} \approx 'Ethonol'}(\text{Chemical Warnings}) $ $\rightarrow$ Ethanol
* $ \sigma_{\text{chemical\_name} \approx 'Driniking Alcohal'}(\text{Chemical Warnings}) $ $\rightarrow$ Drinking Alcohol
* $ \sigma_{\text{chemical\_name} \approx 'Benzne'}(\text{Chemical Warnings}) $ $\rightarrow$ Benzene
* $ \sigma_{\text{chemical\_name} \approx 'Clorine'}(\text{Chemical Warnings}) $ $\rightarrow$ Chlorine
* $ \sigma_{\text{chemical\_name} \approx 'Acetne'}(\text{Chemical Warnings}) $ $\rightarrow$ Acetone
* $ \sigma_{\text{chemical\_name} \approx 'Soduim Hydrxoide'}(\text{Chemical Warnings}) $ $\rightarrow$ Sodium Hydroxide
* $ \sigma_{\text{chemical\_name} \approx 'Caffiene'}(\text{Chemical Warnings}) $ \rightarrow Caffeine
* $\sigma_{\text{chemical\_name} \approx 'Salzsäure'}(\text{Chemical Warnings})$ $\rightarrow$ Hydrochloric Acid
* $\sigma_{\text{chemical\_name} \approx '盐酸'}(\text{Chemical Warnings})$ $\rightarrow$ Hydrochloric Acid
* $\sigma_{\text{chemical\_name} \approx 'Acido cloridrico'}(\text{Chemical Warnings})$ $\rightarrow$ Hydrochloric Acid
* $\sigma_{\text{chemical\_name} \approx 'Соляная кислота'}(\text{Chemical Warnings})$ $\rightarrow$ Hydrochloric Acid
* $\sigma_{\text{chemical\_name} \approx 'Eau'}(\text{Chemical Warnings})$ $\rightarrow$ Water
* $\sigma_{\text{chemical\_name} \approx '水'}(\text{Chemical Warnings})$ $\rightarrow$ Water
* $\sigma_{\text{chemical\_name} \approx 'вода'}(\text{Chemical Warnings})$ $\rightarrow$ Water
* $\sigma_{\text{chemical\_name} \approx 'Wasser'}(\text{Chemical Warnings})$ $\rightarrow$ Water
* $\sigma_{\text{chemical\_name} \approx 'Eau'}(\text{Chemical Warnings})$ $\rightarrow$ Water


### 2.4: Find Acronyms: (Acronyms: H $\approx$ Hydrogen, He $\approx$ Helium)

$$
\bowtie_{\text{element} \approx \text{symbol}} (\text{Elements}, \text{ElementPhases})
$$

### 2.5: Find p.H. neutral chemicals (Unit & Format Inconsistencies: p.H. = 7.0 $\approx$ 'Neutral')

$$ 
\sigma_{\text{ph}\approx\text{'Neutral'}} (\text{Chemicals})
$$

In [None]:
# TODO

## Test Case 3: E-Commerce & Retail

### 3.1. Match companies from two datasets (Semantic Matching: {"name": "microsoft", "year_founded": 1975, "domain": "computer software" ... } $\approx$ {...})

## Test Case 4: Social Media & User-Generated Content

## Test Case 5: Geography
### 7.2. Spelling Variations & Typos, 7.3. Synonyms & Conceptual Overlap 


### 7.5. Join two country datasets: (Abbreviations & Acronyms, Different Languages: 'AT' $\approx$ 'Österreich', 'EE' $\approx$ 'Estonia'


In [None]:
# TODO

## Test Case 6: Entertainment & Media (Movies/ Music
### 6.1 Search Movies base don release disjunction (Unit & Format Inconsistencies: 'Before 2000' $\approx$ '1999')

In [None]:
gt6_1 = {("Pirates of the Caribbean: Dead Man's Chest", ), ("Charlie and the Chocolate Factory", ), ("Inception", ), ("The Matrix", )}

In [None]:
crit1 = SoftEqual(Column("release"), Constant("2006"), em=em, threshold=0.5)
crit2 = SoftEqual(Column("release"), Constant("July"), em=em, threshold=0.5)
crit3 = SoftEqual(Column("release"), Constant("Before 2000"), em=em, threshold=0.5)
op6_1 = Select(ops["movies"], DisjunctiveCriteria([crit1, crit2, crit3]))
evaluate(op6_1, ["name"], gt6_1)

In [None]:
# op6_1 = Select(ops["movies"], SoftValidate("Is {release} in 2006 or is {release} in July or is {release} before 2000?", sv=sv, full_record=False))
# crit11 = SoftValidate("Is {release} in 2006?", sv=sv, full_record=False)
# crit22 = SoftValidate("Is {release} in July?", sv=sv, full_record=False)
# crit33 = SoftValidate("Is {release} before 2000?", sv=sv, full_record=False)
# op = Select(ops["movies"], DisjunctiveCriteria([crit11, crit22, crit33]))

# result2_2 = {(x["name"], ) for x in op}
# evaluate(gt2_1, result2_2)

### 6.2 Match English and German Moview (Entity Matching, Different Languages)

In [None]:
gt6_2 = {
    ('The Lord of the Rings: The Fellowship of the Ring', 'Der Herr der Ringe: Die Gefährten'),
    ("Pirates of the Caribbean: Dead Man's Chest", 'Pirates of the Caribbean – Fluch der Karibik 2'),
    ('The Lord of the Rings: The Return of the King', 'Der Herr der Ringe: Die Rückkehr des Königs'),
    ('Charlie and the Chocolate Factory', 'Charlie und die Schokoladenfabrik'),
    ('Inception', 'Inception'),
    ('The Matrix', 'Matrix')
}

In [None]:
# op6_2 = {(x["movies.name"], x["movies_de.name"]) for x in InnerSoftJoin(ops["movies"], ops["movies_de"], Column("movies.name"), Column("movies_de.name"), em=em, threshold=0.4)}
op6_2 = InnerSoftJoin(ops["movies"], ops["movies_de"], None, None, em=em, threshold=0.6)
evaluate(op6_2, ["movies.name", "movies_de.name"],gt6_2)

### 6.3 Find Actors (Aliases: 'Orlando Bloom' $\approx$ '@orlandobloom'; Synonyms & Conceptual Overlap: 'Leonardo DiCaprio $ \approx $ 'Jack Dawson in Titanic')

In [None]:
gt6_4 =  {
  ('Carrie-Anne Moss', 'Carrie-Anne Moss'),
  ('Elijah Wood', 'Elijah Jordan Wood'),
  ('Elijah Wood', 'Elijah Wood'),
  ('Elliot Page', 'Elliot Page'),
  ('Freddie Highmore', 'Alfred Highmore'),
  ('Ian McKellen', 'Sir Ian Murray McKellen'),
  ('Johnny Depp', 'John Christopher "Johnny" Depp II'),
  ('Johnny Depp', 'John Christopher Depp II'),
  ('Johnny Depp', 'The Mad Hatter in Alice in Wonderland Actor'),
  ('Joseph Gordon-Levitt', 'Joseph Gordon-Levitt'),
  ('Keanu Reeves', 'Keanu Charles Reeves'),
  ('Keira Knightley', 'Keira Christina Knightley'),
  ('Keira Knightley', 'Keira Knightley'),
  ('Laurence Fishburne', 'Laurence Fishburne'),
  ('Leonardo DiCaprio', 'Jack Dawson in Titanic'),
  ('Orlando Bloom', '@orlandobloom'),
  ('Orlando Bloom', 'Orlando Bloom'),
  ('Orlando Bloom', 'Orlando Jonathan Blanchard Copeland Bloom'),
  ('Viggo Mortensen', 'Viggo Mortensen')
}

In [None]:
op6_4 = InnerSoftJoin(ops["actors"], ops["plays_in"], Column("name"), Column("actor/actress"), threshold=0.7, em=em)
evaluate(op6_4, ["name", "actor/actress"], gt6_4)

## Test Unit & Format Inconsistencies: 06/24/2003 $\approx$ 24.06.2003, 6/24/2005 > 24.06.2003, 2.2 lbs' $\approx$ '1 kg

* $ \sigma_{date \approx '06/24/2003'} (Orders) $
* $ \sigma_{date \approx '06/24/2003'} (Orders) $
* $ \sigma_{date \approx '06/24/2003'} (Orders) $
* $ \sigma_{date \approx '06/24/2003'} (Orders) $
* $ \sigma_{date \approx '06/24/2003'} (Orders) $
* $ \sigma_{date \approx '06/24/2003'} (Orders) $
* $ \sigma_{date \approx '06/24/2003'} (Orders) $


# Further Experiments
## Soft Aggregation

In [None]:
x = SoftAggregateFaissKMeans(ops["movies"], ["name"], [StringAggregation("name", "movies")], em=em, num_clusters=5)
print([a for a in x])

In [None]:
x = SoftAggregateScikit(ops["movies"], ["name"], [StringAggregation("name", "movies")], em=em, cluster_class=DBSCAN, cluster_params={"eps":3, "min_samples": 2})
print([a for a in x])

In [None]:
x = SoftAggregateScikit(ops["movies"], ["name"], [CountAggregation("name", "movies")], em=em, cluster_class=SpectralClustering, cluster_params={"n_clusters": 5, "assign_labels" :'discretize', "random_state": 0})
print([a for a in x])