# Evaluation

This notebook is used to evaluate the Soft Query Evaluation System. 


To evaluate the soft query evaluation system we create queries, encoded as evaluation plans, and execute them on a [hand-crafted schema](https://github.com/HackerBschor/SofteningQueryEvaluation/blob/main/evaluation/schema.json).
The queries result set $R$ is then compared against the ground truth set $G$ to determine the true positives $TPs = R \cap G$, the false positives $FPs = R \setminus G$ and the false negatives $FNs = G \setminus R$. Using these sets, we can calculate the following metrics:
1) Precision: $\frac{|TP|}{|TP| + |FP|}$
2) Recall: $\frac{|TP|}{|TP| + |FN|}$
3) F1: $\frac{2 \cdot Precision \cdot Recall}{Precision + Recall}$
4) Jaccard-Coefficient: $\frac{|TP|}{|G \cup R|}$

Furthermore, we collect the runtime, to compare it with the execution time of a query that replaces all soft bindings with strict bindings and therefore, doesn't utilize LLMs.

### Evaluation Schema

The schema is created to maximize **Cross-Domain Generalization**, in order to simulate a real world application of the software.
Therefore, we include data from different domains such as:
1) Medicine
2) Chemistry
3) E-Commerce & Retail
4) Social Media & User-Generated Content
5) Geography
6) Entertainment & Media (Movies/ Music)
7) Sports & Gaming


### Test Criteria
We identified 8 different test criteria, where the usage of soft bindings can benefit the final result. We try to involve all test criteria in every domain to maximize the significance for every domain.
We define the test criteria as follows:
1) Semantic Matching ('Movie about toys that come to life' $\rightarrow$ Record of 'Toy Story')
2) Spelling Variations & Typos ('Aple' $\rightarrow$ 'Apple', 'Neighbour' $\rightarrow$ 'Neighbor')
3) Synonyms & Conceptual Overlap ('High Blood Pressure' $\approx$ 'Hypertension', 'Laptop' $\approx$ 'Notebook')
4) Aliases ('Lady Gaga' $\approx$ 'Stefani Joanne Angelina Germanotta', 'J. Smith' $\approx$ 'Jane Smith')
    5) Abbreviations & Acronyms ('NYC' $\approx$ 'New York City', 'Dr' $\approx$ 'Doctor', 'AI' $\approx$ 'Artificial Intelligence')
6) Different Languages (Apple $\approx$ Apfel $\approx$ Mela)
7) Unit & Format Inconsistencies ('2007-06-29' $\approx$ '06/29/2007' $\approx$ 'June 29, 2007' $\approx$ '29.06.2007' $\approx$ '29. Juni 2007', '2.2 lbs' $\approx$ '1 kg', '1M' $\approx$ '1.000.000' $\approx$ 1000000)

The defined test criteria mostly tests for semantic equality $\approx$. However, a database also offers operators like $<$, $>$, $!=$.
So, if applicable, we also include range queries and queries for semantic inequality. These range queries are mostly affect the criteria `Unit & Format Inconsistencies`.





### Import & Initialize: Models, Data and Functions

In [1]:
import json
import time

from sklearn.cluster import DBSCAN, SpectralClustering

from db.criteria import *
from db.operators import *
from db.operators.Aggregate import *
from db.operators.Project import *
from db.structure import *
from db.structure import Constant

from models import ModelMgr
from models.embedding import SentenceTransformerEmbeddingModel
from models.semantic_validation import LLaMAValidationModel
from models.text_generation.LLaMA import LLaMATextGenerationModel

# logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.DEBUG)

In [2]:
def load_dummy_operators():
    with open("../evaluation/schema.json", "r", encoding="UTF-8") as schema:
        schema = json.load(schema)

    operators = {}
    for relation, data in schema.items():
        operators[relation] = Dummy(relation, [x["name"] for x in data["schema"]], data["data"])
    return operators

# Load Data
ops = load_dummy_operators()

In [3]:
def evaluate(op, result_cols, gt):
    st = time.time()
    op.open()
    result = {tuple(row[col] for col in result_cols) for row in op}
    exec_time = f"{round((time.time() - st) * 1000)}ms"
    
    tps, fns, fps = gt & result, gt - result, result - gt
    tp, fn, fp = len(tps), len(fns), len(fps)
    precision = round((tp / (tp + fp) if (tp + fp) > 0 else 0) * 100.0)
    recall = round((tp / (tp + fn) if (tp + fn) > 0 else 0) * 100.0)
    f1_score = round(((2 * precision * recall) / (precision + recall) if (precision + recall) > 0 else 0))
    jaccard_coefficient = round(len(tps) / len(gt | result) if len(gt | result) > 0 else 0 * 100.0)
    print("False Positives:", "\n".join(map(lambda x: f"\t{x}", fps)), "", sep="\n")
    print("False Negatives:", "\n".join(map(lambda x: f"\t{x}", fns)), "", sep="\n")
    return {
        "precision": precision, 
        "recall": recall, 
        "f1_score": f1_score, 
        "jaccard_coefficient": jaccard_coefficient,
        "exec_time": exec_time
    }


In [4]:
# Load Models
mm = ModelMgr(config="../config.ini")
em = SentenceTransformerEmbeddingModel(mm)
sv = LLaMAValidationModel(mm, temperature=0.0001)
gm = LLaMATextGenerationModel(mm)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Test Case 0: General Tests
### 0.1: Language Detection

$$
\sigma_{\text{detected\_language} \approx \text{'Japanese'} \lor \text{detected\_language} \approx \text{'Dutch'}} (\pi_{\text{id}, \mathcal{T}(\text{"What Langauge is this: '<text>'?"})\rightarrow \text{detected\_language}} (\text{language\_detection}))
$$

In [5]:
gt_0_1 = {(2, ), (3, ), (5, ), (9, ), (10, ), (17, ), (19, ), (20, ), (21, ), (29, )}

prompt_template = "What Langauge is this: '{}'?"
system_prompt = "For a given text, return the language of the sentence. Answer with the language only!"
languange_detection_mapping = TextGeneration(["text"], "detected_language", tgm=gm, prompt_template=prompt_template, system_prompt=system_prompt)

op0_1 = Project(ops["language_detection"], ["id", languange_detection_mapping], em=em)
op0_1 = Select(op0_1, DisjunctiveCriteria([
    SoftEqual(Column("detected_language"), Constant("Japanese"), em=em, threshold=0.8),
    SoftEqual(Column("detected_language"), Constant("Dutch"), em=em, threshold=0.8),
]))

evaluate(op0_1, ["id"], gt_0_1)

False Positives:
	(24,)
	(25,)
	(22,)

False Negatives:
	(29,)



{'precision': 75,
 'recall': 90,
 'f1_score': 82,
 'jaccard_coefficient': 1,
 'exec_time': '42033ms'}

## Test Case : Medicine

### Find disease where symptoms match human description (Semantic Matching)

[Dataset: https://www.kaggle.com/datasets/niyarrbarman/symptom2disease](https://www.kaggle.com/datasets/niyarrbarman/symptom2disease)

$$ 
\sigma_{\text{symptoms}  \approx  \text{'<description>'} } (\bowtie_{\text{disease}}(\text{Diseases}, \gamma_{\text{disease}, STR\_AGG(\text{symptom}) \rightarrow \text{symptoms} } (\text{Symptoms}))) 
$$



In [41]:
gt1_1 = {
    ("Psoriasis", "There is a silver like dusting on my skin, especially on my lower back and scalp. This dusting is made up of small scales that flake off easily when I scratch them."),
    ("Varicose Veins", "The veins in my legs are causing discomfort and difficulty sleeping at night. I have no idea why it is happening. I get cramps when I sprint."),
    ("Typhoid", "I have had some constipation and belly pain, which has been really uncomfortable. The pain has been getting worse and it's really affecting my daily life."),
    ("Chicken pox", "I have a high fever and a mild headache. I'm tired most of the time and completely lost my appetite."),
    ("Impetigo", "I have a high fever and am really weak. My face has gotten sores. The blisters are itchy and painful. A yellow ooze frequently leaks from the wounds."),
    ("Dengue", "I am experiencing very high fever and chills every night. It is really concerning me. Moreover, I don’t feel like eating anything and my back, arms, legs pain a lot. There is a strange pain behing my eyes. I can’t do any physical activities. "),
    ("Fungal infection", "I've been really itchy lately and there are these rashy spots all over my skin. There are also some areas that don't look like the normal color of my skin, and I've had some bumps that are kind of hard"),
    ("Common Cold", "I've been sneezing a lot and feeling really tired and sick. There's also a lot of gross stuff coming out of my nose and my throat feels really scratchy. And my neck feels swollen too."),
    ("Pneumonia", "I've been coughing a lot recently with chest pain and feeling incredibly chilly and exhausted. Additionally, my heart is thumping rapidly, and the phlegm I cough up has a reddish hue."),
    ("Dimorphic hemorrhoids(piles)", "I've been constipated and it's really hard to go to the bathroom. When I do go, it hurts and my stool has been bloody. I've also been having some pain in my butt and my anus has been really itchy and irritated."),
    ("Arthritis", "Lately, I've been having stiffness and weakness in my neck muscles. Since my joints have matured, it is difficult for the American state to operate without becoming stiff. Furthermore, walking has been quite painful."),
    ("Acne", "My skin has just acquired a nasty rash. It's full of pus-filled pimples and blackheads. My skin has been really sensitive as well."),
    ("Bronchial Asthma", "Recently, I have had a number of concerning symptoms, including a dry cough, impaired breathing, a high fever, and a lot of mucus. I also feel very weak and tired."),
    ("Hypertension", "I have been experiencing symptoms such as a headache, chest pain, dizziness, loss of balance, and difficulty focusing."),
    ("Migraine", "I have been struggling with digestive issues, including acid reflux and indigestion, as well as regular headaches and impaired vision, excessive hunger, a sore neck, depression, irritability, and visual disturbance"),
    ("Cervical spondylosis", "My muscles have been quite weak, and I've been coughing up phlegm along with significant back discomfort. In addition to feeling weak and disoriented, my neck has been hurting."),
    ("Jaundice", "I have been suffering from itching, vomiting, and fatigue. I have also lost weight and have a high fever. My skin has turned yellow and my urine is dark. I have also been experiencing abdominal pain"),
    ("Malaria", "I've experienced severe itching, chills, nausea, and a high fever. Besides having a headache, I'm also perspiring a lot. I've been terrible with nausea and muscle ache."),
    ("allergy", "I'm dizzy, nauseated, and shaky. I'm having trouble breathing since my throat is swollen. On occasion, throughout the night, my chest hurts and I feel sick."),
    ("Drug Reaction", "When I have a fever, I feel quite woozy and lightheaded. My heart is pounding, and my head is absolutely foggy. My ability to think properly is impaired, and everything appears to be somewhat blurry.")
}

dummy = Dummy("human_descriptions", ["human_description"], [(x[1], ) for x in gt1_1])
op1_1_hag = HashAggregate(ops["diseases_symptom"], ["disease"], [StringAggregation("symptom", "symptom")])
op1_1 = InnerSoftJoin(op1_1_hag, dummy, None, None, em=em, threshold=0.4)
evaluate(op1_1, ["disease", "human_description"], gt1_1)

False Positives:
	('Paralysis (brain hemorrhage)', "I've been sneezing a lot and feeling really tired and sick. There's also a lot of gross stuff coming out of my nose and my throat feels really scratchy. And my neck feels swollen too.")
	('Chronic cholestasis', 'I am experiencing very high fever and chills every night. It is really concerning me. Moreover, I don’t feel like eating anything and my back, arms, legs pain a lot. There is a strange pain behing my eyes. I can’t do any physical activities. ')
	('Heart attack', 'When I have a fever, I feel quite woozy and lightheaded. My heart is pounding, and my head is absolutely foggy. My ability to think properly is impaired, and everything appears to be somewhat blurry.')
	('Typhoid', "I've been coughing a lot recently with chest pain and feeling incredibly chilly and exhausted. Additionally, my heart is thumping rapidly, and the phlegm I cough up has a reddish hue.")
	('Fungal infection', 'There is a silver like dusting on my skin, espe

{'precision': 3,
 'recall': 80,
 'f1_score': 6,
 'jaccard_coefficient': 0,
 'exec_time': '231ms'}

### 1.2: Find disease where all 3 symptoms occur (Spelling Variations: e.g. 'high fefer' $\approx$ 'high fever'; Synonyms & Conceptual Overlap: 'Appetite suppression' $\approx$ 'loss of appetite')

$$
    \sigma_{\text{number\_symptoms} = 3} (
        \gamma_{\text{disease}, \text{COUNT(Symptom)} \rightarrow \text{number\_symptoms}} (\sigma_{\text{description} \approx \text{<symptom 2>} \lor \text{description} \approx \text{<symptom 2>} \lor \text{description} \approx \text{<symptom 3>}}(\text{Symptoms}))
    )
$$


In [22]:
gt1_2 = {
    (1, "Tuberculosis", "Appetitlosigkeit", "Gelbe Augen", "Kurzatmigkeit"), # German
    (2, "Tuberculosis", "perte d'appétit", "jaunissement des yeux", "essoufflement"), # French
    (3, "Tuberculosis", "Appetite suppression", "yellow eyes", "cant breath properly"), # English derivative
    (4, "Tuberculosis", "食欲不振", "眼睛发黄", "呼吸困难"), # Chinese (simplified)
    (5, "Hyperthyroidism", "sbalzi d'umore", "sudorazione", "fame eccessiva"), # Italian
    (6, "Hyperthyroidism", "перепады настроения", "потливость", "чрезмерное чувство голода"), # Russian
    (7, "Hyperthyroidism", "manic depression", "sweeatng", "extreme cravings for food"), # English derivative
    (8, "Hyperthyroidism", "I get angry fast", "I swet a lot", "I eat much more") # English derivative
}

dummy = Dummy("symptoms", ["test_no", "symptom1", "symptom2", "symptom3"], [(x[0], x[2], x[3], x[4]) for x in gt1_2])
op1_2 = Select(HashAggregate(
    Join(dummy, ops["diseases_symptom"],
    DisjunctiveCriteria([
        SoftValidate("Can {symptom} be described with {symptom1}", sv=sv, full_record=False),
        SoftValidate("Can {symptom} be described with {symptom2}", sv=sv, full_record=False),
        SoftValidate("Can {symptom} be described with {symptom3}", sv=sv, full_record=False),
    ])),
    ["test_no", "disease"],
    [CountAggregation("symptom", "number_symptoms")]),
    HardEqual(Column("number_symptoms"), Constant(3))
)


evaluate(op1_2, ["test_no", "disease"], {(x[0], x[1]) for x in gt1_2})

False Positives:
	(2, 'Hypoglycemia')
	(3, 'Urinary tract infection')
	(3, 'AIDS')
	(7, 'Varicose veins')
	(2, 'Hepatitis C')
	(1, 'Typhoid')
	(7, 'GERD')
	(2, 'Bronchial Asthma')
	(7, 'Paralysis (brain hemorrhage)')
	(6, '(vertigo) Paroymsal  Positional Vertigo')
	(1, 'Pneumonia')
	(2, 'Chronic cholestasis')
	(2, 'Hepatitis D')
	(3, 'Acne')
	(1, 'Peptic ulcer diseae')
	(2, 'hepatitis A')

False Negatives:
	(4, 'Tuberculosis')
	(5, 'Hyperthyroidism')
	(6, 'Hyperthyroidism')
	(2, 'Tuberculosis')
	(3, 'Tuberculosis')
	(1, 'Tuberculosis')
	(7, 'Hyperthyroidism')



{'precision': 6,
 'recall': 12,
 'f1_score': 8,
 'jaccard_coefficient': 0,
 'exec_time': '195458ms'}

### 2.3: Find normal/ increased blood pressure (Unit & Format Inconsistencies)

$$
\sigma_{\checkmark(\text{Systolic Blood Pressure: \{Systolic Blood Pressure\}, Diastolic Blood Pressure: \{Diastolic Blood Pressure\} is increased blood pressure?})}(\text{VitalSigns})
$$

$$
\sigma_{\checkmark(\text{Systolic Blood Pressure: \{Systolic Blood Pressure\}, Diastolic Blood Pressure: \{Diastolic Blood Pressure\} is normal blood pressure?})}(\text{VitalSigns})
$$

$$
\sigma_{\text{Systolic Blood Pressure,Diastolic Blood Pressure} \approx \text{'High Blood Pressure'}}(\text{VitalSigns})
$$

$$
\sigma_{\text{Systolic Blood Pressure,Diastolic Blood Pressure} \approx \text{'Normal Blood Pressure'}}(\text{VitalSigns})
$$

In [75]:
gt1_3_1 = {(194910, ), (14050, ), (96997, ), (39558, ), (20800, ),}

op_1_3_1 = Select(ops["human_vital_sign"], SoftValidate("Systolic Blood Pressure: {Systolic Blood Pressure}, Diastolic Blood Pressure: {Diastolic Blood Pressure} is increased blood pressure? ", sv=sv, full_record=False),)

evaluate(op_1_3_1, ["Patient ID"], gt1_3_1)

False Positives:
	(67833,)

False Negatives:




{'precision': 83,
 'recall': 100,
 'f1_score': 91,
 'jaccard_coefficient': 1,
 'exec_time': '269ms'}

In [73]:
gt1_3_2 = {(151348, ), (96291, ), (29566, ), (67833, ), (192523, ),}

op_1_3_2 = Select(ops["human_vital_sign"], SoftValidate("Systolic Blood Pressure: {Systolic Blood Pressure}, Diastolic Blood Pressure: {Diastolic Blood Pressure} is normal blood pressure? ", sv=sv, full_record=False),)

evaluate(op_1_3_1, ["Patient ID"], gt1_3_1)

False Positives:
	(67833,)

False Negatives:




{'precision': 83,
 'recall': 100,
 'f1_score': 91,
 'jaccard_coefficient': 1,
 'exec_time': '277ms'}

In [125]:
op_1_3_3 = Select(ops["human_vital_sign"], SoftEqual(["Systolic Blood Pressure", "Diastolic Blood Pressure"], Constant("High Blood Pressure"), em=em, threshold=0.4))
evaluate(op_1_3_3, ["Patient ID"], gt1_3_1)

False Positives:
	(96291,)
	(151348,)
	(192523,)
	(67833,)
	(29566,)

False Negatives:




{'precision': 50,
 'recall': 100,
 'f1_score': 67,
 'jaccard_coefficient': 0,
 'exec_time': '63ms'}

In [126]:
op_1_3_4 = Select(ops["human_vital_sign"], SoftEqual(["Systolic Blood Pressure", "Diastolic Blood Pressure"], Constant("Normal Blood Pressure"), em=em, threshold=0.4))
evaluate(op_1_3_4, ["Patient ID"], gt1_3_2)

False Positives:
	(39558,)
	(96997,)
	(20800,)
	(14050,)
	(194910,)

False Negatives:




{'precision': 50,
 'recall': 100,
 'f1_score': 67,
 'jaccard_coefficient': 0,
 'exec_time': '41ms'}

## Test Case 2: Chemistry
### 2.1: Warnings for Organic Chemicals (Semantic Matching, Synonyms: 'Drinking Alcohol' $\approx$ 'CH3OH' $\approx$ 'Ethanol')

$$
\bowtie_{\text{scientific\_name} \approx \text{name}} (\sigma_{\checkmark \text{'Is organic'}}(\text{Chemicals}), \text{Chemical Warnings})
$$

In [141]:
gt2_1 = { ("CH3OH", "Methanol"), ("C2H5OH", "Ethanol"),  ("C2H5OH", "Drinking Alcohol"), ("C6H6", "Benzene"), ("C3H6O", "Acetone"), ("C8H10N4O2", "Caffeine") }

op2_1 = InnerSoftJoin(
    Select(ops["chemicals"], SoftValidate("Is this chemical {scientific_name} organic?", sv=sv, full_record=False)), ops["chemical_warnings"],
    Column("scientific_name"), Column("name"),
    em=em, sv=sv, threshold=0.3, use_semantic_validation=True, sv_template="Is {scientific_name} the scientific name for {name}")

evaluate(op2_1, ["scientific_name", "name"], gt2_1)

False Positives:
	('CH3OH', 'Ethanol')
	('C2H5OH', 'Methanol')
	('C3H6O', 'Methanol')
	('NH3', 'Ammonia')

False Negatives:
	('C2H5OH', 'Drinking Alcohol')
	('C8H10N4O2', 'Caffeine')



{'precision': 50,
 'recall': 67,
 'f1_score': 57,
 'jaccard_coefficient': 0,
 'exec_time': '41290ms'}

### 2.2: Find the chemical (Spelling Variations & Typos: 'Etanol' $\approx$ 'Ethanol'; Languages: 'Hydrochloric Acid' $\approx$ 'Salzsäure')
* $ \sigma_{\text{chemical\_name} \approx 'Wtaer'}(\text{Chemical Warnings}) $ $\rightarrow$ Water
* $ \sigma_{\text{chemical\_name} \approx 'Sulfric Aciid'}(\text{Chemical Warnings}) $ $\rightarrow$ Sulfuric Acid
* $ \sigma_{\text{chemical\_name} \approx 'Methonol'}(\text{Chemical Warnings}) $  $\rightarrow$ Methanol
* $ \sigma_{\text{chemical\_name} \approx 'Hydrocloric Acd'}(\text{Chemical Warnings}) $ $\rightarrow$ Hydrochloric Acid
* $ \sigma_{\text{chemical\_name} \approx 'Muratic Acd'}(\text{Chemical Warnings}) $ $\rightarrow $ Muriatic Acid
* $ \sigma_{\text{chemical\_name} \approx 'Amnoia'}(\text{Chemical Warnings}) $ $\rightarrow$ Ammonia
* $ \sigma_{\text{chemical\_name} \approx 'Ethonol'}(\text{Chemical Warnings}) $ $\rightarrow$ Ethanol
* $ \sigma_{\text{chemical\_name} \approx 'Driniking Alcohal'}(\text{Chemical Warnings}) $ $\rightarrow$ Drinking Alcohol
* $ \sigma_{\text{chemical\_name} \approx 'Benzne'}(\text{Chemical Warnings}) $ $\rightarrow$ Benzene
* $ \sigma_{\text{chemical\_name} \approx 'Clorine'}(\text{Chemical Warnings}) $ $\rightarrow$ Chlorine
* $ \sigma_{\text{chemical\_name} \approx 'Acetne'}(\text{Chemical Warnings}) $ $\rightarrow$ Acetone
* $ \sigma_{\text{chemical\_name} \approx 'Soduim Hydrxoide'}(\text{Chemical Warnings}) $ $\rightarrow$ Sodium Hydroxide
* $ \sigma_{\text{chemical\_name} \approx 'Caffiene'}(\text{Chemical Warnings}) $ \rightarrow Caffeine
* $\sigma_{\text{chemical\_name} \approx 'Salzsäure'}(\text{Chemical Warnings})$ $\rightarrow$ Hydrochloric Acid
* $\sigma_{\text{chemical\_name} \approx '盐酸'}(\text{Chemical Warnings})$ $\rightarrow$ Hydrochloric Acid
* $\sigma_{\text{chemical\_name} \approx 'Acido cloridrico'}(\text{Chemical Warnings})$ $\rightarrow$ Hydrochloric Acid
* $\sigma_{\text{chemical\_name} \approx 'Соляная кислота'}(\text{Chemical Warnings})$ $\rightarrow$ Hydrochloric Acid
* $\sigma_{\text{chemical\_name} \approx 'Eau'}(\text{Chemical Warnings})$ $\rightarrow$ Water
* $\sigma_{\text{chemical\_name} \approx '水'}(\text{Chemical Warnings})$ $\rightarrow$ Water
* $\sigma_{\text{chemical\_name} \approx 'вода'}(\text{Chemical Warnings})$ $\rightarrow$ Water
* $\sigma_{\text{chemical\_name} \approx 'Wasser'}(\text{Chemical Warnings})$ $\rightarrow$ Water
* $\sigma_{\text{chemical\_name} \approx 'Eau'}(\text{Chemical Warnings})$ $\rightarrow$ Water

In [159]:
data = [
    ("Wtaer", "Water"),
    ("Sulfric Aciid", "Sulfuric Acid"),
    ("Methonol", "Methanol"),
    ("Hydrocloric Acd", "Hydrochloric Acid"),
    ("Muratic Acd", "Muriatic Acid"),
    ("Amnoia", "Ammonia"),
    ("Ethonol", "Ethanol"),
    ("Driniking Alcohal", "Drinking Alcohol"),
    ("Driniking Alcohal", "Ethanol"),
    ("Benzne", "Benzene"),
    ("Clorine", "Chlorine"),
    ("Acetne", "Acetone"),
    ("Soduim Hydrxoide", "Sodium Hydroxide"),
    ("Caffiene", "Caffeine"),
    ("Salzsäure", "Hydrochloric Acid"), #German
    ("盐酸", "Hydrochloric Acid"), # Chinese
    ("Acido cloridrico", "Hydrochloric Acid"), # Italian
    ("Соляная кислота", "Hydrochloric Acid"), # Russian
    ("Eau", "Water"), # French
    ("水", "Water"),
    ("вода", "Water"),
    ("Wasser", "Water")
]

dummy = Dummy("dummy", ["test_no", "chemical"], [(i, x[0]) for i, x in enumerate(data)])
op2_2 = InnerSoftJoin(ops["chemical_warnings"], dummy, Column("name"), Column("chemical"), em=em, threshold=.3) # sv=sv, use_semantic_validation=True, sv_template="Does {name} describe {chemical}"
evaluate(op2_2, ["test_no", "chemical", "name"], {(i, x[0], x[1]) for i, x in enumerate(data)})

False Positives:
	(6, 'Ethonol', 'Sodium Hydroxide')
	(6, 'Ethonol', 'Ammonia')
	(9, 'Benzne', 'Muriatic Acid')
	(7, 'Driniking Alcohal', 'Ammonia')
	(10, 'Clorine', 'Hydrochloric Acid')
	(9, 'Benzne', 'Sulfuric Acid')
	(8, 'Driniking Alcohal', 'Benzene')
	(10, 'Clorine', 'Caffeine')
	(3, 'Hydrocloric Acd', 'Benzene')
	(10, 'Clorine', 'Benzene')
	(9, 'Benzne', 'Methanol')
	(1, 'Sulfric Aciid', 'Chlorine')
	(11, 'Acetne', 'Methanol')
	(7, 'Driniking Alcohal', 'Ethanol')
	(16, 'Acido cloridrico', 'Sulfuric Acid')
	(16, 'Acido cloridrico', 'Sodium Hydroxide')
	(1, 'Sulfric Aciid', 'Acetone')
	(2, 'Methonol', 'Ammonia')
	(6, 'Ethonol', 'Chlorine')
	(2, 'Methonol', 'Sodium Hydroxide')
	(7, 'Driniking Alcohal', 'Chlorine')
	(10, 'Clorine', 'Water')
	(9, 'Benzne', 'Hydrochloric Acid')
	(19, '水', 'Sodium Hydroxide')
	(19, '水', 'Ammonia')
	(1, 'Sulfric Aciid', 'Muriatic Acid')
	(9, 'Benzne', 'Caffeine')
	(6, 'Ethonol', 'Acetone')
	(8, 'Driniking Alcohal', 'Sodium Hydroxide')
	(11, 'Acetne', 'Be

{'precision': 11,
 'recall': 59,
 'f1_score': 19,
 'jaccard_coefficient': 0,
 'exec_time': '114ms'}


### 2.3: Find Acronyms (Acronyms: H $\approx$ Hydrogen, He $\approx$ Helium)

$$
\bowtie_{\text{element} \approx \text{symbol}} (\text{Elements}, \text{ElementPhases})
$$


In [170]:
gt2_3 = {("Hydrogen", "H"), ("Helium", "He"), ("Lithium", "Li"), ("Beryllium", "Be"), ("Boron", "B"), ("Carbon", "C"), ("Nitrogen", "N"), ("Oxygen", "O"), ("Fluorine", "F"), ("Neon", "Ne"), ("Sodium", "Na"), ("Magnesium", "Mg"), ("Aluminum", "Al"), ("Silicon", "Si"), ("Phosphorus", "P"), ("Sulfur", "S"), ("Chlorine", "Cl"), ("Argon", "Ar"), ("Potassium", "K"), ("Calcium", "Ca"), ("Scandium", "Sc"), ("Titanium", "Ti"), ("Vanadium", "V"), ("Chromium", "Cr"), ("Manganese", "Mn"), ("Iron", "Fe"), ("Cobalt", "Co"), ("Nickel", "Ni"), ("Copper", "Cu"), ("Zinc", "Zn"), ("Gallium", "Ga"), ("Germanium", "Ge"), ("Arsenic", "As"), ("Selenium", "Se"), ("Bromine", "Br"), ("Krypton", "Kr"), ("Rubidium", "Rb"), ("Strontium", "Sr"), ("Yttrium", "Y"), ("Zirconium", "Zr"), ("Niobium", "Nb"), ("Molybdenum", "Mo"), ("Technetium", "Tc"), ("Ruthenium", "Ru"), ("Rhodium", "Rh"), ("Palladium", "Pd"), ("Silver", "Ag"), ("Cadmium", "Cd"), ("Indium", "In"), ("Tin", "Sn"), ("Antimony", "Sb"), ("Tellurium", "Te"), ("Iodine", "I"), ("Xenon", "Xe"), ("Cesium", "Cs"), ("Barium", "Ba"), ("Lanthanum", "La"), ("Cerium", "Ce"), ("Praseodymium", "Pr"), ("Neodymium", "Nd"), ("Promethium", "Pm"), ("Samarium", "Sm"), ("Europium", "Eu"), ("Gadolinium", "Gd"), ("Terbium", "Tb"), ("Dysprosium", "Dy"), ("Holmium", "Ho"), ("Erbium", "Er"), ("Thulium", "Tm"), ("Ytterbium", "Yb"), ("Lutetium", "Lu"), ("Hafnium", "Hf"), ("Tantalum", "Ta"), ("Wolfram", "W"), ("Rhenium", "Re"), ("Osmium", "Os"), ("Iridium", "Ir"), ("Platinum", "Pt"), ("Gold", "Au"), ("Mercury", "Hg"), ("Thallium", "Tl"), ("Lead", "Pb"), ("Bismuth", "Bi"), ("Polonium", "Po"), ("Astatine", "At"), ("Radon", "Rn"), ("Francium", "Fr"), ("Radium", "Ra"), ("Actinium", "Ac"), ("Thorium", "Th"), ("Protactinium", "Pa"), ("Uranium", "U"), ("Neptunium", "Np"), ("Plutonium", "Pu"), ("Americium", "Am"), ("Curium", "Cm"), ("Berkelium", "Bk"), ("Californium", "Cf"), ("Einsteinium", "Es"), ("Fermium", "Fm"), ("Mendelevium", "Md"), ("Nobelium", "No"), ("Lawrencium", "Lr"), ("Rutherfordium", "Rf"), ("Dubnium", "Db"), ("Seaborgium", "Sg"), ("Bohrium", "Bh"), ("Hassium", "Hs"), ("Meitnerium", "Mt"), ("Darmstadtium ", "Ds "), ("Roentgenium ", "Rg "), ("Copernicium ", "Cn "), ("Nihonium", "Nh"), ("Flerovium", "Fl"), ("Moscovium", "Mc"), ("Livermorium", "Lv"), ("Tennessine", "Ts"), ("Oganesson", "Og")}

op2_3 = InnerSoftJoin(ops["elements"], ops["elements_phase"], Column("element"), Column("symbol"), em=em, sv=sv, threshold=0.5, use_semantic_validation=True, sv_template="Is {symbol} the symbol for {element}")
evaluator = evaluate(op2_3, ["element", "symbol"], gt2_3)

False Positives:
	('Curium', 'Cu')
	('Cesium', 'Ce')
	('Thulium', 'Th')
	('Nihonium', 'Ni')
	('Nitrogen', 'Ni')
	('Astatine', 'As')
	('Moscovium', 'Mo')
	('Niobium', 'Ni')
	('Yttrium', 'Yb')
	('Fermium', 'Fe')

False Negatives:
	('Neptunium', 'Np')
	('Carbon', 'C')
	('Fermium', 'Fm')
	('Gold', 'Au')
	('Mercury', 'Hg')
	('Darmstadtium ', 'Ds ')
	('Tellurium', 'Te')
	('Promethium', 'Pm')
	('Phosphorus', 'P')
	('Neon', 'Ne')
	('Rhodium', 'Rh')
	('Fluorine', 'F')
	('Seaborgium', 'Sg')
	('Manganese', 'Mn')
	('Flerovium', 'Fl')
	('Sodium', 'Na')
	('Thallium', 'Tl')
	('Niobium', 'Nb')
	('Rutherfordium', 'Rf')
	('Sulfur', 'S')
	('Strontium', 'Sr')
	('Ruthenium', 'Ru')
	('Roentgenium ', 'Rg ')
	('Neodymium', 'Nd')
	('Radium', 'Ra')
	('Xenon', 'Xe')
	('Cadmium', 'Cd')
	('Krypton', 'Kr')
	('Scandium', 'Sc')
	('Oganesson', 'Og')
	('Nickel', 'Ni')
	('Tantalum', 'Ta')
	('Nihonium', 'Nh')
	('Dubnium', 'Db')
	('Tin', 'Sn')
	('Cesium', 'Cs')
	('Hassium', 'Hs')
	('Arsenic', 'As')
	('Livermorium', 'Lv')


### 2.4: Find p.H. neutral chemicals (Unit & Format Inconsistencies: p.H. = 7.0 $\approx$ 'Neutral')

$$
\sigma_{\text{ph}\approx\text{'Neutral'}} (\text{Chemicals})
$$

$$
\sigma_{\text{ph}\approx\text{'Base'}} (\text{Chemicals})
$$

$$
\sigma_{\text{ph}\approx\text{'Acidic'}} (\text{Chemicals})
$$

In [34]:
print("Find pH Neutral molecules")
gt2_4 = { ("H2O", "7.0"), ("CH3OH", "Neutral"), ("C2H5OH", "7.0"), ("C6H6", "7"), ("C3H6O", "7.000")}
op2_4 = Select(ops["chemicals"], SoftEqual(Constant("Neutral"), Column("pH"), em=em, threshold=0.5))
print(evaluate(op2_4, ["scientific_name", "pH"], gt2_4))

op2_4 = Select(ops["chemicals"], SoftValidate("Is {pH} is pH neutral", sv=sv, full_record=False))
print(evaluate(op2_4, ["scientific_name", "pH"], gt2_4))

print("\n\n-----------------------------\n\nFind pH Base molecules")
gt2_4 = { ("NaOH", "14.0"), ("NH3", "11.6"), ("C8H10N4O2", "Base")}
op2_4 = Select(ops["chemicals"], SoftEqual(Constant("Base"), Column("pH"), em=em, threshold=0.5))
print(evaluate(op2_4, ["scientific_name", "pH"], gt2_4))

op2_4 = Select(ops["chemicals"], SoftValidate("Is {pH} is base", sv=sv, full_record=False))
print(evaluate(op2_4, ["scientific_name", "pH"], gt2_4))

print("\n\n-----------------------------\n\nFind pH Acidic molecules")
gt2_4 = { ("H2SO4", "0.3"), ("HCl", "1.0"), ("Cl2", "Acidic in water")}
op2_4 = Select(ops["chemicals"], SoftEqual(Constant("Acidic"), Column("pH"), em=em, threshold=0.5))
print(evaluate(op2_4, ["scientific_name", "pH"], gt2_4))

op2_4 = Select(ops["chemicals"], SoftValidate("Is {pH} is acidic", sv=sv, full_record=False))
print(evaluate(op2_4, ["scientific_name", "pH"], gt2_4))

Find pH Neutral molecules
False Positives:


False Negatives:
	('C3H6O', '7.000')
	('C2H5OH', '7.0')
	('H2O', '7.0')
	('C6H6', '7')

{'precision': 100, 'recall': 20, 'f1_score': 33, 'jaccard_coefficient': 0, 'exec_time': '103ms'}
False Positives:


False Negatives:
	('C3H6O', '7.000')
	('C2H5OH', '7.0')
	('H2O', '7.0')
	('C6H6', '7')

{'precision': 100, 'recall': 20, 'f1_score': 33, 'jaccard_coefficient': 0, 'exec_time': '884ms'}


-----------------------------

Find pH Base molecules
False Positives:


False Negatives:
	('NaOH', '14.0')
	('NH3', '11.6')

{'precision': 100, 'recall': 33, 'f1_score': 50, 'jaccard_coefficient': 0, 'exec_time': '1078ms'}
False Positives:


False Negatives:
	('NaOH', '14.0')
	('NH3', '11.6')

{'precision': 100, 'recall': 33, 'f1_score': 50, 'jaccard_coefficient': 0, 'exec_time': '842ms'}


-----------------------------

Find pH Acidic molecules
False Positives:


False Negatives:
	('H2SO4', '0.3')
	('HCl', '1.0')

{'precision': 100, 'recall': 33, 'f1_score

## Test Case 3: Business & E-Commerce

### 3.1. Match companies from two datasets (Semantic Matching: {"name": "microsoft", "year_founded": 1975, "domain": "computer software" ... } $\approx$ {...})

## Test Case 4: Social Media & User-Generated Content

## Test Case 5: Geography
### 7.2. Spelling Variations & Typos, 7.3. Synonyms & Conceptual Overlap 


### 7.5. Join two country datasets: (Abbreviations & Acronyms, Different Languages: 'AT' $\approx$ 'Österreich', 'EE' $\approx$ 'Estonia'


In [None]:
# TODO

## Test Case 6: Entertainment & Media (Movies/ Music
### 6.1 Search Movies base don release disjunction (Unit & Format Inconsistencies: 'Before 2000' $\approx$ '1999')

In [None]:
gt6_1 = {("Pirates of the Caribbean: Dead Man's Chest", ), ("Charlie and the Chocolate Factory", ), ("Inception", ), ("The Matrix", )}

In [None]:
crit1 = SoftEqual(Column("release"), Constant("2006"), em=em, threshold=0.5)
crit2 = SoftEqual(Column("release"), Constant("July"), em=em, threshold=0.5)
crit3 = SoftEqual(Column("release"), Constant("Before 2000"), em=em, threshold=0.5)
op6_1 = Select(ops["movies"], DisjunctiveCriteria([crit1, crit2, crit3]))
evaluate(op6_1, ["name"], gt6_1)

In [None]:
# op6_1 = Select(ops["movies"], SoftValidate("Is {release} in 2006 or is {release} in July or is {release} before 2000?", sv=sv, full_record=False))
# crit11 = SoftValidate("Is {release} in 2006?", sv=sv, full_record=False)
# crit22 = SoftValidate("Is {release} in July?", sv=sv, full_record=False)
# crit33 = SoftValidate("Is {release} before 2000?", sv=sv, full_record=False)
# op = Select(ops["movies"], DisjunctiveCriteria([crit11, crit22, crit33]))

# result2_2 = {(x["name"], ) for x in op}
# evaluate(gt2_1, result2_2)

### 6.2 Match English and German Moview (Entity Matching, Different Languages)

In [None]:
gt6_2 = {
    ('The Lord of the Rings: The Fellowship of the Ring', 'Der Herr der Ringe: Die Gefährten'),
    ("Pirates of the Caribbean: Dead Man's Chest", 'Pirates of the Caribbean – Fluch der Karibik 2'),
    ('The Lord of the Rings: The Return of the King', 'Der Herr der Ringe: Die Rückkehr des Königs'),
    ('Charlie and the Chocolate Factory', 'Charlie und die Schokoladenfabrik'),
    ('Inception', 'Inception'),
    ('The Matrix', 'Matrix')
}

In [None]:
# op6_2 = {(x["movies.name"], x["movies_de.name"]) for x in InnerSoftJoin(ops["movies"], ops["movies_de"], Column("movies.name"), Column("movies_de.name"), em=em, threshold=0.4)}
op6_2 = InnerSoftJoin(ops["movies"], ops["movies_de"], None, None, em=em, threshold=0.6)
evaluate(op6_2, ["movies.name", "movies_de.name"],gt6_2)

### 6.3 Find Actors (Aliases: 'Orlando Bloom' $\approx$ '@orlandobloom'; Synonyms & Conceptual Overlap: 'Leonardo DiCaprio $ \approx $ 'Jack Dawson in Titanic')

In [None]:
gt6_4 =  {
  ('Carrie-Anne Moss', 'Carrie-Anne Moss'),
  ('Elijah Wood', 'Elijah Jordan Wood'),
  ('Elijah Wood', 'Elijah Wood'),
  ('Elliot Page', 'Elliot Page'),
  ('Freddie Highmore', 'Alfred Highmore'),
  ('Ian McKellen', 'Sir Ian Murray McKellen'),
  ('Johnny Depp', 'John Christopher "Johnny" Depp II'),
  ('Johnny Depp', 'John Christopher Depp II'),
  ('Johnny Depp', 'The Mad Hatter in Alice in Wonderland Actor'),
  ('Joseph Gordon-Levitt', 'Joseph Gordon-Levitt'),
  ('Keanu Reeves', 'Keanu Charles Reeves'),
  ('Keira Knightley', 'Keira Christina Knightley'),
  ('Keira Knightley', 'Keira Knightley'),
  ('Laurence Fishburne', 'Laurence Fishburne'),
  ('Leonardo DiCaprio', 'Jack Dawson in Titanic'),
  ('Orlando Bloom', '@orlandobloom'),
  ('Orlando Bloom', 'Orlando Bloom'),
  ('Orlando Bloom', 'Orlando Jonathan Blanchard Copeland Bloom'),
  ('Viggo Mortensen', 'Viggo Mortensen')
}

In [None]:
op6_4 = InnerSoftJoin(ops["actors"], ops["plays_in"], Column("name"), Column("actor/actress"), threshold=0.7, em=em)
evaluate(op6_4, ["name", "actor/actress"], gt6_4)

## Test Unit & Format Inconsistencies: 06/24/2003 $\approx$ 24.06.2003, 6/24/2005 > 24.06.2003, 2.2 lbs' $\approx$ '1 kg

* $ \sigma_{date \approx '06/24/2003'} (Orders) $
* $ \sigma_{date \approx '06/24/2003'} (Orders) $
* $ \sigma_{date \approx '06/24/2003'} (Orders) $
* $ \sigma_{date \approx '06/24/2003'} (Orders) $
* $ \sigma_{date \approx '06/24/2003'} (Orders) $
* $ \sigma_{date \approx '06/24/2003'} (Orders) $
* $ \sigma_{date \approx '06/24/2003'} (Orders) $


# Further Experiments
## Soft Aggregation

In [None]:
x = SoftAggregateFaissKMeans(ops["movies"], ["name"], [StringAggregation("name", "movies")], em=em, num_clusters=5)
print([a for a in x])

In [None]:
x = SoftAggregateScikit(ops["movies"], ["name"], [StringAggregation("name", "movies")], em=em, cluster_class=DBSCAN, cluster_params={"eps":3, "min_samples": 2})
print([a for a in x])

In [None]:
x = SoftAggregateScikit(ops["movies"], ["name"], [CountAggregation("name", "movies")], em=em, cluster_class=SpectralClustering, cluster_params={"n_clusters": 5, "assign_labels" :'discretize', "random_state": 0})
print([a for a in x])