<a href="https://colab.research.google.com/github/aknip/Coding-Cheatsheets/blob/main/Python-Compare-Text-or-JSON.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Compare Texts, Compare JSONS

Content:
1. Compare Texts: Levenshtein Distance
2. Compare JSON (by text) - simple approach
3. Flatten JSON before text comparison
4. Compare JSON (by text) - the final approach
5. Compare JSON (by structure)
6. Compare Text or JSON by AI/LLM
7. Summarized - all functions in one place

Usecase: Compare and benchmark LLM results

In [None]:
!pip install openai litellm numpy python-benedict icecream levenshtein --quiet

In [2]:
import json
import os
from getpass import getpass
import psutil
IN_NOTEBOOK = any(["jupyter-notebook" in i for i in psutil.Process().parent().cmdline()])
if IN_NOTEBOOK:
  CREDS = json.loads(getpass("Secrets (JSON string): "))
  os.environ['CREDS'] = json.dumps(CREDS)
  CREDS = json.loads(os.getenv('CREDS'))

Secrets (JSON string): ··········


In [3]:
import json
import textwrap
from icecream import ic

import openai
from litellm import completion
os.environ["OPENAI_API_KEY"] = CREDS['OpenAI']['v1']['credential'] # my key
os.environ["TOGETHERAI_API_KEY"] = CREDS['together-ai']['key']['credential']

# 1. Compare Texts: Levenshtein Distance and ROUGE

- https://towardsdatascience.com/text-similarity-w-levenshtein-distance-in-python-2f7478986e75#
- Notebook: https://github.com/vatsal220/medium_articles/blob/main/levenshtein_distance/lev_dist.ipynb
- Github: The Levenshtein Python C extension module contains functions for fast computation of Levenshtein distance and string similarity https://github.com/rapidfuzz/Levenshtein

### Levenshtein: Two implementations, same results...

In [31]:
# First implementation, using numpy

import numpy as np

def levenshtein(seq1, seq2):
    # source: https://github.com/vatsal220/medium_articles/blob/main/levenshtein_distance/lev_dist.ipynb
    size_x = len(seq1) + 1
    size_y = len(seq2) + 1
    matrix = np.zeros ((size_x, size_y))
    for x in range(size_x):
        matrix [x, 0] = x
    for y in range(size_y):
        matrix [0, y] = y

    for x in range(1, size_x):
        for y in range(1, size_y):
            if seq1[x-1] == seq2[y-1]:
                matrix [x,y] = min(
                    matrix[x-1, y] + 1,
                    matrix[x-1, y-1],
                    matrix[x, y-1] + 1
                )
            else:
                matrix [x,y] = min(
                    matrix[x-1,y] + 1,
                    matrix[x-1,y-1] + 1,
                    matrix[x,y-1] + 1
                )
    #ic(matrix)
    return (matrix[size_x - 1, size_y - 1])

In [32]:
# Second implementation, using Levenshtein Python C extension module https://github.com/rapidfuzz/

from Levenshtein import distance

In [33]:
# 1. test: Simple one word
string1 = 'stamp'
string2 = 'stomp'
ic("1. test")
ic(levenshtein(string1, string2))
ic(distance(string1, string2))

# 2. test: Sentences, case, whitespace
string1 = 'Hello world, a longer example. With two sentences. Punctuation and upper-/lower case count. Spaces, too!'
string2 = 'Hello world, a longer example. With two sentences, punctuation and upper-/lower case count.    Spaces, too!'
ic("2. test")
ic(levenshtein(string1, string2))
ic(distance(string1, string2))

ic| '1. test'
ic| levenshtein(string1, string2): 1.0
ic| distance(string1, string2): 1
ic| '2. test'
ic| levenshtein(string1, string2): 5.0
ic| distance(string1, string2): 5


5

### ROUGE

Source: https://txt.cohere.com/evaluating-llm-outputs/#llm-generated-evaluation

In [114]:
from collections import Counter

def rouge_1(reference, candidate):
    # Turn into unigrams
    reference_words = reference.split()
    candidate_words = candidate.split()

    # Compute the number of overlapping words
    reference_count = Counter(reference_words)
    candidate_count = Counter(candidate_words)
    overlap = sum(min(candidate_count[w], reference_count[w]) for w in candidate_count)

    # Compute precision, recall, and F1 score
    recall = overlap / len(reference_words)
    precision = overlap / len(candidate_words)
    if (recall + precision) != 0:
      f1 = 2 * (recall * precision) / (recall + precision)
    else:
      f1 = 0

    # Return resuls
    return {"recall": recall, "precision": precision, "f1": f1}

ref_answer = "Because the sound quality is the best out there"
gen_answers = ["Because the audio experience is unrivaled",
                     "Because the microphone has the best quality"]

for index,gen_answer in enumerate(gen_answers):
    result = rouge_1(ref_answer, gen_answer)
    print(f"Answer #{index+1}")
    print(f"Precision: {result['precision']:.2f}")
    print(f"Recall: {result['recall']:.2f}")
    print(f"F1-Score: {result['f1']:.2f}")
    print("\n")

Answer #1
Precision: 0.50
Recall: 0.33
F1-Score: 0.40


Answer #2
Precision: 0.71
Recall: 0.56
F1-Score: 0.63




# 2. Compare JSON (by text) - simple approach

In [34]:
# 1. test: JSON, converted to string - and sorted for better comparison results
string1 = json.dumps({'dict1': {'foo': 1, 'bar': 2}, 'dict2': {'baz': 'lorem ipsum', 'quux': 3}}, sort_keys = True)
string2 = json.dumps({'dict1': {'bar': 9, 'foo': 1}, 'dict2': {'baz': 'lorem ipsum', 'quux': 3}}, sort_keys = True)
ic("3. test")
ic(levenshtein(string1, string2))
ic(distance(string1, string2))

# 2. test: JSON, more complex, with arrays in different orders
string1 = json.dumps({'dict1': {'foo': 1, 'bar': [1,2,3,4]}, 'dict2': {'baz': 'lorem ipsum', 'quux': [{'this': 1, 'is': 2}, {'another': 3, 'test': 2}]}}, sort_keys = True)
string2 = json.dumps({'dict1': {'bar': [4,3,2,1], 'foo': 1}, 'dict2': {'baz': 'lorem ipsum', 'quux': [{'another': 3, 'test': 2}, {'this': 1, 'is': 2}]}}, sort_keys = True)
ic("4. test")
ic(levenshtein(string1, string2))
ic(distance(string1, string2))


ic| '3. test'
ic| levenshtein(string1, string2): 1.0
ic| distance(string1, string2): 1
ic| '4. test'
ic| levenshtein(string1, string2): 28.0
ic| distance(string1, string2): 28


28

In [35]:
# more details for 2. test:
# data content is the same, but order inside of arrays is different => high Levenshtein distance
print("string 1:")
print(json.dumps(json.loads(string1), indent=2))
print()
print("string 2:")
print(json.dumps(json.loads(string2), indent=2))

string 1:
{
  "dict1": {
    "bar": [
      1,
      2,
      3,
      4
    ],
    "foo": 1
  },
  "dict2": {
    "baz": "lorem ipsum",
    "quux": [
      {
        "is": 2,
        "this": 1
      },
      {
        "another": 3,
        "test": 2
      }
    ]
  }
}

string 2:
{
  "dict1": {
    "bar": [
      4,
      3,
      2,
      1
    ],
    "foo": 1
  },
  "dict2": {
    "baz": "lorem ipsum",
    "quux": [
      {
        "another": 3,
        "test": 2
      },
      {
        "is": 2,
        "this": 1
      }
    ]
  }
}


# 3. Flatten JSON before text comparison

As the previous test case shows, the comparison of two JSON objects works fine for simple objects (no arrays). Sorting before comparison helps to optimize the comparison results.

But nested objects and arrays lead to misleading Levenshtein distance results.

Idea:
- Flatten the JSON objects before the comparison
- Remove JSON hierarchy
- Resolve / isolate array values
- Truncate long strings => focus for comparison on the first n characters

In [58]:
from benedict import benedict

# source: https://stackoverflow.com/a/58611501

def flatten_dict (dictionary={}, template='{}: {}', string_truncation=9999):
  # removes hierarchy and arrays in dict, return flat string array
  dict_benedict = benedict(dictionary)
  global flat_item # hacky solution by using 'global'...
  flat_item = []
  def traverse_item(dct, key, value):
    global flat_item
    check = (type(value) == str or type(value) == int or type(value) == float)
    if check:
      value_trunc=str(value)[:string_truncation]
      flat_item.append(template.format(key, value_trunc)) # string template for result
  dict_benedict.traverse(traverse_item)
  return flat_item

def array_to_sorted_string (string_array=''):
  # sorts an array of strings, trunctes strings and finally convert everything into one string
  my_array_sorted = sorted(string_array)
  res = ''
  for txt in my_array_sorted:
    res = res + txt + ', '
  return res[:-2]

In [60]:
# 1. test: Flatten a simple dict
d = {'dict1': {'foo': 1, 'bar': 2}, 'dict2': {'baz': 'hello very long text here which can be truncated by function', 'quux': 4}}
d_flat = flatten_dict(dictionary=d, string_truncation=20)
print(json.dumps(d_flat, indent=2))
print(array_to_sorted_string(string_array=d_flat))
print(array_to_sorted_string(string_array=d_flat))

[
  "foo: 1",
  "bar: 2",
  "baz: hello very long text",
  "quux: 4"
]
bar: 2, baz: hello very long text, foo: 1, quux: 4
bar: 2, baz: hello very long text, foo: 1, quux: 4


In [62]:
# 2. test: Complex dict with arrays (same as 2. test for Levenshtein distance above)
d = {'dict1': {'foo': 1, 'bar': [1,2,3,4]}, 'dict2': {'baz': 'lorem ipsum', 'quux': [{'this': 1, 'is': 2}, {'another': 3, 'test': 2}]}}
d_flat = flatten_dict(dictionary=d)
print(json.dumps(d_flat, indent=2))
print(array_to_sorted_string(string_array=d_flat))

d = {'dict1': {'bar': [4,3,2,1], 'foo': 1}, 'dict2': {'baz': 'lorem ipsum', 'quux': [{'another': 3, 'test': 2}, {'this': 1, 'is': 2}]}}
d_flat = flatten_dict(dictionary=d)
print(json.dumps(d_flat, indent=2))
print(array_to_sorted_string(string_array=d_flat))


[
  "foo: 1",
  "0: 1",
  "1: 2",
  "2: 3",
  "3: 4",
  "baz: lorem ipsum",
  "this: 1",
  "is: 2",
  "another: 3",
  "test: 2"
]
0: 1, 1: 2, 2: 3, 3: 4, another: 3, baz: lorem ipsum, foo: 1, is: 2, test: 2, this: 1
[
  "0: 4",
  "1: 3",
  "2: 2",
  "3: 1",
  "foo: 1",
  "baz: lorem ipsum",
  "another: 3",
  "test: 2",
  "this: 1",
  "is: 2"
]
0: 4, 1: 3, 2: 2, 3: 1, another: 3, baz: lorem ipsum, foo: 1, is: 2, test: 2, this: 1


# 4. Compare JSON (by text) - the final approach with Levenshtein, ROUGE ...

In [63]:
# Let's take the two complex JSON objects from above (nested, arrays, order shuffled - both have the same data content):
d1 = {'dict1': {'foo': 1, 'bar': [1,2,3,4]}, 'dict2': {'baz': 'lorem ipsum', 'quux': [{'this': 1, 'is': 2}, {'another': 3, 'test': 2}]}}
print(json.dumps(d1, indent=2))
d2 = {'dict1': {'bar': [4,3,2,1], 'foo': 1}, 'dict2': {'baz': 'lorem ipsum', 'quux': [{'another': 3, 'test': 2}, {'this': 1, 'is': 2}]}}
print()
print(json.dumps(d2, indent=2))

{
  "dict1": {
    "foo": 1,
    "bar": [
      1,
      2,
      3,
      4
    ]
  },
  "dict2": {
    "baz": "lorem ipsum",
    "quux": [
      {
        "this": 1,
        "is": 2
      },
      {
        "another": 3,
        "test": 2
      }
    ]
  }
}

{
  "dict1": {
    "bar": [
      4,
      3,
      2,
      1
    ],
    "foo": 1
  },
  "dict2": {
    "baz": "lorem ipsum",
    "quux": [
      {
        "another": 3,
        "test": 2
      },
      {
        "this": 1,
        "is": 2
      }
    ]
  }
}


In [68]:
# ... and compare them:
d1_text = array_to_sorted_string(string_array=flatten_dict(dictionary=d1, string_truncation=4))
print(d1_text)
d2_text = array_to_sorted_string(string_array=flatten_dict(dictionary=d2, string_truncation=4))
print(d2_text)
print("Levenshtein distance:")
print(levenshtein(d1_text, d2_text))
print(distance(d1_text, d2_text))

0: 1, 1: 2, 2: 3, 3: 4, another: 3, baz: lore, foo: 1, is: 2, test: 2, this: 1
0: 4, 1: 3, 2: 2, 3: 1, another: 3, baz: lore, foo: 1, is: 2, test: 2, this: 1
Levenshtein distance:
4.0
4


# 5. Compare JSON (by structure)

In [69]:
from benedict import benedict

def get_keys(my_dict):
  # removes hierarchy and arrays in dict, returns array of all keys
  # source: https://stackoverflow.com/a/71857572
  keys = []
  if isinstance(my_dict, list):
      for item in my_dict:
          keys.extend(get_keys(item))
  elif isinstance(my_dict, dict):
      for key in my_dict:
          keys.append(key)
          keys.extend(get_keys(my_dict[key]))
  return keys

def get_entries (my_dict):
  # removes hierarchy and arrays in dict, returns array of all values (dict) or single entries (list/array)
  # source: https://stackoverflow.com/a/58611501
  dict_benedict = benedict(my_dict)
  global flat_item # hacky solution by using 'global'...
  flat_item = []
  def traverse_item(dct, key, value):
    global flat_item
    check = (type(value) == str or type(value) == int or type(value) == float)
    if check:
      flat_item.append('{}'.format(value)) # string template for result
      #flat_item.append('{}: {}'.format(key, value)) # string template for result
  dict_benedict.traverse(traverse_item)
  return flat_item

In [70]:
# Let's take the two complex JSON objects from above (nested, arrays, order shuffled - both have the same data content):
d1 = {'dict1': {'foo': 1, 'bar': [1,2,3,4]}, 'dict2': {'baz': 'lorem ipsum', 'quux': [{'this': 1, 'is': 2}, {'another': 3, 'test': 2}]}}
print(json.dumps(d1, indent=2))
d2 = {'dict1': {'bar': [5,4,3,2,1], 'foo': 1}, 'dict2': {'baz': 'lorem ipsum', 'quux': [{'another': 3, 'test': 2}, {'this': 1, 'is': 2}]}, 'dict3': 'added'}
print()
print(json.dumps(d2, indent=2))

{
  "dict1": {
    "foo": 1,
    "bar": [
      1,
      2,
      3,
      4
    ]
  },
  "dict2": {
    "baz": "lorem ipsum",
    "quux": [
      {
        "this": 1,
        "is": 2
      },
      {
        "another": 3,
        "test": 2
      }
    ]
  }
}

{
  "dict1": {
    "bar": [
      5,
      4,
      3,
      2,
      1
    ],
    "foo": 1
  },
  "dict2": {
    "baz": "lorem ipsum",
    "quux": [
      {
        "another": 3,
        "test": 2
      },
      {
        "this": 1,
        "is": 2
      }
    ]
  },
  "dict3": "added"
}


In [71]:
d1_entries = get_entries(d1)
d1_keys = get_keys(d1)
print(d1_entries)
print(d1_keys)
print(len(d1_entries))
print(len(d1_keys))

['1', '1', '2', '3', '4', 'lorem ipsum', '1', '2', '3', '2']
['dict1', 'foo', 'bar', 'dict2', 'baz', 'quux', 'this', 'is', 'another', 'test']
10
10


# 6. Compare Text or JSON by AI/LLM

In [126]:
target_JSON = {
  "Versicherungssumme": "5500000",
  "Versicherungsnehmer_Name": "Zorrsen Beteiligungsgesellschaft mbH",
  "Versicherungsnehmer_Anschrift": "Juliusweg 9, 28374 Zutzenhausen",
  "Versicherungsnehmer_Branche": "Produktion von Aluminium-Großarmaturen",
  "Versicherungsnehmer_Umsatz": "78500000",
  "Vorschäden_benannt": "Ja, es gibt einen laufenden D&O-Schadensfall"
}
#target_JSON_text = str(target_JSON)
target_JSON_text = '{' + array_to_sorted_string(string_array=flatten_dict(dictionary=target_JSON, template='"{}": "{}"', string_truncation=20)) + '}'

test_JSON = {
  "Versicherungsnehmer_Branche": "Produktion von Aluminium-Großarmaturen",
  "Versicherungsnehmer_Name": "Zorrsen Beteiligungsgesellschaft mbH",
  "Versicherungsnehmer_Anschrift": "Juliusweg 9, 28374 Zutzenhausen",
  "Vorschäden_benannt": "Ja, es gibt einen laufenden D&O-Schadensfall aufgrund einer Inanspruchnahme eines ehemaligen Geschäftsführers aus dem Jahr 2017.",
  "Versicherungssumme": "5500000",
  "Versicherungsnehmer_Umsatz": "78500000"
}
#test_JSON_text = str(test_JSON)
test_JSON_text = '{' + array_to_sorted_string(string_array=flatten_dict(dictionary=test_JSON, template='"{}": "{}"', string_truncation=20)) + '}'


prompt=f"""You are a data analyst and help to benchmark results of a webserive. The ideal result is called 'Target-JSON', the actual result is called 'Test-JSON':\n
Target-JSON: {target_JSON_text}\n
Test-JSON: {test_JSON_text}\n
Please rate the Test-JSON in a range of 0.00-1.00 if it fits to the Target-JSON.
Answer in the JSON-format: {{'AI-Rating': 0.75, 'Differences': 'Field name is missing, field street has a different value.'}}
"""
print(prompt)


You are a data analyst and help to benchmark results of a webserive. The ideal result is called 'Target-JSON', the actual result is called 'Test-JSON':

Target-JSON: {"Versicherungsnehmer_Anschrift": "Juliusweg 9, 28374 Z", "Versicherungsnehmer_Branche": "Produktion von Alumi", "Versicherungsnehmer_Name": "Zorrsen Beteiligungs", "Versicherungsnehmer_Umsatz": "78500000", "Versicherungssumme": "5500000", "Vorschäden_benannt": "Ja, es gibt einen la"}

Test-JSON: {"Versicherungsnehmer_Anschrift": "Juliusweg 9, 28374 Z", "Versicherungsnehmer_Branche": "Produktion von Alumi", "Versicherungsnehmer_Name": "Zorrsen Beteiligungs", "Versicherungsnehmer_Umsatz": "78500000", "Versicherungssumme": "5500000", "Vorschäden_benannt": "Ja, es gibt einen la"}

Please rate the Test-JSON in a range of 0.00-1.00 if it fits to the Target-JSON. 
Answer in the JSON-format: {'AI-Rating': 0.75, 'Differences': 'Field name is missing, field street has a different value.'}



In [127]:
response = completion(
  #model="gpt-3.5-turbo",
  model="gpt-4",
  messages=[{ "content": prompt,"role": "user"}],
  temperature=0,
  max_tokens=50
)
print()
print(response.choices[0].message.content)


{'AI-Rating': 1.00, 'Differences': 'No differences found.'}


# 7. Summarized - all functions in one place

In [81]:
import numpy as np
from benedict import benedict

def levenshtein(seq1, seq2):
    # source: https://github.com/vatsal220/medium_articles/blob/main/levenshtein_distance/lev_dist.ipynb
    size_x = len(seq1) + 1
    size_y = len(seq2) + 1
    matrix = np.zeros ((size_x, size_y))
    for x in range(size_x):
        matrix [x, 0] = x
    for y in range(size_y):
        matrix [0, y] = y

    for x in range(1, size_x):
        for y in range(1, size_y):
            if seq1[x-1] == seq2[y-1]:
                matrix [x,y] = min(
                    matrix[x-1, y] + 1,
                    matrix[x-1, y-1],
                    matrix[x, y-1] + 1
                )
            else:
                matrix [x,y] = min(
                    matrix[x-1,y] + 1,
                    matrix[x-1,y-1] + 1,
                    matrix[x,y-1] + 1
                )
    #ic(matrix)
    return (matrix[size_x - 1, size_y - 1])


def flatten_dict (dictionary={}, template='{}: {}', string_truncation=9999):
  # removes hierarchy and arrays in dict, return flat string array
  dict_benedict = benedict(dictionary)
  global flat_item # hacky solution by using 'global'...
  flat_item = []
  def traverse_item(dct, key, value):
    global flat_item
    check = (type(value) == str or type(value) == int or type(value) == float)
    if check:
      value_trunc=str(value)[:string_truncation]
      flat_item.append(template.format(key, value_trunc)) # string template for result
  dict_benedict.traverse(traverse_item)
  return flat_item

def array_to_sorted_string (string_array=''):
  # sorts an array of strings, trunctes strings and finally convert everything into one string
  my_array_sorted = sorted(string_array)
  res = ''
  for txt in my_array_sorted:
    res = res + txt + ', '
  return res[:-2]

def get_keys(my_dict):
  # removes hierarchy and arrays in dict, returns array of all keys
  # source: https://stackoverflow.com/a/71857572
  keys = []
  if isinstance(my_dict, list):
      for item in my_dict:
          keys.extend(get_keys(item))
  elif isinstance(my_dict, dict):
      for key in my_dict:
          keys.append(key)
          keys.extend(get_keys(my_dict[key]))
  return keys

def get_entries (my_dict):
  # removes hierarchy and arrays in dict, returns array of all values (dict) or single entries (list/array)
  # source: https://stackoverflow.com/a/58611501
  dict_benedict = benedict(my_dict)
  global flat_item # hacky solution by using 'global'...
  flat_item = []
  def traverse_item(dct, key, value):
    global flat_item
    check = (type(value) == str or type(value) == int or type(value) == float)
    if check:
      flat_item.append('{}'.format(value)) # string template for result
      #flat_item.append('{}: {}'.format(key, value)) # string template for result
  dict_benedict.traverse(traverse_item)
  return flat_item

In [87]:
# Examples 1:

string1 = 'Hello world, a longer example. With two sentences. Punctuation and upper-/lower case count. Spaces, too!'
string2 = 'Hello world, a longer example. With two sentences, punctuation and upper-/lower case count.    Spaces, too!'
print("Text Levenshtein distance:")
print(levenshtein(string1, string2))

d1 = {'dict1': {'foo': 1, 'bar': [1,2,3,4]}, 'dict2': {'baz': 'lorem ipsum', 'quux': [{'this': 1, 'is': 2}, {'another': 3, 'test': 2}]}}
d2 = {'dict1': {'bar': [5,4,3,2,1], 'foo': 1}, 'dict2': {'baz': 'lorem ipsum', 'quux': [{'another': 3, 'test': 2}, {'this': 1, 'is': 2}]}, 'dict3': 'added'}
d1_text = array_to_sorted_string(string_array=flatten_dict(dictionary=d1, string_truncation=999))
d2_text = array_to_sorted_string(string_array=flatten_dict(dictionary=d2, string_truncation=999))

print("JSON Levenshtein distance:")
print(levenshtein(d1_text, d2_text),len(d1_text))

Text Levenshtein distance:
5.0
JSON Levenshtein distance:
22.0 85


In [125]:
# Examples 2: (loop)

target_JSON = {
    "Versicherungssumme": "5500000",
    "Versicherungsnehmer_Name": "Zorrsen Beteiligungsgesellschaft mbH",
    "Versicherungsnehmer_Anschrift": "Juliusweg 9, 28374 Zutzenhausen",
    "Versicherungsnehmer_Branche": "Produktion von Aluminium-Großarmaturen",
    "Versicherungsnehmer_Umsatz": "78500000",
    "Vorschäden_benannt": "Ja, es gibt einen laufenden D&O-Schadensfall"
}

result_JSONs = [
  {
    "Versicherungssumme": 5500000,
    "Versicherungsnehmer_Name": "Zorrsen Beteiligungsgesellschaft mbH",
    "Versicherungsnehmer_Anschrift": "Juliusweg 9, 28374 Zutzenhausen",
    "Versicherungsnehmer_Branche": "Produktion von Aluminium-Großfarmaturen",
    "Versicherungsnehmer_Umsatz": 78500000,
    "Vorschäden_benannt": "Ja, es gibt einen laufenden D&O-Schadensfall"
  },
  {
    "Versicherungssumme": "5500000"
  },
  {
    "Versicherungssumme": "5500000",
    "Versicherungsnehmer_Name": "Zorrsen Beteiligungsgesellschaft mbH",
    "Versicherungsnehmer_Anschrift": "Juliusweg 9, 28374 Zutzenhausen",
    "Versicherungsnehmer_Branche": "Produktion von Aluminium-Großarmaturen",
    "Versicherungsnehmer_Umsatz": "78500000",
    "Vorschäden_benannt": "Ja, es gibt einen laufenden D&O-Schadensfall aufgrund einer Inanspruchnahme eines ehemaligen Geschäftsführers aus dem Jahr 2017."
  },
  {
    "Versicherungssumme": 5500000,
    "Versicherungsnehmer_Name": "Zorrsen Beteiligungsgesellschaft mbH",
    "Versicherungsnehmer_Anschrift": "Juliusweg 9, 28374 Zutzenhausen",
    "Versicherungsnehmer_Branche": "Produktion von Aluminium-Großarmaturen",
    "Versicherungsnehmer_Umsatz": 78500000,
    "Vorschäden_benannt": "Ja, D&O-Schadensfall im Jahr 2017"
  },
  {
    "Versicherungssumme": 5500000,
    "Versicherungsnehmer_Name": "Zorrsen Beteiligungsgesellschaft mbH",
    "Versicherungsnehmer_Anschrift": "Juliusweg 9, 28374 Zutzenhausen",
    "Versicherungsnehmer_Branche": "Produktion von Aluminium-Großarmaturen",
    "Versicherungsnehmer_Umsatz": 78500000,
    "Vorschäden_benannt": "Ja, laufender D&O-Schadensfall aufgrund einer Inanspruchnahme eines ehemaligen Geschäftsführers."
  }
]

trunc = 50 # max len of key+value as string representation
target_JSON_text = '{' + array_to_sorted_string(string_array=flatten_dict(dictionary=target_JSON, template='"{}": "{}"', string_truncation=trunc)) + '}'
target_plain_text = array_to_sorted_string(string_array=flatten_dict(dictionary=target_JSON, template='{}: {}', string_truncation=trunc))
target_entries_absolute = len(get_entries(target_JSON))
target_keys_absolute = len(get_keys(target_JSON))

for test_JSON in result_JSONs:
  test_JSON_text = '{' + array_to_sorted_string(string_array=flatten_dict(dictionary=test_JSON, template='"{}": "{}"', string_truncation=trunc)) + '}'
  test_plain_text = array_to_sorted_string(string_array=flatten_dict(dictionary=test_JSON, template='{}: {}', string_truncation=trunc))
  print(target_plain_text)
  print(test_plain_text)
  test_lev_absolute = int(levenshtein(target_JSON_text, test_JSON_text))
  # relative Levenshtein distance: 1.0 = no distance, identical / 0 = full distance, no similarity
  test_lev_relative = 1-test_lev_absolute/len(target_JSON_text)
  test_entries_absolute = len(get_entries(test_JSON))
  test_entries_relative = test_entries_absolute/target_entries_absolute
  test_keys_absolute = len(get_keys(test_JSON))
  test_keys_relative = test_keys_absolute/target_keys_absolute
  #print(f'Lev dist abs: {test_lev_absolute}')
  print(f'Lev dist rel: {test_lev_relative:.2f}')
  #print(f'Entries rel: {test_entries_relative:.2f}')
  print(f'Keys rel: {test_keys_relative:.2f}')

  rouge_result = rouge_1(target_plain_text, test_plain_text)
  print(f"Precision: {rouge_result['precision']:.2f}")
  print(f"Recall: {rouge_result['recall']:.2f}")
  print(f"F1-Score: {rouge_result['f1']:.2f}")

  prompt=f"""You are a data analyst and help to benchmark results of a webserive. The ideal result is called 'Target-JSON', the actual result is called 'Test-JSON':\n
Target-JSON: {target_JSON_text}\n
Test-JSON: {test_JSON_text}\n
Please rate the Test-JSON in a range of 0.00-1.00 if it fits to the Target-JSON.
Answer in the JSON-format: {{'AI-Rating': 0.75, 'Differences': 'Field name is missing, field street has a different value.'}}
"""
  response = completion(
    #model="gpt-3.5-turbo",
    model="gpt-4",
    messages=[{ "content": prompt,"role": "user"}],
    temperature=0,
    #max_tokens=50
  )
  #print(prompt)
  print(response.choices[0].message.content)
  print()


Versicherungsnehmer_Anschrift: Juliusweg 9, 28374 Zutzenhausen, Versicherungsnehmer_Branche: Produktion von Aluminium-Großarmaturen, Versicherungsnehmer_Name: Zorrsen Beteiligungsgesellschaft mbH, Versicherungsnehmer_Umsatz: 78500000, Versicherungssumme: 5500000, Vorschäden_benannt: Ja, es gibt einen laufenden D&O-Schadensfall
Versicherungsnehmer_Anschrift: Juliusweg 9, 28374 Zutzenhausen, Versicherungsnehmer_Branche: Produktion von Aluminium-Großfarmaturen, Versicherungsnehmer_Name: Zorrsen Beteiligungsgesellschaft mbH, Versicherungsnehmer_Umsatz: 78500000, Versicherungssumme: 5500000, Vorschäden_benannt: Ja, es gibt einen laufenden D&O-Schadensfall
Lev dist rel: 1.00
Keys rel: 1.00
Precision: 0.96
Recall: 0.96
F1-Score: 0.96
{'AI-Rating': 0.83, 'Differences': 'Mismatch in the field "Versicherungsnehmer_Branche".'}

Versicherungsnehmer_Anschrift: Juliusweg 9, 28374 Zutzenhausen, Versicherungsnehmer_Branche: Produktion von Aluminium-Großarmaturen, Versicherungsnehmer_Name: Zorrsen Bete