<a href="https://colab.research.google.com/github/aknip/Coding-Cheatsheets/blob/main/Python-Compare-Text-or-JSON.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Compare Texts, Compare JSONS

Usecase: Compare and benchmark LLM results

In [None]:
!pip install numpy python-benedict icecream levenshtein --quiet

In [29]:
import json
import textwrap
from icecream import ic

# Compare Texts: Levenshtein Distance

- https://towardsdatascience.com/text-similarity-w-levenshtein-distance-in-python-2f7478986e75#
- Notebook: https://github.com/vatsal220/medium_articles/blob/main/levenshtein_distance/lev_dist.ipynb
- Github: The Levenshtein Python C extension module contains functions for fast computation of Levenshtein distance and string similarity https://github.com/rapidfuzz/Levenshtein

###Two implementations, same results...

In [33]:
# First implementation, using numpy

import numpy as np

def levenshtein(seq1, seq2):
    # source: https://github.com/vatsal220/medium_articles/blob/main/levenshtein_distance/lev_dist.ipynb
    size_x = len(seq1) + 1
    size_y = len(seq2) + 1
    matrix = np.zeros ((size_x, size_y))
    for x in range(size_x):
        matrix [x, 0] = x
    for y in range(size_y):
        matrix [0, y] = y

    for x in range(1, size_x):
        for y in range(1, size_y):
            if seq1[x-1] == seq2[y-1]:
                matrix [x,y] = min(
                    matrix[x-1, y] + 1,
                    matrix[x-1, y-1],
                    matrix[x, y-1] + 1
                )
            else:
                matrix [x,y] = min(
                    matrix[x-1,y] + 1,
                    matrix[x-1,y-1] + 1,
                    matrix[x,y-1] + 1
                )
    #ic(matrix)
    return (matrix[size_x - 1, size_y - 1])

In [None]:
# Second implementation, using Levenshtein Python C extension module https://github.com/rapidfuzz/

from Levenshtein import distance

In [136]:
# 1. test: Simple one word
string1 = 'stamp'
string2 = 'stomp'
ic("1. test")
ic(levenshtein(string1, string2))
ic(distance(string1, string2))

# 2. test: Sentences, case, whitespace
string1 = 'Hello world, a longer example. With two sentences. Punctuation and upper-/lower case count. Spaces, too!'
string2 = 'Hello world, a longer example. With two sentences, punctuation and upper-/lower case count.    Spaces, too!'
ic("2. test")
ic(levenshtein(string1, string2))
ic(distance(string1, string2))

ic| '1. test'
ic| levenshtein(string1, string2): 1.0
ic| distance(string1, string2): 1
ic| '2. test'
ic| levenshtein(string1, string2): 5.0
ic| distance(string1, string2): 5


5

# 2. Compare JSON - simple approach

In [137]:
# 1. test: JSON, converted to string - and sorted for better comparison results
string1 = json.dumps({'dict1': {'foo': 1, 'bar': 2}, 'dict2': {'baz': 'lorem ipsum', 'quux': 3}}, sort_keys = True)
string2 = json.dumps({'dict1': {'bar': 9, 'foo': 1}, 'dict2': {'baz': 'lorem ipsum', 'quux': 3}}, sort_keys = True)
ic("3. test")
ic(levenshtein(string1, string2))
ic(distance(string1, string2))

# 2. test: JSON, more complex, with arrays in different orders
string1 = json.dumps({'dict1': {'foo': 1, 'bar': [1,2,3,4]}, 'dict2': {'baz': 'lorem ipsum', 'quux': [{'this': 1, 'is': 2}, {'another': 3, 'test': 2}]}}, sort_keys = True)
string2 = json.dumps({'dict1': {'bar': [4,3,2,1], 'foo': 1}, 'dict2': {'baz': 'lorem ipsum', 'quux': [{'another': 3, 'test': 2}, {'this': 1, 'is': 2}]}}, sort_keys = True)
ic("4. test")
ic(levenshtein(string1, string2))
ic(distance(string1, string2))


ic| '3. test'
ic| levenshtein(string1, string2): 1.0
ic| distance(string1, string2): 1
ic| '4. test'
ic| levenshtein(string1, string2): 28.0
ic| distance(string1, string2): 28


28

In [138]:
# more details for 2. test:
# data content is the same, but order inside of arrays is different => high Levenshtein distance
print("string 1:")
print(json.dumps(json.loads(string1), indent=2))
print()
print("string 2:")
print(json.dumps(json.loads(string2), indent=2))

string 1:
{
  "dict1": {
    "bar": [
      1,
      2,
      3,
      4
    ],
    "foo": 1
  },
  "dict2": {
    "baz": "lorem ipsum",
    "quux": [
      {
        "is": 2,
        "this": 1
      },
      {
        "another": 3,
        "test": 2
      }
    ]
  }
}

string 2:
{
  "dict1": {
    "bar": [
      4,
      3,
      2,
      1
    ],
    "foo": 1
  },
  "dict2": {
    "baz": "lorem ipsum",
    "quux": [
      {
        "another": 3,
        "test": 2
      },
      {
        "is": 2,
        "this": 1
      }
    ]
  }
}


# 3. Flatten JSON before text comparison

As the previous test case shows, the comparison of two JSON objects works fine for simple objects (no arrays). Sorting before comparison helps to optimize the comparison results.

But nested objects and arrays lead to misleading Levenshtein distance results.

Idea:
- Flatten the JSON objects before the comparison
- Remove JSON hierarchy
- Resolve / isolate array values
- Truncate long strings => focus for comparison on the first n characters

In [98]:
from benedict import benedict

# source: https://stackoverflow.com/a/58611501

def flatten_dict (my_dict):
  # removes hierarchy and arrays in dict, return flat string array
  dict_benedict = benedict(my_dict)
  global flat_item # hacky solution by using 'global'...
  flat_item = []
  def traverse_item(dct, key, value):
    global flat_item
    check = (type(value) == str or type(value) == int or type(value) == float)
    if check:
      flat_item.append('{}: {}'.format(key, value)) # string template for result
  dict_benedict.traverse(traverse_item)
  return flat_item

def array_to_sorted_string (string_array='', string_truncation=9999):
  # sorts an array of strings, trunctes strings and finally convert everything into one string
  my_array_sorted = sorted(string_array)
  res = ''
  for txt in my_array_sorted:
    res = res + txt[:string_truncation] + ', '
  return res[:-2]

In [106]:
# 1. test: Flatten a simple dict
d = {'dict1': {'foo': 1, 'bar': 2}, 'dict2': {'baz': 'hello very long text here which can be truncated by function', 'quux': 4}}
d_flat = flatten_dict(d)
print(json.dumps(d_flat, indent=2))
print(array_to_sorted_string(string_array=d_flat))
print(array_to_sorted_string(string_array=d_flat, string_truncation=20))

[
  "foo: 1",
  "bar: 2",
  "baz: hello very long text here which can be truncated by function",
  "quux: 4"
]
bar: 2, baz: hello very long text here which can be truncated by function, foo: 1, quux: 4
bar: 2, baz: hello very long, foo: 1, quux: 4


In [135]:
# 2. test: Complex dict with arrays (same as 2. test for Levenshtein distance above)
d = {'dict1': {'foo': 1, 'bar': [1,2,3,4]}, 'dict2': {'baz': 'lorem ipsum', 'quux': [{'this': 1, 'is': 2}, {'another': 3, 'test': 2}]}}
d_flat = flatten_dict(d)
print(json.dumps(d_flat, indent=2))
print(array_to_sorted_string(string_array=d_flat))

d = {'dict1': {'bar': [4,3,2,1], 'foo': 1}, 'dict2': {'baz': 'lorem ipsum', 'quux': [{'another': 3, 'test': 2}, {'this': 1, 'is': 2}]}}
d_flat = flatten_dict(d)
print(json.dumps(d_flat, indent=2))
print(array_to_sorted_string(string_array=d_flat))


[
  "foo: 1",
  "0: 1",
  "1: 2",
  "2: 3",
  "3: 4",
  "baz: lorem ipsum",
  "this: 1",
  "is: 2",
  "another: 3",
  "test: 2"
]
0: 1, 1: 2, 2: 3, 3: 4, another: 3, baz: lorem ipsum, foo: 1, is: 2, test: 2, this: 1
0: 1, 1: 2, 2: 3, 3: 4, another: 3, baz: lorem ipsum, foo: 1, is: 2, test: 2, this: 1
[
  "0: 4",
  "1: 3",
  "2: 2",
  "3: 1",
  "foo: 1",
  "baz: lorem ipsum",
  "another: 3",
  "test: 2",
  "this: 1",
  "is: 2"
]
0: 4, 1: 3, 2: 2, 3: 1, another: 3, baz: lorem ipsum, foo: 1, is: 2, test: 2, this: 1
0: 4, 1: 3, 2: 2, 3: 1, another: 3, baz: lorem ipsum, foo: 1, is: 2, test: 2, this: 1


# 4. Compare JSON - the final approach

In [144]:
# Let's take the two complex JSON objects from above (nested, arrays, order shuffled - both have the same data content):
d1 = {'dict1': {'foo': 1, 'bar': [1,2,3,4]}, 'dict2': {'baz': 'lorem ipsum', 'quux': [{'this': 1, 'is': 2}, {'another': 3, 'test': 2}]}}
print(json.dumps(d1, indent=2))
d2 = {'dict1': {'bar': [4,3,2,1], 'foo': 1}, 'dict2': {'baz': 'lorem ipsum', 'quux': [{'another': 3, 'test': 2}, {'this': 1, 'is': 2}]}}
print()
print(json.dumps(d2, indent=2))

{
  "dict1": {
    "foo": 1,
    "bar": [
      1,
      2,
      3,
      4
    ]
  },
  "dict2": {
    "baz": "lorem ipsum",
    "quux": [
      {
        "this": 1,
        "is": 2
      },
      {
        "another": 3,
        "test": 2
      }
    ]
  }
}

{
  "dict1": {
    "bar": [
      4,
      3,
      2,
      1
    ],
    "foo": 1
  },
  "dict2": {
    "baz": "lorem ipsum",
    "quux": [
      {
        "another": 3,
        "test": 2
      },
      {
        "this": 1,
        "is": 2
      }
    ]
  }
}


In [145]:
# ... and compare them:
d1_text = array_to_sorted_string(string_array=d_flat)
print(d1_text)
d2_text = array_to_sorted_string(string_array=d_flat)
print(d2_text)
print("Levenshtein distance:")
print(levenshtein(d1_text, d2_text))
print(distance(d1_text, d2_text))

0: 4, 1: 3, 2: 2, 3: 1, another: 3, baz: lorem ipsum, foo: 1, is: 2, test: 2, this: 1
0: 4, 1: 3, 2: 2, 3: 1, another: 3, baz: lorem ipsum, foo: 1, is: 2, test: 2, this: 1
Levenshtein distance:
0.0
0


# BACKUP

In [107]:
d_flat = flatten_dict(json_result)
print(json.dumps(d_flat, indent=2))
print(array_to_sorted_string(string_array=d_flat))
print(array_to_sorted_string(string_array=d_flat, string_truncation=50))

[
  "Versicherungsnehmer_Name: Zorrsen Beteiligungsgesellschaft",
  "testB: 34",
  "key3: value3",
  "key1: value1",
  "key2: value2",
  "key4: value4",
  "key3: value3",
  "Versicherungssumme: 5500000"
]
Versicherungsnehmer_Name: Zorrsen Beteiligungsgesellschaft, Versicherungssumme: 5500000, key1: value1, key2: value2, key3: value3, key3: value3, key4: value4, testB: 34
Versicherungsnehmer_Name: Zorrsen Beteiligungsgese, Versicherungssumme: 5500000, key1: value1, key2: value2, key3: value3, key3: value3, key4: value4, testB: 34


In [113]:
json_target = {"Versicherungssumme": "5000000", "Versicherungsnehmer_Name": "Zorrsen Beteiligungsgesellschaft mbH", "test": {"testA": [{"key1": "value1", "key2": "value2"}, {"key3": "value3", "key4": "value4"}], "testB": 34}}
json_result = {"Versicherungsnehmer_Name": "Zorrsen Beteiligungsgesellschaft", "test": {'testB': 34, "testA": [{"key3": "value3", "key1": "value1", "key2": "value2"}, {"key4": "value4", "key3": "value3"}]}, "Versicherungssumme": "5500000"}

json_target_str = json.dumps(json_target, sort_keys = True)
json_result_str = json.dumps(json_result, sort_keys = True)

ic(levenshtein(json_target_str,json_result_str))
ic(distance(json_target_str,json_result_str))

ic| levenshtein(json_target_str,json_result_str): 23.0
ic| distance(json_target_str,json_result_str): 23


23