# Monolingual sentence comparison

Although this library was created with the intention of comparing syntactic structures between languages, it can also
 be used to compare two sentences in the same language. This is for instance useful when comparing different
 translations of the same text, comparing a post-edited or revised version with the original, or compare a machine
 translation with a reference translation.

In this case, we do not necessarily need Universal Dependencies
 to be able to compare languages. So instead of using stanza, which used the UD annotation schema, we can use any other
 parser as well. This library provides built-in support for stanza and spaCy, so in this example we will make use of
 spaCy.

In [1]:
!pip install astred[spacy]



In [2]:
from astred import AlignedSentences, Sentence
from astred.utils import load_parser

# Just for this notebook, we do not want to be bothered with spaCy's UserWarnings
import warnings
warnings.simplefilter("ignore", UserWarning)

The default English spaCy models do not make use of Universal Dependencies, but since we are comparing two sentences in
 the same language, parsed with the same parser, that is not an issue: the tags and labels are comparable.

When using spaCy, models must be downloaded manually, though. When you use stanza in `astred` you can simply provide
 the language code and the required models will be downloaded behind the scenes. That is not possible with spaCy.


In [3]:
# Download a default, English spaCy models
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.0.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl (13.7 MB)
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


The only thing that is different from what is shown in the other examples, is that here we explicitly pass an
 initialized spaCy parser to `.from_text()` instead of a language code. We could have written
 `.from_text("<text>", "en")` but that means that for both these sentences the parser will be loaded. That is quite
 some overhead! If you need to parse multiple sentences with the same parser (language), it is best to first create the
 parser and pass that parser to `.from_text()`, as we do in this example.

The spaCy parser can be created with `spacy.load()`, but as an example we will use the method `load_parser()`, which
 can also be used to initialize a stanza parser.

In [4]:
nlp = load_parser("en_core_web_sm", "spacy")
src_sent = Sentence.from_text("I saw the director hiding a lot of documents at night !", nlp)
tgt_sent = Sentence.from_text("Last night , the director was hiding a lot of our papers .", nlp)
aligns = "2-3 3-4 4-5 4-6 5-7 6-8 7-9 8-10 8-11 9-0 9-1 10-0 10-1 11-12"

aligned = AlignedSentences(src_sent, tgt_sent, word_aligns=aligns)

You'll notice that not all words can be aligned, perhaps because different translators to show a different perspective.
 Particularly, "I saw" is not aligned on the source side. If a word is not aligned, it is implicitly connected to a
 NULL word. after creating the `AlignedSentences` object, the source and target sentence receive a NULL element at the
 front to which "unaligned words" are then connected. For a given word you can check whether it is aligned with
 `is_aligned` and if it is, you can easily get its aligned words. To iterate the words of a sentence without including
 the NULL word, we can use `Sentence.no_null_words`.

In this example we will display for each source word its aligned target words.

In [5]:
for src_word in src_sent.no_null_words:
	if not src_word.is_aligned:
		continue
	print(src_word.text, " ".join([tgt_word.text for tgt_word in src_word.aligned]))

the the
director director
hiding was hiding
a a
lot lot
of of
documents our papers
at Last night
night Last night
! .


You can also easily find whether any of a word's tags have changed compared to its aligned words:

In [6]:
for src_word in src_sent.no_null_words:
	if not src_word.is_aligned:
		continue

	print("DEPENDENCIES")
	for tgt_id, change in src_word.changes("deprel").items():
		tgt_word = tgt_sent[tgt_id]
		print("CHANGE:" if change else "SAME:", f"{src_word.text} ({src_word.deprel})", f"{tgt_word.text} ({tgt_word.deprel})")

	print("PART-OF-SPEECH")
	for tgt_id, change in src_word.changes("xpos").items():
		tgt_word = tgt_sent[tgt_id]
		print("CHANGE:" if change else "SAME:", f"{src_word.text} ({src_word.xpos})", f"{tgt_word.text} ({tgt_word.xpos})")
	print("---")


DEPENDENCIES
SAME: the (det) the (det)
PART-OF-SPEECH
SAME: the (DT) the (DT)
---
DEPENDENCIES
SAME: director (nsubj) director (nsubj)
PART-OF-SPEECH
SAME: director (NN) director (NN)
---
DEPENDENCIES
CHANGE: hiding (ccomp) was (aux)
CHANGE: hiding (ccomp) hiding (ROOT)
PART-OF-SPEECH
CHANGE: hiding (VBG) was (VBD)
SAME: hiding (VBG) hiding (VBG)
---
DEPENDENCIES
SAME: a (det) a (det)
PART-OF-SPEECH
SAME: a (DT) a (DT)
---
DEPENDENCIES
SAME: lot (dobj) lot (dobj)
PART-OF-SPEECH
SAME: lot (NN) lot (NN)
---
DEPENDENCIES
SAME: of (prep) of (prep)
PART-OF-SPEECH
SAME: of (IN) of (IN)
---
DEPENDENCIES
CHANGE: documents (pobj) our (poss)
SAME: documents (pobj) papers (pobj)
PART-OF-SPEECH
CHANGE: documents (NNS) our (PRP$)
SAME: documents (NNS) papers (NNS)
---
DEPENDENCIES
CHANGE: at (prep) Last (amod)
CHANGE: at (prep) night (npadvmod)
PART-OF-SPEECH
CHANGE: at (IN) Last (JJ)
CHANGE: at (IN) night (NN)
---
DEPENDENCIES
CHANGE: night (pobj) Last (amod)
CHANGE: night (pobj) night (npadvmod)


Sentences can be represented as linguistic "trees". Here, we make use of dependency trees to formalize the structure of
 the sentences. As such, it is interesting to find differences between these structures. This is typically done with
 tree edit distance, but in our paper we suggest ASTrED (aligned syntactic tree edit distance), which also takes word
 alignment information into account during the tree edit distance calculation.

We can also check which structural changes need to happen to convert the source dependency tree to the target tree
 quite easily. In `aligned.ted_ops` the operations are saved that are necessary to make the conversion. This is in
 fact a list of tuples of a source and target sub `Tree`s. The comments in the cell below explain this further.
 Note again that this is not calculated based on regular tree edit distance, but with ASTrED. For the argumentation
 behind ASTrED, see our paper but the main goal is to ensure that only aligned elements can match each other to avoid
 "accidental" structural overlap to bias the outcome.

In [7]:
print("Edit distance:", aligned.ted)
for operation in aligned.ted_ops:
	src_node, tgt_node = operation
	src_text = src_node.node.text if src_node else None
	tgt_text = tgt_node.node.text if tgt_node else None

	# If both a source and target element are present in this operation...
	if src_text and tgt_text:
		# ... that can mean they match, or ...
		if src_text == tgt_text:
			print("MATCH", src_text, "===", tgt_text)
		# ... that a source element has been replaced by a target element
		else:
			print("SUBSTITUTION", src_text, "-->", tgt_text)
	# If only a source element is present, and no target, then that means the source element was deleted
	elif src_text:
		print("DELETION:", f"{src_text} (src)")
	# If only a target element is present, and no source, then that means the target element was inserted
	elif tgt_text:
		print("INSERTION:", f"{tgt_text} (tgt)")

Edit distance: 8
SUBSTITUTION saw --> hiding
DELETION: hiding (src)
DELETION: at (src)
DELETION: night (src)
INSERTION: was (tgt)
INSERTION: night (tgt)
INSERTION: Last (tgt)
SUBSTITUTION I --> ,
MATCH director === director
MATCH the === the
MATCH lot === lot
MATCH a === a
MATCH of === of
INSERTION: papers (tgt)
SUBSTITUTION documents --> our
SUBSTITUTION ! --> .


Because we have access to the underlying Token that spaCy produced, we can now also do some pretty neat stuff, like
 calculating the semantic similarity between a source word and its aligned word(s). In this case, we can see that
 there is a reasonable similarity between "documents" and "papers".

In [8]:
# Loop over the source words as before
for src_word in src_sent.no_null_words:
	# Skip words that are not aligned
	if not src_word.is_aligned:
		continue

	spacy_src = src_word._word
	for tgt_word in src_word.aligned:
		spacy_tgt = tgt_word._word
		print(src_word.text, tgt_word.text, spacy_src.similarity(spacy_tgt))



the the 1.0
director director 1.0
hiding was 0.16484383
hiding hiding 1.0
a a 1.0
lot lot 1.0
of of 1.0
documents our -0.00042243474
documents papers 0.6571306
at Last 0.17158762
at night -0.031542435
night Last 0.010333299
night night 1.0
! . 0.91295487
