# Coherence Check

Goal: We've a sample of generated text. We want to select the ones which are coherent and discard the rest. 

Not a Goal: Checking for semantic correctness.

## Possible Approaches for Coherence Checks

1. [**Recommended**] Using a different Language Model and calculating perplexity of the sentence and threshold to select only coherent variants. Use a LM fine-tuned on your training corpus to make sure that the perple

2. Using dependency parsing from spaCy to see if there are conditions/patterns which incoherent sentences fail, but coherent sentences meet. Common example: The root verb in the sentence should be directly connected to the subject. There should be no dangling clauses.

3. For longer text generation, in addition for training with the next sentence prediction task. Generate multiple next sentences and use the [CLS] emb + classifier to mark each sentence as coherent or not. 

> We encode each sentence by adding [CLS] token to the last position, and feed the hidden state of this token to a double dot-product regression model. The final output is from a logistic regression predicting if the two sentences come from the same paragraph or not.
> - From [Improving Language Generation with Sentence Coherence Objective](https://www.arxiv-vanity.com/papers/2009.06358/)

In [14]:
# TODO
# Add example of perplexity change using GPT-2 or T5

In [3]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")

In [12]:
doc = nlp("Stitch in time, saves Nine")
options = {"bg": "#09a3d5", "color": "white"}
displacy.render(doc, style="dep", options=options)

In [13]:
doc = nlp("Stitch in time")
displacy.render(doc, style="dep", options=options)

Notice that when the phrase/clause is used - the verb "Stitch" does not have a subject. While the previous one, does via `conj` (conjuction). 

As an example of rule/filter, we can enforce a constraint that every verb needs to have a subject.