# Step 6 Evaluation

If there are evaluation keys availible for the dataset, you can automatically run a set of evaluation metrics. These are generally
geared to assessing the end-to-end result but obviously there is a relationship with internal steps as well.

![step6](resources/Step6_Evaluate.png)

In [None]:
import opentldr
from datetime import datetime

## Parameters
OpenTLDR workflows use the notebook block tagged as "parameters" to inject variables (for example which scores to produce).

> **Do Not Change Variable Names in the Parameters Block** you are welcome to change the values of these parameter variables, but please do not change their names. They are used elsewhere in the notebook and in other workflow processes.

In [None]:
# Workflow Parameters
eval_data_repo_config = {'repo_type': 'files', 'path': './sample_data/evalkey'}

sentence_embedding_model = "sentence-transformers/all-MiniLM-L6-v2"

## Cosin Distance of text embeddings
This is used to grossly estimate how similar two text blocks are to each other.

In [None]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer(sentence_embedding_model)

def cosin_similarity(string_1:str, string_2:str):
        if string_1 == string_2:
                return 1.0

        #compute the embeddings for each string
        embedding_1= model.encode(string_1, convert_to_tensor=True)
        embedding_2 = model.encode(string_2, convert_to_tensor=True)
        
        #compute the cosin similarity of the two embeddings
        similarity = util.cos_sim(embedding_1, embedding_2).cpu().numpy()[0][0]

        return round(similarity,4)

# Load the EvalKey entries from the data repository
EvalKeys indcate correct answers like a rubric for the associated content/requests.

In [None]:
from opentldr import KnowledgeGraph, DataRepo

kg=KnowledgeGraph()

# uncomment if you run this notebook repeatedly
kg.delete_all_evalkeys()

if eval_data_repo_config is not None:
    repo = DataRepo(kg, eval_data_repo_config)
    list_of_uids =  repo.importData()
    print("Loaded {count} EvalKey nodes from the repository.".format(count=len(list_of_uids)))

# Score each Content/Request Combination Pairwise
Since these might be large text blocks in some cases, this is done by lists of the uids, and the loading of nodes is done on demand (this reduced memory requirements at the cost of more small but slow queries)

In [None]:
from opentldr.Domain import Content, Request, EvalKey, TldrEntry, Recommendation, Summary

content_uids:list[str]=kg.get_all_node_uids_by_tag('Content')
request_uids:list[str]=kg.get_all_node_uids_by_tag('Request')

print("Evaluate pairwise {x} Content and {y} Request nodes.".format(x=len(content_uids),y=len(request_uids)))


## For each pair, compute the metrics
We are interested in the average accross all pairs, so these are aggregate values until normalized later.

In [None]:
tp:int = 0
fp:int = 0
tn:int = 0
fn:int = 0

selection:float = 0.0
focus:float = 0.0
reduction:float = 0.0
reduction_count:int = 1

content:Content = None
for c in content_uids:

    content = kg.get_content_by_uid(c)
    #print()
    #print ("content:\t",content.to_text())
    request:Request = None

    for r in request_uids:

        request= kg.get_request_by_uid(r)
        #print("request:\t",request.to_text())

        key:EvalKey = kg.cypher_query_one("""
            MATCH (c:Content) WHERE c.uid='{content_id}'
            MATCH (q:Request) WHERE q.uid='{request_id}'
            MATCH (c)<-[kc:KEY_FOR_CONTENT]-(k:EvalKey)-[kq:KEY_FOR_REQUEST]->(q)
            RETURN k """.format( content_id=c, request_id=r),"k")

        entry:TldrEntry = kg.cypher_query_one("""
            MATCH (c:Content) WHERE c.uid='{content_id}'
            MATCH (q:Request) WHERE q.uid='{request_id}'
            MATCH (q)<-[]-(tldr:Tldr)-[]->(e:TldrEntry)-[]->(r:Recommendation)-[]->(c)
            RETURN e """.format( content_id=c, request_id=r),"e")

        if entry is None:
            if key is None:
                # TRUE NEGATIVE
                tn += 1
                selection += 1.0
                focus +=  1.0
            else:
                # FALSE NEGATIVE
                #print ("evalkey:\t",key.to_text())
                fn += 1
                selection += 1.0 - key.score
                focus += 0.0
        else:
            #print ("tldr:\t",entry.to_text())

            recommendation:Recommendation = kg.get_recommendation(content,request)
            #print ("recommendation:\t",recommendation.to_text())

            summary:Summary = kg.get_summaries_by_recommendation(recommendation)[0]
            #print ("summary:\t",summary.to_text())

            if key is None:
                # FALSE POSTIVE
                fp += 1
                selection += 1.0 - recommendation.score
                focus += 1.0 - cosin_similarity(summary.text,request.text)
                reduction += 1.0 - (len(summary.text) / len(content.text))
                reduction_count += 1
            else:
                # TRUE POSITIVE
                #print ("evalkey:\t",key.to_text())
                tp += 1
                selection += 1.0 - abs(key.score - recommendation.score)
                focus += cosin_similarity(key.text,summary.text)
                reduction += 1.0 - (len(summary.text) / len(content.text))
                reduction_count += 1

total:int = tp+fp+tn+fn


# Evaluation Results

## Totals / Rates
Confusion matrics of counts and relative rates of the pairwise (requests and content nodes) checks:
- True Positives (tp): Content was included in the TLDR for a Request, AND an EvalKey exists for this pair.
- False Positives (fp): Content was included in the TLDR for a Request, but there was no cooresponding EvalKey.
- True Negative (tn): Content was skipped for this Request, AND there was no cooresponding EvalKey.
- False Negative (fn): Content was skipped for this Request, but an EvalKey exists for this pair.

## Metrics
Computed assessments for end-to-end result:
- Precision, Accuracy, Recall, and F1 Score: based on commonly used formula (see wikipedia) using above confusion matrix
- Selection: The average amount that the recommendation scores were off of those provided in EvalKeys
- Focus: The cosin similarity between the summary and the ideal (manually created) summary in the EvalKeys
- Reduction: The % reduction in size (measured by characters) from the original content and the produced summary.

In [None]:
print("Totals:\n",
            " n:\t{x}\n".format(x=total),
            "tp:\t{x}\n".format(x=tp),
            "tn:\t{x}\n".format(x=tn),
            "fp:\t{x}\n".format(x=fp),
            "fn:\t{x}\n".format(x=fn))

if total > 0.0:
      print("Rates:\n",
            "tp:\t{x:.3f} %\n".format(x=tp/total),
            "tn:\t{x:.3f} %\n".format(x=tn/total),
            "fp:\t{x:.3f} %\n".format(x=fp/total),
            "fn:\t{x:.3f} %\n".format(x=fn/total))
      
      if (tp > 0.0):
            precision:float = tp / (tp + fp)
            accuracy:float = (tp + tn) / total
            recall:float = tp / (tp + fn)
            f1_score:float = tp / (tp + (0.5 * (fp + fn)))

            reduction_avg:float = reduction / reduction_count
            selection_avg:float = selection / total
            focus_avg:float = focus / total

            print("Metrics:\n",
                  "precision:\t{x:.3f}\n".format(x=precision),
                  "accuracy: \t{x:.3f}\n".format(x=accuracy),
                  "recall:   \t{x:.3f}\n".format(x=recall),
                  "f1_score: \t{x:.3f}\n".format(x=f1_score),
                  "selection:\t{x:.3f}\n".format(x=selection_avg),
                  "focus:    \t{x:.3f}\n".format(x=focus_avg),
                  "reduction:\t{x:.3f}\n".format(x=reduction_avg)
            )
      else:
            print("No true-positive results, skipping metrics.")
else:
      print ("No results.")

In [None]:
kg.close()