# 🔎 FHIR Difference: Quality Inspection 🔎

Evaluating whether a generated/predicted Fhir resources is accurate is difficult due to the verbose nature of FHIR.
The note_to_fhir.evaluation module makes this easier with:
- A <b>FhirScore</b> object that tracks the number of changes/similarities w.r.t the ground truth throughout the whole depth of the fhir resource 
- A <b>FhirDiff</b> (tree) object contains all relevant details of the FHIR comparison, including the FhirScore at each level. 
- Visualizations for the above FhirDiff and FhirScore

In [None]:
import sys
sys.path.append("..")

In [None]:
from healthsageai.note_to_fhir.evaluation.datamodels import FhirScore, FhirDiff
import os
from healthsageai.note_to_fhir.evaluation.datamodels import FhirDiff
from healthsageai.note_to_fhir.evaluation.utils import get_diff, diff_to_list, diff_to_dataframe, compare_leaf, get_resource_details, get_resource_class, diff_to_dataframe
from healthsageai.note_to_fhir.evaluation.visuals import show_diff
from datasets import load_dataset
import json

In [None]:
testset = load_dataset("healthsage/example_fhir_output")

In [None]:
fhir_true = json.loads(testset["train"]["fhir_true"][0])
fhir_pred = json.loads(testset["train"]["note_to_fhir"][0])

### 🌲 FHIR JSON trees
A dictionary representation of FHIR has a tree structure. On the root level, we find the dataset, which consists of bundles. Each bundle has entries with resources that can contain nested resources of an arbitrary depth. E.g.:

- Dataset
  - Bundle
    - BundleEntry
       - Resource (Encounter, Condition, Patient, etc.)
           - Resource element
                - Nested Resource
                    - Nested Resource
    - Bundle Entry
        - ...
  - Bundle
                

The note_to_fhir.evaluation module follows this tree structure.

### 🍀 Evaluating leaf nodes

The most elemental comparison that can be done is that of the leaf nodes. The FhirScore object can represent the difference between two leaf nodes

### 🚦The FhirScore Object

In [None]:
score = FhirScore()
score

#### 🍀 Base example: comparing a and b with compare leaf 

In [None]:
print(compare_leaf.__doc__)

In [None]:
?compare_leaf

In [None]:
compare_leaf("a", "a")

In [None]:
compare_leaf("a", "b")

In [None]:
compare_leaf("a", None)

In [None]:
compare_leaf(None, "a")

### 🧮 You can add / sum FhirScores together
The score of a given element in the Fhir Tree is essentially the sum of its parts. Therefore, you can add FhirScores.

In [None]:
score_a = compare_leaf("a", "a")
score_b = compare_leaf("a", "b")

In [None]:
score_a + score_b

In [None]:
sum([compare_leaf("a", "a"), compare_leaf("a", "b")])

In [None]:
sum([compare_leaf("a", "a")])

### ↔️ The FhirDiff object

Sample data

 💬 Example ground truth Encounter

In [None]:
encounter_true, encounter_pred  = fhir_true['entry'][0], fhir_pred['entry'][0]

In [None]:
encounter_true

In [None]:
encounter_pred

In [None]:
print(get_diff.__doc__)

In [None]:
diff = get_diff(encounter_true, encounter_pred, resource_type="BundleEntry")

In [None]:
encounter_true

In [None]:
diff.score

### The score shows:
- There are 16 leaf elements in both resources   
-----> 7 were identical   
-----> 6 are missing in the prediction (n_deletions). This impacts recall score   
-----> 3 are changed (n_modifications). This impacts accuracy and precision   
-----> 0 are added (n_additions), meaning there are no hallucinations   

### We can navigate the Diff tree and see the score at each node. For instance, the "participant field"

In [None]:
diff.children['resource'].children['participant'][0].score

# 📊 Visualization

In [None]:
print(show_diff.__doc__)

In [None]:
show_diff(diff)

### The visualization clearly shows where the differences are:

- Red are 0% accuracy leaves
- Blue are 100% accuracy leaves
- Colors in between represent the accuracy scale

### Learning from this example:

- The period in which the participant was present in the encounter differed by 36 seconds. A neglible difference.
- The Participant type was not predicted, resulting in 4 "missing" nodes
- The Period of the Ecnounter itself was not predicted, altough it was identical to the period of the participant
- The predicted status was unknown where the actual status was finished

### Visualization of the entire Bundle

In [None]:
diff = get_diff(fhir_true, fhir_pred, resource_type="Bundle")

In [None]:
show_diff(diff)

### 📊 Visualizing Diffs in bar charts

In [None]:
df = diff_to_dataframe(diff)

In [None]:
df.head()

In [None]:
scores_per_type = df[['resource_type','score']].groupby('resource_type').sum()

In [None]:
scores_per_type['accuracy'] = scores_per_type['score'].apply(lambda x: x.accuracy)

In [None]:
scores_per_type

In [None]:
scores_per_type.plot(kind='bar')

### Future developments:
- Mapping corresponding arrays that are not in identical order
- Weighted metrics, scores are now weighted by n_leaves, but could also be weighted by resource element, bundle
- FHIR validation