# This notebook will attempt to show how to compare two models

We often arrive in a situation where we've got multiple different models. Yet we're note sure which one we should focus on or start from for a particular task.
This notebook aims to introduce some tools that (hopefully) help us do that.

Currently, the direct comparison is only between two CAT objects (or model packs).
We also need to provide the documents we wish to look at (in CSV form, id and text columns).

In [11]:
# the two models we wish to compare
model_path_1 = "../../../MedCAT/temp/model_packs/20230227__kch_gstt_trained_model_494c3717f637bb89.zip"
# SNOMED 2024 model trained on MIMIC-IV and 20% of KCH data
model_path_2 = "../../../MedCAT/temp/model_packs/snomed2024_kch_trained_fc8dcd4c84fd6502.zip"
# the documents file we'll be looking at
documents_file = "data/some_synthetic_data.csv"

Now that we've got the input data, we need to figure out how they work and what their differences are.
We use the `get_diffs_for` method that loads both models, runs `CAT.get_entities` on each document for eithe rmodel, and then returns some results.
These results show describe the difference in the raw CDB (i.e the number of concepts (join and unique), amount of training, and so on), the total differences in the entities extracted (i.e the number of recognitions and forms per CUI) as well as per document differences (i.e the number of identical as well as different entity recognitions found).

In [12]:
from compare import get_diffs_for
from output import parse_and_show, show_dict_deep, compare_dicts

cui_filter = None
# optional CUI filter:
# cui_filter = {"25064002"}

cdb_comp, tally1, tally2, ann_diffs = get_diffs_for(model_path_1, model_path_2, documents_file, cui_filter=cui_filter)

Loading [1] ../../../MedCAT/temp/model_packs/20230227__kch_gstt_trained_model_494c3717f637bb89.zip




Loading [2] ../../../MedCAT/temp/model_packs/snomed2024_kch_trained_fc8dcd4c84fd6502.zip
Per annotations diff finding


100%|██████████| 60/60 [00:09<00:00,  6.03it/s]


Counting [1&2]


100%|██████████| 60/60 [00:00<00:00, 11011.08it/s]


CDB compare


keys: 100%|██████████| 794151/794151 [00:01<00:00, 554283.22it/s]
keys: 100%|██████████| 794151/794151 [00:02<00:00, 297079.11it/s]


For now, we'll use the common parser/display method to dispaly an overview of the results.
We can later look at more granual details as well.

In [13]:
# show results
parse_and_show(cdb_comp, tally1, tally2, ann_diffs)

CDB overall differences:
names.keys.joint                        	752042                                  	                                        
names.keys.total                        	760283                                  	785910                                  
names.keys.not_in_                      	33868                                   	8241                                    
names.values.joint                      	2327941                                 	                                        
names.values.total                      	3149859                                 	2510372                                 
names.values.unique_in_                 	752906                                  	152108                                  
names.values.not_in_                    	170834                                  	810321                                  
snames.keys.joint                       	752042                                  	                                

## More granual details (per document view)

The above does not give us all the information we need.
For instance, we may also want to compare the performance accross some documents.
We can do so as follows.

In [14]:
# you can play with individual parts as well.
# for example, isolate a specific document
ann_diffs.per_doc_results.keys()

for key in list(ann_diffs.per_doc_results.keys())[0:10]:
    print('='*20,f'\n{key}', f'\n{"="*20}')
    show_dict_deep(ann_diffs.per_doc_results[key].nr_of_comparisons)

doc_0 
IDENTICAL                               	41                                      	                                        
FIRST_HAS                               	6                                       	                                        
SECOND_HAS                              	6                                       	                                        
SAME_SPAN_DIFF_CONCEPT                  	3                                       	                                        
SAME_GRANDPARENT                        	1                                       	                                        
OVERLAPP_1ST_LARGER_DIFF_CONCEPT        	4                                       	                                        
doc_1 
IDENTICAL                               	28                                      	                                        
FIRST_HAS                               	10                                      	                                        
SE

## More granual details (per cui view)

We may also want to look at how we did for a specific CUI.
This is how we can do that.

In [15]:
# cui = '37151006'  # Erythromelalgia
cui = '25064002'  # headache
per_cui1 = tally1.get_for_cui(cui)
per_cui2 = tally2.get_for_cui(cui)
compare_dicts(per_cui1, per_cui2)

name                                    	Headache                                	Headache                                
count                                   	9                                       	15                                      
acc                                     	1.0                                     	1.0                                     
forms                                   	1                                       	1                                       


## More granual details (per annotation view)
Sometimes we may want to look at things on a per annotation basis as well.
That is, we want to look at some annotations and compare them between the two models.

In [16]:
# we can iterate over annotation pairs.
# we may optionally specify the documents we wish to look at
# we will specify one document here so as to not generate too much output
docs = ['doc_2']
# by default, this will omit identical annotations
# but this can be changed by setting omit_identical=False
for doc_name, pair in ann_diffs.iter_ann_pairs(docs=docs, omit_identical=True):
    print('='*20,f'\n{doc_name} ({pair.comparison_type})', f'\n{"="*20}')
    # NOTE: if only one of the two has an annotation, the other one will be None
    #       the following will deal with that automatically, though
    compare_dicts(pair.one, pair.two)

doc_2 (AnnotationComparisonType.FIRST_HAS) 
pretty_name                             	Genus Quercus                           	                                        
cui                                     	53347009                                	                                        
type_ids                                	['81102976']                            	                                        
types                                   	['']                                    	                                        
source_value                            	Oak                                     	                                        
detected_name                           	oak                                     	                                        
acc                                     	0.6368384509248382                      	                                        
context_similarity                      	0.6368384509248382                      	             