# This notebook will attempt to show how to compare two models

We often arrive in a situation where we've got multiple different models. Yet we're note sure which one we should focus on or start from for a particular task.
This notebook aims to introduce some tools that (hopefully) help us do that.

## Initial input - models and data

There are two different workflows this notebook can handle:
1. Compare two different model packs
  - Provide 2 model pack paths
  - Provide a documents file
2. Compare model pack with and without supervised training
  - Provide 1 model pack path
  - Provide a file path to a MedCATtrainer (MCT) export
  - Provide a document file

The model packs can be either the `.zip` file (which will be automatically unzipped) or the folder.

The documents file is expected in a `.csv` format with two columns (`id`, and `text`).

The MCT export is expected in the format given by MedCATtrainer.

For the two approaches, there is a slightly different internal workflow.
But other than ticking the checkbox, the process should be identical to the user.

### CUI filter settings

These are optional.

If you wish to filter based on CUIs (i.e only run the comparison for some CUIs), you can do so.
You can either list the CUIs (separated by comma) or provide a file that lists them (separated by comma).

You can also include the children of the selected CUIs.The default is not to do so.
But you can opt to include children of a certain order (i.e `1` means direct children only, `2` means children of children as well, and so on).

In [1]:
from comp_nbhelper import NBInputter
inputs = NBInputter()
inputs.show_all()

  from tqdm.autonotebook import tqdm, trange


VBox(children=(VBox(children=(HTML(value='<h2>Models</h2>'), VBox(children=(Label(value='Choose model 1'), Fil…

### Running the difference finder

Now that we've got the input data, we need to figure out how they work and what their differences are.
We use the `get_comparison` method that loads both models, runs `CAT.get_entities` on each document for either model, and then returns some results.

These results show describe the difference in the raw CDB (i.e the number of concepts (join and unique), amount of training, and so on), the total differences in the entities extracted (i.e the number of recognitions and forms per CUI) as well as per document differences (i.e the number of identical as well as different entity recognitions found).

We will look into the details later.

In [32]:

comparison = inputs.get_comparison()
comparison.show_all()

For models, selected:
Model1: /Users/martratas/Documents/CogStack/.MedCAT.nosync/MedCAT/models/20230227__kch_gstt_trained_model_494c3717f637bb89.zip
Model2: /Users/martratas/Documents/CogStack/.MedCAT.nosync/MedCAT/models/snomed2024-06-mimic-iv.zip
Documents: /Users/martratas/Documents/CogStack/.MedCAT.nosync/working_with_cogstack/medcat/compare_models/data/intechopen_2cols_3.csv
For CUI filter, selected:
Filter: /Users/martratas/Documents/CogStack/.MedCAT.nosync/working_with_cogstack/medcat/compare_models/data/demo-physio-mobility/cui_filter.csv
Children: None
Loading [1] /Users/martratas/Documents/CogStack/.MedCAT.nosync/MedCAT/models/20230227__kch_gstt_trained_model_494c3717f637bb89.zip




Loading [2] /Users/martratas/Documents/CogStack/.MedCAT.nosync/MedCAT/models/snomed2024-06-mimic-iv.zip




Per annotations diff finding
Applying filter to CATs: 139 CUIs


14370it [2:35:53,  1.54it/s]


Counting [1&2]


100%|██████████| 14362/14362 [00:00<00:00, 17272.71it/s]


CDB compare


keys: 100%|██████████| 804513/804513 [00:01<00:00, 437204.92it/s]
keys: 100%|██████████| 804513/804513 [00:06<00:00, 118845.07it/s]


CDB overall differences:


| Path | Value | [Optional] Comparison |
| ----- | ----- | ----- |
| names.keys.joint                         | 750494                                   |                                          |
| names.keys.not_in_                       | 44230                                    | 9789                                     |
| names.keys.total                         | 760283                                   | 794724                                   |
| names.values.joint                       | 2659786                                  |                                          |
| names.values.unique_in_                  | 421061                                   | 251871                                   |
| names.values.not_in_                     | 375036                                   | 457386                                   |
| names.values.total                       | 3149859                                  | 3067509                                  |
| snames.keys.joint                        | 750494                                   |                                          |
| snames.keys.not_in_                      | 44230                                    | 9789                                     |
| snames.keys.total                        | 760283                                   | 794724                                   |
| snames.values.joint                      | 5697138                                  |                                          |
| snames.values.unique_in_                 | 962832                                   | 552795                                   |
| snames.values.not_in_                    | 1273285                                  | 1291058                                  |
| snames.values.total                      | 13486640                                 | 13468867                                 |

Now tally differences


| Path | First | Second |
| ----- | ----- | ----- |
| pt2ch (Dict[str, Set])                   | 352226 keys (mean 2.0 values per key)    | 372046 keys (mean 2.0 values per key)    |
| cat_data                                 | {'Number of concepts': 760283, 'Number of names': 3080845, 'Number of concepts that received training': 38460, 'Number of seen training examples in total': 153875883, 'Average training examples per concept': 4000.932995319813} | {'Number of concepts': 794724, 'Number of names': 2911657, 'Number of concepts that received training': 62855, 'Number of seen training examples in total': 253418075, 'Average training examples per concept': 4031.788640521836} |
| per_cui_count (Dict[str, int])           | 33 keys (total 6285 in value)            | 33 keys (total 6393 in value)            |
| per_cui_acc (Dict[str, float])           | 33 keys (mean 0.9250648240405015 in value) | 33 keys (mean 0.950254014404179 in value) |
| per_cui_forms (Dict[str, Set])           | 33 keys (mean 2.0 values per key)        | 33 keys (mean 2.0 values per key)        |
| per_type_counts (Dict[str, int])         | 9 keys (total 6285 in value)             | 9 keys (total 6393 in value)             |
| total_count                              | 6285                                     | 6393                                     |

Now per-annotation differences:


| Path | Value | [Optional] Comparison |
| ----- | ----- | ----- |
| IDENTICAL                                | 5283                                     |                                          |
| FIRST_HAS                                | 1002                                     |                                          |
| SECOND_HAS                               | 1110                                     |                                          |

For now, we'll use the common parser/display method to display an overview of the results.
We can later look at more granual details as well.

## More granual details (per document view)

The above does not give us all the information we need.
For instance, we may also want to compare the performance accross some documents.
We can do so as follows.

In [38]:
# you can play with individual parts as well.
# for example, isolate a specific document
comparison.show_per_document(limit=10)

Chemokines Updates_7. Interleukin-5 and Interleukin-5 Receptor Polymorphism in Asthma 


| Path | Value | [Optional] Comparison |
| ----- | ----- | ----- |
| IDENTICAL                                | 1                                        |                                          |

Cerebral Palsy_4. Neurosurgical Treatment of Cerebral Palsy 


| Path | Value | [Optional] Comparison |
| ----- | ----- | ----- |
| IDENTICAL                                | 4                                        |                                          |

Endothelial Dysfunction_2. An Overview of Gene Variants of Endothelin-1: A Critical Regulator of Endothelial Dysfunction 


| Path | Value | [Optional] Comparison |
| ----- | ----- | ----- |
| IDENTICAL                                | 1                                        |                                          |

Endothelial Dysfunction_3. Genetic Markers of Endothelial Dysfunction 


| Path | Value | [Optional] Comparison |
| ----- | ----- | ----- |
| IDENTICAL                                | 1                                        |                                          |

Advances in Skeletal Muscle Health and Disease_4. Duchenne Muscular Dystrophy: Clinical and Therapeutic Approach 


| Path | Value | [Optional] Comparison |
| ----- | ----- | ----- |
| IDENTICAL                                | 5                                        |                                          |

Topics in Trauma Surgery_2. Appropriate Protective Measures for the Prevention of Animal-related Goring Injuries 


| Path | Value | [Optional] Comparison |
| ----- | ----- | ----- |
| IDENTICAL                                | 4                                        |                                          |

Topics in Trauma Surgery_5. OCD of the Knee in Adolescents 


| Path | Value | [Optional] Comparison |
| ----- | ----- | ----- |
| IDENTICAL                                | 1                                        |                                          |

Recent Updates on Multiple Myeloma_6. Management of Renal Failure in Multiple Myeloma 


| Path | Value | [Optional] Comparison |
| ----- | ----- | ----- |
| IDENTICAL                                | 1                                        |                                          |

Immune Checkpoint Inhibitors_5. Immune Checkpoint Inhibitors in Hodgkin Lymphoma and Non-Hodgkin Lymphoma 


| Path | Value | [Optional] Comparison |
| ----- | ----- | ----- |
| IDENTICAL                                | 2                                        |                                          |

Immune Checkpoint Inhibitors_8. Recent Developments in Application of Multiparametric Flow Cytometry in CAR-T Immunotherapy 


| Path | Value | [Optional] Comparison |
| ----- | ----- | ----- |
| FIRST_HAS                                | 1                                        |                                          |

# Saving annotation output to CSV file
You can also save the annotation output to a .csv file. That file inclues the following columns:
```
doc_id  text    ann1    ann2
```
where `doc_id` refers to the ID of the document in question, `text` is the relevant text around the specific annotation, `ann1` is the annotation json for model 1 (if present), and `ann2` is the annotation json for model 2 (if present).

*Note:* One of the annotations may not be present. This is the case if one of the models did not annotate that specific span.

In [20]:
comparison.diffs_to_csv("23vs24_annotations_2.csv")

## More granual details (per cui view)

We may also want to look at how we did for a specific CUI.
This is how we can do that.

In [43]:
# cui = '37151006'  # Erythromelalgia
# cui = '25064002'  # headache
cuis_1 = comparison.tally1.per_cui_count.keys()
cuis_2 = comparison.tally2.per_cui_count.keys()
joint_cuis = set(cuis_1) & set(cuis_2)
print(f"CUIs in 1 ({len(cuis_1)}):", cuis_1)
print(f"CUIs in 2 ({len(cuis_2)}):", cuis_2)
print("JOINT", len(joint_cuis), ":", joint_cuis)
for cui in joint_cuis:
    print("CUI", cui)
    try:
        comparison.compare_for_cui(cui)
    except KeyError as e:
        print("ERROR", e)

CUIs in 1 (33): dict_keys(['371153006', '1149222004', '282884000', '1912002', '249902000', '223600005', '273302005', '160689007', '8510008', '229799001', '261001000', '165232002', '183376001', '229798009', '307439001', '229797004', '161903000', '72042002', '273469003', '25711000087100', '129041007', '257301003', '160685001', '718360006', '45850009', '160734000', '31031000119102', '718705001', '24029004', '302043000', '225602000', '282971008', '282966001'])
CUIs in 2 (33): dict_keys(['371153006', '1149222004', '282884000', '1912002', '249902000', '223600005', '160689007', '8510008', '229799001', '261001000', '183376001', '165232002', '229798009', '307439001', '229797004', '161903000', '45850009', '72042002', '273302005', '273469003', '25711000087100', '129041007', '257301003', '160685001', '718360006', '160734000', '31031000119102', '718705001', '24029004', '302043000', '225602000', '282971008', '282966001'])
JOINT 33 : {'223600005', '229798009', '31031000119102', '25711000087100', '273

| Path | First | Second |
| ----- | ----- | ----- |
| name                                     | India                                    | India                                    |
| count                                    | 1394                                     | 1377                                     |
| acc                                      | 0.9973064050984463                       | 1.0                                      |
| forms                                    | 3                                        | 2                                        |

CUI 229798009


| Path | First | Second |
| ----- | ----- | ----- |
| name                                     | Three times daily (and 3 children)       | Three times daily (and 4 children)       |
| count                                    | 82                                       | 49                                       |
| acc                                      | 1.0                                      | 1.0                                      |
| forms                                    | 4                                        | 2                                        |

CUI 31031000119102


| Path | First | Second |
| ----- | ----- | ----- |
| name                                     | Physical deconditioning                  | Physical deconditioning                  |
| count                                    | 6                                        | 6                                        |
| acc                                      | 1.0                                      | 1.0                                      |
| forms                                    | 1                                        | 1                                        |

CUI 25711000087100


| Path | First | Second |
| ----- | ----- | ----- |
| name                                     | Assisted living facility                 | Assisted living facility                 |
| count                                    | 4                                        | 4                                        |
| acc                                      | 1.0                                      | 1.0                                      |
| forms                                    | 1                                        | 1                                        |

CUI 273302005


| Path | First | Second |
| ----- | ----- | ----- |
| name                                     | Barthel index (Barthel United Kingdom index of activities of daily living, Modified Barthel index of activities of daily living) | Barthel index (Modified Barthel index of activities of daily living, Barthel United Kingdom index of activities of daily living) |
| count                                    | 116                                      | 13                                       |
| acc                                      | 0.48484750522355524                      | 1.0                                      |
| forms                                    | 2                                        | 1                                        |

CUI 282884000


| Path | First | Second |
| ----- | ----- | ----- |
| name                                     | Does stand from sitting (Does stand from sitting on edge of bed) | Does stand from sitting (Does stand from sitting on edge of bed) |
| count                                    | 16                                       | 16                                       |
| acc                                      | 1.0                                      | 1.0                                      |
| forms                                    | 3                                        | 3                                        |

CUI 1149222004


| Path | First | Second |
| ----- | ----- | ----- |
| name                                     | Overdose (and 1075 children)             | Overdose (and 1084 children)             |
| count                                    | 237                                      | 227                                      |
| acc                                      | 0.9795706800607633                       | 1.0                                      |
| forms                                    | 2                                        | 1                                        |

CUI 225602000


| Path | First | Second |
| ----- | ----- | ----- |
| name                                     | Unable to sit unsupported (Able to sit with support, unable to sit unsupported) | Unable to sit unsupported (Able to sit with support, unable to sit unsupported) |
| count                                    | 1                                        | 1                                        |
| acc                                      | 1.0                                      | 1.0                                      |
| forms                                    | 1                                        | 1                                        |

CUI 257301003


| Path | First | Second |
| ----- | ----- | ----- |
| name                                     | K-wire                                   | K-wire (and 21 children)                 |
| count                                    | 10                                       | 10                                       |
| acc                                      | 1.0                                      | 1.0                                      |
| forms                                    | 2                                        | 2                                        |

CUI 8510008


| Path | First | Second |
| ----- | ----- | ----- |
| name                                     | Reduced mobility                         | Reduced mobility                         |
| count                                    | 28                                       | 27                                       |
| acc                                      | 0.8248241140712277                       | 0.72670834838349                         |
| forms                                    | 5                                        | 5                                        |

CUI 307439001


| Path | First | Second |
| ----- | ----- | ----- |
| name                                     | Four times daily (and 3 children)        | Four times daily (and 4 children)        |
| count                                    | 33                                       | 33                                       |
| acc                                      | 1.0                                      | 0.8600978040291086                       |
| forms                                    | 4                                        | 4                                        |

CUI 160734000


| Path | First | Second |
| ----- | ----- | ----- |
| name                                     | Lives in a nursing home                  | Lives in nursing home                    |
| count                                    | 5                                        | 5                                        |
| acc                                      | 1.0                                      | 1.0                                      |
| forms                                    | 2                                        | 2                                        |

CUI 718705001


| Path | First | Second |
| ----- | ----- | ----- |
| name                                     | Functionally dependent (and 20 children) | Functionally dependent (and 29 children) |
| count                                    | 1                                        | 1                                        |
| acc                                      | 1.0                                      | 1.0                                      |
| forms                                    | 1                                        | 1                                        |

CUI 1912002


| Path | First | Second |
| ----- | ----- | ----- |
| name                                     | Fall (event) (and 80 children)           | Fall (and 112 children)                  |
| count                                    | 141                                      | 1173                                     |
| acc                                      | 0.5283850094587151                       | 0.7940999134595381                       |
| forms                                    | 6                                        | 6                                        |

CUI 183376001


| Path | First | Second |
| ----- | ----- | ----- |
| name                                     | Mobilizing                               | Mobilizing                               |
| count                                    | 259                                      | 27                                       |
| acc                                      | 0.5849852029345775                       | 0.5465187114053693                       |
| forms                                    | 8                                        | 4                                        |

CUI 229799001


| Path | First | Second |
| ----- | ----- | ----- |
| name                                     | Twice a day (and 4 children)             | Twice a day (and 6 children)             |
| count                                    | 655                                      | 277                                      |
| acc                                      | 0.7491845039123373                       | 0.938731518947015                        |
| forms                                    | 7                                        | 6                                        |

CUI 273469003


| Path | First | Second |
| ----- | ----- | ----- |
| name                                     | Functional independence measure          | Functional independence measure          |
| count                                    | 8                                        | 8                                        |
| acc                                      | 1.0                                      | 1.0                                      |
| forms                                    | 1                                        | 1                                        |

CUI 249902000


| Path | First | Second |
| ----- | ----- | ----- |
| name                                     | Difficulty standing                      | Difficulty standing (Astasia-abasia, Dissociative astasia-abasia) |
| count                                    | 5                                        | 5                                        |
| acc                                      | 1.0                                      | 1.0                                      |
| forms                                    | 2                                        | 2                                        |

CUI 229797004


| Path | First | Second |
| ----- | ----- | ----- |
| name                                     | Once daily (Once a day, in the evening, Once a day, at bedtime) | Once daily (Once a day, in the evening, Once a day, at bedtime) |
| count                                    | 117                                      | 117                                      |
| acc                                      | 1.0                                      | 1.0                                      |
| forms                                    | 3                                        | 3                                        |

CUI 129041007


| Path | First | Second |
| ----- | ----- | ----- |
| name                                     | Independent bathing                      | Independent bathing                      |
| count                                    | 1                                        | 1                                        |
| acc                                      | 1.0                                      | 1.0                                      |
| forms                                    | 1                                        | 1                                        |

CUI 302043000


| Path | First | Second |
| ----- | ----- | ----- |
| name                                     | Unable to mobilize (and 29 children)     | Unable to mobilize (and 35 children)     |
| count                                    | 1                                        | 1                                        |
| acc                                      | 1.0                                      | 1.0                                      |
| forms                                    | 1                                        | 1                                        |

CUI 72042002


| Path | First | Second |
| ----- | ----- | ----- |
| name                                     | Incontinence of feces (and 16 children)  | Incontinence of feces (and 22 children)  |
| count                                    | 160                                      | 160                                      |
| acc                                      | 0.9983109712644286                       | 0.999875                                 |
| forms                                    | 12                                       | 12                                       |

CUI 718360006


| Path | First | Second |
| ----- | ----- | ----- |
| name                                     | Functionally independent                 | Functionally independent                 |
| count                                    | 9                                        | 9                                        |
| acc                                      | 1.0                                      | 1.0                                      |
| forms                                    | 1                                        | 1                                        |

CUI 45850009


| Path | First | Second |
| ----- | ----- | ----- |
| name                                     | Continent of urine (Bladder: fully continent) | Continent of urine                       |
| count                                    | 1                                        | 19                                       |
| acc                                      | 0.6176311321701055                       | 0.6859683035181096                       |
| forms                                    | 1                                        | 3                                        |

CUI 371153006


| Path | First | Second |
| ----- | ----- | ----- |
| name                                     | Independent                              | Independent (and 30 children)            |
| count                                    | 2318                                     | 2328                                     |
| acc                                      | 1.0                                      | 3.0                                      |
| forms                                    | 1                                        | 3                                        |

CUI 160689007


| Path | First | Second |
| ----- | ----- | ----- |
| name                                     | Housebound (Temporarily housebound)      | Housebound (Temporarily housebound)      |
| count                                    | 2                                        | 2                                        |
| acc                                      | 1.0                                      | 1.0                                      |
| forms                                    | 1                                        | 1                                        |

CUI 161903000


| Path | First | Second |
| ----- | ----- | ----- |
| name                                     | Unable to stand                          | Unable to stand                          |
| count                                    | 4                                        | 4                                        |
| acc                                      | 1.0                                      | 1.0                                      |
| forms                                    | 2                                        | 2                                        |

CUI 282971008


| Path | First | Second |
| ----- | ----- | ----- |
| name                                     | Able to stand                            | Able to stand                            |
| count                                    | 2                                        | 2                                        |
| acc                                      | 1.0                                      | 1.0                                      |
| forms                                    | 1                                        | 1                                        |

CUI 282966001


| Path | First | Second |
| ----- | ----- | ----- |
| name                                     | Unable to sit (and 3 children)           | Unable to sit (and 3 children)           |
| count                                    | 2                                        | 2                                        |
| acc                                      | 2.0                                      | 2.0                                      |
| forms                                    | 2                                        | 2                                        |

CUI 24029004


| Path | First | Second |
| ----- | ----- | ----- |
| name                                     | Bowels: fully continent                  | Bowels: fully continent                  |
| count                                    | 9                                        | 9                                        |
| acc                                      | 1.0                                      | 1.0                                      |
| forms                                    | 2                                        | 2                                        |

CUI 160685001


| Path | First | Second |
| ----- | ----- | ----- |
| name                                     | Bed-ridden                               | Bed-ridden                               |
| count                                    | 9                                        | 9                                        |
| acc                                      | 1.0                                      | 1.0                                      |
| forms                                    | 3                                        | 3                                        |

CUI 165232002


| Path | First | Second |
| ----- | ----- | ----- |
| name                                     | Urinary incontinence (and 23 children)   | Urinary incontinence (and 36 children)   |
| count                                    | 403                                      | 231                                      |
| acc                                      | 0.9986527591730276                       | 1.0                                      |
| forms                                    | 9                                        | 7                                        |

CUI 261001000


| Path | First | Second |
| ----- | ----- | ----- |
| name                                     | Continent                                | Continent                                |
| count                                    | 247                                      | 251                                      |
| acc                                      | 0.7634409099693621                       | 0.8063828755952787                       |
| forms                                    | 1                                        | 1                                        |

## More granual details (per annotation view)
Sometimes we may want to look at things on a per annotation basis as well.
That is, we want to look at some annotations and compare them between the two models.

In [45]:
# we can iterate over annotation pairs.
# we may optionally specify the documents we wish to look at
# we will specify one document here so as to not generate too much output
all_docs = list(comparison.ann_diffs.per_doc_results.keys())
docs = all_docs[:10]  # just 10 for now
# by default, this will omit identical annotations
# but this can be changed by setting omit_identical=False
comparison.show_docs(docs, omit_identical=False)

Chemokines Updates_7. Interleukin-5 and Interleukin-5 Receptor Polymorphism in Asthma (AnnotationComparisonType.IDENTICAL) 


| Path | First | Second |
| ----- | ----- | ----- |
| pretty_name                              | Independent                              | Independent                              |
| cui                                      | 371153006                                | 371153006                                |
| type_ids                                 | ['7882689']                              | ['7882689']                              |
| types                                    | ['']                                     | ['qualifier value']                      |
| source_value                             | independent                              | independent                              |
| detected_name                            | independent                              | independent                              |
| acc                                      | 1.0                                      | 1.0                                      |
| context_similarity                       | 1.0                                      | 1.0                                      |
| start                                    | 2204                                     | 2204                                     |
| end                                      | 2215                                     | 2215                                     |
| icd10                                    | []                                       | []                                       |
| ontologies                               | ['20220803_SNOMED_UK_CLINICAL_EXT']      | ['SNOMED-CT']                            |
| snomed                                   | []                                       | []                                       |
| id                                       | 175                                      | 171                                      |
| meta_anns (Dict[str, dict])              | 3                                        | 0                                        |

Chemokines Updates_7. Interleukin-5 and Interleukin-5 Receptor Polymorphism in Asthma (AnnotationComparisonType.IDENTICAL) 


| Path | First | Second |
| ----- | ----- | ----- |
| pretty_name                              | Independent                              | Independent                              |
| cui                                      | 371153006                                | 371153006                                |
| type_ids                                 | ['7882689']                              | ['7882689']                              |
| types                                    | ['']                                     | ['qualifier value']                      |
| source_value                             | independent                              | independent                              |
| detected_name                            | independent                              | independent                              |
| acc                                      | 1.0                                      | 1.0                                      |
| context_similarity                       | 1.0                                      | 1.0                                      |
| start                                    | 2204                                     | 2204                                     |
| end                                      | 2215                                     | 2215                                     |
| icd10                                    | []                                       | []                                       |
| ontologies                               | ['20220803_SNOMED_UK_CLINICAL_EXT']      | ['SNOMED-CT']                            |
| snomed                                   | []                                       | []                                       |
| id                                       | 175                                      | 171                                      |
| meta_anns (Dict[str, dict])              | 3                                        | 0                                        |