![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.13.Deidentification_Model_Evaluation.ipynb)

# Benchmarking De‑Identification Models

In this notebook we assess all available de‑identification models on the **deidentification_benchmark_ground_truth_10_doc** dataset.  

Using the profiling pipeline, we gather predictions, align them with ground‑truth labels, tidy up tag formats, and produce standard evaluation metrics. The results for each model are summarised along the way for quick comparison.

## Healthcare NLP for Data Scientists Course

If you are not familiar with the components in this notebook, you can check [Healthcare NLP for Data Scientists Udemy Course](https://www.udemy.com/course/healthcare-nlp-for-data-scientists/) and the [MOOC Notebooks](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP) for each components.

## Colab Setup


In [None]:
!pip -q install azure-health-deidentification azure-identity openai anthropic boto3

📍 Restart session may required.

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.settings.enforce_versions=True
nlp.install(refresh_install=True)

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_6.0.4.json
Installing /root/.johnsnowlabs/py_installs/spark_nlp_jsl-6.0.4-py3-none-any.whl to /usr/bin/python3
Installed 1 products:
💊 Spark-Healthcare==6.0.4 installed! ✅ Heal the planet with NLP! 


In [None]:
from johnsnowlabs import nlp, medical
# Automatically load license data and start a session with all jars user has access to

spark = nlp.start()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_6.0.4.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==6.0.4, 💊Spark-Healthcare==6.0.4, running on ⚡ PySpark==3.4.0


In [None]:
spark

In [None]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only
from sparknlp_jsl.pipeline_tracer import PipelineTracer

import pandas as pd
pd.set_option('display.max_colwidth', 200)

import warnings
warnings.filterwarnings('ignore')

import json
import string
import numpy as np


# 📝 **Dataset**

For the benchmark, we will use 48 open-source documents annotated by domain experts working at John Snow Labs. These annotations include the entities `IDNUM`, `LOCATION`, `DATE`, `AGE`, `NAME` and `CONTACT` Therefore, the benchmark will be conducted based on these labels.

**NOTE:** In this notebook, the results were prepared using the whole dataset, but a small part of it was shared. If you are interested in using the dataset, please contact us by sending email to **support@johnsnowlabs.com**

In [None]:
ground_truth_df = pd.read_csv("deidentification_benchmark_ground_truth_10_doc.csv")
ground_truth_df

Unnamed: 0,doc_id,text,begin,end,chunk,chunk_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,10,957770228,IDNUM
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,14,FIH,LOCATION
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,22,0408267,IDNUM
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,33,46769/5v7d,IDNUM
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,40,237890,IDNUM
...,...,...,...,...,...,...
271,38,"Pasco Bond DOB : 24/09/1974 Page 1 of 7\n\nOffice Visit Pasco Bond ( 24/9/1974 )\n\nPasco Bond Description : 52 year old\n\n1/10/2018 9:00 AM Office Visit male\n\nProvider : Delvin , Dept Phone : ...",10233,10245,Cheree Cords,NAME
272,38,"Pasco Bond DOB : 24/09/1974 Page 1 of 7\n\nOffice Visit Pasco Bond ( 24/9/1974 )\n\nPasco Bond Description : 52 year old\n\n1/10/2018 9:00 AM Office Visit male\n\nProvider : Delvin , Dept Phone : ...",10300,10311,Pauline Bos,NAME
273,38,"Pasco Bond DOB : 24/09/1974 Page 1 of 7\n\nOffice Visit Pasco Bond ( 24/9/1974 )\n\nPasco Bond Description : 52 year old\n\n1/10/2018 9:00 AM Office Visit male\n\nProvider : Delvin , Dept Phone : ...",10328,10337,1/10/2018,DATE
274,38,"Pasco Bond DOB : 24/09/1974 Page 1 of 7\n\nOffice Visit Pasco Bond ( 24/9/1974 )\n\nPasco Bond Description : 52 year old\n\n1/10/2018 9:00 AM Office Visit male\n\nProvider : Delvin , Dept Phone : ...",10554,10564,Pasco Bond,NAME


In [None]:
ground_truth_df.chunk_label.value_counts()

Unnamed: 0_level_0,count
chunk_label,Unnamed: 1_level_1
DATE,123
NAME,70
IDNUM,37
LOCATION,35
AGE,8
CONTACT,3


# 🧩 **Getting Token-Level Information**

Obtaining token, start, end, and label information for each text to be used as ground truth for future predictions.

In [None]:
deid_pipeline = nlp.PretrainedPipeline("ner_profiling_deidentification", "en", "clinical/models")

ner_profiling_deidentification download started this may take some time.
Approx size to download 2.5 GB
[OK!]


In [None]:
text_df = ground_truth_df[["doc_id", "text"]].drop_duplicates()
spark_df = spark.createDataFrame(text_df).repartition(32)
spark_df.show()

+------+--------------------+
|doc_id|                text|
+------+--------------------+
|     1|\n957770228\nFIH\...|
|     4|\n229937784\nFIH\...|
|    38|Pasco Bond DOB : ...|
|    20|\n263283549 ELMVH...|
|    10|\n305265793\nFIH\...|
|    29|\n417344403 RWH\n...|
|    25|\n649953363\nFIH\...|
|     9|\n333145593\nFIH\...|
+------+--------------------+



In [None]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")

nlpPipeline = nlp.Pipeline(
    stages=[
        documentAssembler,
        tokenizer
        ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

In [None]:
sdf_result = model.transform(spark_df)

token_pred_df = sdf_result.select("doc_id", "text", F.explode(F.arrays_zip(sdf_result.token.begin,
                                                                           sdf_result.token.end,
                                                                           sdf_result.token.result,
                                                                          )).alias("cols"))\
                          .select("doc_id", "text", F.expr("cols['0']").alias("token_begin"),
                                                    F.expr("cols['1']").alias("token_end"),
                                                    F.expr("cols['2']").alias("token")).toPandas()

token_pred_df

Unnamed: 0,doc_id,text,token_begin,token_end,token
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890
...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end


In [None]:
%%time
from tqdm import tqdm

gt_token_labels = []
docs = token_pred_df.text.unique()

for doc in tqdm(docs):
    gt_slice_df = ground_truth_df[ground_truth_df.text==doc].copy()
    token_slice_df = token_pred_df[token_pred_df.text==doc].copy()

    for token_ind, token_row in token_slice_df.iterrows():
        # gt
        token_label="O"
        for gt_ind, gt_row in gt_slice_df.iterrows():
            if (gt_row.begin <=token_row.token_begin < gt_row.end):
                token_label = gt_row.chunk_label
                break
            else:
                pass
        gt_token_labels.append(token_label)


100%|██████████| 8/8 [00:20<00:00,  2.54s/it]

CPU times: user 20.3 s, sys: 50.3 ms, total: 20.3 s
Wall time: 20.4 s





In [None]:
token_pred_df["token_label"] = gt_token_labels
token_pred_df

Unnamed: 0,doc_id,text,token_begin,token_end,token,token_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM
...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O


In [None]:
token_pred_df.to_csv("token_pred_df.csv", index=False)

In [None]:
token_pred_df = pd.read_csv("token_pred_df.csv")

token_pred_df = token_pred_df.rename(columns={
    'token_begin': 'begin',
    'token_end': 'end'
})

token_pred_df

Unnamed: 0,doc_id,text,begin,end,token,token_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM
...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O


# 🧨 **Getting Predictions from Each Model in the Pipeline**

In [None]:
import pandas as pd
import plotly.graph_objects as go
from typing import List, Dict, Optional

class NERResultViewer:
    """
    A utility class to visualize NER (Named Entity Recognition) model outputs
    in different formats: chunks, token-level predictions, tuple results, and label distribution plots.
    """

    def __init__(self, light_result=None):
        """
        Initialize the class with the light_result dictionary from Spark NLP or similar output.
        """
        self.light_result = light_result


    def build_chunk_dataframe(self, results):
        """
        Helper function that builds a DataFrame from chunk results.
        """
        data = {
            'sentence': [r.metadata.get('sentence', '') for r in results],
            'begin': [r.begin for r in results],
            'end': [r.end for r in results],
            'chunks': [r.result for r in results],
            'entity': [r.metadata.get('entity') if r.metadata.get('entity') else r.metadata.get('field', '') for r in results],
            'confidence': [r.metadata.get('confidence', '') for r in results]
        }
        return pd.DataFrame(data)

    def show_chunk_results(self):
        """
        Prints chunk-level NER results as DataFrames for each model.
        """
        if self.light_result is None:
            print("light_result is None.")
            return

        for key in self.light_result.keys():
            if key in ['sentence', 'token'] or "_chunks" not in key:
                continue

            model_name = "_".join(key.split("_")[:-1])
            results = self.light_result[key]

            print(f"\n{'*' * 20} {model_name} Model Results {'*' * 20}")

            if len(results) == 0:
                print("No Result For This Model")
                continue

            df = self.build_chunk_dataframe(results)
            try:
                from IPython.display import display  # noqa
                display(df)
            except Exception:
                print(df.to_string(index=False))

    def get_token_results(self):
        """
        Generate a DataFrame showing token-level predictions for all models.
        """
        if self.light_result is None:
            return pd.DataFrame()

        # Extract base token data
        token_data = self.light_result["token"]
        df = pd.DataFrame({
            'sentence': [t.metadata["sentence"] for t in token_data],
            'begin': [t.begin for t in token_data],
            'end': [t.end for t in token_data],
            'token': [t.result for t in token_data]
        })

        # Add neural NER model predictions
        for key, model_output in self.light_result.items():
            if key in ["sentence", "token"] or "_chunks" in key or "ner" not in key:
                continue
            labels = [entry.result for entry in model_output]
            df[key] = labels

        # Add rule-based chunk predictions (e.g., *_chunks)
        for key, chunk_annotations in self.light_result.items():
            if key in ["sentence", "token"] or "ner" in key:
                continue

            chunk_df = self.build_chunk_dataframe(chunk_annotations)

            # Build an index-efficient token -> label mapping
            labels = []
            for _, token_row in df.iterrows():
                matched_entity = "O"
                for _, chunk_row in chunk_df.iterrows():
                    if (chunk_row.begin <= token_row.begin < chunk_row.end) or (token_row.begin <= chunk_row.begin < token_row.end):
                        matched_entity = chunk_row.entity
                        break
                labels.append(matched_entity)

            column_name = "_".join(key.split("_")[:-1])
            df[column_name] = labels

        return df

    def show_results_as_tuples(self):
        """
        Print the NER token predictions as (token, label) tuples for each model.
        """
        if self.light_result is None:
            print("light_result is None.")
            return

        tokens = [j.result for j in self.light_result["token"]]

        for key in self.light_result.keys():
            if key == 'sentence' or "_chunks" in key:
                continue

            results = self.light_result[key]

            print(f"\n{'*'*20} {key} Model Results {'*'*20}")

            if len(results) == 0:
                print("No Result For This Model")
                continue

            labels = [r.result for r in results]
            paired = list(zip(tokens, labels))
            print(paired)

    def plot_entity_counts(self):
        """
        Plot a horizontal bar chart of entity counts (excluding "O") per model using Plotly.
        """
        result_df = self.get_token_results()
        if result_df.empty:
            print("No token results to plot.")
            return

        # Extract only label columns (excluding token info)
        label_result_df = result_df[result_df.columns[4:]].copy()

        # Count how many non-'O' labels exist for each model
        label_count_list = [label_result_df[label_result_df[col] != "O"].shape[0] for col in label_result_df.columns]

        # Create bar plot
        fig = go.Figure(go.Bar(
            x=label_count_list,
            y=label_result_df.columns,
            orientation='h'
        ))

        fig.update_layout(
            autosize=False,
            width=1500,
            height=1500,
            margin=dict(l=50, r=50, b=100, t=100, pad=4),
            paper_bgcolor="LightSteelBlue",
            title={'text': "Counts of Labelled Entities", 'y':0.98, 'x':0.5, 'xanchor': 'center', 'yanchor': 'top'},
            titlefont=dict(size=30),
            yaxis=dict(title_text="Clinical NER Models", titlefont=dict(size=30)),
        )

        fig.show()


    @staticmethod
    def build_docs(token_pred_df: pd.DataFrame) -> pd.DataFrame:
        return (
            token_pred_df[['doc_id', 'text']]
            .drop_duplicates(subset='doc_id')
            .sort_values('doc_id')
            .reset_index(drop=True)
        )

    @staticmethod
    def run_light_pipeline_on_docs(docs_df: pd.DataFrame, pipeline) -> pd.DataFrame:
        """
        Runs a Spark NLP LightPipeline on a docs dataframe (doc_id, text)
        and returns concatenated token-level predictions for all documents.
        """
        results = []
        for doc_id, text in zip(docs_df['doc_id'].tolist(), docs_df['text'].tolist()):
            annotated = pipeline.fullAnnotate(text)[0]
            viewer = NERResultViewer(annotated)
            token_df = viewer.get_token_results()
            token_df.insert(0, 'doc_id', doc_id)
            results.append(token_df)

        prediction_df = pd.concat(results, ignore_index=True) if results else pd.DataFrame()
        return prediction_df

    @staticmethod
    def infer_model_columns(prediction_df: pd.DataFrame) -> List[str]:
        base = {'doc_id', 'sentence', 'begin', 'end', 'token'}
        return [c for c in prediction_df.columns if c not in base]

    @staticmethod
    def get_model_df(pred_df: pd.DataFrame, model_name: str, strip_prefix: bool = True) -> pd.DataFrame:
        df = pred_df[['doc_id', 'sentence', 'begin', 'end', 'token', model_name]].copy()
        if strip_prefix and df[model_name].dtype == 'object':
            df[model_name] = df[model_name].str.replace(r'^[BI]-', '', regex=True)
        return df

    @staticmethod
    def merge_all_models(
        gold_token_df: pd.DataFrame,
        pred_df: pd.DataFrame,
        model_cols: Optional[List[str]] = None,
        strip_prefix: bool = True
    ) -> Dict[str, pd.DataFrame]:

        if model_cols is None:
            model_cols = NERResultViewer.infer_model_columns(pred_df)

        results: Dict[str, pd.DataFrame] = {}

        gold = gold_token_df.copy()
        if 'token_label' not in gold.columns and 'label' in gold.columns:
            gold = gold.rename(columns={'label': 'token_label'})

        for model in model_cols:
            if model not in pred_df.columns:
                continue

            df_pred = NERResultViewer.get_model_df(pred_df, model, strip_prefix)

            merged = gold.merge(
                df_pred[['doc_id', 'begin', 'end', model]],
                on=['doc_id', 'begin', 'end'],
                how='inner',
                validate='one_to_one'
            )

            merged = merged.rename(columns={model: 'pred_label'})

            ordered_cols = [c for c in ['doc_id', 'text', 'begin', 'end', 'token', 'token_label', 'pred_label'] if c in merged.columns]
            results[model] = merged[ordered_cols].copy()

        return results


In [None]:
token_pred_df[['doc_id','text']].head()

Unnamed: 0,doc_id,text
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...


In [None]:
docs = NERResultViewer.build_docs(token_pred_df)

In [None]:
prediction_df = NERResultViewer.run_light_pipeline_on_docs(docs, deid_pipeline)
prediction_df.head()

Unnamed: 0,doc_id,sentence,begin,end,token,ner_deid_augmented,ner_deid_subentity_augmented_i2b2,ner_deidentify_dl,ner_deid_subentity_augmented_langtest,ner_deid_synthetic,ner_deid_aipii,ner_deid_large_langtest,ner_deid_large,ner_deid_subentity_docwise,ner_deid_subentity_augmented_v2,ner_deid_subentity_augmented_docwise,zeroshot_ner_deid_subentity_merged_medium,ner_deid_sd_large,ner_deid_enriched_langtest,ner_deid_sd,ner_deid_generic_augmented,ner_deid_enriched,ner_deid_generic_augmented_langtest,ner_deid_subentity_augmented,ner_deid_generic_docwise,ner_deid_generic_augmented_allUpperCased_langtest,state_matcher,ssn_parser,phone_parser,zip_matcher,ip_matcher,medical_record_parser,date_of_death_parser,account_parser,plate_parser,url_matcher,email_matcher,zip_parser,license_parser,phone_matcher,dln_parser,vin_parser,age_parser,country_matcher,date_matcher,date_of_birth_parser
0,1,0,1,9,957770228,B-ID,B-IDNUM,I-MEDICALRECORD,B-MEDICALRECORD,B-ID,O,B-ID,B-ID,B-MEDICALRECORD,B-MEDICALRECORD,B-MEDICALRECORD,O,B-ID,B-MEDICALRECORD,I-ID,B-ID,B-MEDICALRECORD,B-ID,B-MEDICALRECORD,B-ID,B-ID,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O
1,1,1,11,13,FIH,B-LOCATION,B-HOSPITAL,I-HOSPITAL,B-HOSPITAL,B-LOCATION,O,B-LOCATION,B-LOCATION,B-HOSPITAL,B-HOSPITAL,B-HOSPITAL,O,I-LOCATION,B-HOSPITAL,I-LOCATION,B-LOCATION,B-HOSPITAL,B-LOCATION,B-HOSPITAL,B-LOCATION,B-LOCATION,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O
2,1,1,15,21,0408267,B-ID,B-IDNUM,I-MEDICALRECORD,B-MEDICALRECORD,B-ID,O,O,B-ID,B-MEDICALRECORD,B-MEDICALRECORD,B-MEDICALRECORD,O,I-ID,I-MEDICALRECORD,I-ID,B-ID,I-MEDICALRECORD,B-ID,B-MEDICALRECORD,B-ID,B-ID,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O
3,1,1,23,32,46769/5v7d,O,B-MEDICALRECORD,I-MEDICALRECORD,O,O,O,O,O,B-MEDICALRECORD,O,O,O,I-ID,O,O,I-ID,I-MEDICALRECORD,B-LOCATION,O,O,I-ID,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O
4,1,1,34,39,237890,B-ID,B-IDNUM,I-MEDICALRECORD,O,B-ID,O,O,B-ID,B-MEDICALRECORD,B-IDNUM,B-MEDICALRECORD,O,I-ID,B-IDNUM,O,I-ID,I-MEDICALRECORD,B-ID,B-MEDICALRECORD,O,B-ID,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O


In [None]:
all_merged = NERResultViewer.merge_all_models(token_pred_df, prediction_df)

### 🎈 **ner_deid_augmented**

In [None]:
df_aug = all_merged['ner_deid_augmented']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,ID
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,LOCATION
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,ID
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,O
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,ID
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,O


In [None]:
df_aug['pred_label'] = df_aug['pred_label'].replace({'ID': 'IDNUM'})

In [None]:
from sklearn.metrics import classification_report

y_true = df_aug['token_label']
y_pred = df_aug['pred_label']

labels = df_aug['token_label'].unique()
labels = [l for l in labels if l != 'O']

print(classification_report(y_true, y_pred, labels=labels))

              precision    recall  f1-score   support

       IDNUM       0.97      0.78      0.87        37
    LOCATION       0.73      0.84      0.78        81
        DATE       0.98      1.00      0.99       213
         AGE       1.00      0.30      0.46        20
        NAME       0.89      0.58      0.70       165
     CONTACT       0.50      1.00      0.67         3

   micro avg       0.90      0.80      0.85       519
   macro avg       0.84      0.75      0.74       519
weighted avg       0.91      0.80      0.83       519



### 🎈 **ner_deid_subentity_augmented_i2b2**

In [None]:
df_aug = all_merged['ner_deid_subentity_augmented_i2b2']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,IDNUM
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,HOSPITAL
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,IDNUM
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,MEDICALRECORD
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,IDNUM
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,O


In [None]:
df_aug['token_label'].value_counts()

Unnamed: 0_level_0,count
token_label,Unnamed: 1_level_1
O,8113
DATE,213
NAME,165
LOCATION,81
IDNUM,37
AGE,20
CONTACT,3


In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8171
DATE,218
DOCTOR,99
HOSPITAL,49
IDNUM,17
MEDICALRECORD,14
STATE,14
PHONE,13
PATIENT,13
STREET,9


In [None]:
df_aug['pred_label'] = df_aug['pred_label'].replace({'MEDICALRECORD': 'IDNUM', 'ZIP':'LOCATION', 'CITY':'LOCATION',
                                                     'STREET':'LOCATION', 'COUNTRY':'LOCATION', 'STATE':'LOCATION',
                                                     'PHONE':'CONTACT', 'DOCTOR':'NAME', 'HOSPITAL':'LOCATION', 'PATIENT':'NAME',
                                                     'ORGANIZATION':'O', 'DEVICE':'O'})

In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8174
DATE,218
NAME,112
LOCATION,76
IDNUM,31
CONTACT,13
AGE,8


In [None]:
from sklearn.metrics import classification_report

y_true = df_aug['token_label']
y_pred = df_aug['pred_label']

labels = df_aug['token_label'].unique()
labels = [l for l in labels if l != 'O']

print(classification_report(y_true, y_pred, labels=labels))

              precision    recall  f1-score   support

       IDNUM       1.00      0.84      0.91        37
    LOCATION       0.70      0.65      0.68        81
        DATE       0.97      0.99      0.98       213
         AGE       0.88      0.35      0.50        20
        NAME       0.91      0.62      0.74       165
     CONTACT       0.23      1.00      0.38         3

   micro avg       0.89      0.78      0.83       519
   macro avg       0.78      0.74      0.70       519
weighted avg       0.90      0.78      0.83       519



### 🎈 **ner_deidentify_dl**

In [None]:
df_aug = all_merged['ner_deidentify_dl']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,MEDICALRECORD
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,HOSPITAL
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,MEDICALRECORD
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,MEDICALRECORD
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,MEDICALRECORD
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,O


In [None]:
df_aug['token_label'].value_counts()

Unnamed: 0_level_0,count
token_label,Unnamed: 1_level_1
O,8113
DATE,213
NAME,165
LOCATION,81
IDNUM,37
AGE,20
CONTACT,3


In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8199
DATE,206
DOCTOR,91
HOSPITAL,55
MEDICALRECORD,27
PATIENT,22
STREET,13
PHONE,7
AGE,6
EMAIL,1


In [None]:
df_aug['pred_label'] = df_aug['pred_label'].replace({'MEDICALRECORD': 'IDNUM', 'ZIP':'LOCATION', 'CITY':'LOCATION',
                                                     'STREET':'LOCATION', 'STATE':'LOCATION', 'PHONE':'CONTACT', 'EMAIL':'CONTACT',
                                                     'DOCTOR':'NAME', 'HOSPITAL':'LOCATION', 'PATIENT':'NAME', 'USERNAME':'O'})

In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8200
DATE,206
NAME,113
LOCATION,71
IDNUM,28
CONTACT,8
AGE,6


In [None]:
from sklearn.metrics import classification_report

y_true = df_aug['token_label']
y_pred = df_aug['pred_label']

labels = df_aug['token_label'].unique()
labels = [l for l in labels if l != 'O']

print(classification_report(y_true, y_pred, labels=labels))

              precision    recall  f1-score   support

       IDNUM       0.96      0.73      0.83        37
    LOCATION       0.77      0.68      0.72        81
        DATE       0.99      0.95      0.97       213
         AGE       1.00      0.30      0.46        20
        NAME       0.88      0.61      0.72       165
     CONTACT       0.38      1.00      0.55         3

   micro avg       0.91      0.76      0.83       519
   macro avg       0.83      0.71      0.71       519
weighted avg       0.92      0.76      0.82       519



### 🎈 **ner_deid_subentity_augmented_langtest**

In [None]:
df_aug = all_merged['ner_deid_subentity_augmented_langtest']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,MEDICALRECORD
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,HOSPITAL
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,MEDICALRECORD
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,O
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,O
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,O


In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8170
DATE,210
DOCTOR,116
HOSPITAL,42
MEDICALRECORD,27
PATIENT,16
STREET,13
AGE,8
CITY,8
IDNUM,5


In [None]:
df_aug['pred_label'] = df_aug['pred_label'].replace({'MEDICALRECORD': 'IDNUM', 'ZIP':'LOCATION', 'CITY':'LOCATION',
                                                     'STREET':'LOCATION', 'STATE':'LOCATION', 'COUNTRY':'LOCATION',
                                                     'PHONE':'CONTACT', 'DOCTOR':'NAME', 'HOSPITAL':'LOCATION', 'PATIENT':'NAME',
                                                     'USERNAME':'O', 'PROFESSION':'O', 'ORGANIZATION':'O'})

In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8177
DATE,210
NAME,132
LOCATION,71
IDNUM,32
AGE,8
CONTACT,2


In [None]:
from sklearn.metrics import classification_report

y_true = df_aug['token_label']
y_pred = df_aug['pred_label']

labels = df_aug['token_label'].unique()
labels = [l for l in labels if l != 'O']

print(classification_report(y_true, y_pred, labels=labels))

              precision    recall  f1-score   support

       IDNUM       0.91      0.78      0.84        37
    LOCATION       0.85      0.74      0.79        81
        DATE       0.99      0.97      0.98       213
         AGE       1.00      0.40      0.57        20
        NAME       0.85      0.68      0.75       165
     CONTACT       1.00      0.67      0.80         3

   micro avg       0.92      0.81      0.86       519
   macro avg       0.93      0.71      0.79       519
weighted avg       0.92      0.81      0.85       519



### 🎈 **ner_deid_synthetic**

In [None]:
df_aug = all_merged['ner_deid_synthetic']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,ID
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,LOCATION
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,ID
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,O
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,ID
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,O


In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8198
DATE,207
NAME,111
LOCATION,73
ID,31
AGE,9
CONTACT,3


In [None]:
df_aug['pred_label'] = df_aug['pred_label'].replace({'ID': 'IDNUM'})

In [None]:
from sklearn.metrics import classification_report

y_true = df_aug['token_label']
y_pred = df_aug['pred_label']

labels = df_aug['token_label'].unique()
labels = [l for l in labels if l != 'O']

print(classification_report(y_true, y_pred, labels=labels))

              precision    recall  f1-score   support

       IDNUM       1.00      0.84      0.91        37
    LOCATION       0.85      0.77      0.81        81
        DATE       1.00      0.97      0.98       213
         AGE       0.67      0.30      0.41        20
        NAME       0.89      0.60      0.72       165
     CONTACT       1.00      1.00      1.00         3

   micro avg       0.94      0.78      0.85       519
   macro avg       0.90      0.75      0.80       519
weighted avg       0.93      0.78      0.84       519



### 🎈 **ner_deid_aipii**

In [None]:
df_aug = all_merged['ner_deid_aipii']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,O
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,O
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,O
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,O
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,O
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,O


In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8468
NAME,69
DATE,25
CITY,23
STREET,23
LICENSE,8
ZIP,7
IDNUM,5
SSN,4


In [None]:
df_aug['pred_label'] = df_aug['pred_label'].replace({'SSN': 'IDNUM', 'ZIP':'LOCATION', 'CITY':'LOCATION',
                                                     'STREET':'LOCATION', 'LICENSE':'O'})

In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8476
NAME,69
LOCATION,53
DATE,25
IDNUM,9


In [None]:
from sklearn.metrics import classification_report

y_true = df_aug['token_label']
y_pred = df_aug['pred_label']

labels = df_aug['token_label'].unique()
exclude = ['O', 'CONTACT', 'AGE']
labels = [l for l in labels if l not in exclude]

print(classification_report(y_true, y_pred, labels=labels))

              precision    recall  f1-score   support

       IDNUM       0.22      0.05      0.09        37
    LOCATION       0.53      0.35      0.42        81
        DATE       1.00      0.12      0.21       213
        NAME       0.48      0.20      0.28       165

   micro avg       0.56      0.18      0.27       496
   macro avg       0.56      0.18      0.25       496
weighted avg       0.69      0.18      0.26       496



### 🎈 **ner_deid_large_langtest**

In [None]:
df_aug = all_merged['ner_deid_large_langtest']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,ID
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,LOCATION
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,O
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,O
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,O
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,O


In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8035
DATE,250
NAME,155
LOCATION,120
ID,24
PROFESSION,24
AGE,18
CONTACT,6


In [None]:
df_aug['pred_label'] = df_aug['pred_label'].replace({'ID': 'IDNUM', 'PROFESSION':'O'})

In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8059
DATE,250
NAME,155
LOCATION,120
IDNUM,24
AGE,18
CONTACT,6


In [None]:
from sklearn.metrics import classification_report

y_true = df_aug['token_label']
y_pred = df_aug['pred_label']

labels = df_aug['token_label'].unique()
labels = [l for l in labels if l != 'O']

print(classification_report(y_true, y_pred, labels=labels))

              precision    recall  f1-score   support

       IDNUM       0.83      0.54      0.66        37
    LOCATION       0.47      0.69      0.56        81
        DATE       0.85      1.00      0.92       213
         AGE       0.33      0.30      0.32        20
        NAME       0.63      0.59      0.61       165
     CONTACT       0.50      1.00      0.67         3

   micro avg       0.69      0.76      0.72       519
   macro avg       0.60      0.69      0.62       519
weighted avg       0.70      0.76      0.72       519



### 🎈 **ner_deid_large**

In [None]:
df_aug = all_merged['ner_deid_large']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,ID
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,LOCATION
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,ID
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,O
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,ID
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,O


In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8172
DATE,206
NAME,139
LOCATION,73
ID,32
AGE,7
CONTACT,3


In [None]:
df_aug['pred_label'] = df_aug['pred_label'].replace({'ID': 'IDNUM'})

In [None]:
from sklearn.metrics import classification_report

y_true = df_aug['token_label']
y_pred = df_aug['pred_label']

labels = df_aug['token_label'].unique()
labels = [l for l in labels if l != 'O']

print(classification_report(y_true, y_pred, labels=labels))

              precision    recall  f1-score   support

       IDNUM       1.00      0.86      0.93        37
    LOCATION       0.90      0.81      0.86        81
        DATE       0.99      0.96      0.97       213
         AGE       0.86      0.30      0.44        20
        NAME       0.96      0.81      0.88       165
     CONTACT       1.00      1.00      1.00         3

   micro avg       0.97      0.86      0.91       519
   macro avg       0.95      0.79      0.85       519
weighted avg       0.96      0.86      0.90       519



### 🎈 **ner_deid_subentity_docwise**

In [None]:
df_aug = all_merged['ner_deid_subentity_docwise']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,MEDICALRECORD
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,HOSPITAL
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,MEDICALRECORD
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,MEDICALRECORD
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,MEDICALRECORD
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,O


In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8206
DATE,205
DOCTOR,99
HOSPITAL,42
MEDICALRECORD,29
CITY,11
PATIENT,10
AGE,8
PHONE,7
IDNUM,5


In [None]:
df_aug['pred_label'] = df_aug['pred_label'].replace({'MEDICALRECORD': 'IDNUM', 'ZIP':'LOCATION', 'CITY':'LOCATION',
                                                     'STREET':'LOCATION', 'STATE':'LOCATION', 'USERNAME':'O',
                                                     'PHONE':'CONTACT', 'DOCTOR':'NAME', 'HOSPITAL':'LOCATION', 'PATIENT':'NAME',
                                                     'PROFESSION':'O'})

In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8209
DATE,205
NAME,109
LOCATION,60
IDNUM,34
AGE,8
CONTACT,7


In [None]:
from sklearn.metrics import classification_report

y_true = df_aug['token_label']
y_pred = df_aug['pred_label']

labels = df_aug['token_label'].unique()
labels = [l for l in labels if l != 'O']

print(classification_report(y_true, y_pred, labels=labels))

              precision    recall  f1-score   support

       IDNUM       0.91      0.84      0.87        37
    LOCATION       0.93      0.69      0.79        81
        DATE       1.00      0.96      0.98       213
         AGE       1.00      0.40      0.57        20
        NAME       0.92      0.61      0.73       165
     CONTACT       0.29      0.67      0.40         3

   micro avg       0.95      0.77      0.85       519
   macro avg       0.84      0.69      0.72       519
weighted avg       0.95      0.77      0.84       519



### 🎈 **ner_deid_subentity_augmented_v2**

In [None]:
df_aug = all_merged['ner_deid_subentity_augmented_v2']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,MEDICALRECORD
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,HOSPITAL
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,MEDICALRECORD
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,O
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,IDNUM
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,USERNAME


In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8097
DATE,216
DOCTOR,115
HOSPITAL,64
PATIENT,49
MEDICALRECORD,20
STREET,15
PHONE,11
AGE,9
ORGANIZATION,8


In [None]:
df_aug['pred_label'] = df_aug['pred_label'].replace({'MEDICALRECORD': 'IDNUM', 'ZIP':'LOCATION', 'CITY':'LOCATION',
                                                     'STREET':'LOCATION', 'STATE':'LOCATION', 'USERNAME':'O',
                                                     'PHONE':'CONTACT', 'DOCTOR':'NAME', 'HOSPITAL':'LOCATION', 'PATIENT':'NAME',
                                                     'PROFESSION':'O', 'ORGANIZATION':'O', 'DEVICE':'O'})

In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8116
DATE,216
NAME,164
LOCATION,90
IDNUM,26
CONTACT,11
AGE,9


In [None]:
from sklearn.metrics import classification_report

y_true = df_aug['token_label']
y_pred = df_aug['pred_label']

labels = df_aug['token_label'].unique()
labels = [l for l in labels if l != 'O']

print(classification_report(y_true, y_pred, labels=labels))

              precision    recall  f1-score   support

       IDNUM       1.00      0.70      0.83        37
    LOCATION       0.80      0.89      0.84        81
        DATE       0.98      0.99      0.98       213
         AGE       0.89      0.40      0.55        20
        NAME       0.89      0.88      0.89       165
     CONTACT       0.27      1.00      0.43         3

   micro avg       0.90      0.90      0.90       519
   macro avg       0.80      0.81      0.75       519
weighted avg       0.92      0.90      0.90       519



### 🎈 **ner_deid_subentity_augmented_docwise**

In [None]:
df_aug = all_merged['ner_deid_subentity_augmented_docwise']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,MEDICALRECORD
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,HOSPITAL
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,MEDICALRECORD
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,O
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,MEDICALRECORD
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,O


In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8121
DATE,214
DOCTOR,133
PATIENT,44
HOSPITAL,43
MEDICALRECORD,29
LOCATION,10
AGE,8
STREET,8
USERNAME,6


In [None]:
df_aug['pred_label'] = df_aug['pred_label'].replace({'MEDICALRECORD': 'IDNUM', 'ZIP':'LOCATION', 'CITY':'LOCATION',
                                                     'STREET':'LOCATION', 'STATE':'LOCATION', 'USERNAME':'O',
                                                     'PHONE':'CONTACT', 'DOCTOR':'NAME', 'HOSPITAL':'LOCATION', 'PATIENT':'NAME',
                                                     'ORGANIZATION':'O'})

In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8131
DATE,214
NAME,177
LOCATION,68
IDNUM,31
AGE,8
CONTACT,3


In [None]:
from sklearn.metrics import classification_report

y_true = df_aug['token_label']
y_pred = df_aug['pred_label']

labels = df_aug['token_label'].unique()
labels = [l for l in labels if l != 'O']

print(classification_report(y_true, y_pred, labels=labels))

              precision    recall  f1-score   support

       IDNUM       1.00      0.84      0.91        37
    LOCATION       0.91      0.77      0.83        81
        DATE       0.99      0.99      0.99       213
         AGE       1.00      0.40      0.57        20
        NAME       0.82      0.88      0.85       165
     CONTACT       1.00      1.00      1.00         3

   micro avg       0.92      0.89      0.90       519
   macro avg       0.95      0.81      0.86       519
weighted avg       0.92      0.89      0.90       519



### 🎈 **zeroshot_ner_deid_subentity_merged_medium**

In [None]:
df_aug = all_merged['zeroshot_ner_deid_subentity_merged_medium']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,O
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,O
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,O
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,O
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,O
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,USERNAME


In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8066
DATE,217
DOCTOR,103
HOSPITAL,72
PATIENT,55
AGE,24
STREET,17
CITY,16
USERNAME,15
ORGANIZATION,13


In [None]:
df_aug['pred_label'] = df_aug['pred_label'].replace({'ZIP':'LOCATION', 'CITY':'LOCATION', 'COUNTRY':'LOCATION',
                                                     'STREET':'LOCATION', 'STATE':'LOCATION', 'USERNAME':'O',
                                                     'PHONE':'CONTACT', 'DOCTOR':'NAME', 'HOSPITAL':'LOCATION', 'PATIENT':'NAME',
                                                     'ORGANIZATION':'O', 'PROFESSION':'O'})

In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8106
DATE,217
NAME,158
LOCATION,112
AGE,24
IDNUM,12
CONTACT,3


In [None]:
from sklearn.metrics import classification_report

y_true = df_aug['token_label']
y_pred = df_aug['pred_label']

labels = df_aug['token_label'].unique()
labels = [l for l in labels if l != 'O']

print(classification_report(y_true, y_pred, labels=labels))

              precision    recall  f1-score   support

       IDNUM       1.00      0.32      0.49        37
    LOCATION       0.68      0.94      0.79        81
        DATE       0.98      1.00      0.99       213
         AGE       0.33      0.40      0.36        20
        NAME       0.96      0.92      0.93       165
     CONTACT       0.67      0.67      0.67         3

   micro avg       0.88      0.89      0.88       519
   macro avg       0.77      0.71      0.71       519
weighted avg       0.90      0.89      0.88       519



### 🎈 **ner_deid_sd_large**

In [None]:
df_aug = all_merged['ner_deid_sd_large']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,ID
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,LOCATION
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,ID
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,ID
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,ID
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,O


In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8149
DATE,215
NAME,149
LOCATION,78
ID,31
AGE,6
CONTACT,4


In [None]:
df_aug['pred_label'] = df_aug['pred_label'].replace({'ID': 'IDNUM'})

In [None]:
from sklearn.metrics import classification_report

y_true = df_aug['token_label']
y_pred = df_aug['pred_label']

labels = df_aug['token_label'].unique()
labels = [l for l in labels if l != 'O']

print(classification_report(y_true, y_pred, labels=labels))

              precision    recall  f1-score   support

       IDNUM       0.94      0.78      0.85        37
    LOCATION       0.82      0.79      0.81        81
        DATE       0.96      0.97      0.97       213
         AGE       1.00      0.30      0.46        20
        NAME       0.92      0.83      0.87       165
     CONTACT       0.50      0.67      0.57         3

   micro avg       0.92      0.86      0.89       519
   macro avg       0.86      0.72      0.76       519
weighted avg       0.92      0.86      0.88       519



### 🎈 **ner_deid_generic_augmented**

In [None]:
df_aug = all_merged['ner_deid_generic_augmented']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,ID
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,LOCATION
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,ID
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,ID
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,ID
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,O


In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8147
DATE,214
NAME,136
LOCATION,85
ID,36
AGE,8
CONTACT,6


In [None]:
df_aug['pred_label'] = df_aug['pred_label'].replace({'ID': 'IDNUM'})

In [None]:
from sklearn.metrics import classification_report

y_true = df_aug['token_label']
y_pred = df_aug['pred_label']

labels = df_aug['token_label'].unique()
labels = [l for l in labels if l != 'O']

print(classification_report(y_true, y_pred, labels=labels))

              precision    recall  f1-score   support

       IDNUM       0.97      0.95      0.96        37
    LOCATION       0.80      0.84      0.82        81
        DATE       0.98      0.99      0.98       213
         AGE       1.00      0.40      0.57        20
        NAME       0.79      0.65      0.72       165
     CONTACT       0.50      1.00      0.67         3

   micro avg       0.89      0.83      0.86       519
   macro avg       0.84      0.80      0.79       519
weighted avg       0.89      0.83      0.85       519



### 🎈 **ner_deid_enriched**

In [None]:
df_aug = all_merged['ner_deid_enriched']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,MEDICALRECORD
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,HOSPITAL
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,MEDICALRECORD
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,MEDICALRECORD
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,MEDICALRECORD
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,O


In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8145
DATE,215
DOCTOR,101
HOSPITAL,49
PATIENT,47
MEDICALRECORD,30
STREET,15
CITY,8
AGE,7
PHONE,4


In [None]:
df_aug['pred_label'] = df_aug['pred_label'].replace({'MEDICALRECORD':'IDNUM', 'ZIP':'LOCATION', 'CITY':'LOCATION', 'COUNTRY':'LOCATION',
                                                     'STREET':'LOCATION', 'STATE':'LOCATION', 'USERNAME':'O',
                                                     'PHONE':'CONTACT', 'DOCTOR':'NAME', 'HOSPITAL':'LOCATION', 'PATIENT':'NAME'})

In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8149
DATE,215
NAME,148
LOCATION,77
IDNUM,32
AGE,7
CONTACT,4


In [None]:
from sklearn.metrics import classification_report

y_true = df_aug['token_label']
y_pred = df_aug['pred_label']

labels = df_aug['token_label'].unique()
labels = [l for l in labels if l != 'O']

print(classification_report(y_true, y_pred, labels=labels))

              precision    recall  f1-score   support

       IDNUM       0.97      0.84      0.90        37
    LOCATION       0.82      0.78      0.80        81
        DATE       0.99      1.00      0.99       213
         AGE       1.00      0.35      0.52        20
        NAME       0.94      0.84      0.89       165
     CONTACT       0.50      0.67      0.57         3

   micro avg       0.94      0.87      0.91       519
   macro avg       0.87      0.75      0.78       519
weighted avg       0.94      0.87      0.90       519



### 🎈 **ner_deid_generic_augmented_langtest**

In [None]:
df_aug = all_merged['ner_deid_generic_augmented_langtest']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,ID
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,LOCATION
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,ID
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,LOCATION
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,ID
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,NAME


In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8135
DATE,197
NAME,152
LOCATION,94
ID,37
AGE,8
PROFESSION,6
CONTACT,3


In [None]:
df_aug['pred_label'] = df_aug['pred_label'].replace({'ID':'IDNUM', 'PROFESSION':'O'})

In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8141
DATE,197
NAME,152
LOCATION,94
IDNUM,37
AGE,8
CONTACT,3


In [None]:
from sklearn.metrics import classification_report

y_true = df_aug['token_label']
y_pred = df_aug['pred_label']

labels = df_aug['token_label'].unique()
labels = [l for l in labels if l != 'O']

print(classification_report(y_true, y_pred, labels=labels))

              precision    recall  f1-score   support

       IDNUM       0.76      0.76      0.76        37
    LOCATION       0.73      0.85      0.79        81
        DATE       0.98      0.91      0.94       213
         AGE       1.00      0.40      0.57        20
        NAME       0.76      0.70      0.73       165
     CONTACT       0.67      0.67      0.67         3

   micro avg       0.85      0.80      0.82       519
   macro avg       0.82      0.71      0.74       519
weighted avg       0.85      0.80      0.82       519



### 🎈 **ner_deid_subentity_augmented**

In [None]:
df_aug = all_merged['ner_deid_subentity_augmented']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,MEDICALRECORD
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,HOSPITAL
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,MEDICALRECORD
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,O
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,MEDICALRECORD
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,O


In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8186
DATE,198
DOCTOR,101
HOSPITAL,41
MEDICALRECORD,24
PATIENT,22
IDNUM,16
STREET,10
AGE,9
PHONE,7


In [None]:
df_aug['pred_label'] = df_aug['pred_label'].replace({'MEDICALRECORD':'IDNUM', 'ZIP':'LOCATION', 'CITY':'LOCATION', 'ORGANIZATION':'O',
                                                     'STREET':'LOCATION', 'STATE':'LOCATION', 'USERNAME':'O',
                                                     'PHONE':'CONTACT', 'DOCTOR':'NAME', 'HOSPITAL':'LOCATION', 'PATIENT':'NAME'})

In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8192
DATE,198
NAME,123
LOCATION,63
IDNUM,40
AGE,9
CONTACT,7


In [None]:
from sklearn.metrics import classification_report

y_true = df_aug['token_label']
y_pred = df_aug['pred_label']

labels = df_aug['token_label'].unique()
labels = [l for l in labels if l != 'O']

print(classification_report(y_true, y_pred, labels=labels))

              precision    recall  f1-score   support

       IDNUM       0.68      0.73      0.70        37
    LOCATION       0.86      0.67      0.75        81
        DATE       0.98      0.91      0.94       213
         AGE       0.89      0.40      0.55        20
        NAME       0.84      0.62      0.72       165
     CONTACT       0.29      0.67      0.40         3

   micro avg       0.88      0.75      0.81       519
   macro avg       0.75      0.67      0.68       519
weighted avg       0.89      0.75      0.81       519



### 🎈 **ner_deid_generic_docwise**

In [None]:
df_aug = all_merged['ner_deid_generic_docwise']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,ID
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,LOCATION
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,ID
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,O
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,O
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,O


In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8156
DATE,216
NAME,129
LOCATION,75
ID,38
AGE,8
PROFESSION,6
CONTACT,4


In [None]:
df_aug['pred_label'] = df_aug['pred_label'].replace({'ID':'IDNUM', 'PROFESSION':'O'})

In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8162
DATE,216
NAME,129
LOCATION,75
IDNUM,38
AGE,8
CONTACT,4


In [None]:
from sklearn.metrics import classification_report

y_true = df_aug['token_label']
y_pred = df_aug['pred_label']

labels = df_aug['token_label'].unique()
labels = [l for l in labels if l != 'O']

print(classification_report(y_true, y_pred, labels=labels))

              precision    recall  f1-score   support

       IDNUM       0.74      0.76      0.75        37
    LOCATION       0.71      0.65      0.68        81
        DATE       0.96      0.97      0.97       213
         AGE       1.00      0.40      0.57        20
        NAME       0.82      0.64      0.72       165
     CONTACT       0.50      0.67      0.57         3

   micro avg       0.86      0.78      0.82       519
   macro avg       0.79      0.68      0.71       519
weighted avg       0.86      0.78      0.81       519



### 🎈 **ner_deid_generic_augmented_allUpperCased_langtest**

In [None]:
df_aug = all_merged['ner_deid_generic_augmented_allUpperCased_langtest']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,ID
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,LOCATION
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,ID
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,ID
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,ID
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,O


In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8165
DATE,193
NAME,132
LOCATION,89
ID,40
AGE,9
CONTACT,2
PROFESSION,2


In [None]:
df_aug['pred_label'] = df_aug['pred_label'].replace({'ID':'IDNUM', 'PROFESSION':'O'})

In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8167
DATE,193
NAME,132
LOCATION,89
IDNUM,40
AGE,9
CONTACT,2


In [None]:
from sklearn.metrics import classification_report

y_true = df_aug['token_label']
y_pred = df_aug['pred_label']

labels = df_aug['token_label'].unique()
labels = [l for l in labels if l != 'O']

print(classification_report(y_true, y_pred, labels=labels))

              precision    recall  f1-score   support

       IDNUM       0.82      0.89      0.86        37
    LOCATION       0.67      0.74      0.71        81
        DATE       0.96      0.87      0.92       213
         AGE       0.89      0.40      0.55        20
        NAME       0.79      0.63      0.70       165
     CONTACT       1.00      0.67      0.80         3

   micro avg       0.85      0.76      0.80       519
   macro avg       0.86      0.70      0.76       519
weighted avg       0.85      0.76      0.80       519



### 🎈 **state_matcher**

In [None]:
df_aug = all_merged['state_matcher']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,O
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,O
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,O
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,O
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,O
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,O


In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8628
STATE,4


In [None]:
df_aug['pred_label'] = df_aug['pred_label'].replace({'STATE':'LOCATION'})

In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8628
LOCATION,4


In [None]:
from sklearn.metrics import classification_report

y_true = df_aug['token_label']
y_pred = df_aug['pred_label']

labels = df_aug['token_label'].unique()
labels = [l for l in df_aug['token_label'].unique() if l == 'LOCATION']

print(classification_report(y_true, y_pred, labels=labels))

              precision    recall  f1-score   support

    LOCATION       1.00      0.05      0.09        81

   micro avg       1.00      0.05      0.09        81
   macro avg       1.00      0.05      0.09        81
weighted avg       1.00      0.05      0.09        81



### 🎈 **ssn_parser**

In [None]:
df_aug = all_merged['ssn_parser']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,O
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,O
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,O
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,O
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,O
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,O


In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8632


**This model did not detect any entities in the current dataset.**

### 🎈 **phone_parser**

In [None]:
df_aug = all_merged['phone_parser']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,O
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,O
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,O
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,O
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,O
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,O


In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8630
PHONE,2


In [None]:
df_aug['pred_label'] = df_aug['pred_label'].replace({'PHONE':'CONTACT'})

In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8630
CONTACT,2


In [None]:
from sklearn.metrics import classification_report

y_true = df_aug['token_label']
y_pred = df_aug['pred_label']

labels = df_aug['token_label'].unique()
labels = [l for l in df_aug['token_label'].unique() if l == 'CONTACT']

print(classification_report(y_true, y_pred, labels=labels))

              precision    recall  f1-score   support

     CONTACT       1.00      0.67      0.80         3

   micro avg       1.00      0.67      0.80         3
   macro avg       1.00      0.67      0.80         3
weighted avg       1.00      0.67      0.80         3



### 🎈 **zip_matcher**

In [None]:
df_aug = all_merged['zip_matcher']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,O
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,O
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,O
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,O
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,O
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,O


In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8630
ZIP,2


In [None]:
df_aug['pred_label'] = df_aug['pred_label'].replace({'ZIP':'LOCATION'})

In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8630
LOCATION,2


In [None]:
from sklearn.metrics import classification_report

y_true = df_aug['token_label']
y_pred = df_aug['pred_label']

labels = df_aug['token_label'].unique()
labels = [l for l in df_aug['token_label'].unique() if l == 'LOCATION']

print(classification_report(y_true, y_pred, labels=labels))

              precision    recall  f1-score   support

    LOCATION       1.00      0.02      0.05        81

   micro avg       1.00      0.02      0.05        81
   macro avg       1.00      0.02      0.05        81
weighted avg       1.00      0.02      0.05        81



### 🎈 **ip_matcher**

In [None]:
df_aug = all_merged['ip_matcher']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,O
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,O
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,O
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,O
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,O
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,O


In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8632


**This model did not detect any entities in the current dataset.**

### 🎈 **medical_record_parser**

In [None]:
df_aug = all_merged['medical_record_parser']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,O
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,O
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,O
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,O
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,O
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,O


In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8631
MEDICALRECORD,1


In [None]:
df_aug['pred_label'] = df_aug['pred_label'].replace({'MEDICALRECORD':'IDNUM'})

In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8631
IDNUM,1


In [None]:
from sklearn.metrics import classification_report

y_true = df_aug['token_label']
y_pred = df_aug['pred_label']

labels = df_aug['token_label'].unique()
labels = [l for l in df_aug['token_label'].unique() if l == 'IDNUM']

print(classification_report(y_true, y_pred, labels=labels))

              precision    recall  f1-score   support

       IDNUM       1.00      0.03      0.05        37

   micro avg       1.00      0.03      0.05        37
   macro avg       1.00      0.03      0.05        37
weighted avg       1.00      0.03      0.05        37



### 🎈 **date_of_death_parser**

In [None]:
df_aug = all_merged['date_of_death_parser']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,O
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,O
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,O
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,O
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,O
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,O


In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8632


**This model did not detect any entities in the current dataset.**

### 🎈 **account_parser**

In [None]:
df_aug = all_merged['account_parser']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,O
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,O
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,O
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,O
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,O
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,O


In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8632


**This model did not detect any entities in the current dataset.**

### 🎈 **plate_parser**

In [None]:
df_aug = all_merged['plate_parser']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,O
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,O
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,O
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,O
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,O
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,O


In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8632


**This model did not detect any entities in the current dataset.**

### 🎈 **url_matcher**

In [None]:
df_aug = all_merged['url_matcher']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,O
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,O
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,O
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,O
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,O
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,O


In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8632


**This model did not detect any entities in the current dataset.**

### 🎈 **email_matcher**

In [None]:
df_aug = all_merged['email_matcher']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,O
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,O
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,O
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,O
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,O
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,O


In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8632


**This model did not detect any entities in the current dataset.**

### 🎈 **zip_parser**

In [None]:
df_aug = all_merged['zip_parser']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,O
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,O
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,O
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,O
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,O
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,O


In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8630
ZIP,2


In [None]:
df_aug['pred_label'] = df_aug['pred_label'].replace({'ZIP':'LOCATION'})

In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8630
LOCATION,2


In [None]:
from sklearn.metrics import classification_report

y_true = df_aug['token_label']
y_pred = df_aug['pred_label']

labels = df_aug['token_label'].unique()
labels = [l for l in df_aug['token_label'].unique() if l == 'LOCATION']

print(classification_report(y_true, y_pred, labels=labels))

              precision    recall  f1-score   support

    LOCATION       1.00      0.02      0.05        81

   micro avg       1.00      0.02      0.05        81
   macro avg       1.00      0.02      0.05        81
weighted avg       1.00      0.02      0.05        81



### 🎈 **license_parser**

In [None]:
df_aug = all_merged['license_parser']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,O
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,O
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,O
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,O
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,O
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,O


In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8632


**This model did not detect any entities in the current dataset.**

### 🎈 **phone_matcher**

In [None]:
df_aug = all_merged['phone_matcher']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,O
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,O
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,O
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,O
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,O
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,O


In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8631
PHONE,1


In [None]:
df_aug['pred_label'] = df_aug['pred_label'].replace({'PHONE':'CONTACT'})

In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8631
CONTACT,1


In [None]:
from sklearn.metrics import classification_report

y_true = df_aug['token_label']
y_pred = df_aug['pred_label']

labels = df_aug['token_label'].unique()
labels = [l for l in df_aug['token_label'].unique() if l == 'CONTACT']

print(classification_report(y_true, y_pred, labels=labels))

              precision    recall  f1-score   support

     CONTACT       1.00      0.33      0.50         3

   micro avg       1.00      0.33      0.50         3
   macro avg       1.00      0.33      0.50         3
weighted avg       1.00      0.33      0.50         3



### 🎈 **dln_parser**

In [None]:
df_aug = all_merged['dln_parser']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,O
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,O
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,O
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,O
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,O
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,O


In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8632


**This model did not detect any entities in the current dataset.**

### 🎈 **vin_parser**

In [None]:
df_aug = all_merged['vin_parser']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,O
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,O
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,O
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,O
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,O
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,O


In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8632


**This model did not detect any entities in the current dataset.**

### 🎈 **age_parser**

In [None]:
df_aug = all_merged['age_parser']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,O
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,O
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,O
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,O
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,O
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,O


In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8624
AGE,8


In [None]:
from sklearn.metrics import classification_report

y_true = df_aug['token_label']
y_pred = df_aug['pred_label']

labels = df_aug['token_label'].unique()
labels = [l for l in df_aug['token_label'].unique() if l == 'AGE']

print(classification_report(y_true, y_pred, labels=labels))

              precision    recall  f1-score   support

         AGE       1.00      0.40      0.57        20

   micro avg       1.00      0.40      0.57        20
   macro avg       1.00      0.40      0.57        20
weighted avg       1.00      0.40      0.57        20



### 🎈 **country_matcher**

In [None]:
df_aug = all_merged['country_matcher']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,O
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,O
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,O
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,O
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,O
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,O


In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8630
COUNTRY,2


In [None]:
df_aug['pred_label'] = df_aug['pred_label'].replace({'COUNTRY':'LOCATION'})

In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8630
LOCATION,2


In [None]:
from sklearn.metrics import classification_report

y_true = df_aug['token_label']
y_pred = df_aug['pred_label']

labels = df_aug['token_label'].unique()
labels = [l for l in df_aug['token_label'].unique() if l == 'LOCATION']

print(classification_report(y_true, y_pred, labels=labels))

              precision    recall  f1-score   support

    LOCATION       0.50      0.01      0.02        81

   micro avg       0.50      0.01      0.02        81
   macro avg       0.50      0.01      0.02        81
weighted avg       0.50      0.01      0.02        81



### 🎈 **date_matcher**

In [None]:
df_aug = all_merged['date_matcher']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,O
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,O
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,O
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,O
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,O
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,O


In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8496
DATE,136


In [None]:
from sklearn.metrics import classification_report

y_true = df_aug['token_label']
y_pred = df_aug['pred_label']

labels = df_aug['token_label'].unique()
labels = [l for l in df_aug['token_label'].unique() if l == 'DATE']

print(classification_report(y_true, y_pred, labels=labels))

              precision    recall  f1-score   support

        DATE       0.99      0.63      0.77       213

   micro avg       0.99      0.63      0.77       213
   macro avg       0.99      0.63      0.77       213
weighted avg       0.99      0.63      0.77       213



### 🎈 **date_of_birth_parser**

In [None]:
df_aug = all_merged['date_of_birth_parser']
df_aug

Unnamed: 0,doc_id,text,begin,end,token,token_label,pred_label
0,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,1,9,957770228,IDNUM,O
1,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,11,13,FIH,LOCATION,O
2,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,15,21,0408267,IDNUM,O
3,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,23,32,46769/5v7d,IDNUM,O
4,1,\n957770228\nFIH\n0408267\n46769/5v7d\n237890\n2/5/1994 12:00:00 AM\nTRACHEOESOPHAGEAL FISTULA .\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n2-5-94\nDISCHARGE DATE :\n4-2-94\nPRIN...,34,39,237890,IDNUM,O
...,...,...,...,...,...,...,...
8627,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7602,7604,M.D,O,O
8628,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7605,7605,.,O,O
8629,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7607,7607,[,O,O
8630,9,\n333145593\nFIH\n9229891\n50096/c65m\n571634\n12/5/1993 12:00:00 AM\nDischarge Summary\nUnsigned\nDIS\nReport Status :\nUnsigned\nADMISSION DATE :\n12/05/93\nDISCHARGE DATE :\n12/17/93\nCAUSE OF ...,7609,7618,report_end,O,O


In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8618
DOB,14


In [None]:
df_aug['pred_label'] = df_aug['pred_label'].replace({'DOB':'DATE'})

In [None]:
df_aug['pred_label'].value_counts()

Unnamed: 0_level_0,count
pred_label,Unnamed: 1_level_1
O,8618
DATE,14


In [None]:
from sklearn.metrics import classification_report

y_true = df_aug['token_label']
y_pred = df_aug['pred_label']

labels = df_aug['token_label'].unique()
labels = [l for l in df_aug['token_label'].unique() if l == 'DATE']

print(classification_report(y_true, y_pred, labels=labels))

              precision    recall  f1-score   support

        DATE       1.00      0.07      0.12       213

   micro avg       1.00      0.07      0.12       213
   macro avg       1.00      0.07      0.12       213
weighted avg       1.00      0.07      0.12       213

