<img src="../public/colorlogo.png" width="50%"/>


[Homepage](https://www.datafog.ai) | 
[Discord](https://discord.gg/bzDth394R4) | 
[Github](https://github.com/datafog/datafog-python) | 
[Contact](mailto:sid@datafog.ai)



### Setup

In [None]:
!pip install --upgrade datafog --quiet

Pytesseract (OCR) dependencies:

In [None]:
! apt install tesseract-ocr
! apt install libtesseract-dev

In [None]:
!pip install nest_asyncio

In [5]:
import asyncio
import nest_asyncio
nest_asyncio.apply()
from datafog import DataFog, OCRPIIAnnotator, TextPIIAnnotator
import pandas as pd
import os
from IPython.display import Markdown
import json

### OCR Examples

In [7]:
image_set = {
  "medical_invoice": "https://s3.amazonaws.com/thumbnails.venngage.com/template/dc377004-1c2d-49f2-8ddf-d63f11c8d9c2.png",
  "sales_receipt": "https://templates.invoicehome.com/sales-receipt-template-us-classic-white-750px.png",
  "press_release": "https://newsroom.cisco.com/c/dam/r/newsroom/en/us/assets/a/y2023/m09/cisco_splunk_1200x675_v3.png",
  "insurance_claim_scanned_form": "https://www.pdffiller.com/preview/101/35/101035394.png",
  "scanned_internal_record": "https://www.pdffiller.com/preview/435/972/435972694.png",
  "executive_email": "https://pbs.twimg.com/media/GM3-wpeWkAAP-cX.jpg"
}
url_list = list(image_set.values())
print(url_list)

['https://s3.amazonaws.com/thumbnails.venngage.com/template/dc377004-1c2d-49f2-8ddf-d63f11c8d9c2.png', 'https://templates.invoicehome.com/sales-receipt-template-us-classic-white-750px.png', 'https://newsroom.cisco.com/c/dam/r/newsroom/en/us/assets/a/y2023/m09/cisco_splunk_1200x675_v3.png', 'https://www.pdffiller.com/preview/101/35/101035394.png', 'https://www.pdffiller.com/preview/435/972/435972694.png', 'https://pbs.twimg.com/media/GM3-wpeWkAAP-cX.jpg']


#### Use Case: Extract text from images

In [8]:
datafog = DataFog(operations='extract_text')

async def run_ocr_pipeline_demo():
  results = await datafog.run_ocr_pipeline(url_list)
  print("OCR Pipeline Results:", results)

In [9]:
loop = asyncio.get_event_loop()
loop.run_until_complete(run_ocr_pipeline_demo())

OCR Pipeline Results: ["MEDICAL BILLING INVOICE\n\nPATIENT INFORMATION\n\nKemba Harris\n(855) 595-5999\n\n11 Rosewood Drive,\nCollingwood, NY 33580\n\nPERSCRIBING PHYSICIAN'S INFORMATION\n\nDr. Alanah Gomez\n(855) 505-5000\n\n102 Trope Street,\nNew York, NY 45568\n\nDATE\n\n07/01/23\n\nINVOICE NUMBER\n\n12245\n\nINVOICE DUE DATE\n\n07/30/23\n\nAmount DUE\n\n$1,745.00\n\nITEM DESCRIPTION AMOUNT\nFull Check Up Full body check up $745.00\n1,000.00\nEar & Throat Examination Infection check due to inflammation sy\nNOTES SUBTOTAL $745.00\nA prescription has been written out for patient, TAXRATE 9%\nfor an acute throat infection.\nTAX $157.05\nTOTAL $1,902.05\n\nConcordia Hill Hospital\n\nwww.concordiahill.com\n\nFor more information or any issues or concerns,\nemail us at invoices@concordiahill.com\n\n", 'East Repair Inc.\n\n1912 Harvest Lane\nNew York, NY 12210\n\nSold To\nJohn Smith,\n\n2 Court Square\n\nNew York, NY 12210\n\nShip To\nJohn Smith\n\n3787 Pineview Drive\nCambridge, MA 12210\

#### Use Case: Extract text from images -> annotate for PII

In [10]:
datafog = DataFog() # default operations is 'annotate_pii'

async def run_ocr_pipeline_demo():
  results = await datafog.run_ocr_pipeline(url_list)
  print("OCR Pipeline Results:", results)

In [11]:
loop = asyncio.get_event_loop()
loop.run_until_complete(run_ocr_pipeline_demo())

OCR Pipeline Results: {"MEDICAL BILLING INVOICE\n\nPATIENT INFORMATION\n\nKemba Harris\n(855) 595-5999\n\n11 Rosewood Drive,\nCollingwood, NY 33580\n\nPERSCRIBING PHYSICIAN'S INFORMATION\n\nDr. Alanah Gomez\n(855) 505-5000\n\n102 Trope Street,\nNew York, NY 45568\n\nDATE\n\n07/01/23\n\nINVOICE NUMBER\n\n12245\n\nINVOICE DUE DATE\n\n07/30/23\n\nAmount DUE\n\n$1,745.00\n\nITEM DESCRIPTION AMOUNT\nFull Check Up Full body check up $745.00\n1,000.00\nEar & Throat Examination Infection check due to inflammation sy\nNOTES SUBTOTAL $745.00\nA prescription has been written out for patient, TAXRATE 9%\nfor an acute throat infection.\nTAX $157.05\nTOTAL $1,902.05\n\nConcordia Hill Hospital\n\nwww.concordiahill.com\n\nFor more information or any issues or concerns,\nemail us at invoices@concordiahill.com\n\n": {'DATE_TIME': [], 'LOC': ['New York'], 'NRP': [], 'ORG': ['MEDICAL BILLING', 'Concordia Hill Hospital'], 'PER': ['Kemba Harris\n', 'Rosewood Drive', 'Gomez\n']}, 'East Repair Inc.\n\n1912 Ha

### Text Annotation Examples

#### Use Case: Annotate a folder of text files for PII

In [12]:
!git clone https://gist.github.com/b43b72693226422bac5f083c941ecfdb.git

Cloning into 'b43b72693226422bac5f083c941ecfdb'...
remote: Enumerating objects: 7, done.[K
remote: Counting objects: 100% (7/7), done.[K
remote: Compressing objects: 100% (7/7), done.[K
remote: Total 7 (delta 4), reused 0 (delta 0), pack-reused 0[K
Receiving objects: 100% (7/7), done.
Resolving deltas: 100% (4/4), done.


In [None]:
# Define the directory path
folder_path = 'clinical_notes/'

# List all files in the directory
file_list = os.listdir(folder_path)
text_files = sorted([file for file in file_list if file.endswith('.txt')])

with open(os.path.join(folder_path, text_files[0]), 'r') as file:
    clinical_note = file.read()

display(Markdown(clinical_note))

In [14]:
async def run_text_pipeline_demo():
  results = await datafog.run_text_pipeline(texts)
  print("Text Pipeline Results:", results)
  return results


texts = [clinical_note]
loop = asyncio.get_event_loop()
results = loop.run_until_complete(run_text_pipeline_demo())

Text Pipeline Results: {'\n**Date:** April 10, 2024\n\n**Patient:** Emily Johnson, 35 years old\n\n**MRN:** 00987654\n\n**Chief Complaint:** "I\'ve been experiencing severe back pain and numbness in my legs."\n\n**History of Present Illness:** The patient is a 35-year-old who presents with a 2-month history of worsening back pain, numbness in both legs, and occasional tingling sensations. The patient reports working as a freelance writer and has been experiencing increased stress due to tight deadlines and financial struggles.\n\n**Past Medical History:** Hypothyroidism\n\n**Social History:**\nThe patient shares a small apartment with two roommates and relies on public transportation. They mention feeling overwhelmed with work and personal responsibilities, often sacrificing sleep to meet deadlines. The patient expresses concern over the high cost of healthcare and the need for affordable medication options.\n\n**Review of Systems:** Denies fever, chest pain, or shortness of breath. Re