![SparkPdf](https://stabrise.com/media/filer_public_thumbnails/filer_public/de/31/de3156f0-386d-4b3b-ac7e-8856a38f7c1e/sparkpdflogo.png__808x214_subsampling-2.webp)

<p align="center">
    <a target="_blank" href="https://colab.research.google.com/github/StabRise/spark-pdf-tutorials/blob/master/1.Ner.ipynb">
        <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
    </a>
    <a href="https://pypi.org/project/pyspark-pdf/" alt="Package on PyPI"><img src="https://img.shields.io/pypi/v/pyspark-pdf.svg" /></a>    
    <a href="https://github.com/stabrise/spark-pdf/blob/main/LICENSE"><img alt="GitHub" src="https://img.shields.io/github/license/stabrise/spark-pdf.svg?color=blue"></a>
    <a href="https://stabrise.com"><img alt="StabRise" src="https://img.shields.io/badge/powered%20by-StabRise-orange.svg?style=flat&colorA=E1523D&colorB=007D8A"></a>
</p>

# Named Entity Recognition (NER) with Spark Pdf

Spark Pdf provides possibility to run NER models using HugeTransformer library. You can run NER models for text and pdf documents.
You can use any NER models from the Hugging Face model hub. You can also visualize the results of the NER models.

## Installation

In [None]:
%%bash
[[ ! "${COLAB_RELEASE_TAG}" ]] && exit
sudo apt install tesseract-ocr

In [None]:
!pip install pyspark-pdf[ml]

## Start Spark Session with Spark Pdf

In [None]:
from sparkpdf import *

spark = SparkPdfSession()
spark

## Read Text

In [2]:
df = spark.read.text("./data/texts/example.txt", wholetext=True)
df.show(1, False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## Show text

In [3]:
df.show_text()

                                                                                

## Run Ner for Text document

In [4]:
pipeline = PipelineModel(stages=[
    TextToDocument(),
    Ner(model="dslim/bert-base-NER")
])
result = pipeline.transform(df).cache()

result.show_ner("ner")

[Stage 4:>                                                          (0 + 1) / 1]

+------------+------------------+-------------------+-----+----+-----+
|entity_group|             score|               word|start| end|boxes|
+------------+------------------+-------------------+-----+----+-----+
|         ORG|0.9993033409118652|             OpenAI|   32|  38|   []|
|         LOC|0.9992387294769287|      San Francisco|  115| 128|   []|
|         PER|0.9997133016586304|         Sam Altman|  206| 216|   []|
|         ORG|0.9987567663192749|             OpenAI|  229| 235|   []|
|        MISC|0.9966957569122314|                 AI|  275| 277|   []|
|         ORG|0.9987478256225586|Stanford University|  371| 390|   []|
|         ORG|0.9986779093742371|                MIT|  395| 398|   []|
|         ORG|0.9991996884346008|          Microsoft|  412| 421|   []|
|         ORG| 0.998961329460144|             OpenAI|  488| 494|   []|
|         ORG|0.9134462475776672|             OpenAI|  488| 494|   []|
|         ORG|0.9980237483978271|             Amazon|  522| 528|   []|
|     

                                                                                

## Visualize NER

In [5]:
result.visualize_ner("ner")

# Run Ner for Pdf document

## Read Pdf file

In [6]:
pdf_df = spark.read.format("binaryFile").load("./data/pdfs/example.pdf")

pdf_df.show_pdf()

## Run Ner for Pdf document

In [7]:
pipeline = PipelineModel(stages=[
    PdfDataToImage(),
    TesseractOcr(keepInputData=True),
    Ner(model="dslim/bert-base-NER"),
    ImageDrawBoxes(inputCols=["image", "ner"], displayDataList=["entity_group", "score"],
                   textSize=30, lineWidth=3)
])

result = pipeline.transform(pdf_df).cache()

result.show_ner("ner")

[Stage 12:>                                                         (0 + 1) / 1]

+------------+------------------+-------------------+-----+----+--------------------+
|entity_group|             score|               word|start| end|               boxes|
+------------+------------------+-------------------+-----+----+--------------------+
|         ORG|0.9991413354873657|             OpenAl|   32|  38|[{OpenAl, 0.90458...|
|         LOC|0.9992393255233765|      San Francisco|  115| 128|[{San, 0.95797309...|
|         PER|0.9997143745422363|         Sam Altman|  206| 216|[{Sam, 0.96336723...|
|         ORG|0.9987333416938782|             OpenAl|  229| 235|[{OpenAl,, 0.7108...|
|        MISC|0.4643312692642212|                 Al|  275| 277|[{Al, 0.96258057,...|
|         ORG|0.9987882375717163|Stanford University|  362| 381|[{Stanford, 0.961...|
|         ORG|0.9988399147987366|                MIT|  386| 389|[{MIT., 0.9693145...|
|         ORG|0.9991363883018494|          Microsoft|  402| 411|[{Microsoft, 0.96...|
|         ORG|0.6873346567153931|               This| 

                                                                                

## Visualize NER

In [8]:
result.visualize_ner("ner")

## Visualize Ner results on original image

In [9]:
result.show_image("image_with_boxes")