## Metrics on RVL CDIP for V3 dit based classifier
This sample is prepared for running in Databrics, however adapting this to other environments 

#### Disable logging
This step is required in some Databricks environments to remove some undesired logging messages.

In [0]:
import logging
logging.getLogger("py4j").setLevel(logging.ERROR)

#### Mount the S3 bucket
We will use DBFS to mount an S3 folder containing the RvlCdip test dataset.


In [0]:
access_key = "AKIAXXXXXXXXXXXXXXXXXXXX"
secret_key = "XYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZ"
encoded_secret_key = secret_key.replace("/", "%2F")
aws_bucket_name = "dev.johnsnowlabs.com"
mount_name = "s3_dev"

try:
  dbutils.fs.mount("s3a://%s:%s@%s" % (access_key, encoded_secret_key, aws_bucket_name), "/mnt/%s" % mount_name)
except:
  dbutils.fs.unmount("/mnt/%s" % mount_name)
  dbutils.fs.mount("s3a://%s:%s@%s" % (access_key, encoded_secret_key, aws_bucket_name), "/mnt/%s" % mount_name)

Let's take a look at the contents of the dataset folder. Each document image is placed within a specific folder reflecting the name of the class the document belongs to.</br>
The RvlCdipReader.readTestDataset() function will lift the directory structure we just described and create an 'act_label' column in a PySpark Dataframe.</br>
More on this later.

In [0]:
!ls /dbfs/mnt/s3_dev/ocr/datasets/RVL-CDIP/test

#### Set up license
If you are in Databricks, and haven't used any other mechanism for setting the license, paste your license key in this cell and update the license file accordingly.

In [0]:
%%bash
rm /dbfs/FileStore/johnsnowlabs/license.key
echo "XYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZXYZ" >> /dbfs/FileStore/johnsnowlabs/license.key

#### Read the dataset
Let's first take a look at the documentation,

In [1]:
from sparkocr.transformers.readers.rvlcdip_reader import RvlCdipReader
help(RvlCdipReader.readTestDataset)

Help on function readTestDataset in module sparkocr.transformers.readers.rvlcdip_reader:

readTestDataset(self, spark, path, partitions=8, storage_level=StorageLevel(True, False, False, False, 1))
    Reads the RVL-CDP train dataset from an external resource.
    
    Parameters
    ----------
    spark : :class:`pyspark.sql.SparkSession`
        Initiated Spark Session with Spark NLP
    path : str
        the path where you unzip the files for RvlCdip Test
    partitions : sets the minimum number of partitions for the case of lifting multiple files in parallel into a single dataframe. Defaults to 8.
    storage_level : sets the persistence level according to PySpark definitions. Defaults to StorageLevel.DISK_ONLY.
    
    
    Returns
    -------
    :class:`pyspark.sql.DataFrame`
        Spark Dataframe with the data



In [0]:
datasetPath = "dbfs:/mnt/s3_dev/ocr/datasets/RVL-CDIP/test/"
df = RvlCdipReader().readTestDataset(spark, datasetPath)

Let's unpack these outputs, we have the path, which is the invdividual file location on disk, we have the content that is the binary information, and the act_label that is the label coming from the dataset.<br>
Let's check how many images  we have,

In [0]:
total_images = df.select("path").count()
total_images

Let's take a look at the datasets, the partitioning, and also at some specific records to make sure everything is ok.

In [0]:
r = df.rdd
r.getNumPartitions()

In [None]:
df.select("path", "act_label").show(truncate=False)

### Define the pipeline
Let's define our pipeline using 2 transformers: BinaryToImage and VisualDocumentClassifierV3. </br>
The first one is responsible for transforming the binary content into an image structure decoding things like number of channels, resolution, etc. The second one is the model perse, check that the predicted label will go in column 'label'(which is different from dataset labels that will be in 'act_label' column).

In [0]:
from sparkocr.transformers import *
from sparkocr.enums import *
from pyspark.ml import PipelineModel

binary_to_image = BinaryToImage()\
    .setOutputCol("image") \
    .setImageType(ImageType.TYPE_3BYTE_BGR)

doc_class = VisualDocumentClassifierV3() \
    .pretrained("dit_base_finetuned_rvlcdip_opt", "en", "clinical/ocr") \
    .setInputCols(["image"]) \
    .setOutputCol("label")

# OCR pipeline
pipeline = PipelineModel(stages=[
    binary_to_image,
    doc_class
])

In [0]:
predicted = pipeline.transform(df)

At this point, 'predicted' holds our results, no computation has happened so far, the dataframe at this point contains the 'recipe' for classifying our dataset, nothing will happen until some action happens in the Dataframe.</br>
We will use the write.parquet() action to store everything to disk, in case we want to consume the resuls multiple times, or use it for something else later.</br>
Let's go!,

In [0]:
predicted.select("label", "act_label", "exception").write.parquet("predictions_pipeline.parquet")

#### Compute Metrics
Here we will compute accuracy, you can try other metrics as well.

In [0]:
from tqdm import tqdm
predicted = spark.read.parquet('dbfs:/predictions_pipeline.parquet')
it = predicted.toLocalIterator()


total_images = total_images
correct = 0
empty = 0
#for row in tqdm(it, total=total_images):
for row in it:
  gold = row.asDict()['act_label']
  predicted = row.asDict()['label']
  if predicted == '':
    empty += 1
  if gold.replace("_", "") == predicted.replace(" ", ""):
      correct += 1

float(correct)/total_images      