# Document Processing with AutoML and Vision API

## Problem Statement
Formally the brief for this Open Project could be stated as follows: Given a collection of varying pdf/png documents containing similar information, create a pipeline that will extract relevant entities from the documents and store the entities in a standardized, easily accessible format. 

The data for this project is contained in the Cloud Storage bucket [gs://document-processing/patent_dataset.zip](https://storage.googleapis.com/document-processing/patent_dataset.zip). The file [gs://document-processing/ground_truth.csv](https://storage.googleapis.com/document-processing/ground_truth.csv) contains hand-labeled fields extracted from the patents. 
The labels in the ground_truth.csv file are filename, category, publication_date, classification_1, classification_2, application_number, filing_date, priority, representative, applicant, inventor, titleFL, titleSL, abstractFL, and publication_number

Here is an example of two different patent formats:

<table><tr>
<td> <img src="eu_patent.png" alt="Drawing" style="width: 600px;"/> </td>
<td> <img src="us_patent.png" alt="Drawing" style="width: 600px;"/> </td>
</tr></table>

### Flexible Solution
There are many possible ways to develop a solution to this task which allows students to touch on various functionality and GCP tools that we discuss during the ASL, including the Vision API, AutoML Vision, BigQuery, Tensorflow, Cloud Composer, PubSub. 

For students more interested in modeling with Tensorflow, they could build a classification model from scratch to regcognize the various type of document formats at hand. Knowing the document format (e.g. US or EU patents as in the example above), relevant entities can then be extracted using the Vision API and some basic regex extactors. It might also be possible to train a [conditional random field in Tensorflow](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/crf) to learn how to tag and extract relevant entities from text and the given labels, instead of writing regex-based entity extractors for each document class. 

Students more interested in productionization could work to use Cloud Functions to automate the extraction pipeline. Or incorporate PubSub so that when a new document is uploaded to a specific GCS bucket it is parsed and the entities uploaded to a BigQuery table. 

Below is a solution outline that uses the Vision API and AutoML, uploading the extracted entities to a table in BigQuery. 

## Install AutoML package

**Caution:** Run the following command and **restart the kernel** afterwards.

In [1]:
!pip freeze | grep google-cloud-automl==0.1.2 || pip install --upgrade google-cloud-automl==0.1.2

google-cloud-automl==0.1.2


## Set the correct environment variables

The following variables should be updated according to your own enviroment:


In [2]:
PROJECT_ID = "asl-open-projects"
SERVICE_ACCOUNT = "entity-extractor"
ZONE = "us-central1"
AUTOML_MODEL_ID = "ICN6705037528556716784"

The following variables are computed from the one you set above, and should
not be modified:

In [3]:
import os

PWD = os.path.abspath(os.path.curdir)

SERVICE_KEY_PATH = os.path.join(PWD, "{0}.json".format(SERVICE_ACCOUNT))
SERVICE_ACCOUNT_EMAIL="{0}@{1}.iam.gserviceaccount.com".format(SERVICE_ACCOUNT, PROJECT_ID)

# Exporting the variables into the environment to make them available to all the subsequent cells
os.environ["PROJECT_ID"] = PROJECT_ID
os.environ["SERVICE_ACCOUNT"] = SERVICE_ACCOUNT
os.environ["SERVICE_KEY_PATH"] = SERVICE_KEY_PATH
os.environ["SERVICE_ACCOUNT_EMAIL"] = SERVICE_ACCOUNT_EMAIL
os.environ["ZONE"] = ZONE

## Switching the right project and zone

In [4]:
%%bash
gcloud config set project $PROJECT_ID
gcloud config set compute/region $ZONE

Updated property [core/project].
Updated property [compute/region].


## Create a service account

In [5]:
%%bash
gcloud iam service-accounts list | grep $SERVICE_ACCOUNT ||
gcloud iam service-accounts create $SERVICE_ACCOUNT

               entity-extractor@asl-open-projects.iam.gserviceaccount.com


## Grant service account project ownership 

TODO: We should ideally restrict the permissions to AutoML and Vision roles only

In [6]:
%%bash
gcloud projects add-iam-policy-binding $PROJECT_ID \
 --member "serviceAccount:$SERVICE_ACCOUNT_EMAIL" \
 --role "roles/owner"

bindings:
- members:
  - serviceAccount:automl-vision@asl-open-projects.iam.gserviceaccount.com
  - user:dherin@google.com
  - user:munn@google.com
  role: roles/automl.admin
- members:
  - serviceAccount:automl-vision@asl-open-projects.iam.gserviceaccount.com
  role: roles/automl.editor
- members:
  - serviceAccount:service-171999062104@gcp-sa-automl.iam.gserviceaccount.com
  role: roles/automl.serviceAgent
- members:
  - serviceAccount:custom-vision@appspot.gserviceaccount.com
  role: roles/ml.admin
- members:
  - serviceAccount:entity-extractor@asl-open-projects.iam.gserviceaccount.com
  - user:dherin@google.com
  - user:munn@google.com
  role: roles/owner
- members:
  - serviceAccount:custom-vision@appspot.gserviceaccount.com
  role: roles/serviceusage.serviceUsageAdmin
- members:
  - serviceAccount:service-171999062104@sourcerepo-service-accounts.iam.gserviceaccount.com
  role: roles/sourcerepo.serviceAgent
- members:
  - serviceAccount:custom-vision@appspot.gserviceaccount.com
  

## Create service account keys if not existing

In [7]:
%%bash
test -f $SERVICE_KEY_PATH || 
gcloud iam service-accounts keys create $SERVICE_KEY_PATH \
  --iam-account $SERVICE_ACCOUNT_EMAIL

echo "Service key: $(ls $SERVICE_KEY_PATH)"

Service key: /content/datalab/clones/document-processing/notebooks/entity-extractor.json


## Make the key available to google clients for authentication

In [8]:
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = SERVICE_KEY_PATH

## Implement a document classifier with AutoML

Here is a simple wrapper around an already trained AutoML model trained directly
from the cloud console on the various document types:

In [9]:
from google.cloud import automl_v1beta1 as automl

class DocumentClassifier:
  
  def __init__(self, project_id, model_id, zone):
    self.client = automl.PredictionServiceClient()
    self.model = self.client.model_path(project_id, zone, model_id)

  def __call__(self, filename):
    with open(filename, 'rb') as fp:
        image = fp.read()
    payload = {
      'image': {
        'image_bytes': image
      }
    }
    response = self.client.predict(self.model, payload)
    predicted_class = response.payload[0].display_name
    return predicted_class

Let's see how to use that `DocumentClassifier`:

In [10]:
classifier = DocumentClassifier(PROJECT_ID, AUTOML_MODEL_ID, ZONE)

eu_image_label = classifier("./eu_patent.png")
us_image_label = classifier("./us_patent.png")

print("EU patent inferred label:", eu_image_label)
print("US patent inferred label:", us_image_label)

EU patent inferred label: eu
US patent inferred label: us


## Implement a document parser with Vision API

Documentation:
* https://cloud.google.com/vision/docs/base64
* https://stackoverflow.com/questions/49918950/response-400-from-google-vision-api-ocr-with-a-base64-string-of-specified-image


Here is a simple class wrapping calls to the OCR capabilities of Cloud Vision:

In [12]:
!pip freeze | grep google-api-python-client==1.7.7 || pip install --upgrade google-api-python-client==1.7.7

google-api-python-client==1.7.7


In [13]:
import base64
from googleapiclient.discovery import build

class DocumentParser:
  def __init__(self):
    self.client = build('vision', 'v1')
    
  def __call__(self, filename):
    with open(filename, 'rb') as fp:
        image = fp.read()   
    encoded_image = base64.b64encode(image).decode('UTF-8')
    payload = {
        'requests': [{
                'image': {
                    'content': encoded_image
                },
                'features': [{
                    'type': 'TEXT_DETECTION',
                }]
            }],
        }
    request = self.client.images().annotate(body=payload)
    response = request.execute(num_retries=3)
    return response['responses'][0]['textAnnotations'][0]['description']

Let's now see how to use our `DocumentParser`:

In [14]:
parser = DocumentParser()

eu_patent_text = parser("./eu_patent.png")
us_patent_text = parser("./us_patent.png")

print(eu_patent_text)

(19)
Europäisches
Patentamt
0
European
Patent Office
Office européen
des brevets
EP 3 399 652 A1
(12)
EUROPEAN PATENT APPLICATION
(43) Date of publication:
(51) Int Cl.:
07.11.2018 Bulletin 2018/45
H04B 117107 (2011.01)H04B 117073 (2011.01)
(21) Application number: 18180150.7
(22) Date of filing: 24.06.2009
(84) Designated Contracting States
. YEE, Nathan, D.
AT BE BG CH CY CZ DE DK EE ES FI FR GB GR
HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL
PT RO SE SI SK TR
San Diego, CA 92121-1714 (US)
SUBRAHMANYA, Parvathanathaln
San Diego, CA 92121-1714 (US)
(30) Priority: 25.06.2008 US 146232
(74) Representative: Wegner, Hans
Bardehle Pagenberg Partnerschaft mbB
Patentanwälte, Rechtsanwälte
Prinzregentenplatz 7
81675 München (DE)
(62) Document number(s) of the earlier application(s) in
accordance with Art. 76 EPC:
09770969.5/2311 196
(71) Applicant: QUALCOMM Incorporated
San Diego, CA 92121-1714 (US)
Remarks:
This application was filed on 27-06-2018 as a
divisional application to the applicati

# Implement the rule-based extractors for each document categories

For each patent type, we now want to write a simple function that takes the 
text extracted by the OCR system above and extract the name and date of the patent.

We will write two rule base extractors, one for each type of patent (us or eu), each of
which will yield a PatentInfo object collecting the extracted object into a `nametuple` instance.

In [42]:
from collections import namedtuple

PatentInfo = namedtuple('PatentInfo', ['filename', 'category', 'date', 'number'])

Here are two helper functions for text splitting and pattern matching:

In [43]:
!pip freeze | grep pandas==0.23.4 || pip install --upgrade pandas==0.23.4

pandas==0.23.4


In [44]:
import pandas as pd
import re


def split_text_into_lines(text, sep="\(..\)"):
  lines = [line.strip() for line in re.split(sep, text)]
  return lines


def extract_pattern_from_lines(lines, pattern):
  """Extracts the first line from `text` with a matching `pattern`.
  """
  lines = pd.Series(lines)
  mask = lines.str.contains(pattern)
  return lines[mask].values[0] if mask.any() else None

### European patent extractor

In [45]:
def extract_info_from_eu_patent(filename, text):
  lines = split_text_into_lines(text)

  category = "eu"
  
  number_paragraph = extract_pattern_from_lines(lines, "EP")
  number_lines = number_paragraph.split('\n')
  number = extract_pattern_from_lines(number_lines, 'EP')  
  
  date_paragraph = extract_pattern_from_lines(lines, 'Date of filing:')
  date = date_paragraph.replace("Date of filing:", "").strip()
  
  return PatentInfo(
    filename=filename,
    category=category,
    date=date,
    number=number
  )

In [46]:
eu_patent_info = extract_info_from_eu_patent("./eu_patent.png", eu_patent_text)
eu_patent_info

PatentInfo(filename='./eu_patent.png', category='eu', date='24.06.2009', number='EP 3 399 652 A1')

### US patent extractor

In [47]:
def extract_info_from_us_patent(filename, text):
  lines = split_text_into_lines(text)

  category = "us"
  
  number_paragraph = extract_pattern_from_lines(lines, "Patent No.:")
  number = number_paragraph.replace("Patent No.:", "").strip()
  
  date_paragraph = extract_pattern_from_lines(lines, "Date of Patent:")
  date = date_paragraph.split('\n')[-1]
  
  return PatentInfo(
    filename=filename,
    category=category,
    date=date,
    number=number
  )

In [48]:
us_patent_info = extract_info_from_us_patent("./us_patent.png", us_patent_text)
us_patent_info

PatentInfo(filename='./us_patent.png', category='us', date='Nov. 27, 2018', number='US 10,142,913 B2')

## Tie all together into a DocumentExtractor

In [49]:
class DocumentExtractor:
  def __init__(self, classifier, parser):
    self.classifier = classifier
    self.parser = parser

  def __call__(self, filename):

    text = self.parser(filename)
    label = self.classifier(filename)
    
    if label == 'eu':
      info = extract_info_from_eu_patent(filename, text)
    elif label == 'us':
      info = extract_info_from_us_patent(filename, text)
    else:
      raise ValueError
        
    return info

In [50]:
extractor = DocumentExtractor(classifier, parser)

eu_patent_info = extractor("./eu_patent.png")
us_patent_info = extractor("./us_patent.png")

print(eu_patent_info)
print(us_patent_info)

PatentInfo(filename='./eu_patent.png', category='eu', date='24.06.2009', number='EP 3 399 652 A1')
PatentInfo(filename='./us_patent.png', category='us', date='Nov. 27, 2018', number='US 10,142,913 B2')


## Upload found entites to BigQuery
Start by adding a dataset called patents to the current project

In [52]:
!pip freeze | grep google-cloud-bigquery==1.8.1 || pip install google-cloud-bigquery==1.8.1

google-cloud-bigquery==1.8.1


Check to see if the dataset called "patents" exists in the current project. If not, create it. 

In [72]:
from google.cloud import bigquery


client = bigquery.Client()

# Collect datasets and project information
datasets = list(client.list_datasets())
project = client.project

# Create a list of the datasets. If the 'patents' dataset 
# does not exist, then create it.
if datasets:
    all_datasets = []
    for dataset in datasets:
        all_datasets.append(dataset.dataset_id)
else:
    print('{} project does not contain any datasets.'.format(project))

if datasets and 'patents' in all_datasets:
    print('The dataset "patents" already exists in project {}.'.format(project))
else:
    dataset_id = 'patents'
    dataset_ref = client.dataset(dataset_id)

    # Construct a Dataset object.
    dataset = bigquery.Dataset(dataset_ref)

    # Specify the geographic location where the dataset should reside.
    dataset.location = "US"

    # Send the dataset to the API for creation.
    dataset = client.create_dataset(dataset)  # API request
    print('The dataset "patents" was created in project {}.'.format(project))

The dataset "patents" already exists in project asl-open-projects.


Upload the extracted entities to a table called "found_entities" in the "patents" dataset.

Start by creating an empty table in the patents dataset.

In [None]:
# Create an empty table in the patents dataset and define schema
dataset_ref = client.dataset('patents')

schema = [
    bigquery.SchemaField('filename', 'STRING', mode='NULLABLE'),
    bigquery.SchemaField('category', 'STRING', mode='NULLABLE'),
    bigquery.SchemaField('date', 'STRING', mode='NULLABLE'),
    bigquery.SchemaField('number', 'STRING', mode='NULLABLE'),
]
table_ref = dataset_ref.table('found_entities')
table = bigquery.Table(table_ref, schema=schema)
table = client.create_table(table)  # API request

assert table.table_id == 'found_entities'

In [74]:
def upload_to_bq(patent_info, dataset_id, table_id):
    """Appends the information extracted in patent_info into the 
    dataset_id:table_id in BigQuery.
    patent_info should be a namedtuple as created above and should
    have components matching the schema set up for the table
    """
    table_ref = client.dataset(dataset_id).table(table_id)
    table = client.get_table(table_ref)  # API request

    rows_to_insert = [tuple(patent_info._asdict().values())]

    errors = client.insert_rows(table, rows_to_insert)  # API request

    assert errors == []

In [75]:
upload_to_bq(eu_patent_info, 'patents', 'found_entities')
upload_to_bq(us_patent_info, 'patents', 'found_entities')

### Examine the resuts in BigQuery. 
We can now query the BigQuery table to see what values have been uploaded.

In [76]:
%load_ext google.cloud.bigquery

In [77]:
%%bigquery
SELECT
    *
FROM `asl-open-projects.patents.found_entities`

Unnamed: 0,filename,category,date,number
0,./us_patent.png,us,"Nov. 27, 2018","US 10,142,913 B2"
1,./eu_patent.png,eu,24.06.2009,EP 3 399 652 A1
2,./eu_patent.png,eu,24.06.2009,EP 3 399 652 A1
3,./us_patent.png,us,"Nov. 27, 2018","US 10,142,913 B2"


We can also look at the resulting entities in a dataframe. 

In [78]:
dataset_id = 'patents'
table_id = 'found_entities'

sql = """
SELECT
    *
FROM
    `{}.{}.{}`
LIMIT 10
""".format(project, dataset_id, table_id)

df = client.query(sql).to_dataframe()
df.head()

Unnamed: 0,filename,category,date,number
0,./eu_patent.png,eu,24.06.2009,EP 3 399 652 A1
1,./us_patent.png,us,"Nov. 27, 2018","US 10,142,913 B2"
2,./us_patent.png,us,"Nov. 27, 2018","US 10,142,913 B2"
3,./eu_patent.png,eu,24.06.2009,EP 3 399 652 A1


## Pipeline Evaluation

TODO: We should include some section on how to evaluate the performance of the extractor. Here we can use the ground_truth table and explore different kinds of string metrics (e.g. Levenshtein distance) to measure accuracy of the entity extraction.

## Clean up

To remove the table "found_entities" from the "patents" dataset created above.

In [79]:
dataset_id = 'patents'
table_id = 'found_entities'

tables = list(client.list_tables(dataset_id))  # API request(s)

if tables:
    num_tables = len(tables)
    all_tables = []
    for _ in range(num_tables):
        all_tables.append(tables[_].table_id)
    print('These tables were found in the {} dataset: {}'.format(dataset_id,all_tables))
    if table_id in all_tables:
        table_ref = client.dataset(dataset_id).table(table_id)
        client.delete_table(table_ref)  # API request
        print('Table {} was deleted from dataset {}.'.format(table_id, dataset_id))
else:
    print('{} dataset does not contain any tables.'.format(dataset_id))

These tables were found in the patents dataset: ['found_entities', 'ground_truth']
Table found_entities was deleted from dataset patents.


### The next cells will remove the patents dataset and all of its tables. Not recommended as I recently uploaded a talbe of 'ground_truth' for entities in the files

To remove the "patents" dataset and all of its tables.

In [80]:
'''
client = bigquery.Client()
# Collect datasets and project information
datasets = list(client.list_datasets())
project = client.project

if datasets:
    all_datasets = []
    for dataset in datasets:
        all_datasets.append(dataset.dataset_id)
    if 'patents' in all_datasets:
        # Delete the dataset "patents" and its contents
        dataset_id = 'patents'
        dataset_ref = client.dataset(dataset_id)
        client.delete_dataset(dataset_ref, delete_contents=True)
        print('Dataset {} deleted from project {}.'.format(dataset_id, project))
    else: print('{} project does not contain the "patents" datasets.'.format(project))
else:
    print('{} project does not contain any datasets.'.format(project))
'''

'\nclient = bigquery.Client()\n# Collect datasets and project information\ndatasets = list(client.list_datasets())\nproject = client.project\n\nif datasets:\n    all_datasets = []\n    for dataset in datasets:\n        all_datasets.append(dataset.dataset_id)\n    if \'patents\' in all_datasets:\n        # Delete the dataset "patents" and its contents\n        dataset_id = \'patents\'\n        dataset_ref = client.dataset(dataset_id)\n        client.delete_dataset(dataset_ref, delete_contents=True)\n        print(\'Dataset {} deleted from project {}.\'.format(dataset_id, project))\n    else: print(\'{} project does not contain the "patents" datasets.\'.format(project))\nelse:\n    print(\'{} project does not contain any datasets.\'.format(project))\n'