# *Ortec Finance Invoice Data Extractor*
---
##**Architecting the present model** 
Several fundamental aspects have been altered from the baseline provided to us. Firstly, the invoicegenerator was tweaked in order to generate both the training and validation datasets. Consequently, we manipulated the invoicegenerator to account for more other targets, e.g. item quantity, item names and terms & conditions. Secondly, two models are created which consider different targets. These models are tweaked through their hyperparameters, see readme.txt for specifics, trying to optimize training accuracy  validation accuracies. Once the predictive model had been constructed, it could be used to search the magnificent world wide web for confirming the existence of the company. This is done through creating a sorted list of predictions and extracting and extracting the sender name and KVK number targets. These model outputs are used as the input for the google search function, which is concatenated with 'NL' in order to filter out google searches to locally relevant results. \

##**Constraints encountered through our journey** 
During the present solution's development for the "automatic invoice reader" problem there we encountered a few constraints, which are elaborated on here. Firstly, a major bottleneck is attributable to the limited amount of training data incorporated. The present model is fed data with regards to 5 separate entities, each entailing 28 possible items and the invoices are only formatted over 2 templates. Thus, our model will not generalize effectively to many observations outside of the current sample since the Artificial intelligence phenomenon works brilliantly in the presence of a large and diverse dataset. The constraints mentioned ultimtely lead to an unavoidable bias *-overfitting-* towards the items, companies and certainly the templates provided. Next to the bias another problem of the lack of data is the amount of information that can be retrieved. After a few epochs we it is observed that the model cannot be trained further, which is attributed to the lack of information. Lastly, a KVK api key was not granted to us by the government agency, thus we were unable to use this required channel for confirming an entity's registration. \


##**The present model's results**
In architecting an acceptable solution, we attempted to account for the constraints listed above. This is done by creating two models; one that is predominantly perfect (100% validation accuracy) in predicting simple targets e.g. Company name, and one that is extremely good (99.58% validation accuracy) in predicting complex targets, e.g. item descriptions. Also, quite an elegant solution is operationalized in filtering the list of results to return the most similar google-search result in comparison to the company name outputted by the predictive model. 

##### When life gives you lemons, you make lemonade....




In [4]:
#@title Default title text
# Setup

from google.colab import files
uploaded = files.upload()

Saving templates.zip to templates.zip


In [5]:
!unzip templates.zip

Archive:  templates.zip
   creating: templates/
  inflating: templates/.DS_Store     
   creating: templates/.ipynb_checkpoints/
  inflating: templates/.ipynb_checkpoints/Baseline Ortec-checkpoint.ipynb  
  inflating: templates/.ipynb_checkpoints/Untitled-checkpoint.ipynb  
  inflating: templates/.ipynb_checkpoints/Untitled1-checkpoint.ipynb  
  inflating: templates/invoicegen.py  
  inflating: templates/TEMPLATE_1.txt  
  inflating: templates/TEMPLATE_2.txt  
   creating: templates/__pycache__/
  inflating: templates/__pycache__/invoicegen.cpython-35.pyc  


In [1]:
!rm *

rm: cannot remove 'datalab': Is a directory
rm: cannot remove 'templates': Is a directory


In [0]:
!rm -r templates

In [59]:
!ls

datalab  templates  templates.zip


In [60]:
!pip install pandas



In [61]:
!pip install google-search



In [0]:
!pip install -q keras

In [0]:
# System hyper parameters here

# How many characters before and after the main char to feed the NN
PADDING = 35


'''
Ignore:           0       COMPLEXITY = FALSE       COMPLEXITY = TRUE
                        Sender Name:      1   F  Date:          7   T
                        Sender KVK:       2   F  Item:          8   T
                        Sender IBAN:      3   F  Item Q:        9   T
                        Invoice Reference:4   F  Item Total:    10  T
                        Total:            5   F
                        Conditions:       6   F

'''
N_CLASSES = 11

In [0]:
# Data generator -Creates invoices-
import numpy as np
from sklearn.utils import resample
from datetime import datetime
import random
import string

# If complexity is true, we predict the item descriptions, quantities, totals and the date of the invoice. See list of targets above.
def create_invoice(content, complexity, prediction):
    
    items = ['Water','Tea','Coffee',
         'Amazon Echo',' Instant Pot 7-in-1 Multi-Functional Pressure Cooker',
         'TechMatte MagGrip Air Vent Magnetic Universal Car Mount','SanDisk 32GB Ultra Class Memory Card',
        'Sony XB950B1 Extra Bass Wireless Headphones','iRobot Roomba 652 Robotic Vacuum Cleaner',
        'Anker Bluetooth SoundBuds Headphones','Kindle Paperwhite','Fire TV Stick with Alexa Voice Remote',
         'Oral-B Pro 7000 SmartSeries Electric Toothbrush',
         'TaoTronics Dimmable LED Desk Lamp with USB Charging Port','23andMe DNA Test','NVIDIA Tesla K80 GPU',
         'NVIDIA TITAN V VOLTA 12GB HBM2 VIDEO CARD','Equinox Down Alternative Comforter',
         'Anker Super Bright Tactical Flashlight, Rechargeable', 'Cucisina Lemon Squeezer',
         'Fengbao 2PCS Kitchen Sink Strainer - Stainless Steel','ZIONOR Lagopus Ski Snowboard Goggles',
         'BIC Marking Permanent Marker, Metallic','2-in-1 Pet Glove: Grooming Tool + Furniture Pet Hair Remover',
         'Criacr Bluetooth FM Transmitter, Wireless In-Car FM Transmitter','Get Out [Blu-ray]',
         'Car Charger for Nintendo Switch','JETech 2-Pack iPhone 8/7 Screen Protector']
         
    companies = []

    co = {'NAME':'Ortec Finance Big Data Analytics B.V.',
          'STREET':'Boompjes 40',
          'POST':'3011XB',
          'CITY':'Rotterdam',
          'KVKNR':'70498032',
          'VATNR':'000038761017', 
          'BANK':'RABOBANK', 
          'BIC':'RABONL2U',
          'IBAN':'NL97RABO0167773583', 
          'PHONE':'06-98486335',
          'EMAIL':'info@ofdataanalytics.com',
          'WEB':'http://ofdataanalytics.com/'}

    companies.append(co)

    co = {'NAME':'ORTEC Finance B.V.',
          'STREET':'Boompjes 40',
          'POST':'3011XB',
          'CITY':'Rotterdam',
          'KVKNR':'24421148',
          'VATNR':'000019986750',
          'BANK':'RABOBANK', 
          'BIC':'RABONL2U',
          'IBAN':'NL35RABO0386025669',
          'PHONE':'06-90366060',
          'EMAIL':'info@ortec-finance.com',
          'WEB':'http://www.ortec-finance.com/nl-nl'}

    companies.append(co)

    co = {'NAME':'ING Bank N.V.',
          'STREET':'Bijlmerplein 888',
          'POST':'1102MG',
          'CITY':'Amsterdam',
          'KVKNR':'33031431',
          'VATNR':'000019531656',
          'BANK':'ING NETHERLANDS', 
          'BIC':'INGBNL2A',
          'IBAN':'NL12INGB0758162765',
          'PHONE':'06-90366060',
          'EMAIL':'info@ing.nl',
          'WEB':'https://www.ing.nl'}

    companies.append(co)

    co = {'NAME':'Amazon NL International Holdings B.V.',
          'STREET':'Johanna Westerdijkplein 1',
          'POST':'2521EN',
          'CITY':'s-Gravenhage',
          'KVKNR':'69988978',
          'VATNR':'000038299550',
          'BANK':'RABOBANK', 
          'BIC':'RABONL2U',
          'IBAN':'NL41RABO0150437878',
          'PHONE':'06-35829070',
          'EMAIL':'info@amazon.com',
          'WEB':'https://www.amazon.com'}

    companies.append(co)

    co = {'NAME':'Unilever Nederland',
          'STREET':'Nassaukade 5',
          'POST':'3071JL',
          'CITY':'Rotterdam',
          'KVKNR':'24269393',
          'VATNR':'000019267231',
          'BANK':'ING NETHERLANDS', 
          'BIC':'INGBNL2A',
          'IBAN':'NL02INGB0681309748',
          'PHONE':'06-88163931',
          'EMAIL':'info@unilever.com',
          'WEB':'https://www.unilever.com'}


    companies.append(co)
    
    conditions = ['Payable within 30 days','Delivery after payment','']
    
    bill_items = resample(items,n_samples=4,replace=False)
    prices = np.random.rand(4)*1000
    quants = np.random.randint(1,10,size=4)
    totals = prices * quants
    vat_s = totals * 0.12 # 12% vat on everything
    total_vat = np.sum(vat_s)
    total_wo_vat = np.sum(totals)
    total = total_vat + total_wo_vat
    
    
    for i in range(4):
        item_name = bill_items[i]
        item_quant = quants[i]
        item_price = prices[i]
        item_vat = vat_s[i]
        item_total = totals[i]
        content = content.replace('<ITEM_{}_NAME>'.format(i+1),item_name)
        content = content.replace('<ITEM_{}_QUANT>'.format(i+1),"{0:.2f}".format(item_quant))
        content = content.replace('<ITEM_{}_PRICE>'.format(i+1),"{0:.2f}".format(item_price))
        content = content.replace('<ITEM_{}_VAT>'.format(i+1),"{0:.2f}".format(item_vat))
        content = content.replace('<ITEM_{}_TOTAL>'.format(i+1),"{0:.2f}".format(item_total))
    content = content.replace('<TOTAL_WO_VAT>',"{0:.2f}".format(total_wo_vat))
    content = content.replace('<TOTAL_VAT>',"{0:.2f}".format(total_vat))
    content = content.replace('<TOTAL>',"{0:.2f}".format(total))
    sender, reciever = resample(companies, n_samples=2,replace=False)
    
    content = content.replace('<SENDER_NAME>',sender['NAME'])
    content = content.replace('<SENDER_STREET>',sender['STREET'])
    content = content.replace('<SENDER_POST>',sender['POST'])
    content = content.replace('<SENDER_CITY>',sender['POST'])
    content = content.replace('<KVKNR>',sender['KVKNR'])
    content = content.replace('<VATNR>',sender['VATNR'])
    content = content.replace('<BANK_NAME>',sender['BANK'])
    content = content.replace('<IBAN>',sender['IBAN'])
    content = content.replace('<BIC_CODE>',sender['BIC'])
    content = content.replace('<PHONE>',sender['PHONE'])
    content = content.replace('<EMAIL>',sender['EMAIL'])
    content = content.replace('<WEBSITE>',sender['WEB'])
    content = content.replace('<RECIPIENT_NAME>',reciever['NAME'])
    content = content.replace('<RECIPIENT_STREET>',reciever['STREET'])
    content = content.replace('<RECIPIENT_POST>',reciever['POST'])
    content = content.replace('<RECIPIENT_CITY>',reciever['CITY'])
    
    year = random.randint(2008, 2017)
    month = random.randint(1, 12)
    day = random.randint(1, 28)

    date = '{}/{}/{}'.format(day,month,year)

    content = content.replace('<INVOICE_DATE>',date)    
    
    if month < 12: 
        month += 1
    else:
        month = 1

    due = '{}/{}/{}'.format(day,month,year)

    content = content.replace('<DUE_DATE>',due)
    
    reference = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(5))
    invoice_no = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(10))

    content = content.replace('<CUSTOMER_REFERENCE>',reference)
    content = content.replace('<INVOICE_NR>',invoice_no)
    
    condition = resample(conditions, n_samples = 1)[0]
    content = content.replace('<CONDITIONS>',condition)
    
    

    target = [0]* len(content)
    
    if complexity == True or prediction == True:
      
      date_start = content.find(date)
      date_len = len(date)
      date_end = date_start + date_len
      if date_start != -1:
        target[date_start:date_end] = [7]*date_len 
 
      for i in range(4):
        item_name = bill_items[i]
        item_quant = quants[i]
        item_price = prices[i]
        item_vat = vat_s[i]

        item_start = content.find(item_name)
        item_len = len(item_name)
        item_stop = item_start + item_len
        if item_start != -1:
            target[item_start:item_stop] = [8]*item_len

        item_q_start = content.find("{0:.2f}".format(item_quant)) 
        item_q_len = len("{0:.2f}".format(item_quant))
        item_q_stop = item_q_start + item_q_len
        if item_q_start != -1:
            target[item_q_start:item_q_stop] = [9]*item_q_len

        itemt_start = content.find("{0:.2f}".format(item_total))
        itemt_len = len("{0:.2f}".format(item_total))
        itemt_stop = itemt_start + itemt_len
        if itemt_start != -1:
            target[itemt_start:itemt_stop] = [10]*itemt_len    

    
    if complexity == False or prediction == True:
      co_start = content.find(condition)
      co_len = len(condition)
      co_end = co_start + co_len
      if co_start != -1:
        target[co_start:co_end] = [6]*co_len

      sn_start = content.find(sender['NAME'])
      sn_len = len(sender['NAME'])
      sn_end = sn_start + sn_len
      if sn_start != -1:
          target[sn_start:sn_end] = [1]*sn_len

      skvk_start = content.find(sender['KVKNR'])
      skvk_len = len(sender['KVKNR'])
      skvk_end = skvk_start + skvk_len
      if skvk_start != -1:
          target[skvk_start:skvk_end] = [2]*skvk_len

      siban_start = content.find(sender['IBAN'])
      siban_len = len(sender['IBAN'])
      siban_end = siban_start + siban_len
      if siban_start != -1:
          target[siban_start:siban_end] = [3]*siban_len

      ref_start = content.find(reference)
      ref_len = len(reference)
      ref_end = ref_start + ref_len
      if ref_start != -1:
          target[ref_start:ref_end] = [4]*ref_len

      total_start = content.find("{0:.2f}".format(total))
      total_len = len("{0:.2f}".format(total))
      total_end = total_start + total_len
      if total_start != -1:
          target[total_start:total_end] = [5]*total_len
    
    assert (len(content) == len(target))
    return content, target
         



In [0]:
# Your friendly tokenizer
from keras.preprocessing.text import Tokenizer

# Numpy
import numpy as np

import pandas as pd

In [0]:
# Create 100 invoices for each template

invoices_simple = []
targets_simple = []
invoices_complex = []
targets_complex = []

# Load template 1
with open('templates/TEMPLATE_1.txt', 'r') as content_file:
    content = content_file.read()

# Create invoices from template
for i in range(500):
    inv_s, tar_s = create_invoice(content, complexity=False, prediction=False)
    invoices_simple.append(inv_s)
    targets_simple.append(tar_s)
    inv_c, tar_c = create_invoice(content, complexity=True, prediction=False)
    invoices_complex.append(inv_c)
    targets_complex.append(tar_c)
    
# Load template 2
with open('templates/TEMPLATE_2.txt', 'r') as content_file:
    content = content_file.read()
    
# Create invoices from template
for i in range(500):
    inv_s, tar_s = create_invoice(content, complexity=False, prediction=False)
    invoices_simple.append(inv_s)
    targets_simple.append(tar_s)
    inv_c, tar_c = create_invoice(content, complexity=True, prediction=False)
    invoices_complex.append(inv_c)
    targets_complex.append(tar_c)

In [0]:
# Counts the amount of generated invoices that have a certain target.

para_c = 4  # Amount of complex parameters
para_s = N_CLASSES-para_c

def counter(targets, N_CLASSES, complexity):
  count = np.zeros(N_CLASSES)
  for i in range(len(targets)):
    if complexity==True:
      for j in [7,8,9,10]: # Complex targets without 'Ignore'
        if j in targets[i]:
          count[j-7] += 1  # First complex target to get index back to 0.
    else:     
      for j in range(N_CLASSES):
        if j in targets[i]:
          count[j] += 1
  return count

In [68]:
'''
Ignore:           0   
Sender Name:      1   F
Sender KVK:       2   F
Sender IBAN:      3   F
Invoice Reference:4   F
Total:            5   F
Conditions:       6   F
Date:             7   T
Item:             8   T
Item Total:       9   T
Item Q:          10   T
'''
# Get the total count
total_count = np.hstack((counter(targets_simple, para_s, complexity=False),(counter(targets_complex, para_c,complexity=True))))
# Remove the ignore - since every model has them, and the complex do not count them per definition -.
total_count = np.delete(total_count,0)

print(total_count)


#It is already clear from this counter that there is an oddity in the prediction. There are less item quantities predicted than there are invoices with items.
#This is an indication of the lack of information provided/accessible to the model, when merely using 28 items.


[ 500.  500.  500. 1000. 1000.  335. 1000. 1000.  515. 1000.]


In [0]:
# Create our tokenizer
# We will tokenize on character level!
# We will NOT remove any characters
tokenizer = Tokenizer(char_level=True, filters=None)
tokenizer.fit_on_texts(invoices_simple)
tokenizer.fit_on_texts(invoices_complex)

In [0]:
def gen_sub(inv,tar,pad, m = None):
    '''
    Generates a substring from invoice inv and target list tar 
    using the character at index m as a midpoint.
    
    Params:
    inv - an invoice string
    tar - a target list specifying the type of each item
    pad - the amount of padding to attach before and after the focus character
    
    Returns:
    sub - a string with pad characters, the focus character, pad characters
    '''
    # If no focus character index is set, choose at random
    if m == None:
        m = np.random.randint(0,len(inv))
        
    l = m - pad # define the lower bound of our substring
    h = m + pad + 1 # define the upper (high) of our substring

    # Sometimes, our lower bound could be below zero
    # In this case we attach the remaining characters from the back of the string
    if l < 0:
        # Get the characters from the back of the file
        s1 = inv[l:None]
        
        # Edge case: Sample size larger than string
        # Our upper bound might be higher than the lenth of the text
        # In that case we start from the front again
        if h >= len(inv): 
            # How many characters do we need from the front
            overlap = h - len(inv)
            # The string is the entire invoice + some chars from the front
            s2 = inv
            s_over = inv[None:overlap]
            s2 = s2 + s_over
        else:
            # If we don't need chars from the front 
            # we can just select to the upper bound
            s2 = inv[None:h]
            
        # Create substring
        sub = s1 + s2
        # Ensure the substring has the right length
        assert(len(sub) == pad*2 +1)
        return sub, tar[m]
    
    # Our lower bound might be positive but our upper bound might 
    # still be above the length of the invoice
    elif h >= len(inv):
        # Calc how many chars we need from the front
        overlap = h - len(inv)
        
        # Get string from lower bound to end
        s1 = inv[l:None]
        # Get string from the front of the doc
        s2 = inv[None:overlap]
        sub = s1 + s2
        # Make sure our string has the correct length
        assert(len(sub) == pad*2 +1)
        return sub, tar[m]
    
    # Upper and lower bound lie within the length of the invoice
    else: 
        sub = inv[l:h]
        assert(len(sub) == pad*2 +1)
        return sub, tar[m]

In [0]:
def gen_dataset(sample_size, n_classes, invoices, targets, tokenizer, complexity):
    '''
    Generate a dataset of inputs and outputs for our neural network
    
    Params:
    sample_size - desired sample size
    n_classes - number of classes
    invoices - list of invoices to sample from
    targets - list of corresonding targets to sample from
    tokenizer - a keras tokenizer fit on the invoices
    
    The function creates balanced samples by randomly sampling untill 
    an equal amount of samples of all types is created.
    
    Characters are one hot encoded
    
    Returns:
    x_arr: a numpy array of shape (sample_size, sequence length, number of unique characters)
    y_arr: a numpy array of shape (sample_size,)
    '''
    
    # Create a budget
    budget = [sample_size / n_classes] * n_classes
    
    # Setup holding variables
    X_train = []
    y_train = []

    # While there is still a budget left...
    while sum(budget) > 0:
        # ... get a random invoice and target list
        index = np.random.randint(0,len(invoices))
        inv = invoices[index]
        tar = targets[index]
        # ... sample up to 10 items from this invoice 
        for j in range(10):
            # Get an item
            x, y = gen_sub(inv,tar,PADDING)
            if complexity == True:
              if y != 0:
                y = y-7
            # if we still have a budget for this items target
            if budget[y] > 0:
                # Tokenize to one hot
                xm = tokenizer.texts_to_matrix(x)
                # Add data and target
                X_train.append(xm)
                y_train.append(y)
                budget[y] -= 1
      
    # Create numpy arrays from all data and targets
    x_arr = np.array(X_train)
    y_arr = np.array(y_train)
    return x_arr,y_arr

In [0]:
# Ger data
def generate_sets(N_CLASSES, invoices, targets, tokenizer, complexity):
  train_size = 24000
  val_size = 240

  x_tr, y_tr = gen_dataset(train_size, N_CLASSES, invoices, targets, tokenizer, complexity)
  x_val, y_val = gen_dataset(val_size, N_CLASSES, invoices, targets, tokenizer, complexity)
  return x_tr, y_tr, x_val, y_val

In [0]:
# Create training and validation data for both the complex and simple model.
x_tr_s, y_tr_s, x_val_s, y_val_s = generate_sets(para_s, invoices_simple, targets_simple, tokenizer, complexity = False)
x_tr_c, y_tr_c, x_val_c, y_val_c = generate_sets(para_c, invoices_complex, targets_complex, tokenizer, complexity = True)

In [0]:
from keras.models import Sequential
from keras.layers import SimpleRNN, Dense,Activation, Conv1D, MaxPool1D

In [0]:
def make_model(INPUTSHAPE, PADDING, N_CLASSES):  # A simple model
  model = Sequential()
  model.add(Conv1D(2*PADDING,2,input_shape=INPUTSHAPE)) # The input shape assumes there is 85 possible characters
  model.add(MaxPool1D(2))
  model.add(SimpleRNN(4*N_CLASSES))
  model.add(Dense(N_CLASSES))
  model.add(Activation('softmax'))
  
  return model

In [0]:
model_simple = make_model(INPUTSHAPE=x_tr_s.shape[1:3], PADDING=PADDING, N_CLASSES = para_s)
model_complex = make_model(INPUTSHAPE=x_tr_c.shape[1:3], PADDING=PADDING,N_CLASSES = para_c+1)

In [0]:
# sparse_categorical_crossentropy is like categorical crossentropy but without converting targets to one hot
model_simple.compile(loss='sparse_categorical_crossentropy',optimizer='adam', metrics=['acc'])
model_complex.compile(loss='sparse_categorical_crossentropy',optimizer='adam', metrics=['acc'])

In [78]:
model_simple.fit(x_tr_s,y_tr_s,batch_size=32,epochs=2,validation_data=(x_val_s,y_val_s))

Train on 23996 samples, validate on 238 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f44cbe84590>

In [79]:
# 1 epoch proved the most accurate
model_complex.fit(x_tr_c,y_tr_c,batch_size=32,epochs=1,validation_data=(x_val_c,y_val_c))

Train on 24000 samples, validate on 240 samples
Epoch 1/1


<keras.callbacks.History at 0x7f44ce01ca90>

In [0]:
'''
To make predictions from our model, we need to create 
sequences around every character from the invoice.

We the making predictions for every charater based on their invoice
'''

# Make a new invoice:
template = np.random.randint(0,1)
if template == 0:
  with open('templates/TEMPLATE_1.txt', 'r') as content_file:
    content = content_file.read()
else:
  with open('templates/TEMPLATE_1.txt', 'r') as content_file:
    content = content_file.read()

# Complexity does not matter, since prediction arg overrules it.
inv,tar = create_invoice(content, complexity=False, prediction=True)


chars = [] # Holds the individual characters
data = [] # Holds the sequences around the characters
y_true = [] # Holds the true targets for each character

# Loop over characters indices
for i in range(len(inv) -1):
    # Create sequence around this character
    x,y = gen_sub(inv,tar,PADDING,m=i)
    # Tokenize the sequence to one hot
    xm = tokenizer.texts_to_matrix(x)
    # Get the character itself
    c = inv[i]
    
    chars.append(c)
    data.append(xm)
    y_true.append(y)

In [0]:
import pandas as pd

In [0]:
# For demo purposes we can look what our invoice looks like
df = pd.DataFrame({'Char':chars,'Target':y_true})

In [141]:
# Show all characters belonging to the amount
df[df.Target == 3]

Unnamed: 0,Char,Target
241,N,3
242,L,3
243,4,3
244,1,3
245,R,3
246,A,3
247,B,3
248,O,3
249,0,3
250,1,3


In [0]:
# Create test data for predictions with neural net
x_test = np.array(data)

In [143]:
x_test.shape

(954, 71, 84)

In [0]:
# Make predictions
y_pred_simple = model_simple.predict(x_test)
y_pred_complex = model_complex.predict(x_test)

In [0]:
# Get the maximum likely class
y_pred_complex = y_pred_complex.argmax(axis=1)
y_pred_simple = y_pred_simple.argmax(axis=1)

In [0]:
for i in range(len(y_pred_complex)):
  if y_pred_complex[i] != 0:
    y_pred_complex[i] += 7

In [0]:
# Show how our model predictions look like
df['Predicted'] = y_pred_complex

In [148]:
# Show all chars that are predicted to belong to the amount
df[df.Predicted == 9]

Unnamed: 0,Char,Target,Predicted
476,,8,9
486,7,9,9
487,.,9,9
488,0,9,9
489,0,9,9
490,\n,0,9
491,\n,0,9
501,1,0,9
586,\n,0,9
587,3,9,9


In [0]:
from itertools import groupby
# Create groups by the predicted output
# The this code will return a tuple with the format
# (category, length, starting index)

# TODO: This code is ugly and very hard to understand
# But it works

def grouping(y_pred):

  # Group by predicted category
  g = groupby(enumerate(y_pred), lambda x:x[1])

  # Create list of groups
  l = [(x[0], list(x[1])) for x in g]

  # Create list with tuples of groups
  groups = [(x[0], len(x[1]), x[1][0][0]) for x in l]

  return groups

In [0]:
groups_simple = grouping(y_pred_simple)

In [0]:
groups_complex = grouping(y_pred_complex)

In [152]:
# Show grouping
groups_complex

[(0, 448, 0),
 (8, 28, 448),
 (9, 1, 476),
 (8, 7, 477),
 (10, 2, 484),
 (9, 6, 486),
 (0, 9, 492),
 (9, 1, 501),
 (0, 27, 502),
 (8, 57, 529),
 (9, 6, 586),
 (10, 2, 592),
 (0, 31, 594),
 (8, 1, 625),
 (0, 1, 626),
 (8, 1, 627),
 (0, 1, 628),
 (8, 63, 629),
 (10, 1, 692),
 (9, 6, 693),
 (0, 34, 699),
 (8, 1, 733),
 (0, 1, 734),
 (8, 1, 735),
 (0, 1, 736),
 (8, 9, 737),
 (9, 1, 746),
 (8, 9, 747),
 (10, 1, 756),
 (9, 7, 757),
 (10, 13, 764),
 (0, 1, 777),
 (10, 2, 778),
 (9, 10, 780),
 (10, 10, 790),
 (0, 1, 800),
 (9, 1, 801),
 (0, 152, 802)]

In [0]:
def candidates(groups,chars):
  '''
  We only want to consider sequences of predictions of the same type 
  that have a minimum length. This way we remove the noise
  But we also might remove some good predictions

  The min length is set to 5 here, certainly a value to experiment with
  '''
  candidates = []
  # Loop over all groups
  for group in groups:

      # Unpack group
      category, length, index = group

      # Ignore the ignore category and only consider category sequences longer than 5
      if category != 0 and length > 5:
          # Create text
          candidate_text = ''.join(chars[index:index+length])
          # Remove line breaks, this is just one way to prettify outputs!
          candidate_text = candidate_text.replace('\n','')
          candidates.append((candidate_text,category))
  return candidates

In [0]:
candidates_simple = candidates(groups_simple,chars)
candidates_complex = candidates(groups_complex,chars)

In [155]:
candidates_simple

[('torec%Aan:', 3),
 ('Amazon NL International Holdings B.V.Johanna We', 1),
 ('699889780', 2),
 ('KNL41RABO0150437878R', 3),
 ('ABONL2U0', 1),
 ('YZCL1', 4),
 ('\xac 8104.68', 5),
 ('ardenDelivery after payment', 6)]

In [0]:
'''
Ignore:           0
Sender Name:      1 
Sender KVK:       2 
Sender IBAN:      3 
Invoice Reference:4
Total:            5
Conditions:       6
Date:             7
Item:             8
Item Total:       9
Item Q:          10
'''


sorting_simple = sorted(candidates_simple, key=lambda tup: tup[1])
sorting_complex = sorted(candidates_complex, key=lambda tup: tup[1])

In [157]:
sorting_simple

[('Amazon NL International Holdings B.V.Johanna We', 1),
 ('ABONL2U0', 1),
 ('699889780', 2),
 ('torec%Aan:', 3),
 ('KNL41RABO0150437878R', 3),
 ('YZCL1', 4),
 ('\xac 8104.68', 5),
 ('ardenDelivery after payment', 6)]

In [0]:
# Put the prediction to panda dataframe for easier handling
df_s = pd.DataFrame(sorting_simple)
df_c = pd.DataFrame(sorting_complex)

In [159]:
# Rename the columns - important for later definitions - 
df_s.columns = ['Prediction', 'Target']
df_c.columns = ['Prediction', 'Target']
df = pd.concat([df_s,df_c])
print(df)

                                           Prediction  Target
0     Amazon NL International Holdings B.V.Johanna We       1
1                                            ABONL2U0       1
2                                           699889780       2
3                                          torec%Aan:       3
4                                KNL41RABO0150437878R       3
5                                               YZCL1       4
6                                           � 8104.68       5
7                         ardenDelivery after payment       6
0                        ZIONOR Lagopus Ski Snowboard       8
1                                             Goggles       8
2   TechMatte MagGrip Air Vent Magnetic Universal ...       8
3   Criacr Bluetooth FM Transmitter, Wireless In-C...       8
4                                           VIDIA Tes       8
5                                           a K80 GPU       8
6                                                7.00       9
7       

In [0]:
# Get the name from the prediction
if (df.loc[df['Target'] == 1]['Prediction']== np.nan).as_matrix()[0]==False:
  Name = df.loc[df['Target'] == 1]['Prediction'].as_matrix()[0]
else:
  print("No company name could be found.")

In [0]:
# Find KVK to search for with Google or KVK-API.
# The dataframe should have two columns: Prediction, Target.
# Prediction being column 0, and the predicted name, kvk number, ...
# Target is column 1, and the target number of the prediction. 0,1,2,...
def find_KVK(df):
  # Extract KVK from prediction
  KVK = df.loc[df['Target'] == 2]['Prediction'].str.extract('(^\d*)').as_matrix()
  #If a KVK number has the wrong amount of numbers, take the first and last 8.
  KVK_candidates = []
  for i in range(len(KVK)):
    if KVK[i] == 8:
      KVK_candidates.append(KVK[i])
    elif KVK[i] > 8:
      KVK_candidates.append(KVK[i][-8:])
      KVK_candidates.append(KVK[i][:8])  

  #Remove empty elements
  KVK_candidates = list(filter(None, KVK_candidates))
  
  return KVK_candidates

In [128]:
# Run the above definition with the predictions
KVK_candidates = find_KVK(df)

  This is separate from the ipykernel package so we can avoid doing imports until


In [0]:
KVK_cand_text = []
for i in range(len(KVK_candidates)):
  KVK_cand_text.append(str(KVK_candidates[i]) + " nl")
  # Add nl, because googlesearch method looks at international google results.

In [0]:
# Use this github method - credit to: https://github.com/anthonyhseb/googlesearch - to find google results.
from googlesearch.googlesearch import GoogleSearch
titles = [[],[]]
for i in range(len(KVK_cand_text)):
  response = GoogleSearch().search(KVK_cand_text[i])
  for result in response.results:
      titles[0].append(result.title)
      titles[1].append(KVK_candidates[i])
 # num_results could be used to limit search results


In [131]:
titles

[[u'The Winnipeg Tribune from Winnipeg, on June 3, 1920 \xb7 Page 8',
  u'chargemaster - Office of Statewide Health Planning and Development',
  u'Hengels - Hissink Videobewaking Service',
  u'Miembro - B\xfasqueda - Delcampe',
  u'Prijslijst geldig vanaf: Prof + Chemie 24-3-2014',
  u'PERM, RUSLAND - 9 Januari 2015: Kerstboom In De Buurt Van ...',
  u'PERM, RUSLAND - 5 Januari 2015: De Mensen Lopen In Ice ... - 123RF',
  u'Prijslijst geldig vanaf: Prof + Chemie 24-3-2014 - Rodeo - doczz',
  u'Jobmonitor. Search results for endabnahme endabnahme',
  u'06 Nummer reeks 06-44210000 06-44219999 - WPOI.NL',
  u'Ortec Finance Bv - Rotterdam 3011 XB (Rotterdam), Boompjes 40 ...',
  u'ORTEC Finance BV | Rotterdam - Drimble',
  u'ORTEC Finance BV | Amsterdam - Drimble',
  u'ORTEC Finance B.V. - Oozo.nl',
  u'ORTEC Finance B.V. - Oozo.nl',
  u'Ortec Finance in Amsterdam | De Telefoongids',
  u'Ton van Welie, CEO Ortec Finance - VBA beleggingsprofessionals',
  u'Loopbaanori\xebntatiedag - De Leid

In [132]:
Name

'f-torec%Aan:ORTEC Finance B.V.Boompjes 4030'

In [0]:
import difflib
# Find the closest match to the name prediction
closest_match = difflib.get_close_matches(Name, titles[0], 2)
# Cleaning the output
closest_match = [item.encode('utf-8') for item in closest_match]

In [134]:
closest_match

[]

In [135]:
if len(closest_match) == 0:
  print("No match was found!")
else:
  print("Company name is " + closest_match[0])
  print("Wit")

No match was found!


The results above could be used to auto-fill the name once the accountant starts filling in the field.