In [341]:
import re
import json
import html_to_json
import sys, fitz
import spacy

import en_core_web_sm

from time import time
from markdown import markdown as markdown_to_html
from PyPDF2 import PdfReader

In [342]:
nlp = en_core_web_sm.load()

In [343]:
def read_text(file: str = ''):
    reader = PdfReader(file)

    isJournal = False
    content = ''

    doc = fitz.open(file)

    for page in doc:

        text = page.get_text().encode("utf8")  # get plain text (is in UTF-8)
        decoded = text.decode()
        
        result = re.sub(
            r'([\w\W])\s\n([\w\W])|(\-)\n([\w\W])', '\g<1> \g<2>', decoded
        , 0, re.MULTILINE)
        
        hasJournalKeyword = re.search(r'\s?(Abstract|Abstrak|ABSTRACT|ABSTRAK)\s?', result)

        if hasJournalKeyword:
            isJournal = True
        
        content += result

    return ( isJournal, content )

def extract(file: str = ''):
    
    ( journal, content ) = read_text(file)

    result = ''
    
    if journal:
        
        content = re.sub(
            r'\.{2,}', '', re.sub(
                r'\s((m{0,4}(cm|cd|d?c{0,3})(xc|xl|l?x{0,3})(ix|iv|v?i{0,3}))|(M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3}))|\d+)\.\s', '\n\g<1>. ', re.sub(
                    r'\[(.*)\]\s?\n(.*?)\n', '[\\g<1>](\\g<2>)\n', re.sub(
                        r'((m{0,4}(cm|cd|d?c{0,3})(xc|xl|l?x{0,3})(ix|iv|v?i{0,3}))|\d+)\)\n([A-Z])', '\g<1>) \g<6>', re.sub( 
                            r'((M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3}))|\d+)\.\n([A-Z])', '\g<1>. \g<6>', content
                        , 0, re.MULTILINE)
                    , 0, re.MULTILINE)
                , 0, re.MULTILINE)
            , 0, re.MULTILINE)
        , 0, re.MULTILINE)
        
        for line in content.split('\n'):
            hasFormula = re.search(r'\s{2}\s+', line)

            if re.search(r'^([0-9a-zA-Z]|\[)', line) and len(line) > 7.5 and hasFormula == None:
                result += line + '\n\n'
        
        result = re.sub(r'([\w\W]{40})\n{2}([a-z])', '\g<1> \g<2>', result)

    else:
        
        result = content
    
    nlp_entities = []
    nlp_result = nlp(result)
    
    for item in nlp_result.ents:
        nlp_entities.append({
            'label': item.label_,
            'text': item.text
        })
    
    print(result, )

In [344]:
extract('./test/pdf/01-simple.pdf')

 
 PDF Test File  
Congratulations, your computer is equipped with a PDF (Portable Document Format) reader!  You should be able to view any of the PDF documents and forms available on our site.  PDF forms are indicated by these icons:   or  .    
Yukon Department of Education Box 2703 Whitehorse,Yukon Canada Y1A 2C6  
Please visit our website at:  http://www.education.gov.yk.ca/
   
 (PDF Test File, PDF, Document Format, PDF, PDF, Yukon Department of Education, 2703, Yukon Canada, 2C6)


In [337]:
extract('./test/pdf/02-text-image.pdf')

Welcome to Smallpdf
Digital Documents—All In One Place
Access Files Anytime, Anywhere Enhance Documents in One Click Collaborate With Others With the new Smallpdf experience, you can freely upload, organize, and share digital documents. When you enable the ‘Storage’ option, we’ll also store all processed files here. You can access files stored on Smallpdf from your computer, phone, or tablet. We’ll also sync files from the Smallpdf Mobile App to our online portal
When you right-click on a file, we’ll present you with an array of options to convert, compress, or modify it. Forget mundane administrative tasks. With Smallpdf, you can request e-signatures, send large files, or even enable the Smallpdf G Suite App for your entire organization. Ready to take document management to the next level? 



In [338]:
extract('./test/pdf/03-invoice.pdf')

Invoice
Payment is due within 30 days from date of invoice. Late payment is subject to fees of 5% per month.
Thanks for choosing DEMO - Sliced Invoices | admin@slicedinvoices.com
Page 1/1
From:
DEMO - Sliced Invoices
Suite 5A-1204
123 Somewhere Street
Your City AZ 12345
admin@slicedinvoices.com
Invoice Number
INV-3337
Order Number
12345
Invoice Date
January 25, 2016
Due Date
January 31, 2016
Total Due
$93.50
To:
Test Business
123 Somewhere St
Melbourne, VIC 3000
test@test.com
Hrs/Qty
Service
Rate/Price
Adjust
Sub Total
1.00
Web Design
This is a sample description...
$85.00
0.00%
$85.00
Sub Total
$85.00
Tax
$8.50
Total
$93.50
ANZ Bank
ACC # 1234 1234
BSB # 4321 432
Paid



In [339]:
extract('./test/pdf/04-journal.pdf')

Bitcoin: A Peer-to-Peer Electronic Cash System

Satoshi Nakamoto satoshin@gmx.com

www.bitcoin.org

Abstract.  A purely peer-to-peer version of electronic cash would allow online payments to be sent directly from one party to another without going through a financial institution.  Digital signatures provide part of the solution, but the main benefits are lost if a trusted third party is still required to prevent double-spending. We propose a solution to the double-spending problem using a peer-to-peer network. The network timestamps transactions by hashing them into an ongoing chain of hash-based proof-of-work, forming a record that cannot be changed without redoing the proof-of-work.  The longest chain not only serves as proof of the sequence of events witnessed, but proof that it came from the largest pool of CPU power.  As long as a majority of CPU power is controlled by nodes that are not cooperating to attack the network, they'll generate the longest chain and outpace attackers.  

In [340]:
extract('./test/pdf/05-complex.pdf')

arXiv:1812.09449v3  [cs.CL]  18 Mar 2020

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2020

A Survey on Deep Learning for

Named Entity Recognition

Jing Li, Aixin Sun, Jianglei Han, and Chenliang Li

Abstract—Named entity recognition (NER) is the task to identify mentions of rigid designators from text belonging to predeﬁned semantic types such as person, location, organization etc. NER always serves as the foundation for many natural language applications such as question answering, text summarization, and machine translation. Early NER systems got a huge success in achieving good performance with the cost of human engineering in designing domain-speciﬁc features and rules. In recent years, deep learning, empowered by continuous real-valued vector representations and semantic composition through nonlinear processing, has been employed in NER systems, yielding stat-of-the-art performance. In this paper, we provide a comprehensive review on existing deep learning techniques fo