# PDF Parse - Using x and space as patterns

## pdfplumber + re + pandas

In [23]:
import pdfplumber

with pdfplumber.open('Principles-of-Financial-Accounting.pdf') as pdf:
    page = pdf.pages[110]
    text = page.extract_text()

In [24]:
print(text)

PRINCIPLES OF FINANCIAL ACCOUNTING ACCOUNTING CYCLE - MERCHANDISING BUSINESS
Key questions to ask when dealing with merchandising transactions:
1. Are you the buyer or the seller?
2. Are there any returns?
3. What is the form of payment (cash or on account)?
4. Does the discount apply?
5. Who is to absorb the transportation cost?
6. If the buyer is to absorb the freight cost, did the seller prepay it?
Journal  Calculate 
ACCT 2101 Topics - Merchandising Fact  Entry Amount Format
Concept of a merchandising business x      
Concept of a perpetual inventory system x      
Merchandising income statement: net sales, gross profit, and net income     x x
Journalize purchase of inventory on account   x x  
Journalize purchaser’s return of inventory on account   x x  
Journalize payment on account   x x  
Journalize payment on account with a discount   x x  
Journalize purchaser’s payment of transportation charges terms FOB shipping   x x  
Journalize sale of merchandise on account under perpet

In [86]:
import re
from collections import namedtuple

table_content_namedtuple = namedtuple('t_content', ['Fact', 'Journal_Entry', 'Calculate_Amount', 'Format'])

re_pattern = re.compile(r'([A-wy-z\s’:,]+) ([x ]{1}) ([x ]{1}) ([x ]{1}) ([x ]{1})')

table_content_list = []

for line in text.split('\n'):
    line = re_pattern.search(line)
    if line:
        fact = line.group(1)
        Journal_Entry = line.group(2)
        Calculate_Amount = line.group(3)
        Format = line.group(4)
        table_content_list.append(table_content_namedtuple(fact, Journal_Entry, Calculate_Amount, Format))
        print(line.group(1))

Concept of a merchandising business
Concept of a perpetual inventory system
Merchandising income statement: net sales, gross profit, and net income
Journalize purchase of inventory on account
Journalize purchaser’s return of inventory on account
Journalize payment on account
Journalize payment on account with a discount
Journalize purchaser’s payment of transportation charges terms FOB shipping
Journalize sale of merchandise on account under perpetual system
for cash under perpetual system
Journalize receipt of payment on account
Journalize receipt of payment on account with a discount
Journalize seller’s payment of transportation charges terms FOB destination
Journalize seller’s payment of transportation charges terms FOB shipping
Journalize bank charges
Financial statements
Journalize closing entries
Post closing entries to ledgers


In [88]:
import pandas as pd
pd.DataFrame(table_content_list)

Unnamed: 0,Fact,Journal_Entry,Calculate_Amount,Format
0,Concept of a merchandising business,x,,
1,Concept of a perpetual inventory system,x,,
2,"Merchandising income statement: net sales, gro...",,,x
3,Journalize purchase of inventory on account,,x,x
4,Journalize purchaser’s return of inventory on ...,,x,x
5,Journalize payment on account,,x,x
6,Journalize payment on account with a discount,,x,x
7,Journalize purchaser’s payment of transportati...,,x,x
8,Journalize sale of merchandise on account unde...,,x,x
9,for cash under perpetual system,,x,x


In [90]:
table_title_pattern = re.compile(r'[A-Z]{4} [A-z\s\d]+')
for line in text.split('\n'):
    if table_title_pattern.match(line):
        print(line)

ACCT 2101 Topics - Merchandising Fact  Entry Amount Format


## Tabula

In [4]:
import tabula
file_name = 'Principles-of-Financial-Accounting.pdf'
table_list = tabula.read_pdf(file_name, 
                      pages=111,
                           pandas_options={'header': None 
                                           },
                    
#                            stream=True, 
                             lattice=True
                           )
table_list[0]

Unnamed: 0,0,1,2,3,4
0,Concept of a merchandising business,x,,,
1,Concept of a perpetual inventory system,x,,,
2,"Merchandising income statement: net sales, gro...",,,x,x
3,Journalize purchase of inventory on account,,x,x,
4,Journalize purchaser’s return of inventory on ...,,x,x,
5,Journalize payment on account,,x,x,
6,Journalize payment on account with a discount,,x,x,
7,Journalize purchaser’s payment of transportati...,,x,x,
8,Journalize sale of merchandise on account unde...,,x,x,
9,Journalize return of merchandise on account/fo...,,x,x,


In [16]:
import tabula
file_name = 'Principles-of-Financial-Accounting.pdf'
table_list = tabula.read_pdf(file_name, 
                      pages=35,
                           pandas_options={'header': None 
                                           },
                    
#                            stream=True, 
                             lattice=False
                           )

In [17]:
table_list[0]

Unnamed: 0,0,1,2,3,4,5,6
0,ACCOUNTS SUMMARY TABLE,,,,,,
1,ACCOUNT\rTYPE,ACCOUNTS,TO\rINCREASE,TO\rDECREASE,NORMAL\rBALANCE,FINANCIAL\rSTATEMENT,CLOSE\rOUT?
2,Asset,Cash\rAccounts Receivable,debit,credit,debit,Balance\rSheet,NO
3,Liability,Accounts Payable,credit,debit,credit,Balance\rSheet,NO
4,Stockholders’ Equity,Retained Earnings,credit,debit,credit,Balance\rSheet,NO
5,Revenue,Fees Earned,credit,debit,credit,Income\rStatement,YES
6,Expense,Wages Expense\rRent Expense\rUtilities Expense...,debit,credit,debit,Income\rStatement,


In [21]:
import tabula
file_name = 'Principles-of-Financial-Accounting.pdf'
table_list = tabula.read_pdf(file_name, 
                      pages=102,
                           pandas_options={'header': None 
                                           },
                    
#                            stream=True, 
                             lattice=False
                           )
table_list[0]

Unnamed: 0,0,1,2,3,4
0,Date,Account,,Debit,Credit
1,11,Merchandise Inventory,,500,
2,,Accounts Payable,,,500


In [22]:
table_list

[      0                      1   2      3       4
 0  Date                Account NaN  Debit  Credit
 1    11  Merchandise Inventory NaN    500     NaN
 2   NaN       Accounts Payable NaN    NaN     500,
       0                    1   2      3       4
 0  Date              Account NaN  Debit  Credit
 1    12  Accounts Receivable NaN    500     NaN
 2   NaN                Sales NaN    NaN     500,
       0                         1   2      3       4
 0  Date                   Account NaN  Debit  Credit
 1    12  Cost of Merchandise Sold NaN    200     NaN
 2   NaN     Merchandise Inventory NaN    NaN     200,
       0                 1   2      3       4
 0  Date           Account NaN  Debit  Credit
 1    12  Delivery Expense NaN     20     NaN
 2   NaN  Accounts Payable NaN    NaN      20,
      0                 1   2      3       4
 0  NaN           Account NaN  Debit  Credit
 1    ▼  Accounts Payable NaN    500     NaN
 2    ▼              Cash NaN    NaN     500,
      0        