
# PDF Parsing:
The objective of this notebook is to develop a pdf parser and arabic text extractor.  
Parsing PDF could be treated as 5-steps process, done respectively in the following order -after decompressing the pdf file- :

1. Traversing PDF logical tree and find all Page objects in that way you guarantees pages ordering is correct.
Then for each Page:  
2. Retrieve fonts information and ToUnicode tables.   
3. Retrieve contents.   
4. Decode contents. 
5. Position the text into their right orders. 


The implementation here is optimized for parsing pdf with CID fonts ”Type0”,  where fonts are explicitly referenced by the Page object and the ToUnicode tables are embedded within the pdf.

#### Helpful resources:
- [`pdf parsing - understanding pdfs`](https://docs.google.com/document/d/1gfxrJyJlx4NPCdnrElcwRx7ZByCKBO3RFrzmkKEUxuo/edit?usp=sharing)
- [PDFReference](https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference.pdf)
- [PDF Explained](https://learning.oreilly.com/library/view/pdf-explained/9781449321581/)


In [1]:
## imports
import re
import numpy as np
from binascii import unhexlify, hexlify

## Info Retrievers Functions
Retrieve related data such as content streams, ToUnicode tables.

#### TODO:
- need to change the way of naming fonts, not all font start with C2_X. <mark> [Done] </mark>
- need to consider beggingofrange alongside with beggingofchar
- need to change the way of retrieving content and consider the case of an array


In [2]:
def get_fonts(page):
    """
    Retrieving fonts from a given page
    #Args:
        - Page object
    #Returns:
        - Fonts obejcts Id's numbers
    """
    fonts_ref = [font.split(' ')[1]       for font in re.findall('C2_[0-9]\s[0-9]+', page.split('Font')[1]) ] 
    return fonts_ref
 
    
def get_cmap(font_ref,pdf_content):
    """
    Retrieving ToUnicode table
    #Args:
        - Font object id 
        - Decompressed pdf file
    #Returns:
        - ToUnicode table saved in a dictionary
    """
    # Finding cmap reference that associated to a specific font
    cmap_ref = re.findall(re.compile(fr'(obj\s{font_ref}\s0\n[a-zA-Z0-9\n\s:,.<_/\[\]+-]+/ToUnicode\s)([0-9]+)'),pdf_content)[0][1]
    
    # Traverse to cmap object and retrieve the cmap and save it into a dictionary
    cmap = re.findall(re.compile(fr"""(obj\s{cmap_ref}\s0\n[a-zA-Z0-9\n\s:,.<>_+-/\[\]\\']+)(nbegincmap.*?nendcmap)"""),pdf_content)[0][1]
    cmap_as_list = re.findall(re.compile('<[a-fA-F0-9]+> <[a-fA-F0-9]+>'), cmap)
    return {encode.split()[0].replace('<','').replace('>',''):unhexlify(encode.split()[1].replace('<','').replace('>','')).decode('utf-16-be') for encode in cmap_as_list}


def get_content(page,pdf_content):
    """
    Retriving content obejct
    #Args:
        - Page object
        - Decompressed pdf file
    #Returns:
        - Page content 
    """
    contents_ref = page.split('Contents ')[1].split(' ')[0]
    try:
        content = re.findall(re.compile(fr"""(obj\s{contents_ref}\s0\n[a-zA-Z0-9\n\s:,.<>_+-/\[\]\\()]+)('.*?')"""),pdf_content)[0][1]
    except:
        content = re.findall(re.compile(fr"""(obj\s{contents_ref}\s0\n[a-zA-Z0-9\n\s:,.<>_+-/\[\]\\()]+)(".*?")"""),pdf_content)[0][1]

    return content

## Decoder Function
Decoding the text is done by finding the text associated font ToUnicode table, where the table  
Keys represent font encoding and values represent the corresponding unicode characters. 

E.x.  
/C2_0 FT   <"AAAABBBB"> Tj  

The font name is `C2_0`    
The dictnory **instance_mapping[`C2_0`]** contains the ToUnicode table for `C2_0`  
and the mapping would be **instance_mapping[`C2_0`][AAAA]**

AAAA maps to unicode xxxx  
BBBB  maps to unicode zzzz
 
 
#### TODO:
- consider different encoding cases such as \xAAAA, this might be done by checking the font type



In [3]:

def decode_content(tag,instance_mapping,used_font):
    """
    #Args:
        - Text tags.i.e. <XXXX> inside a Tj/J
        - Font dictionary formatted as {font_name: {font_code:unicode}}
    #Return:
        - Decoded text
    """
    t = ""
    tag = tag.replace("<","")
    tag = tag.replace(">","")
    for i in range(0,len(tag), 4):
        try:
            t= instance_mapping[used_font][tag[i:i+4]]+t
        except:
            pass
    return t

## Positioning Functions
Get text strings coordinates and update text matrix. 

The suggested approach is to extract all TJ/j with their coordinates, store them in a nested dictionary where keys represent y,x coordinates and then do the sorting.  
The calculations are done as explained in the `pdf parsing - understanding pdfs` document
#### TODO:
- remove number of page <mark> [Done] </mark>
- consider cropbox and page ranges, check for content if it's in the page cropbox or not
- TJ positioning

#### Things to highlight:
- Due to the way of how float is stored in momery and how it is affecting the positioning calculations, I needed to round  x and y to the nearest integer when using them as keys only, but the calculations are not rounded.
- did a quick solution to cropbox problem

In [4]:
# a Bt contines more than one Tm
# added .split('\\n')
def get_text_metrics(bt,text_matrix):
    """
    Get the updated text matrix
    #Arg:
        -bt: BT tags or text string "TJ/j tags"
        -text_matrix : pre initialized text matrix
    #Return:
        - a: horizontal scale
        - b: vertical scale
        - c: horizontal rotation
        - d: vertical rotation
        - e: horizontal position -  x 
        - f: vertical position - y
    """
    
    if 'Tm' in bt:
        Tm = bt.split('Tm')[0].split('\\n')[-1]
        a,b,c,d,e,f = list(map(float,Tm.split()[-6:]))
        return np.array([[a,b,0],
                     [c,d,0],
                     [e,f,1]])
    else:
        return text_matrix


def get_text_coordinate(Tj, text_matrix,Tl):
    """
    Get corresponding text coordinates
    #Arg:
        - Text string either Tj or TJ
        - text matrix
        - Text leading "line spacing"
    #Return:
        - Text string x position
        - Text string y position
        - Updated text leading 
    """
    if 'Td' in Tj:
        for Td in Tj.split(' Td')[:-1]:
            Tx = float(Td.split('\\n')[-1].split()[-2])
            Ty = float(Td.split('\\n')[-1].split()[-1])
            Tlm = np.array([[1,0,0],
                            [0,1,0],
                            [Tx,Ty,1]])
            text_matrix =  Tlm.dot(text_matrix)  
                    
    elif 'TD' in Tj:
        for TD in Tj.split(' TD')[:-1]:
            # update text leading
            Tl = float(TD.split('\\n')[-1].split()[-1])
            # update x and y 
            Tx = float(TD.split('\\n')[-1].split()[-2])
            Ty = float(TD.split('\\n')[-1].split()[-1])
            Tlm = np.array([[1,0,0],
                            [0,1,0],
                            [Tx,Ty,1]])
            text_matrix = Tlm.dot(text_matrix)
            
    elif 'T*' in Tj:
        Tx =0 
        Ty = Tl
        Tlm = np.array([[1,0,0],
                        [0,1,0],
                        [Tx,Ty,1]])
        text_matrix = Tlm.dot(text_matrix)
        
    return (text_matrix,Tl)




def store_text_with_coordinates(Tx,Ty,Tpage,scale,text,content_positioning):
    """
    Store decoded text into a dictionary to postion text 
    into its right palce i.e {page_number:{y:{x:text}}}
    #Args:
        - Tx: represent offset in a line
        - Ty:represent line
        - page number: to be romved
        - decoded text: readable text
        - content_positioning: dictionary in which to store text
    #Return:
        None, It applys changes directly to the dictionary
    """
    if (Tx > 0 and Ty > 0):
        Tx = int(Tx)
        Ty = int(Ty)

        page = content_positioning.setdefault(Tpage,{})
        y = page.setdefault(Ty,{})
        try:
            y[Tx] = text +y[Tx]
        except:
            x = y.setdefault(Tx,text)
    

## Sorting Function:
Arrange text using their x y cooreinates.

#### TODO:
- find a method to position diacritics correctly. The problem here is that pdf place diacritics alone so their y-coordinates will be higher than the words y-coordinate which will cause them to appear before the word, one way to slove this is by rounding and combining lines where the distance between them is less than line spaceing 

In [60]:
def arranging_text(text_with_coordinates):
    text = ""
    for page in sorted(text_with_coordinates):
        for line in sorted(text_with_coordinates[page],reverse =True):
            for word in sorted(text_with_coordinates[page][line],reverse =True):
                text += text_with_coordinates[page][line][word]

    return text


# Putting it all together

In [6]:
# reading decompressed pdf file
file_name = "5.txt"
with open(f"1435_decompressed/{file_name}", "r") as f:
    pdf_content = f.read() 
f.close()

# Finding all Page objects - Step 1
page_tags = re.findall(re.compile(r"""obj\s[0-9]+\s0\n\sType:\s/Page[a-zA-Z0-9\n\s:,.<>_/\[\]]+Contents[a-zA-Z0-9\n\s:,.<>_/\[\]]+Font[a-zA-Z0-9\n\s:,.<>_/\[\]]+"""), pdf_content)



In [8]:
full_docs_text = ''
for page in page_tags:


    # Retriving fonts for encoding - Step 2
    fonts_mapping_dic = {}
    instance_mapping = {}


    fonts_ref = get_fonts(page)
    for font_ref, num in zip(fonts_ref, range(len(fonts_ref))):
        if font_ref not in fonts_mapping_dic:
            fonts_mapping_dic[font_ref] = get_cmap(font_ref, pdf_content)

        instance_mapping[f'C2_{num}'] = fonts_mapping_dic[font_ref]

    # Retriving content - Step 3
    content = get_content(page,pdf_content)
    cropbox_x = float(re.findall('[\d+\.\d+]+', page.split('/CropBox')[1])[:4][-2])
    cropbox_y = float(re.findall('[\d+\.\d+]+', page.split('/CropBox')[1])[:4][-1])

    # Decodeing and positioning
    BTs = content.split("BT")
    text_with_coordinates = dict()
    Tpage = 0  # Initial value  -PDF specification- 
    Tm = np.array([[1,0,0],  # Initial value: 0 -PDF specification-
               [0,1,0],
               [0,0,1]])  
    Tl = 0 # Initial value: 0 -PDF specification-
    for j in range(1,len(BTs)):
        bt = BTs[j]

        for Tj in bt.split('Tj'):

            # Finding text string in TJ 
            if 'TJ' in Tj:
                for TJ in Tj.split('TJ')[:-1]:
                    try:
                        used_font = re.findall(re.compile(r"""(C2_[0-9]+)\s"""), TJ)[0]
                        last_used_font = used_font
                    except:
                        pass
                    # Get text metrics
                    Tm = get_text_metrics(TJ,Tm)               
                    Tm,Tl = get_text_coordinate(TJ,Tm,Tl)
                    # Finding text strings
                    text_tags = re.findall("<[0-9a-fA-F]+>", TJ)
                    for text_tag in text_tags:
                        text = decode_content(text_tag,instance_mapping,last_used_font)
                        store_text_with_coordinates(Tm[2][0],Tm[2][1],Tpage,Tm[0][0],text,text_with_coordinates)



            # Finding text string in Tj
            Tj_ = Tj.split('TJ')[-1]
            try:
                used_font = re.findall(re.compile(r"""(C2_[0-9]+)\s"""), Tj_)[0]
                last_used_font = used_font
            except:
                pass

            # Get text metrics
            Tm = get_text_metrics(Tj_,Tm)
            Tm,Tl = get_text_coordinate(Tj_,Tm,Tl)
            # Finding text strings
            text_tags = re.findall("<[0-9a-fA-F]+>", Tj_)
            for text_tag in text_tags:
                text = decode_content(text_tag,instance_mapping,last_used_font)
                store_text_with_coordinates(Tm[2][0],Tm[2][1],Tpage,Tm[0][0],text,text_with_coordinates)




    full_docs_text += arranging_text(text_with_coordinates)
        
        

TypeError: store_text_with_coordinates() missing 1 required positional argument: 'content_positioning'