# PDF

This is a small introduction into PDF files as an interactive Notebook with Python.

PDF is a file format designed to hold documents of all types.
PDF files are unencoded and their file data is normal ASCII text.

## Data Syntax

Before the deep dive into the PDF document structure, knowledge about the syntax of data in PDF.
There are data types like strings or numeric values.
Those are simple enough and will be introduced when they are first used.

There is however three types of data and the method of referencing objects that needs to be examined beforehand.

### Name Objects

Name objects are atomic objects that are uniquely defined.
This means that two Name objects that consist of the same characters are one and the same object.

Names begin with a `/` which is followed by arbitrarily many alphanumeric characters.
They are usually used to denote keys in dictionary objects.

**Example:**
```
/Name
```

### Array Objects

Arrays in PDF are one-dimensional collections of PDF objects.
They are delimited by square brackets (`[]`) while the objects inside are separated by whitespace.

**Example:**
```
[ 1 200 /Name ]
```

### Dictionary Objects

Dictionary objects are a collection key-value pairs.
The key of any pair has to be a name object, the value may be any kind of data.
Dictionaries are delimited by `<< >>` and the pairs inside by whitespace.

**Example:**
```
<<
    /Name   /Foo
    /Age    23
>>
```

### Referenced Objects

Any object within a PDF file can be marked as referenced / indirect object.

It is identified by an identifier, which consists of two parts, the object number and the generation number.
The object number is a unique number for the object, while the generation number specifies in which generation of the document the object was created.
In a newly created PDF file each object has generation number 0.

This identifier is the followed up by the object that shall be refernced, surrounded by the keywords `obj` and `endobj`.

It can then be referenced by its object number, its generation number and the keyword `R`.

**Example:**
```
1 0 obj
<<
    /Name   /Foo
    /Age    23
>>
endobj
```
```
1 0 R
```

## Blank PDF

To create a simple, blank PDF, we have to create a minimum base-structure.

PDF-Files consist of mainly 4 parts:

1. Header: Indicates PDF Version
2. Body: Contains objects in the PDF-File
3. Cross-Reference Table: Contains information about referenced objects
4. Trailer: Gives Locations of CRT and other objects in the file

### Header

The header is a single line containing the PDF-Version that was used to create this PDF file.
The most recent PDF specification version is 2.0, but only up to version 1.7 is open-sourced.
Since PDF-Files are meant to be backwards-compatible and the basic concepts should stay the same this tutorial uses version 1.7.

The single header-line has the form

```
%PDF-1.7
```

where `1.7` is the used PDF version.

In this tutorial we use 1.7, therefor our header looks like this:

In [1]:
header = """%PDF-1.7
"""

### Body

The PDF body contains the whole contents of the PDF file.
Since we only want a blank page for our PDF file one could thing it was empty.
This is not the case.
Even the page specification is part of the PDF body.

A PDF document is a hierarchy of PDF objects.
This hierarchy is described in the PDF body.

The root object is called the **Document Catalog**.
It is also referenced in the trailer later to make finding it easier.
The whole structure of the PDF file starts here.

#### Document Catalog

The document catalog contians the whole description of the PDF document.
It is a referenced PDF object that contains several other objects.

It has a set of values which are required to describe a PDF document:

- **/Type**: has to be `/Catalog`
- **/Pages**: a reference to the documents **Page Tree Root**

In this tutorial the objects are created in sequential order.
The document catalog is the object 1.
The page tree root will be object 2.
Therefor our document catalog has the identifier `1 0` and the page tree root will be referenced by `2 0 R`:

In [2]:
document_catalog = """1 0 obj                 % DOCUMENT CATALOG
<<
    /Type   /Catalog    % Specifies this dictionary as of type document catalog
    /Pages  2 0 R       % Reference to the page tree root that will be defined later
>>
endobj
"""

#### Page Tree

The page tree consists of one page tree root and can contain arbitrarily many page tree nodes.

Each page tree node has a certain structure.
The root node is special as in that it has no parent.

**Page Tree Node Structure:**
- **/Type**: has to be `/Pages`
- **/Parent**: reference to the parent root tree node (prohibited in root)
- **/Kids**: array containing the immediate children of the node; can contain page tree nodes or page objects
- **/Count**: number of page objects that are descendants of this page tree node

The page tree root in this case has identifier `2 0` specified in the document catalog section.
The only page in the document will have identifier `3 0`:

In [3]:
page_tree_root = """2 0 obj                 % PAGE TREE ROOT
<<
    /Type   /Pages      % Specifies this dictionary as of type page tree object
    /Kids   [           % Direct children on the page tree
        3 0 R
    ]
    /Count  1           % Number of Pages that are descendant
>>
endobj
"""

#### Page Objects

Page objects describe a page in the document with all of its contents.

**Page Object Structure:**
- **/Type**: has to be `/Page`
- **/Parent**: reference to the page tree node that is the immediate parent of this page
- **/Resources**: the resources needed by content on this page, if omitted it is inherited from its parent in the page tree
- **/MediaBox**: rectangle specifying the boundries of the page, if omitted it is inherited
- **/Contents** (optional): array of contents on this page

These are only the basic keys in the page object but all that are needed to define a page object.

Since our page will be empty, we don't specify any contents (an empty array).
Because the resources and the media box (page boundaries) are inherited from the parent node we don't need to specify them though they should be at least be specified somewhere in the page tree.
Omitting the media box completely from the page tree lets it fall back to A4 page format, for simplicity reasons we don't use any media box here.
The parent of this page is the page tree root defined earlier:

In [4]:
page = """3 0 obj                 % PAGE
<<
    /Type       /Page   % Specifies this dictionary as of type page object
    /Parent     2 0 R   % Reference to parent page tree node
>>
endobj
"""

#### Summary

The body section contains all data for the PDF file.

Here's what we got so far:

In [5]:
body = "".join((document_catalog,  page_tree_root, page))
print(body)

print("=" * 20 + "\n")

pdf_file = "".join((header, body))
print(pdf_file)


1 0 obj                 % DOCUMENT CATALOG
<<
    /Type   /Catalog    % Specifies this dictionary as of type document catalog
    /Pages  2 0 R       % Reference to the page tree root that will be defined later
>>
endobj
2 0 obj                 % PAGE TREE ROOT
<<
    /Type   /Pages      % Specifies this dictionary as of type page tree object
    /Kids   [           % Direct children on the page tree
        3 0 R
    ]
    /Count  1           % Number of Pages that are descendant
>>
endobj
3 0 obj                 % PAGE
<<
    /Type       /Page   % Specifies this dictionary as of type page object
    /Parent     2 0 R   % Reference to parent page tree node
>>
endobj


%PDF-1.7
1 0 obj                 % DOCUMENT CATALOG
<<
    /Type   /Catalog    % Specifies this dictionary as of type document catalog
    /Pages  2 0 R       % Reference to the page tree root that will be defined later
>>
endobj
2 0 obj                 % PAGE TREE ROOT
<<
    /Type   /Pages      % Specifies this diction

### Cross-Reference Table

The cross reference table holds the positions of all referenced objects in the file.
It acts as a kind of random access memory for the objects, so objects hidden somewhere in the file can be found quickly.
This is done so not the whole file needs to be read to find an object.

The cross-reference table consists of multiple sections which each hold data for one file version.
Therefor a new PDF file that has not been edited has only one such sections holding all objects.

A section begins with the line `xref`, after that follow one or more subsections in any order.

These subsections are also used for file updates so a new PDF file once again has only one of them.
Each subsection holds a set of entries for continuous objects, which means each additional entry in a subsection increases the id of the referenced object by 1.
This removes the necessity to specify the id of every object for each entry.

A subsection begins with a line holding two numbers:
The id of the first referenced object and the number of objects that are being referenced.

Each entry into the cross-reference table has a very specific syntax:

```
nnnnnnnnnn ggggg <f/n>
```

Where `n...n` is the offset of the object in the file from the beginning in bytes with 10 digits(0-padded if necessary)
`g...g` is the generation number which is initially 0 and gets updated with each customisation to this object.
`f` or `n` specify the type of entry.
`n` is for active objects while `f` is for entries which objects have been deleted.

The entry with object id 0 shall always be free and have generation number 65535.

Our PDF file has 3 objects therefor we need 3 entries in the cross reference table + 1 for the 0th object:

In [6]:
header_length = len(header)
object1_length = len(document_catalog)
object2_length = len(page_tree_root)
object3_length = len(page)

object1_offset = header_length
object2_offset = object1_offset + object1_length
object3_offset = object2_offset + object2_length

cross_reference_table = f"""xref
0 4
0000000000 65535 f
{str(object1_offset).zfill(10)} 00000 n
{str(object2_offset).zfill(10)} 00000 n
{str(object3_offset).zfill(10)} 00000 n
"""

print(cross_reference_table)

xref
0 4
0000000000 65535 f
0000000009 00000 n
0000000230 00000 n
0000000505 00000 n



### Trailer

The trailer of the PDF file is designed to allow quick access to all important parts of a PDF document.
It consists of the line `trailer`, the **Trailer Dictionary**, the offset of the cross-reference dictionary and the line `%%EOF`.

#### Trailer Dictionary

The trailer dictionary holds important information about the pdf file.

**Trailer Dictionary Structure:**
- **/Size**: total size of the cross-reference table
- **/Root**: reference to the document catalog

Our cross-reference table has 4 entries and the document catalog is `1 0`:

In [7]:
trailer_dictionary = """trailer                 % TRAILER DICTIONARY
<<
    /Size   4           % Size of the CRT
    /Root   1 0 R       % References the Document Catalog
>>
"""

#### CRT Offset

Finally, the offset to the CRT has to be specified in bytes from the file start.
This is preceeded by the line `startxref` and followed by the file end marker `%%EOF`.

In [8]:
body_length = len(body)

crt_offset = f"""startxref
{header_length + body_length}
%%EOF
"""

print(crt_offset)

startxref
685
%%EOF



#### Summary

The finished trailer:

In [9]:
trailer = "".join((trailer_dictionary, crt_offset))

### Putting it together

The complete file:

In [10]:
pdf_file = "".join((header, body, cross_reference_table, trailer))
print(pdf_file)

%PDF-1.7
1 0 obj                 % DOCUMENT CATALOG
<<
    /Type   /Catalog    % Specifies this dictionary as of type document catalog
    /Pages  2 0 R       % Reference to the page tree root that will be defined later
>>
endobj
2 0 obj                 % PAGE TREE ROOT
<<
    /Type   /Pages      % Specifies this dictionary as of type page tree object
    /Kids   [           % Direct children on the page tree
        3 0 R
    ]
    /Count  1           % Number of Pages that are descendant
>>
endobj
3 0 obj                 % PAGE
<<
    /Type       /Page   % Specifies this dictionary as of type page object
    /Parent     2 0 R   % Reference to parent page tree node
>>
endobj
xref
0 4
0000000000 65535 f
0000000009 00000 n
0000000230 00000 n
0000000505 00000 n
trailer                 % TRAILER DICTIONARY
<<
    /Size   4           % Size of the CRT
    /Root   1 0 R       % References the Document Catalog
>>
startxref
685
%%EOF



Let's save our PDF file into an actual file and try to open it:

In [14]:
with open("example.pdf", "w", newline="") as target_file:
    target_file.write(pdf_file)