# Page Number Adder

## Introduction

Requirements:

- Inputs: documents with pdf format (1.pdf - 7.pdf)
- Outputs: a summary table of pages in each document (details: [page_types.xlsx](../data/page_types.xlsx))
    - docs_id, `docs_pages`, `chapter_title_pages`
    - `regular_pages`: (min, max)
    - `chapter_title_pages`: (min, max)
    - `last_page_of_a_chapter`: (min, max)

### Inputs: Document List

Documents (pdf format)

- [1.pdf](../data/pdf/1.pdf)
- [2.pdf](../data/pdf/2.pdf)
- [3.pdf](../data/pdf/3.pdf)
- [4.pdf](../data/pdf/4.pdf)
- [5.pdf](../data/pdf/5.pdf)
- [6.pdf](../data/pdf/6.pdf)
- [7.pdf](../data/pdf/7.pdf)

### Outputs: A Summary Table

The Summary Table of Books

| Page number format | Page number list | List of chapter title pages | Regular page minimum | Regular Page Maximum | Chapter Title Page Minimum | Chapter Title Page Maximum | Last-page-of-a-chapter Page Minimum | Last-page-of-a-chapter Page Maximum |
|:-------------------|:-----------------|:----------------------------|:---------------------|:--------------------|:---------------------------|:---------------------------|:------------------------------------|:--------------------------------------|
| 1                  | 32, 54, 79       | 54              | | | | | | |
| 2                  | 4, 7, 120, 158   | 7               | | | | | | |
| 3                  | 63, 77, 88       | 88              | | | | | | |
| 4                  | 1, 2, 19, 23     | 19              | | | | | | |
| 5                  | 3, 18, 27, 53    | 27              | | | | | | |
| 6                  | 22, 33, 37       | 33              | | | | | | |
| 7                  | 9, 42, 104       | 42              | | | | | | |


Formats:

1. the end of the page;
2. the end of the page, except chapter title pages have no page number, only the chapter number at the top;
2. book-specific: one end-of-chapter page, identified through the user's input;
3. the end of the page, except chapter title pages have no page number, Nor any chapter number at the top;
4. at the top of the page, except chapter title pages have no page number, only the chapter number at the top;
5. at the top of the page, except chapter title pages have no page number, Nor any chapter number at the top;
6. the top of the page (and also chapter number below that);
7. the top of the page, except chapter title pages have the page number at the end of the page (chapter number is at top of page below the page number)

## Overview

**Steps**:

1. Data Loading
    - 1.1. importing packages
    - 1.2. loading data
2. Data Preprocessing
    - 2.1. Generating page numbers
        - 2.1.1. Entering page numbers
        - 2.1.2. Fixing symbols (comma, hyphen)
        - 2.1.3. Generating pages
    - 2.2. Generating page numbers and chapter title pages
        - 2.2.1. Entering page numbers
        - 2.2.2. Entering pages of chapter titles
        - 2.2.3. Specifying a document type
3. Data Manipulation
    - 3.1. Removing invalid records from data (by avoid indexes)
    - 3.2. Generating a morphing dictionary (record indexes + page_breaker indexes)
        - 3.2.1. Creating a row/record dictionary (key: indexes of records / value: order)
        - 3.2.2. Creating a page breaker dictionary (key: indexes of page breakers / value: order)
        - 3.2.3. Generating a morphing list (key: indexes of records / values: order or page breaker)
    - 3.3. Adding page numbers to a table
        - 3.3.1. Filling page numbers between page breakers (morphing dict, page numbers)
        - 3.3.2. Assigning page numbers to a csv file


**Examples**:

Data Preprocessing

```
2.1.1. Entering page numbers: ['-15', '20', '30-33']
2.1.2. Fixing symbols (comma, hyphen): ['0-15', '20', '30-33']
2.1.3. Generating pages: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 30, 31, 32, 33]
```

## 1. Data Loading

### 1.1. Importing packages

In [12]:
from itertools import islice
import math, re, operator, fnmatch, itertools

import numpy as np
import pandas as pd
from networkx.utils import powerlaw_sequence

### 1.2. Loading data

In [13]:
datafile = "../data/csv/pagenumbers.csv"
data = pd.read_csv(datafile, sep=",")

In [9]:
data

Unnamed: 0,Text,Summary,Author,Title,Page Numbers
0,blah blah blah,blatherskite,Deferential Writer,A Superfluous Tale,
1,blah blah blah,blatherskite,Deferential Writer,A Superfluous Tale,
2,"back in 1822, there was a warthog",blatherskite,Deferential Writer,A Superfluous Tale,
3,blah blah blah,blatherskite,Deferential Writer,A Superfluous Tale,
4,blah blah blah,blatherskite,Deferential Writer,A Superfluous Tale,
5,blah blah blah,blatherskite,Deferential Writer,A Superfluous Tale,
6,"5, 6, 7, 8, who do we appreciate? Platypuses!",blatherskite,Deferential Writer,A Superfluous Tale,
7,blah blah blah,blatherskite,Deferential Writer,A Superfluous Tale,
8,blah blah blah,blatherskite,Deferential Writer,A Superfluous Tale,
9,blah blah blah,blatherskite,Deferential Writer,A Superfluous Tale,


## 2. Data Preprocessing

### 2.1. Generating page numbers

#### 2.1.1. Entering page numbers

In [11]:
txt = """
Please enter your list of page numbers, separating by
a comma and an empty space (e.g. 12, 34).
"""

print(txt)
page_numbers = [(x) for x in input().split(', ')]


Please enter your list of page numbers, separating by
a comma and an empty space (e.g. 12, 34).

12, 34


In [15]:
print(page_numbers)

['12', '34']


#### 2.1.2. Fixing symbols (comma, hyphen)

In [34]:
def commaHyphenFix(lliissttyy):
    pageNumberList = []
    priorNumber = 0
    for stringchar, item in enumerate(lliissttyy):
        if "-" in item:
            splitNumber1, splitNumber2 = item.split('-')   
            if splitNumber1 == "":
                
                pageNumberList.append(str(priorNumber) + "-" + splitNumber2) 
        priorNumber = lliissttyy[stringchar]

    return pageNumberList

In [25]:
commaHyphenFix(['-12', '14', '15-17'])

['0-12']

In [26]:
commaHyphenFix(['14', '15-17'])

[]

In [27]:
commaHyphenFix(['15-17'])

[]

In [29]:
commaHyphenFix(['-50'])

['0-50']

In [47]:
commaHyphenFix(['-14', '15-17', '-50', '-51', '-100'])

['0-14', '15-17-50', '-50-51', '-51-100']

In [45]:
commaHyphenFix(['-14', '15-17', '50', '51', '-100'])

['0-14', '51-100']

In [32]:
result1, result2 = '20'.split('-')
print(result1, result2)

ValueError: not enough values to unpack (expected 2, got 1)

In [35]:
result1, result2 = '20-25'.split('-')
print(result1, result2)

20 25


In [48]:
for stringchar, item in enumerate(['-14', '15-17', '-50', '-51', '-100']):
    print(stringchar, item)

0 -14
1 15-17
2 -50
3 -51
4 -100


#### 2.1.3. Generating pages

In [51]:
def addPages(lliisstt):
    pageNumberList = []
    for stringchar, item in enumerate(lliisstt):
        if "-" in item:
            splitNumber1, splitNumber2 = item.split('-')   
            rangy = int(splitNumber2) - int(splitNumber1) + 1
            for counter in range(rangy):          
                splitNumber = str(int(splitNumber1) + counter)
                pageNumberList.append(splitNumber) 
        else:
            pageNumberList.append(item) 
                
    return pageNumberList

In [53]:
yayList = addPages(['0-12'])
print(yayList)

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12']


In [55]:
yayList = addPages(['13-15'])
print(yayList)

['13', '14', '15']


### 2.2.  Generating page numbers and chapter title pages

#### 2.2.1. Entering page numbers

In [58]:
paginationFormat = input("Please indicate where the page numbers are located: ")
print(paginationFormat)

Please indicate where the page numbers are located: 12,34,56
12,34,56


#### 2.2.2. Entering pages of chapter titles

In [2]:
chapterTitlePageList = input("What pages are chapter title pages?")
print(chapterTitlePageList)

What pages are chapter title pages?12
12


In [3]:
def chTitlePage(chapterTitlePageList):
    pass

In [4]:
chTitlePage(12)

#### 2.2.3. Specifying a document type

In [56]:
txt = """
Where are the page numbers?
1 the end of the page;
2 the end of the page, except chapter title pages have no page number, only the chapter number at the top; 
3 the end of the page, except chapter title pages have no page number, Nor any chapter number at the top; 
4 the end of the page including chapter title pages, which also have chapter title page at the top; 
5 at the top of the page, except chapter title pages have no page number, Nor any chapter number at the top; 
6 the top of the page, except chapter title pages have no page number, only the chapter number at the top; 
7 the top of the page; 
8 the top of the page, except chapter title pages have the page number at the end of the page.
"""

print(txt)


Where are the page numbers?
1 the end of the page;
2 the end of the page, except chapter title pages have no page number, only the chapter number at the top; 
3 the end of the page, except chapter title pages have no page number, Nor any chapter number at the top; 
4 the end of the page including chapter title pages, which also have chapter title page at the top; 
5 at the top of the page, except chapter title pages have no page number, Nor any chapter number at the top; 
6 the top of the page, except chapter title pages have no page number, only the chapter number at the top; 
7 the top of the page; 
8 the top of the page, except chapter title pages have the page number at the end of the page.



In [9]:
def numberLayout(pageNumberPattern):
    rangeOfNumbers = 7  # ?

    for a in rangeOfNumbers:
        if pageNumberPattern == a:
            pass

    return 

In [10]:
numberLayout(3)

TypeError: numberLayout() missing 1 required positional argument: 'pageNumberPattern'

## 3. Data Manipulation

### 3.1. Removing invalid records from data (by avoid indexes)

In [22]:
dataReg = open(datafile)
listOIndexes = []

In [23]:
def getRawIndexList():
    listToAvoid = ('0:','1:','2:','3:', '4:', '5:', '6:', '7:', '8:', '9:', ':0', ':1', ':2', ':3', ':4', ':5', ':6', ':7', ':8', ':9', '0:','1:','2:','3:', '4:', '5:', '6:', '7:', '8:', '9:', ':0', ':1', ':2', ':3', ':4', ':5', ':6', ':7', ':8', ':9', '0.','1.','2.','3.', '4.', '5.', '6.', '7.', '8.', '9.', '.0', '.1', '.2', '.3', '.4', '.5', '.6', '.7', '.8', '.9')
    for rowIndexes, row in enumerate(dataReg):
        if re.search('[0-9]', row):  # only the row that contains 0-9
            print(row)
            if row in listToAvoid:
                pass
            else:
                listOIndexes.append(rowIndexes)
    return listOIndexes

In [24]:
rawIndexList = getRawIndexList()
print(rawIndexList)

"back in 1822, there was a warthog",blatherskite,Deferential Writer,A Superfluous Tale,

"5, 6, 7, 8, who do we appreciate? Platypuses!",blatherskite,Deferential Writer,A Superfluous Tale,

"text containing a numeral, 8 in this case",blatherskite,Deferential Writer,A Superfluous Tale,

"text containing a numeral, 0 in this case",oil changing process,Deferential Writer,A Superfluous Tale,

"text containing a numeral, 9 in this case",oil changing process,Deferential Writer,A Superfluous Tale,

[3, 7, 14, 19, 24]


### 3.2. Generating a morphing dictionary (record indexes + page_breaker indexes)

#### 3.2.1. Creating a row/record dictionary (key: indexes of records / value: order)

In [25]:
def initializeDictionaryOfRows(initializeListOfRows):
    rowDict = {k: v for v, k in enumerate(initializeListOfRows)}
    return(rowDict)

In [29]:
listOfRows = [None]*4346
dictOfRows = initializeDictionaryOfRows(listOfRows)
print(dictOfRows)

{None: 4345}


#### 3.2.2. Creating a page breaker dictionary (key: indexes of page breakers / value: order)

In [30]:
def assignListOfIndexesToDict(indexList):
    inDictMint = {k: v for v, k in enumerate(indexList)}
    return(inDictMint)

In [31]:
listOfIndexes = [1, 36, 73, 110, 147, 186, 222, 258, 294, 331, 369, 402, 437, 470, 506, 539, 576, 644, 678, 715, 751, 787, 823, 860, 895, 931, 965, 1001, 1028, 1064, 1100, 1135, 1170, 1204, 1240, 1277, 1314, 1349, 1386, 1422, 1458, 1494, 1529, 1564, 1591, 1627, 1664, 1702, 1739, 1776, 1810, 1844, 1880, 1917, 1952, 1984, 2023, 2060, 2095, 2131, 2168, 2205, 2239, 2276, 2312, 2345, 2381, 2417, 2453, 2490, 2527, 2562, 2599, 2636, 2673, 2710, 2747, 2782, 2818, 2855, 2891, 2926, 2962, 2996, 3033, 3071, 3109, 3144, 3182, 3219, 3255, 3289, 3326, 3363, 3398, 3435, 3471, 3508, 3545, 3582, 3617, 3654, 3717, 3750, 3786, 3820, 3857, 3894, 3931, 3966, 4003, 4040, 4075, 4110, 4146, 4183, 4219, 4254, 4291, 4328]
dictOfIndexes = assignListOfIndexesToDict(listOfIndexes)
print(dictOfIndexes)

{1: 0, 36: 1, 73: 2, 110: 3, 147: 4, 186: 5, 222: 6, 258: 7, 294: 8, 331: 9, 369: 10, 402: 11, 437: 12, 470: 13, 506: 14, 539: 15, 576: 16, 644: 17, 678: 18, 715: 19, 751: 20, 787: 21, 823: 22, 860: 23, 895: 24, 931: 25, 965: 26, 1001: 27, 1028: 28, 1064: 29, 1100: 30, 1135: 31, 1170: 32, 1204: 33, 1240: 34, 1277: 35, 1314: 36, 1349: 37, 1386: 38, 1422: 39, 1458: 40, 1494: 41, 1529: 42, 1564: 43, 1591: 44, 1627: 45, 1664: 46, 1702: 47, 1739: 48, 1776: 49, 1810: 50, 1844: 51, 1880: 52, 1917: 53, 1952: 54, 1984: 55, 2023: 56, 2060: 57, 2095: 58, 2131: 59, 2168: 60, 2205: 61, 2239: 62, 2276: 63, 2312: 64, 2345: 65, 2381: 66, 2417: 67, 2453: 68, 2490: 69, 2527: 70, 2562: 71, 2599: 72, 2636: 73, 2673: 74, 2710: 75, 2747: 76, 2782: 77, 2818: 78, 2855: 79, 2891: 80, 2926: 81, 2962: 82, 2996: 83, 3033: 84, 3071: 85, 3109: 86, 3144: 87, 3182: 88, 3219: 89, 3255: 90, 3289: 91, 3326: 92, 3363: 93, 3398: 94, 3435: 95, 3471: 96, 3508: 97, 3545: 98, 3582: 99, 3617: 100, 3654: 101, 3717: 102, 3750: 1

#### 3.2.3. Generating a morphing list (key: indexes of records / values: order or page breaker)

In [32]:
def addPagebreakers(ur_Dict_Of_Rows, ur_Dict_Of_Indexes):
    #Take dictOfRows and add pagebreakers based on dictOfIndexes
    newDict = {}
    for k,cellValue in ur_Dict_Of_Rows.items():
        if cellValue in ur_Dict_Of_Indexes:
            newDict[cellValue] = 'pagebreaker'
        else:
            newDict[cellValue] = cellValue
    return newDict

In [33]:
morphingDict = addPagebreakers(dictOfRows, dictOfIndexes)
morphingDict

{4345: 4345}

In [34]:
rows = {0:0, 1:1, 2:2}
indexes = {1:0}
addPagebreakers(rows, indexes)

{0: 0, 1: 'pagebreaker', 2: 2}

### 3.3. Adding page numbers to a table

#### 3.3.1. Filling page numbers between page breakers (morphing dict, page numbers)

In [35]:
def addPageNumbers(dictInProgress, yayIndexListAgain, pageNumberList, rowList, indexList):
    finalList = []
    counter = 0

    for i, v in enumerate(dictInProgress):
        if dictInProgress[v] == "pagebreaker":
            counter += 1
            finalList.append("pageBreaker")
        else:
            if counter < len(yayIndexListAgain):
                finalList.append(str(pageNumberList[counter]))
                
    print(finalList)

    return finalList

In [36]:
listOfPageNumbers = [5, 6, 7, 8, 9, 10, 13, 14, 15, 17, 26, 27, 28, 36, 37, 38, 39, 42, 43, 44, 45, 46, 47, 48, 49, 54, 55, 56, 57, 58, 59, 60, 61, 62, 65, 67, 71, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 110, 111, 114, 117, 120, 121, 123, 124, 127, 128, 129, 130, 131, 132, 141, 145, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 174, 175, 176, 177, 178, 179, 180, 181]
youGotItToyotaList = addPageNumbers(morphingDict, dictOfIndexes, listOfPageNumbers, listOfRows, listOfIndexes)

['5']


In [37]:
def addPageNumbers(dictInProgress, yayIndexListAgain, pageNumberList):
    finalList = []
    counter = 0

    for i, v in enumerate(dictInProgress):
        if dictInProgress[v] == "pagebreaker":
            counter += 1
            finalList.append("pageBreaker")
        else:
            if counter < len(yayIndexListAgain):
                finalList.append(str(pageNumberList[counter]))
                
    print(finalList)

    return finalList

In [38]:
listOfPageNumbers = [5, 6, 7, 8, 9, 10, 13, 14, 15, 17, 26, 27, 28, 36, 37, 38, 39, 42, 43, 44, 45, 46, 47, 48, 49, 54, 55, 56, 57, 58, 59, 60, 61, 62, 65, 67, 71, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 110, 111, 114, 117, 120, 121, 123, 124, 127, 128, 129, 130, 131, 132, 141, 145, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 174, 175, 176, 177, 178, 179, 180, 181]
youGotItToyotaList = addPageNumbers(morphingDict, dictOfIndexes, listOfPageNumbers)

['5']


In [46]:
clean_dict = {
    1: 0, 
    2: 'pagebreaker', 
    3: 2, 
    4: 3, 
    5: 4, 
    6: 'pagebreaker'}

page_breakers = {2:0, 6:1}
pages_between_breakers = [1,3]
addPageNumbers(clean_dict, page_breakers, pages_between_breakers)

['1', 'pageBreaker', '3', '3', '3', 'pageBreaker']


['1', 'pageBreaker', '3', '3', '3', 'pageBreaker']

#### 3.3.2. Assigning page numbers to a csv file

In [41]:
newcol = np.log(data['Page Numbers'])
newcol = pd.Series(youGotItToyotaList)
#data.assign(PageNumbers=newcol)