# Summary

In this Jupyter Notebook, we take the complete works of H.P.Lovecraft pdf, split the stories into individual `pdf` files, clean the text, and convert everything into `txt` files. The source file I'm working with is available here in pdf: 

https://arkhamarchivist.com/free-complete-lovecraft-ebook-nook-kindle/

We are using PyPDF2 and PDFMiner, installing the libraries: 
<br>
`pip install pypdf2`
<br>
`pip install pdfminer.six`

# Set up Environment

In [5]:
import PyPDF2
import data_func
import csv

reader = PyPDF2.PdfFileReader(
    './data/original/Complete_Works_Lovecraft.pdf')

print(reader.documentInfo)

num_of_pages = reader.numPages
print('Number of pages: ' + str(num_of_pages))

{'/Author': 'H.P. Lovecraft', '/Creator': 'Microsoft® Word 2010', '/CreationDate': "D:20110729214233-04'00'", '/ModDate': "D:20110729214233-04'00'", '/Producer': 'Microsoft® Word 2010'}
Number of pages: 708


# Table of Contents

First, we are saving the Table of Contents as a separate pdf file. 

In [6]:
writer = PyPDF2.PdfFileWriter()

for page in range(2,4):

    writer.addPage(reader.getPage(page))
    
output_filename = './data/original/table_of_contents.pdf'

with open(output_filename, 'wb') as output:
    writer.write(output)

The next step is to extract text. To extract with PyPDF2, we can use the extractText, like: reader.getPage(7).extractText(). However, in doing that, lines would be missing. We are using PDFMiner for this reason. 

In [7]:
text = data_func.convert_pdf_to_string(
    './data/original/table_of_contents.pdf')

In [8]:
text[:1000]

'Table of Contents \n\nPreface ............................................................................................................................. 2 \nThe Tomb .......................................................................................................................... 5 \nDagon ............................................................................................................................. 12 \nPolaris............................................................................................................................. 16 \nBeyond the Wall of Sleep ............................................................................................... 19 \nMemory ........................................................................................................................... 26 \nOld Bugs ......................................................................................................................... 27 \nThe Transition of Juan Romero 

In [9]:
text = text.replace('.','')
text = text.replace('\x0c','')
table_of_contents_raw = text.split('\n')
table_of_contents_raw[:10]

['Table of Contents ',
 '',
 'Preface  2 ',
 'The Tomb  5 ',
 'Dagon  12 ',
 'Polaris 16 ',
 'Beyond the Wall of Sleep  19 ',
 'Memory  26 ',
 'Old Bugs  27 ',
 'The Transition of Juan Romero  32 ']

Converting the string above to lists. 

In [10]:
title_list = []
pagenum_list = []
title_formatted_list = []
for item in table_of_contents_raw:
        title, pagenum = \
            data_func.split_to_title_and_pagenum(item)
        if title != None:
            title_list.append(title)
            pagenum_list.append(pagenum)
            title_formatted_list.append(
                data_func.convert_title_to_filename(title))
            
# for page_list, we need to add the last page as well
pagenum_list.append(num_of_pages + 1)

# Saving Individual PDF Files

Next up, we are saving the individual PDF files, skipping the first one, which is the Preface. 

In [11]:
for i in range(1, len(title_formatted_list)):
    title_formatted = title_formatted_list[i]
    page_start = pagenum_list[i] - 1
    page_end = pagenum_list[i+1] - 2
    
    writer = PyPDF2.PdfFileWriter()

    for page in range(page_start,page_end + 1):
        writer.addPage(reader.getPage(page))
    
    output_filename = './data/original/pdfs/' + title_formatted + '.pdf'

    with open(output_filename, 'wb') as output:
        writer.write(output)

# Saving Individual TXT Files

Loop through individual pdf's, convert to string, clean string, save out txt. 

In [29]:
year_written = []
# first element is Preface, where year is not applicable
year_written.append('n/a')

for title_formatted in title_formatted_list[1:]:
    
    text = data_func.convert_pdf_to_string(
        './data/original/pdfs/' + title_formatted + '.pdf')
    
    # exclude the year after the title, collect in a list
    i = 0
    while text[i]!= '(':
        i+=1
    year = text[i+1:i+5]
    text = text[:i] + text[i+6:]
    year_written.append(year)
    
    # replace 'Return to Table of Contents', which is not part of the text
    text = text.replace('Return to Table of Contents', '')
    
    # replace Fin from the end of the last title
    if title_formatted == 'the_haunter_of_the_dark':
        text = text[:-15]
    
    # save in a txt file
    text_file = open('./data/original/txts/' + title_formatted + '.txt', 'w')
    n = text_file.write(text)
    text_file.close()

# Saving Table of Contents in a CSV

Save titles, clean titles, page numbers, and years in which the story was written in a csv. 

In [30]:
with open('./data/original/table_of_contents.csv', 'a') as f:
    writer = csv.writer(f)
    writer.writerows(zip(
        title_list, pagenum_list, title_formatted_list, year_written))