### 1) Importing libraries

In [1]:
import pdfplumber
import regex as re
from collections import defaultdict

### 2) Extracting text page-by-page

In [2]:
page_text = {}
with pdfplumber.open(r'DCISC_Annual_Reports/DCISC-31st-Annual-Report.pdf') as pdf:
    page_numbers = len(pdf.pages)
    for num in range(page_numbers):
        page = pdf.pages[num] 
        page_text[num+1] = page.extract_text()

In [3]:
page_text[196]

"st\n31  Annual Report by the Diablo Canyon Independent Safety Committee, July 1, 2020—June 30,\n2021\nPreface | Executive Summary\nVolume I TOC | Volume II TOC | PG&E Response | Contact the DCISC\nst\n31  Annual Report, Volume II, Exhibit B.3, Minutes of the October\n22-23, 2020 Public Meeting\nMinutes of the Diablo Canyon Independent Safety Committee's October\n22-23, 2020 Public Meeting [Approval at the February 16-17, 2021 Public\nMeeting.]\nThursday & Friday\nOctober 22-23, 2020\nConducted online as a Zoom webinar.\nIn response to Governor Newsom's Executive Order N.29-20 related to the COVID-19\n(coronavirus) pandemic, public participation in this DCISC public meeting was by\nelectronic means only and without a physical location for public participation, in\ncompliance with California state guidelines on social distancing. This meeting was\nproduced by AGP Video Inc. and webcast live on SLO-SPAN at http://www.slo-span.org\nand was subsequently broadcast on San Luis Obispo local g

In [4]:
page_text[394]

"st\n31  Annual Report by the Diablo Canyon Independent Safety Committee, July 1, 2020—June 30,\n2021\nPreface | Executive Summary\nVolume I TOC | Volume II TOC | PG&E Response | Contact the DCISC\nst\n31  Annual Report, Volume I, Section 4.5, Health, Nuclear Safety\nCulture, and Safety Conscious Work Environment\n4.5 Nuclear Safety Culture, and Safety Conscious Work Environment\n4.5.1 Overview and Previous Activities\nThe purpose of Nuclear Safety Culture, and Safety Conscious Work\nEnvironment (SCWE) is twofold: 1) the health of the individual employee, and 2)\nnuclear and personnel safety as the context and requirement for all DCPP\nemployees. Included in the area are all health related issues. This section also\nfocuses on Safety as a contextual, cultural requirement.\nIn the previous reviewing period (2017-2018) the DCISC reviewed the following:\nNuclear Safety Culture\nThe DCISC concluded the following:\nThe DCPP Nuclear Safety Culture Monitoring Panel and the Safety Culture\nLea

In [5]:
len(page_text)

903

### 3) Tokenizing page text into paragraphs

In [6]:
page_pargraphs = {}
for num, text in page_text.items():
    split_paragraphs = re.split('\.\n|\d\)\n|\.\)\n|:\n|\.\s+\n',text)
    split_paragraphs = [x for x in split_paragraphs if '|' not in x]
    if len(split_paragraphs) > 1:
        page_pargraphs[num] = split_paragraphs
page_pargraphs[8]

["responsible for the operation of his submarine's nuclear power plant.  Mr",
 'McWhorter then served the U. S. Nuclear Regulatory Commission for ten years first\nas an Operator Licensing Examiner and then as Senior Resident Inspector at North\nAnna Power Station.  He then was employed for two years as a Systems\nEngineering Manager for Dominion Virginia Power at North Anna Power Station',
 ' For ten years, Mr. McWhorter was employed at Old Dominion Electric Cooperative\nwhere he served as Vice President of Operations and Asset Management.  Mr',
 'McWhorter has been a Consultant to the DCISC since 2016',
 'Legal Counsel: Robert R. Wellington, Esq. has been Legal Counsel for the DCISC\nsince its organization in 1989. He is a graduate of Stanford University and the\nUniversity of California (Hastings) Law School. For over 40 years his practice has\nbeen limited to representing several cities, regional wastewater and solid waste\ndistricts and other public agencies, including the DCISC. H

In [7]:
page_pargraphs[394]

['nuclear and personnel safety as the context and requirement for all DCPP\nemployees. Included in the area are all health related issues. This section also\nfocuses on Safety as a contextual, cultural requirement',
 'In the previous reviewing period (2017-2018) the DCISC reviewed the following',
 'Nuclear Safety Culture\nThe DCISC concluded the following',
 "The DCPP Nuclear Safety Culture Monitoring Panel and the Safety Culture\nLeadership Team identified an Improvement Opportunity that employee perception\nof the station's ability to maintain a proficient workforce is causing distraction. This\nmatches the DCISC concern about retention of qualified, experienced personnel\nnecessary to operate DCPP at an appropriate level of safety. The DCISC will\ncontinue to monitor this area closely",
 '4.5.2 Current Period Activities\nDuring the current period, the DCISC reviewed the following topics that\nfocused specifically on Health, Nuclear Safety Culture, or Safety Conscious Work\nEnvironme

In [8]:
len(page_pargraphs)

668

### 4) Fixing tokenized paragraphs
#### First fix: Merging paragraphs that were split after Mr./Ms./Dr.

In [9]:
prefixes = ("Mr","St","Mrs","Ms","Dr")

fixed_paragraphs = defaultdict(list)

for num, paragraphs in page_pargraphs.items():
    fix = 0
    for paragraph in paragraphs:
        loc = paragraphs.index(paragraph)
        if fix:
            new_paragraph = paragraphs[loc-1]+ ' ' + paragraph
            fixed_paragraphs[num].append(new_paragraph)
            fix = 0
            continue
            fix = 0
        if paragraph.endswith(prefixes):
            fix = 1
        else:
            fixed_paragraphs[num].append(paragraph)      
    
fixed_paragraphs[8]

["responsible for the operation of his submarine's nuclear power plant.  Mr McWhorter then served the U. S. Nuclear Regulatory Commission for ten years first\nas an Operator Licensing Examiner and then as Senior Resident Inspector at North\nAnna Power Station.  He then was employed for two years as a Systems\nEngineering Manager for Dominion Virginia Power at North Anna Power Station",
 ' For ten years, Mr. McWhorter was employed at Old Dominion Electric Cooperative\nwhere he served as Vice President of Operations and Asset Management.  Mr McWhorter has been a Consultant to the DCISC since 2016',
 'Legal Counsel: Robert R. Wellington, Esq. has been Legal Counsel for the DCISC\nsince its organization in 1989. He is a graduate of Stanford University and the\nUniversity of California (Hastings) Law School. For over 40 years his practice has\nbeen limited to representing several cities, regional wastewater and solid waste\ndistricts and other public agencies, including the DCISC. He advise

In [10]:
len(fixed_paragraphs)

668

#### Second Fix: Merging paragraphs that were split across two pages

In [11]:
merged_paragraphs = {}
i = 0
for num, paragraphs in fixed_paragraphs.items():
    first_word = paragraphs[0].split(' ')[0]
    merged_paragraphs[num] = paragraphs
    if first_word.islower() and first_word.isalpha() and page_text[num].startswith(first_word):
        merged_paragraphs[num-1][-1] = merged_paragraphs[num-1][-1]+' '+merged_paragraphs[num][0]
        merged_paragraphs[num].remove(merged_paragraphs[num][0])
    i+=1

In [12]:
merged_paragraphs[384]

['Seismically Induced Systems Interactions (SISI) Program (Volume II, Exhibit D.4,\nSection 3.',
 'Routine station operations with respect to the SISI Program were governed by\nprocedures AD4.ID3, Revision 16, "SISI Housekeeping Activities," dated October 8,\n2019, and AD4.ID1, Revision 17, "Housekeeping," dated April 2, 2020, copies of\nwhich were provided to and reviewed by the DCISC.  These procedures appeared\nadequate and addressed application of the SISI Program to daily housekeeping\nactivities within the plant such as the following',
 "Transient equipment being brought into the plant\nComponent parts of systems, structures, or components being brought into\nthe plant\nNon-design change alterations of systems, structures, or components\nThe objective of the SISI Program was to ensure that safe-shutdown systems,\nstructures, and components, as well as certain accident-mitigating systems, would\nproperly function during and following an earthquake. The procedure's intent was\nto e