# Python 从 PDF 中提取内容

In [None]:
from IPython.display import IFrame
IFrame('https://zh.wikipedia.org/wiki/%E5%8F%AF%E7%A7%BB%E6%A4%8D%E6%96%87%E6%A1%A3%E6%A0%BC%E5%BC%8F', width=900, height=450)

PDF(**P**ortable **D**ocument **F**ormat)：便携式文档格式，是一种用独立于应用程序、硬件、操作系统的方式呈现文档的文件格式。PDF文件通常混合了矢量图形、文本和位图，基本内容包括：文本存储为内容字符串、由图形和线条组成的用于说明和设计的矢量图形、由照片和其他类型的图片组成的位图。这是[维基百科-PDF](https://zh.wikipedia.org/wiki/%E5%8F%AF%E7%A7%BB%E6%A4%8D%E6%96%87%E6%A1%A3%E6%A0%BC%E5%BC%8F)的介绍。

结合自己的使用经验，个人认为常见的PDF文件可以分为两类：一种是文本转化而成（tex-based）,这种文件通常可以直接将内容进行复制和粘贴；另一种是扫描文件而成（scanned），比如影印版的书籍，或者插入图片制成的PDF。虽然都是PDF文件，但在处理方式和处理后的效果存在差异。将 Python 中处理 PDF 文件的第三方库简单梳理如下：

 - text-based：`PyPDF2`、`pdfminer`、`textract`、`slate` 等库可用于提取文本；`pdfplumber`、`camelot` 等库可用于提取表格。
 - scanned：先将文件转换为图片，之后利用 OCR（光学字符识别）提取内容，如 `pytesseract` 库；或者采用 `OpenCV` 进行图像处理。
 
 接下来，我们就分别使用上述几种库看看效果。因为大部分是第三方库，所以在开始之前先需要进行安装：
 
 ```shell
 pip install PyPDF2
 pip install pdfminer
 pip install pdfminer.six
 pip install textract
 pip install slate
 pip install pdfplumber
 pip install camelot
 pip install pytesseract
 ```
 
 推荐阅读:
 
 [Working with PDF files in Python](https://www.geeksforgeeks.org/working-with-pdf-files-in-python/)
 [PDF提取文本](https://zhuanlan.zhihu.com/p/136888486)
 
 [Exporting Data from PDFs with Python](https://www.blog.pythonlibrary.org/2018/05/03/exporting-data-from-pdfs-with-python/)
 
 [使用pdfminer3k模块读取PDF](https://blog.csdn.net/qq_42415326/article/details/89432839?utm_medium=distribute.pc_relevant.none-task-blog-baidujs-8)
 
 [How do I use pdfminer as a library
](https://stackoverflow.com/questions/5725278/how-do-i-use-pdfminer-as-a-library#)

# 实现过程

## text-based PDF

In [None]:
import os

In [None]:
os.listdir()
os.listdir("./input")

### PyPDF2 库

`PyPDF2`拥有`PdfFileReader`, `PdfFileMerger`,`PageObject`和`PdfFileWriter` 四个类，能够完成 PDF 简单读取、拆分、裁剪和合并等工作。


In [7]:
import PyPDF2

# 获取 PDF 信息
pdfFile = open('./input/2020中央一号文件.pdf', 'rb')
pdfText = PyPDF2.PdfFileReader(pdfFile)
page_count = pdfText.getNumPages()  # 获取页面总数
print(page_count)  # 共 19 页

19


In [8]:
# 提取文本
for p in range(0, page_count):
    text = pdfText.getPage(p)
    print(text.extractText().encode('utf-8'))  # 但是文件为空

b'\n \n\n \n\n \n\n2020\n\n1\n\n2\n\n \n  \n\n\n\n\n \n  \n2020\n\n\n\n\n\n\n\n2020\n\n\n \n  \n\n2020\n\n\n\n\n'
b'\n\n\n\n\n\n \n  \n\n \n  \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n'
b'  \n\n\n\n\n\n\n\n\n\n \n  \n\n\n\n\n\n\n\n\n \n  \n\n\n\n\n\n\n'
b'\n\n \n  \n\n\n\n\n\n\n\n\n \n  \n\n\n \n  \n\n\n\n\n\n\n\n\n\n'
b'\n\n \n  \n\n\n\n\n\n\n\n \n  \n\n\n\n\n\n\n\n\n\n\n\n \n  \n\n\n'
b'\n\n\n\n\n\n\n\n\n\n\n\n \n  \n\n\n\n\n\n\n\n\n\n\n'
b'\n\n \n  \n\n\n\n\n\n\n\n\n \n  \n\n\n\n\n\n\n\n\n\n \n  \n\n\n\n'
b'\n\n\n\n\n \n  \n\n\n\n(\n5.790\n,\n \n-\n0.23\n,\n \n-\n3.82%\n)\n\n\n \n  \n\n\n\n2020\n\n\n\n\n\n\n\n\n\n\n\n\n'
b'\n\n \n  \n\n\n2020\nE\xc3\xbc\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n  \n\n\n'
b'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n  \n\n\n\n\n\n\n\n'
b'\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n  \n\n\n\n\n\n\n\n\n\n\n\n'
b'\n\n\n \n  \n\n \n  \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'
b'\n\n \n  \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n  \n\n\n\n\n\n'
b'\n\n\n\n\n \n  \n\n\n\n\n\n\n\n\n\n\n \n  \n\n

In [10]:
# 更换另外的英文PDF尝试
pdfFile = open('./input/Political Uncertainty and Corporate Investment Cycles.pdf', 'rb')
pdfObj = PyPDF2.PdfFileReader(pdfFile)
page_count = pdfObj.getNumPages()
print(page_count)
for p in range(0, page_count):
    text = pdfObj.getPage(p)
    print(text.extractText())

39
THEJOURNALOFFINANCE
•
VOL.LXVII,NO.1
•
FEBRUARY2012
PoliticalUncertaintyandCorporateInvestment
Cycles
BRANDONJULIOandYOUNGSUKYOOK

ABSTRACT
Wedocumentcyclesincorporateinvestmentcorrespondingwiththetimingofna-
tionalelectionsaroundtheworld.Duringelectionyears,Þrmsreduceinvestmentex-
pendituresbyanaverageof4.8%relativetononelectionyears,controllingforgrowth
opportunitiesandeconomicconditions.Themagnitudeoftheinvestmentcyclesvaries
withdifferentcountryandelectioncharacteristics.Weinvestigateseveralpotential
explanationsandÞndevidencesupportingthehypothesisthatpoliticaluncertainty
leadsÞrmstoreduceinvestmentexpendituresuntiltheelectoraluncertaintyisre-
solved.TheseÞndingssuggestthatpoliticaluncertaintyisanimportantchannel
throughwhichthepoliticalprocessaffectsrealeconomicoutcomes.
T
HERELATIONSHIPBETWEENPOLITICS
andeconomicoutcomeshasalonghis-
toryinresearchandpublicdebate.Oneimportantwayinwhichpoliticsis
hypothesizedtoinßuencerealdecisionsisthroughthechannelofuncertainty
andinstability

52
TheJournalofFinance
R

TheÞrsttaskfortheelectiondatacollectionistoidentitythechiefexecutive
ofeachcountryandthenationalelectionsassociatedwiththeselectionofthe
chiefexecutive.Inacountrywithapresidentialsystem,thesupremeexecutive
powerisnormallyvestedintheofÞceofthepresident.Thus,presidentialelec-
tionsarenaturallyconsideredinouranalysisforcountrieswithpresidential
systems.Inaparliamentarysystem,theexecutivepowerisnormallyvestedin
acabinetresponsibletoparliament.Insuchacountry,theprimeministeror
premier,beingtheheadofthecabinetandleaderoftheparliament,functions
astheactualchiefexecutiveofthenation.Thus,legislativeelectionsareused
forcountrieswithparliamentarysystemsastheoutcomeofsuchelectionshas
theforemostinßuenceovertheappointmentofprimeminister.
4
Somecountries
useahybridsystemcombiningelementsofbothparliamentaryandpresiden-
tialdemocracy;apresidentandaprimeministercoexistwithbothpresidential
andlegislativeelectionsheldnationally.Insuchcases,weexaminetheconsti-
tutionalframeworkan

60
TheJournalofFinance
R

TableIII
BaselineInvestmentRegressions
Thistablepresentsestimatesfrominvestmentregressionsofthetype
I
ijt
=

i
+

1
ElectionDummy
jt
+

2
Q
i
,
t
−
1
+

3
CF
it
+

4
%

GDP
j
,
t
−
1
+

t
+

ijt
,
where
i
indexestheÞrm,
j
indexesthecountry,and
t
denotestheyear.Theleft-hand-sidevariable
iscapitalexpendituresscaledbybeginning-of-yeartotalassets.
Q
i
,
t
−
1
istheproxyforTobinÕs
Q
,
CF
it
iscashßow,and

GDP
j
,
t
−
1
isthepercentagechangeinrealgrossdomesticproduct
foragivencountryoverthepreviousyear.SeetheAppendixforvariabledeÞnitions.Standard
errors,clusteredbycountryandyear,arereportedinbrackets.

,

,and

representstatistical
signiÞcanceatthe10%,5%,and1%level,respectively.
(1)(2)(3)(4)(5)(6)
ElectionYear
−
0
.
0067
−
0
.
0036
−
0
.
0036
−
0
.
0037
−
0
.
0038
−
0
.
0041
Dummy[0
.
0016]

[0
.
0014]

[0
.
0014]

[0
.
0012]

[0
.
0013]

[0
.
0014]

Q
0
.
01100
.
00560
.
00550
.
0041
[0
.
0010]

[0
.
0012]

[0
.
0012]

[0
.
0010]

CashFlow0
.
18850
.
18660
.
1737
[

68
TheJournalofFinance
R

Figure1.
Investmentaroundnationalelections.ThisÞguredisplaysestimatesfromtheregressionresultsreportedinTableVIIofthefollowing
speciÞcation:
I
ijt
=

i
+

1
Election
jt
+

2
Election
jt
×
Close
jt
+

3
Post-Election
jt
+

4
Post-Election
jt
×
Close
jt
+

5
Q
i
,
t
−
1
+

6
CF
it
+

7
%

GDP
j
,
t
−
1
+

t
+

ijt
.
Thesampleincludesonlythosecountriesforwhichthetimingofelectionsisexogenous.Thevariable
Close
isadummyvariablesetequaltooneif
themarginofvictoryforagivenelectionissmallerthanthe25
th
percentileofthedistributionforallexogenouslytimedelections.Theverticalaxis
representstheinvestmentratesrelativetotheaveragerateofnonelectionyears,whicharetheperiodsneitherimmediatelybeforenorimmediately
afteranelection.Thedashedlinedisplayschangesininvestmentforallelections(basedonestimatesfromcolumn(2)ofTableVII),andthesolidlin
e
displayschangesininvestmentfortheelectionswithcloseoutcomes(basedonestimatesfromcolumn(3)ofTableVII).

PoliticalUncertaintyandCorporateInvestmen

PoliticalUncertaintyandCorporateInvestment
75
TableIX
PoliticalConnectionsandIncumbentOpportunism
ThistablepresentstheestimationresultsofthebaselinespeciÞcationforvarioussubsamples.The
ÞrstcolumnreportstheestimationresultsoftheinvestmentspeciÞcationomittingthepolitically
connectedÞrmsfromFaccio(2006).Column(2)reportsestimationresultsforthesubsamplewitha
highdegreeofcentralbankindependence(CBI).SpeciÞcally,thesampleincludesanyobservation
inwhichtheCBImeasureisgreaterthanthe75
th
percentileoftheCBIdistribution.TheCBI
indexisdeÞnedaccordingtoCukiermanetal.(1992).TheelectionyeardummyintheÞnalcolumn
ismodiÞedsuchthatitissettooneonlyiftheincumbentleaderisnotrunningforre-election
inanygivenelection,andzerootherwise.TobinÕs
Q
,cashßow,andGDPgrowthareincludedas
controlvariables.Standarderrors,clusteredbycountryandyear,arereportedinbrackets.
(1)(2)(3)
PoliticallyConnectedHighCentralIncumbent
FirmsOmittedBankIndependenceNotRunning
ElectionYearDummy
−
0
.
0036
−
0
.
0040
−
0
.
0052
[0
.
0013]

[0



PoliticalUncertaintyandCorporateInvestment
81
Appendix
Ñ
Continued
RegularElectionAnelectionisclassiÞedasregularifitisheldwithin6monthsbeforeor
aftertheanticipatedelectiondate,whichiscalculatedbyaddingthe
nominaltermofthechiefexecutivetothepreviouselectiondate.
Otherwise,anelectionisclassiÞedasirregular.Anelectionisalso
classiÞedasirregularifitisheldfortheÞrsttime.
ExogenousElectionAnelectionisclassiÞedasexogenousifitstimingisÞxedby
constitutionorelectorallaw.TobespeciÞc,allcountrieswitharecord
ofearlyelectionsareclassiÞedashavingendogenoustiming.All
presidentialelections,withtheexceptionofSriLankaÕs,areheldona
regularbasisandareclassiÞedashavingexogenoustiming.This
leavesunclassiÞedsevencountrieswithparliamentarysystemsand
onecountrywithahybridsystem.Inordertoclassifythese
remainingcountries,werefertoelectorallawsandpracticesaswell
astheclassiÞcationprovidedbyAlesinaetal.(1992).Accordingly,
threeoftheremainingcountries,CzechRepublic,Finland,andNew
Zealand,areclassiÞedashavingendogeno

### pdfminer 库

In [11]:
import io

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage


def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()  # 存储共享资源，例如字体或图片
    retstr = io.StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)  # 解析 page内容
    password = ""  # 密码，若无则初始化为空
    maxpages = 0
    caching = True
    pagenos = set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,
                                  password=password,
                                  caching=caching,
                                  check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text


convert_pdf_to_txt("./input/2020中央一号文件.pdf")

'中共中央   国务院关于抓好“三农”领域重点工作  \n\n确保如期实现全面小康的意见  \n\n（2020 年 1 月 2 日） \n\n \n\n  党的十九大以来，党中央围绕打赢脱贫攻坚战、实施乡村\n\n振兴战略作出一系列重大部署，出台一系列政策举措。农业农\n\n村改革发展的实践证明，党中央制定的方针政策是完全正确\n\n的，今后一个时期要继续贯彻执行。  \n\n \n\n  2020 年是全 面建成小康社会目标实现之年，是全面打赢脱\n\n贫攻坚战收官之年。党中央认为，完成上述两大目标任务，脱\n\n贫攻坚最后堡垒必须攻克，全面小康“三农”领域突出短板必\n\n须补上。小康不 小康，关键看老乡。脱贫攻坚质量怎么样、小\n\n康成色如何，很大程度上要看“三农”工作成效。全党务必深\n\n刻认识做好 2020 年“三农”工作的特殊重要性，毫不松懈，\n\n持续加力，坚决夺取第一个百年奋斗目标的全面胜利。  \n\n \n\n  做好 2020 年 “三农”工作总的要求是，坚持以习近平新\n\n时代中国特色社会主义思想为指导，全面贯彻党的十九大和十\n\n九届二中、三中、四中全会精神，贯彻落实中央经济工作会议\n\n精神，对标对表全面建成小康社会目标，强化举措、狠抓落\n\n\x0c实，集中力量完成打赢脱贫攻坚战和补上全面小康“三农”领\n\n域突出短板两大重点任务，持续抓好农业稳产保 供和农民增\n\n收，推进农业高质量发展，保持农村社会和谐稳定，提升农民\n\n群众获得感、幸福感、安全感，确保脱贫攻坚战圆满收官，确\n\n保农村同步全面建成小康社会。  \n\n \n\n \n\n  一、坚决打赢脱贫攻坚战  \n\n  （一）全面完成脱贫任务。脱贫攻坚已经取得决定性成\n\n就，绝大多数贫困人口已经脱贫，现在到了攻城拔寨、全面收\n\n官的阶段。要坚持精准扶贫，以更加有力的举措、更加精细的\n\n工作，在普遍实现“两不愁”基础上，全面解决“三保障”和\n\n饮水安全问题，确保剩余贫困人口如期脱贫。进一步聚焦“三\n\n区三州”等深度贫困地区，瞄准突出问题和薄弱环节集中发\n\n力，狠抓政策落实 。对深度贫困地区贫困人口多、贫困发生率\n\n高、脱贫难度大的县和行政村，要组织精锐力量强力帮扶、挂\n\n牌督战。对特殊贫困群体，要落实落细低保、医保、养老

### textract 库

安装textract的时候并不会自动帮你安装pdfminer,需要手动安装pdfminer

[python使用textract解析pdf时遇到UnboundLocalError: local variable 'pipe' referenced before assignment](https://blog.csdn.net/FannLann/article/details/80238889)

In [12]:
import textract
text = textract.process("./input/2020中央一号文件.pdf", 'utf-8')
print(text.decode())

中共中央

国务院关于抓好“三农”领域重点工作

确保如期实现全面小康的意见

（ 2020 年 1 月 2 日 ）

党的十九大以来，党中央围绕打赢脱贫攻坚战、实施乡村
振兴战略作出一系列重大部署，出台一系列政策举措。农业农
村改革发展的实践证明，党中央制定的方针政策是完全正确
的，今后一个时期要继续贯彻执行。

2020 年 是 全 面 建 成 小 康 社 会 目 标 实 现 之 年 ， 是 全 面 打 赢 脱
贫攻坚战收官之年。党中央认为，完成上述两大目标任务，脱
贫攻坚最后堡垒必须攻克，全面小康“三农”领域突出短板必
须补上。小康不小康，关键看老乡。脱贫攻坚质量怎么样、小
康成色如何，很大程度上要看“三农”工作成效。全党务必深
刻 认 识 做 好 2020 年 “ 三 农 ” 工 作 的 特 殊 重 要 性 ， 毫 不 松 懈 ，
持续加力，坚决夺取第一个百年奋斗目标的全面胜利。

做 好 2020 年 “ 三 农 ” 工 作 总 的 要 求 是 ， 坚 持 以 习 近 平 新
时代中国特色社会主义思想为指导，全面贯彻党的十九大和十
九届二中、三中、四中全会精神，贯彻落实中央经济工作会议
精神，对标对表全面建成小康社会目标，强化举措、狠抓落

实，集中力量完成打赢脱贫攻坚战和补上全面小康“三农”领
域突出短板两大重点任务，持续抓好农业稳产保供和农民增
收，推进农业高质量发展，保持农村社会和谐稳定，提升农民
群众获得感、幸福感、安全感，确保脱贫攻坚战圆满收官，确
保农村同步全面建成小康社会。

一、坚决打赢脱贫攻坚战

（一）全面完成脱贫任务。脱贫攻坚已经取得决定性成
就，绝大多数贫困人口已经脱贫，现在到了攻城拔寨、全面收
官的阶段。要坚持精准扶贫，以更加有力的举措、更加精细的
工作，在普遍实现“两不愁”基础上，全面解决“三保障”和
饮水安全问题，确保剩余贫困人口如期脱贫。进一步聚焦“三
区三州”等深度贫困地区，瞄准突出问题和薄弱环节集中发
力，狠抓政策落实。对深度贫困地区贫困人口多、贫困发生率
高、脱贫难度大的县和行政村，要组织精锐力量强力帮扶、挂
牌督战。对特殊贫困群体，要落实落细低保、医保、养老保
险、特困人员救助供养、临时救助等综合社会保障政策，实现
应保尽保。各

## Scanned PDF

 > Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and “read” the text embedded in images.Python-tesseract is a wrapper for Google’s Tesseract-OCR Engine. It is also useful as a stand-alone invocation script to tesseract, as it can read all image types supported by the Pillow and Leptonica imaging libraries, including jpeg, png, gif, bmp, tiff, and others. Additionally, if used as a script, Python-tesseract will print the recognized text instead of writing it to a file.

In [13]:
from pdf2image import convert_from_path
from PIL import Image
import pytesseract

In [15]:
# 切割PDF
pdfFile = open('./input/Political Uncertainty and Corporate Investment Cycles.pdf', 'rb')
pdf_input = PyPDF2.PdfFileReader(pdfFile)

pdf_output = PyPDF2.PdfFileWriter()
for i in range(15, 31):
    pdf_output.addPage(pdf_input.getPage(i))
pdf_output.write(open('./output/风险、不确定性和利润_15_30.pdf', 'wb'))

In [17]:
# PDF 转为图片
PDF_file = './output/风险、不确定性和利润_15_30.pdf'
pages = convert_from_path(PDF_file, 500)
image_counter = 1
for page in pages:
    filename = "page_"+str(image_counter)+".jpg"
    page.save(filename, 'JPEG')
    image_counter += 1

OSError: [Errno 22] Invalid argument: '.\temp\\page_1.jpg'

In [None]:
# 图片中提取文本
filelimit = image_counter-1
outfile = ".\output\out_text.txt"
f = open(outfile, "a")
for i in range(1, filelimit + 1):
    filename = "page_"+str(i)+".jpg"
    text = str(
        ((pytesseract.image_to_string(Image.open(filename), lang='chi_sim'))))
    text = text.replace('\n', '')
    text = text.replace(' ', '')
    f.write(text)
f.close()