Python读取PDF文档 #124

Qingquan-Li · 2019-09-03T11:10:44Z

环境：python3.x

首先安装 PDFMiner3k 库（是 PDFMiner 的 Python 3.x 移植版）。

它非常灵活，可以通过命令行使用，也可以整合到代码中。它还可以处理不同的语言编码，而且对网络文件的处理也非常方便。

from urllib.request import urlopen
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from io import StringIO
from io import open

def read_pdf(pdf_file):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, laparams=laparams)

    process_pdf(rsrcmgr, device, pdf_file)
    device.close()

    content = retstr.getvalue()
    retstr.close()
    return content

# pdf_file = urlopen('https://bitcoin.org/files/bitcoin-paper/bitcoin_zh_cn.pdf')
pdf_file = urlopen('http://pythonscraping.com/pages/warandpeace/chapter1.pdf')
output_string = read_pdf(pdf_file)
print(output_string)
pdf_file.close()

read_pdf 函数最大的好处是，如果 PDF 文件在电脑里，可以直接把 urlopen 返回的对象 pdf_file 替换成普通的 open() 文件对象：

pdf_file = open('../pages/warandpeace/chapter1.pdf', 'rb')
# 'r':默认值，表示从文件读取数据。'b':表示要读写二进制数据。'rb'：以二进制读方式打开文件。
# 使用'r'的时候，如果碰到'0x1A'，就视为文件结束，就是EOF。使用'rb'则不存在这个问题。

对大多数只包含纯文本内容的 PDF 而言，其输出结果与纯文本格式基本没什么区别。

Qingquan-Li added the Python label Sep 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python读取PDF文档 #124

Python读取PDF文档 #124

Qingquan-Li commented Sep 3, 2019

Python读取PDF文档 #124

Python读取PDF文档 #124

Comments

Qingquan-Li commented Sep 3, 2019