Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python读取PDF文档 #124

Open
Qingquan-Li opened this issue Sep 3, 2019 · 0 comments
Open

Python读取PDF文档 #124

Qingquan-Li opened this issue Sep 3, 2019 · 0 comments
Labels

Comments

@Qingquan-Li
Copy link
Owner

环境:python3.x

首先安装 PDFMiner3k 库(是 PDFMiner 的 Python 3.x 移植版)。

它非常灵活,可以通过命令行使用,也可以整合到代码中。它还可以处理不同的语言编码,而且对网络文件的处理也非常方便。


from urllib.request import urlopen
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from io import StringIO
from io import open

def read_pdf(pdf_file):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, laparams=laparams)

    process_pdf(rsrcmgr, device, pdf_file)
    device.close()

    content = retstr.getvalue()
    retstr.close()
    return content

# pdf_file = urlopen('https://bitcoin.org/files/bitcoin-paper/bitcoin_zh_cn.pdf')
pdf_file = urlopen('http://pythonscraping.com/pages/warandpeace/chapter1.pdf')
output_string = read_pdf(pdf_file)
print(output_string)
pdf_file.close()

read_pdf 函数最大的好处是,如果 PDF 文件在电脑里,可以直接把 urlopen 返回的对象 pdf_file 替换成普通的 open() 文件对象:

pdf_file = open('../pages/warandpeace/chapter1.pdf', 'rb')
# 'r':默认值,表示从文件读取数据。'b':表示要读写二进制数据。'rb':以二进制读方式打开文件。
# 使用'r'的时候,如果碰到'0x1A',就视为文件结束,就是EOF。使用'rb'则不存在这个问题。

对大多数只包含纯文本内容的 PDF 而言,其输出结果与纯文本格式基本没什么区别。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant