Simple Multilingual PDF text extraction, Also extracts from images
import pdf2textlib
print(pdf2textlib.getText("Demo.pdf","eng+tel+urd"))
# parameter 1 : Path to the PDF file
# parameter 2 : string of language codes separated by '+' sign
sudo apt-get install build-essential libpoppler-cpp-dev pkg-config python-dev
sudo yum install gcc-c++ pkgconfig poppler-cpp-devel python-devel redhat-rpm-config
brew install pkg-config poppler
Conda users may also need libgcc
:
conda install -c anaconda libgcc
Currently tested only when using conda:
- Install the Microsoft Visual C++ Build Tools
- Install poppler through conda:
conda install -c conda-forge poppler
pip install pdf2textlib